# Using Word Vectors to find misspellings

This notebook explores some of the ideas found [here](http://forums.fast.ai/t/nlp-any-libraries-dictionaries-out-there-for-fixing-common-spelling-errors/16411).  

Most notably the idea that we can use the vectors of misspelled word to create a single *transformation vector* that in turn, can be used to identify other misspelled words



In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.text import *

import html, pdb
import dill as pickle
from collections import Counter, defaultdict
import multiprocessing as mp

import spacy
spacy_en = spacy.load('en')
spacy_es = spacy.load('es')

# pandas and plotting config
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_colwidth', -1)

## Configuration

In [3]:
# various default, LM, and classification paths
PATH = Path('data/verbatims')

LM_PATH = PATH/'lm'
CLS_PATH = PATH/'class'

(LM_PATH/'models').mkdir(parents=True, exist_ok=True)
(LM_PATH/'tmp').mkdir(exist_ok=True)

(CLS_PATH/'models').mkdir(parents=True, exist_ok=True)
(CLS_PATH/'tmp').mkdir(exist_ok=True)

# dataframe config
verbatims_filename = 'verbatims.csv'
date_cols = ['LastReviewedOn', 'LastPredictionsOn']
dtypes = { 'AnswerText': str, 'AnswerText_Cleaned': str, 'AnswerText_NonEnglish': str, 
          'p_Summary': str, 'LastReviewedBy' : str }

# columns for text, labels, and classes
TXT_COLS = ['AnswerText_Cleaned']

LABELS_SENT = ['IsVeryPositive', 'IsPositive', 'IsVeryNegative', 'IsNegative', 
             'IsSuggestion', 'FeelsThreatened', 'IsNonsense']

LABELS_ENT = ['HasPersonsName', 'HasOrgName', 'HasContactInfo']

LABELS = LABELS_SENT + LABELS_ENT
CLASSES = ['no', 'yes']

In [4]:
PRE_PATH = PATH/'models'/'wt103'
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'

## Wiki103 Embeddings with sklearn's NearestNeighbors

In [5]:
wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)

In [6]:
list(wgts.keys()), wgts['0.encoder.weight'].size()

(['0.encoder.weight',
  '0.encoder_with_dropout.embed.weight',
  '0.rnns.0.module.weight_ih_l0',
  '0.rnns.0.module.bias_ih_l0',
  '0.rnns.0.module.bias_hh_l0',
  '0.rnns.0.module.weight_hh_l0_raw',
  '0.rnns.1.module.weight_ih_l0',
  '0.rnns.1.module.bias_ih_l0',
  '0.rnns.1.module.bias_hh_l0',
  '0.rnns.1.module.weight_hh_l0_raw',
  '0.rnns.2.module.weight_ih_l0',
  '0.rnns.2.module.bias_ih_l0',
  '0.rnns.2.module.bias_hh_l0',
  '0.rnns.2.module.weight_hh_l0_raw',
  '1.decoder.weight'],
 torch.Size([238462, 400]))

In [7]:
enc_wgts = to_np(wgts['0.encoder.weight'])
row_m = enc_wgts.mean(0)

In [8]:
itos2 = pickle.load((PRE_PATH/'itos_wt103.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda: -1, { v:k for k,v in enumerate(itos2) })

#### Convert embedding matrix to np

In [9]:
emb = to_np(wgts['0.encoder.weight'])

In [10]:
emb.shape, len(itos2)

((238462, 400), 238462)

In [11]:
itos2[100]  # "world"

'world'

In [12]:
emb[100].shape # 400D vector for word "world"

(400,)

#### Use KNN to evaluate vector space

In [13]:
from sklearn.neighbors import NearestNeighbors

In [14]:
nbrs = NearestNeighbors(n_neighbors=8, metric="cosine").fit(emb)

How to find the first 5 words that start with "god" (`select * from words where word like 'god%'`)

In [15]:
sorted([s for s in itos2 if (s.startswith('god'))])[:5]

['god', 'goda', 'godaddy', 'godai', 'godal']

How to find the index of the word "god" (use `list.index(val)`)

In [16]:
itos2.index('god')

921

Look up the vector and add a dimension for nn

In [17]:
word_vec = np.expand_dims(emb[921], 0) # (400,) => (1,400)
word_vec.shape

(1, 400)

In [18]:
dist, near_idxs = nbrs.kneighbors(word_vec)

print(dist)
print(near_idxs)

[[0.      0.32039 0.34489 0.34614 0.34909 0.35    0.35193 0.35347]]
[[  921  5085 14224  3431  8117  4365 12068  3260]]


Look up the words based on the indexes of the word's nearest neighbors

In [19]:
[itos2[idx] for idx in near_idxs.squeeze()]

['god', 'goddess', 'satan', 'christ', 'deity', 'gods', 'mankind', 'jesus']

Some fun.  What does "god" - "satan" mean?  Or "god" + "satan"?

In [20]:
vec_diff = emb[itos2.index('god')] - emb[itos2.index('satan')] 
dist, near_idxs = nbrs.kneighbors(np.expand_dims(vec_diff,0))
[itos2[idx] for idx in near_idxs.squeeze()]

['god', 'goddess', 'christ', 'divine', 'deity', 'you', 'gods', 'someone']

In [21]:
vec_add = emb[itos2.index('god')] + emb[itos2.index('satan')] 
dist, near_idxs = nbrs.kneighbors(np.expand_dims(vec_add,0))
[itos2[idx] for idx in near_idxs.squeeze()]

['god', 'satan', 'humankind', 'allah', 'athena', 'babylon', 'lucifer', 'cupid']

Using sklearn's `cosine_similarity` to basically do the same thing KNN does above, except it is going to pull up the results for the entire vocab.

We can then use `np.argsort()` to find the closest and furthest neighbors

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(emb[np.expand_dims(itos2.index('god'), 0)], emb)

print(cos_sim.shape)

(1, 238462)


In [23]:
cos_sim[0,:5]
np.argsort(cos_sim[0])[:5]        # => sort desc; get first 5 (farthest away)
np.argsort(cos_sim[0])[::-1][:5]  # => sort asc; get first 5 (closest)

array([  921,  5085, 14224,  3431,  8117])

In [24]:
[itos2[idx] for idx in np.argsort(cos_sim[0])[::-1][:5]]

['god', 'goddess', 'satan', 'christ', 'deity']

#### Use PCA to breakdown the 400 dimensional vectors into 2 dimensions for plotting

In [25]:
from sklearn.decomposition import PCA

In [26]:
pca = PCA(n_components=2)
emb_pca = pca.fit_transform(emb)

print(pca.explained_variance_ratio_)
print(pca.singular_values_)  
print(emb_pca.shape)

[0.57408 0.01101]
[741.492   102.66906]
(238462, 2)


## Using Glove embeddings to find spelling errors

In [27]:
g_dim = 300 # the dimensionality of our glove vectors

Load via `np.loadtxt` (kernel dies for anything bigger than the 6B model)

In [28]:
# glove = np.loadtxt("data/glove/glove.42B.300d.txt", dtype='str', comments=None)

In [29]:
# words = glove[:, 0].tolist()
# vectors = glove[:, 1:].astype('float')

Load via pandas (works for **all** glove models)

In [30]:
glove_df = pd.read_table("data/glove/glove.42B.300d.txt", sep=" ", index_col=0, header=None, 
                         quoting=csv.QUOTE_NONE, na_values=None, keep_default_na=False)

len(glove_df)

1917494

In [31]:
words = glove_df.index.values.tolist()
words[:5]

[',', 'the', '.', 'and', 'to']

In [32]:
vectors = glove_df.values
len(vectors[0])

300

#### Instead of using sklearn, we'll try using nmslib instead

In [33]:
import nmslib

In [34]:
def create_index(a):
    # 1. create index on angular distance
    #index = nmslib.init(space='angulardist') 
    index = nmslib.init(method='hnsw', space='cosinesimil')
    
    index.addDataPointBatch(a)
    index.createIndex()
    return index

def get_knns(index, vecs, n_nn=8):
    # 3. can query a bunch of vectors all at once and get there "k" nearest neighbors
    return zip(*index.knnQueryBatch(vecs, k=n_nn, num_threads=4)) 

def get_knn(index, vec, n_nn=8): 
    return index.knnQuery(vec, k=n_nn)

def get_token_knns(index, tokens, n_nn=8):    
    vecs = [ vectors[words.index(t)] if t in words else np.zeros((g_dim)) for t in tokens ]
    return get_knns(index, vecs, n_nn)

In [35]:
%time nms_idx = create_index(vectors) # 2. do it on all of our word vectors

CPU times: user 1h 37min 24s, sys: 57.8 s, total: 1h 38min 22s
Wall time: 15min 47s


In [36]:
word_vec = vectors[words.index('god')]
word_vec.shape

(300,)

In [37]:
# 4. tells you their indexes and how far away they are
knn_idxs, knn_dists = get_knns(nms_idx, np.expand_dims(word_vec, 0))

In [38]:
knn_idxs, knn_dists, len(knn_idxs), len(knn_dists)

((array([  344,  1849,  1371,  1368,  3097,  5764, 15099,  1734], dtype=int32),),
 (array([0.     , 0.19305, 0.20231, 0.24353, 0.26355, 0.27491, 0.27819, 0.28092], dtype=float32),),
 1,
 1)

In [39]:
[[ words[idx] for idx in idxs ] for idxs in knn_idxs ]

[['god', 'christ', 'jesus', 'lord', 'heaven', 'gods', 'almighty', 'faith']]

#### Try finding spelling mistakes based on a transformation vector built from a single misspelling

Explore conclusions from here:  [NLP: Any libraries/dictionaries out there for fixing common spelling errors?](http://forums.fast.ai/t/nlp-any-libraries-dictionaries-out-there-for-fixing-common-spelling-errors/16411/8?u=wgpubs)

In [40]:
misspelled = ['realiable']

In [41]:
knn_idxs, knn_dists = get_token_knns(nms_idx, misspelled)

In [42]:
misspelled_words = []
d = collections.defaultdict(lambda: [])
for i, (idxs, dists) in enumerate(zip(knn_idxs, knn_dists)):
    for idx, dist in zip(idxs, dists):
        misspelled_words.append(words[idx])
        print(misspelled[i], words[idx], dist)
    print('')

realiable realiable 0.0
realiable relaible 0.37913597
realiable realible 0.47054034
realiable reliabe 0.50882244
realiable trust-worthy 0.515444
realiable relieable 0.5173365
realiable relaiable 0.527301
realiable knowlegable 0.5483686



In [43]:
misspelled_words_vecs = np.array([ vectors[words.index(w)] for w in misspelled_words ])

In [44]:
misspelled_words_vecs.shape

(8, 300)

In [45]:
one_correct_word = vectors[words.index('reliable')]

In [46]:
one_xform_vec = (misspelled_words_vecs - one_correct_word).mean(axis=0)

In [48]:
one_xform_vec.shape

(300,)

Try using other *transformation vector* on a different misspelled word

In [49]:
new_vec = vectors[words.index('thier')] - one_xform_vec

In [50]:
knn_idxs, knn_dists = get_knns(nms_idx, np.expand_dims(new_vec, 0))
knn_idxs, knn_dists

((array([   62, 14902,   205,    63,   193,    92,    32,   273], dtype=int32),),
 (array([0.22198, 0.24216, 0.25902, 0.26866, 0.29234, 0.29627, 0.30234, 0.30454], dtype=float32),))

In [51]:
[[ words[idx] for idx in idxs ] for idxs in knn_idxs ]

[['their', 'thier', 'own', 'our', 'same', 'them', 'your', 'sure']]

There is is, the closest word ***is*** the correct spelling

#### Try finding misspellings based on an ensembled transformation vector

In [52]:
spelling_regex_repls = {
    # abbreviations
    r"\bacctg\b" : "acct",
    r"\badd'l\b" : "additional",
#     r"\br\s\b": "are",
#     r"\bu\s\b": "you ",
#     r"\b\sm\s\b ": "am",
#     r"'cause\b" : "because",
#     r"\b(ha)+\b": "haha",
#     r"\b(he)+\b": "haha",
#     r"\bya+y\b": "yay",
#     r"\bwa+y\b": "way",
    r"\bf'real\b" : "for real",
#     r"\bgr8\b" : "great",
    r"\bintl\b" : "int'l",
    # common misspellings
    r"\bbailable\b" : "available",
    r"\babilty\b" : "ability",
    r"\babsolutly\b" : "absolutely",
    r"\babsoultely\b" : "absolutely",
    r"\bacces\b" : "access",
    r"\baccesability\b" : "accessibility",
    r"\baccesbility\b" : "accessibility",
    r"\baccesibility\b" : "accessibility",
    r"\baccessability\b" : "accessibility",
    r"\baccessbility\b" : "accessibility",
    r"\baccesable\b" : "accessible",
    r"\baccesible\b" : "accessible",
    r"\baccessable\b" : "accessible",
    r"\bacessible\b" : "accessible",
    r"\bassessable\b" : "availability",
    r"\baccidently\b" : "accidentally",
    r"\baccomadate\b" : "accommodate",
    r"\baccomdate\b" : "accommodate",
    r"\baccomidate\b" : "accommodate",
    r"\baccomodate\b" : "accommodate",
    r"\baccomadating\b" : "accommodating",
    r"\baccomidating\b" : "accommodating",
    r"\baccomodating\b" : "accommodating",
    r"\baccomadations\b" : "accommodations",
    r"\baccomodation\b" : "accommodation",
    r"\baccouting\b" : "accounting",
    r"\baccross\b" : "across",
    r"\badd'l\b" : "additional",
    r"\badditonal\b" : "additional",
    r"\baddtionally\b" : "additionally",
    r"\badminstration\b" : "administration",
    r"\badminstrative\b" : "administrative",
    r"\badminstrator\b" : "administrator",
    r"\badress\b" : "address",
    r"\badvancment\b" : "advancement",
    r"\badvertized\b" : "advertised",
    r"\bafforable\b" : "affordable",
    r"\bafordable\b" : "affordable",
    r"\bafterall\b" : "after all",
    r"\bafterhours\b" : "after hours",
    r"\baggresive\b" : "aggressive",
    r"\bagressive\b" : "aggressive",
    r"\bagressions\b" : "aggressions",
    r"\balittle\b" : "a little",
    r"\balll\b" : "all",
    r"\balloted\b" : "allotted",
    r"\ballthough\b" : "although",
    r"\balthought\b" : "although",
    r"\ballways\b" : "always",
    r"\balos\b" : "also",
    r"\balot\b" : "a lot",
    r"\balotted\b" : "allotted",
    r"\bammount\b" : "amount",
    r"\bammounts\b" : "amounts",
    r"\bamoung\b" : "among",
    r"\bamoungst\b" : "amongst",
    r"\bannouncment\b" : "announcement",
    r"\baparments\b" : "apartments",
    r"\bapparrel\b" : "apparel",
    r"\bappartment\b" : "apartment",
    r"\bappriciate\b" : "appreciate",
    r"\bassitance\b" : "assistance",
    r"\bassitant\b" : "assistant",
    r"\batleast\b" : "at least",
    r"\battentative\b" : "attentive",
    r"\battrocious\b" : "atrocious",
    r"\bavaiable\b" : "available",
    r"\bavaible\b" : "available",
    r"\bavailabe\b" : "available",
    r"\bavailble\b" : "available",
    r"\bavailiable\b" : "available",
    r"\bavailible\b" : "available",
    r"\bavaliable\b" : "available",
    r"\bavalible\b" : "available",
    r"\bavilable\b" : "available",
    r"\bavailiability\b" : "availability",
    r"\bavailabiltiy\b" : "availability",
    r"\bavailabilty\b" : "availability",
    r"\bavailablility\b" : "availability",
    r"\bavailablity\b" : "availability",
    r"\bavailibility\b" : "availability",
    r"\bavaliability\b" : "availability",
    r"\bavaliablity\b" : "availability",
    r"\bavalibility\b" : "availability",
    r"\bactivies\b" : "activities",
    r"\bactivites\b" : "activities",
    r"\bactualy\b" : "actually",
    r"\bacutally\b" : "actually",
    r"\bammenities\b" : "amenities",
    r"\bantoher\b" : "another",
    r"\bassitant\b" : "assistant",
    r"\baswell\b" : "as well",
    r"\baweful\b" : "awful",
    r"\bawfull\b" : "awful",
    r"\bawsome\b" : "awesome",
    r"\bbeacuse\b" : "because",
    r"\bbearly\b" : "barely",
    r"\bbeaurocracy\b" : "bureaucracy",
    r"\bbeaurocratic\b" : "bureaucratic",
    r"\bbecasue\b" : "because",
    r"\bbecuase\b" : "because",
    r"\bbecuse\b" : "because",
    r"\bbefor\b" : "before",
    r"\bbeggining\b" : "beginning",
    r"\bbegining\b" : "beginning",
    r"\bbeleive\b" : "believe",
    r"\bbelive\b" : "believe",
    r"\bbenificial\b" : "beneficial",
    r"\bbenifit\b" : "benefit",
    r'\bbugetary\b' : "budgetary",
    r'\bbuiding\b' : "building",
    r'\bbuidling\b' : "building",
    r'\bbuisness\b' : "business",
    r'\bbuliding\b' : "building",
    r"\bbureacracy\b" : "bureaucracy",
    r"\bburitto\b" : "burrito",
    r"\bbussiness\b" : "business",
    r"\bcalender\b" : "calendar",
    r"\bcan;t\b" : "can't",
    r"\bcasher\b" : "cashier",
    r'\bcatagories\b' : "categories",
    r'\bcatagory\b' : "category",
    r"\bcheapter\b" : "cheaper",
    r"\bcheeper\b" : "cheaper",
    r'\bclasss\b' : "clas",
    r'\bclassses\b' : "classes",
    r"\bcleaniness\b" : "cleanliness",
    r"\bcmapus\b" : "campus",
    r'\bcofee\b' : "coffee",
    r'\bcoffe\b' : "coffee",
    r'\bcollegue\b' : "colleague",
    r'\bcoment\b' : "comment",
    r'\bcoments\b' : "comments",
    r'\bcomming\b' : "coming",
    r'\bcommittment\b' : "commitment",
    r'\bcommment\b' : "comment",
    r'\bcommuication\b' : "communication",
    r'\bcommunter\b' : "commuter",
    r'\bcommunters\b' : "commuters",
    r'\bcomotion\b' : "commotion",
    r'\bcomparision\b' : "comparison",
    r'\bcompatability\b' : "compatibility",
    r'\bcompatable\b' : "compatible",
    r'\bcompetative\b' : "competitive",
    r'\bcompetetive\b' : "competitive",
    r'\bcompetive\b' : "competitive",
    r'\bcompletly\b' : "completely",
    r"\bcomraderie\b" : "camaraderie",
    r'\bcomradery\b' : "camaraderie",
    r'\bcomunication\b' : "communication",
    r'\bcomunity\b' : "community",
    r'\bconcious\b' : "conscious",
    r'\bcondusive\b' : "conducive",
    r'\bconection\b' : "connection",
    r"\bconfortable\b" : "comfortable",
    r'\bconsistant\b' : "consistent",
    r'\bconsistantly\b' : "consistently",
    r'\bconsistenly\b' : "consistently",
    r'\bcontinously\b' : "continuously",
    r'\bcontruction\b' : "construction",
    r'\bconveinent\b' : "convenient",
    r'\bconveinient\b' : "convenient",
    r'\bconveniant\b' : "convenient",
    r'\bconveniece\b' : "convenience",
    r'\bconveninent\b' : "convenient",
    r'\bconvienance\b' : "convenience",
    r'\bconvienant\b' : "convenient",
    r'\bconvience\b' : "convenience",
    r'\bconvienence\b' : "convenience",
    r'\bconvienent\b' : "convenient",
    r'\bconvienet\b' : "convenient",
    r'\bconvienience\b' : "convenience",
    r'\bconvienient\b' : "convenient",
    r'\bconvient\b' : "convenient",
    r'\bconviently\b' : "conveniently",
    r'\bconvinence\b' : "convenience",
    r'\bconvinent\b' : "convenient",
    r'\bconvinience\b' : "convenience",
    r'\bconvinient\b' : "convenient",
    r'\bcorteous\b' : "courteous",
    r'\bcostodial\b' : "custodial",
    r'\bcoureous\b' : "courteous",
    r'\bcourtis\b' : "courteous",
    r'\bcouteous\b' : "courteous",
    r'\bcovenient\b' : "convenient",
    r'\bcroweded\b' : "crowded",
    r'\bcurteous\b' : "courteous",
    r'\bcurtesy\b' : "courtesy",
    r'\bcurtious\b' : "courteous",
    r"\bdeaprtment\b" : "department",
    r"\bdecission\b" : "decision",
    r'\bdefinately\b' : "definitely",
    r'\bdefinetely\b' : "definitely",
    r'\bdefinetly\b' : "definitely",
    r'\bdefinitley\b' : "definitely",
    r'\bdefinitly\b' : "definitely",
    r'\bdelievered\b' : "delivered",
    r'\bdeliverers\b' : "deliveries",
    r'\bdeparment\b' : "department",
    r'\bdeparments\b' : "department",
    r'\bdepartement\b' : "department",
    r"\bdepartment\(s\b" : "departments",
    r'\bdepartmet\b' : "department",
    r'\bdepratment\b' : "department",
    r"\bdeptartment\b" : "department",
    r'\bdescrimination\b' : "discrimination",
    r'\bdesireable\b' : "desirable",
    r"\bdiffernt\b" : "different",
    r"\bdiffrent\b" : "different",
    r'\bdinig\b' : "dining",
    r'\bdirverse\b' : "diverse",
    r'\bdisapointed\b' : "disappointed",
    r'\bdisapointing\b' : "disappointing",
    r'\bdisasterous\b' : "disastrous",
    r'\bdisatisfied\b' : "dissatisfied",
    r'\bdisbursment\b' : "disbursement",
    r'\bdisbursments\b' : "disbursements",
    r'\bdiscretely\b' : "discreetly",
    r'\bdiscusting\b' : "disgusting",
    r'\bdisfunctional\b' : "dysfunctional",
    r'\bdispensors\b' : "dispensers",
    r'\bdispersement\b' : "disbursement",
    r'\bdissapointed\b' : "disappointed",
    r'\bdissapointing\b' : "disappointing",
    r'\bdissapointment\b' : "disappointment",
    r'\bdissappointed\b' : "disappointed",
    r'\bdissappointing\b' : "disappointing",
    r'\bdissatified\b' : "dissatisfied",
    r'\bdiveristy\b' : "diversity",
    r'\bdivison\b' : "division",
    r'\bdivsion\b' : "division",
    r"\bdoens't\b" : "doesn't",
    r"\bdoes't\b" : "doesn't",
    r"\bdoesn;t\b" : "doesn't",
    r"\bdon;t\b" : "don't",
    r'\bdonot\b' : "do not",
    r"\bdosen't\b" : "doesn't",
    r"\bdosent\b" : "doesn't",
    r'\bdumbells\b' : "dumbbells",
    r'\bdurring\b' : "during",
    r"\beatting\b" : "eating",
    r"\beduation\b" : "education",
    r'\beffeciency\b' : "efficiency",
    r'\beffecient\b' : "efficient",
    r'\befficency\b' : "efficiency",
    r'\befficent\b' : "efficient",
    r'\beffiecient\b' : "efficient",
    r'\beimplying\b' : "implying",
    r'\bembarassed\b' : "embarrassed",
    r'\bembarassing\b' : "embarrassing",
    r'\bembarassment\b' : "embarrassment",
    r'\bemploee\b' : "employee",
    r'\bemploye\b' : "employee",
    r'\bemployee\(s\b' : "employees",
    r'\bemployeed\b' : "employed",
    r'\bemployement\b' : "employment",
    r'\bemployes\b' : "employees",
    r'\bemployess\b' : "employees",
    r'\bemplyee\b' : "employee",
    r'\bemplyees\b' : "employees",
    r'\bempolyees\b' : "employees",
    r'\bencoutered\b' : "encountered",
    r'\benought\b' : "enough",
    r'\benrollement\b' : "enrollment",
    r'\benviorment\b' : "environment",
    r'\benviornment\b' : "environment",
    r'\benvirnment\b' : "environment",
    r'\benviroment\b' : "environment",
    r'\benvironement\b' : "environment",
    r'\bequiped\b' : "equipped",
    r'\bespcially\b' : "especially",
    r'\bespecailly\b' : "especially",
    r'\bespecialy\b' : "especially",
    r'\bespeically\b' : "especially",
    r"\besthetically\b" : "aesthetically ",
    r"\bethinicity\b" : "ethnicity",
    r"\bevaulation\b" : "evaluation",
    r"\beventhough\b" : "even though",
    r'\beverday\b' : "every day",
    r'\beverthing\b' : "everything",
    r'\beveryones\b' : "everyones",
    r'\beverythings\b' : "everythings",
    r'\beveryway\b' : "every way",
    r'\beveyone\b' : "everyone",
    r'\beveything\b' : "everything",
    r'\bevrything\b' : "everything",
    r'\bexcelent\b' : "excellent",
    r'\bexcellant\b' : "excellent",
    r'\bexellent\b' : "excellent",
    r'\bexhorbitant\b' : "exorbitant",
    r'\bexistance\b' : "existence",
    r'\bexpecially\b' : "especially",
    r'\bexpensice\b' : "expensive",
    r'\bexpereince\b' : "experience",
    r'\bexperiance\b' : "experience",
    r'\bexperince\b' : "experience",
    r'\bexpierence\b' : "experience",
    r'\bexpirence\b' : "experience",
    r'\bexplaination\b' : "explanation",
    r'\bexremely\b' : "extremely",
    r'\bextemely\b' : "extremely",
    r'\bextention\b' : "extension",
    r'\bextermely\b' : "extremely",
    r'\bextreamly\b' : "extremely",
    r'\bextrememly\b' : "extremely",
    r'\bextremly\b' : "extremely",
    r"\bfacilites\b" : "facilities",
    r'\bfacilties\b' : "facilities",
    r'\bfacilty\b' : "facility",
    r'\bfaculity\b' : "faculty",
    r'\bfacutly\b' : "faculty",
    r'\bfiancial\b' : "financial",
    r"\bfinacial\b" : "financial",
    r"\bfirendly\b" : "friendly",
    r'\bflexability\b' : "flexibility",
    r'\bflexibilty\b' : "flexibility",
    r'\bflexiblity\b' : "flexibility",
    r"\bflourescent\b" : "fluorescent",
    r'\bfreindly\b' : "friendly",
    r'\bfreqency\b' : "frequency",
    r'\bfreqent\b' : "frequent",
    r'\bfriednly\b' : "friendly",
    r'\bfrusterating\b' : "frustrating",
    r'\bfrusturating\b' : "frustrating",
    r'\bfustrating\b' : "frustrating",
    r'\bgovenor\b' : "governor",
    r"\bgraffitti\b" : "graffiti",
    r"\bgrafitti\b" : "graffiti",
    r"\bgreatful\b" : "grateful",
    r"\bguarenteed\b" : "guaranteed",
    r"\bguidlines\b" : "guidelines",
    r"\bguranteed\b" : "guaranteed",
    r"\bhappend\b" : "happened",
    r'\bharrass\b' : "harass",
    r'\bharrassed\b' : "harassed",
    r'\bharrassing\b' : "harassing",
    r'\bharrassment\b' : "harassment",
    r"\bhavn't\b" : "haven't",
    r'\bhealtheir\b' : "healthier",
    r'\bhealthly\b' : "healthy",
    r'\bhealtier\b' : "healthier",
    r'\bhealty\b' : "healthy",
    r'\bheathy\b' : "healthy",
    r'\bheirarchy\b' : "hierarchy",
    r'\bhelful\b' : "helpful",
    r'\bhelpfull\b' : "helpful",
    r'\bhelpul\b' : "helpful",
    r'\bhighschool\b' : "high school",
    r'\bhighschools\b' : "high schools",
    r'\bhorendous\b' : "horrendous",
    r'\bhorible\b' : "horrible",
    r'\bhouseing\b' : "housing",
    r'\bi"m\b' : "i'm",
    r'\bi"ve\b' : "i've",
    r'\bimplimented\b' : "implemented",
    r'\bimporve\b' : "improve",
    r'\bimposible\b' : "impossible",
    r'\bimprovment\b' : "improvement",
    r'\bimprovments\b' : "improvements",
    r'\bincompetant\b' : "incompetent",
    r'\binconsistant\b' : "inconsistent",
    r'\binconveinent\b' : "nconvenient",
    r'\binconvience\b' : "inconvenience",
    r'\binconvienent\b' : "nconvenient",
    r'\binconvienient\b' : "nconvenient",
    r'\binconvient\b' : "nconvenient",
    r'\binconvinient\b' : "nconvenient",
    r'\bindentify\b' : "identify",
    r'\bindependant\b' : "independent",
    r'\bindividual\(s\b' : "individuals",
    r'\binforced\b' : "enforced",
    r'\binformaiton\b' : "information",
    r'\binformtion\b' : "information",
    r'\binfront\b' : "in front",
    r'\binnout\b' : "in-n-out",
    r'\binsentive\b' : "incentive",
    r'\binsufficent\b' : "insufficient",
    r'\binterenet\b' : "internet",
    r'\binterent\b' : "internet",
    r'\bintermural\b' : "intramural",
    r'\bintramurals\b' : "intramurals",
    r'\binvironment\b' : "environment",
    r'\bissue\(s\b' : "issues",
    r'\bit;s\b' : "it's",
    r'\bitem\(s\b' : "items",
    r"\bjob\(s\b" : "jobs",
    r'\bknowledable\b' : "knowledgeable",
    r'\bknowledeable\b' : "knowledgeable",
    r'\bknowledegable\b' : "knowledgeable",
    r'\bknowledgable\b' : "knowledgeable",
    r'\bknowledgably\b' : "knowledgeably",
    r'\bknowledgeably\b' : "knowledgeably",
    r'\bknowledgeble\b' : "knowledgeable",
    r'\bknowlegable\b' : "knowledgeable",
    r'\bknowlegeable\b' : "knowledgeable",
    r'\bliek\b' : "like",
    r'\blieke\b' : "like",
    r'\blimted\b' : "limited",
    r'\bmaintainance\b' : "maintenance",
    r'\bmaintaince\b' : "maintenance",
    r'\bmaintainence\b' : "maintenance",
    r'\bmaintanance\b' : "maintenance",
    r'\bmaintance\b' : "maintenance",
    r'\bmaintanence\b' : "maintenance",
    r'\bmaintenace\b' : "maintenance",
    r'\bmaintenances\b' : "maintenance",
    r'\bmaintence\b' : "maintenance",
    r'\bmaintenece\b' : "maintenance",
    r'\bmaintenence\b' : "maintenance",
    r'\bmaitenance\b' : "maintenance",
    r'\bmanager\(s\b' : "managers",
    r'\bmanagment\b' : "management",
    r'\bmanangement\b' : "management",
    r'\bmangement\b' : "management",
    r'\bmangers\b' : "managers",
    r'\bmanuever\b' : "maneuver",
    r'\bmintues\b' : "minutes",
    r'\bmoblie\b' : "mobile",
    r'\bmulitple\b' : "multiple",
    r'\bn\?a\b' : "n/a",
    r'\bna\b' : "n/a",
    r'\bneccessary\b' : "necessary",
    r'\bnecesary\b' : "necessary",
    r'\bneedes\b' : "needs",
    r'\bneeed\b' : "need",
    r'\bnonexistant\b' : "nonexistent",
    r'\bnothig\b' : "nothing",
    r'\bnothjng\b' : "nothing",
    r'\bnoticable\b' : "noticeable",
    r'\bobsurd\b' : "absurd",
    r'\bocassional\b' : "occasional",
    r'\boccassion\b' : "occasion",
    r'\boccassional\b' : "occasional",
    r'\boccassionally\b' : "occasionally",
    r'\boccassions\b' : "occasions",
    r'\boccations\b' : "occasions",
    r'\boccurances\b' : "occurrences",
    r'\boccured\b' : "occurred",
    r'\boccuring\b' : "occurring",
    r'\boccurr\b' : "occur",
    r'\bofcourse\b' : "of course",
    r'\bofferred\b' : "offered",
    r'\bopinon\b' : "opinion",
    r'\bopitions\b' : "options",
    r'\boportunities\b' : "opportunities",
    r'\bopperation\b' : "operation",
    r'\boppertunities\b' : "opportunities",
    r'\boppinion\b' : "opinion",
    r'\bopportunites\b' : "opportunities",
    r'\bopportunties\b' : "opportunities",
    r'\boppotunities\b' : "opportunities",
    r'\boppurtunities\b' : "opportunities",
    r'\boppurtunity\b' : "opportunity",
    r'\borgnized\b' : "organized",
    r'\boutragous\b' : "outrageous",
    r'\bpage\(s\b' : "pages",
    r'\bpakages\b' : "packages",
    r'\bparkibg\b' : "parking",
    r'\bparkig\b' : "parking",
    r'\bparkign\b' : "parking",
    r'\bparkinglots\b' : "parking lots",
    r'\bpartime\b' : "part-time",
    r'\bparttime\b' : "part-time",
    r'\bpatroling\b' : "patrolling",
    r'\bpeopel\b' : "people",
    r'\bpermitt\b' : "permit",
    r'\bperson\(s\b' : "persons",
    r'\bpersonel\b' : "personnel",
    r'\bpersonell\b' : "personnel",
    r'\bpharamcy\b' : "pharmacy",
    r'\bpleasent\b' : "pleasant",
    r'\bplently\b' : "plenty",
    r'\bplesant\b' : "pleasant",
    r'\bpositon\b' : "position",
    r'\bposses\b' : "possess",
    r'\bpossition\b' : "position",
    r'\bpostion\b' : "position",
    r'\bpostions\b' : "positions",
    r'\bpostition\b' : "position",
    r'\bpostive\b' : "positive",
    r'\bpractioner\b' : "practitioner",
    r'\bpractioners\b' : "practitioners",
    r'\bprefered\b' : "preferred",
    r'\bpreferrably\b' : "preferably",
    r'\bpreform\b' : "perform",
    r'\bpreforming\b' : "performing",
    r'\bpricess\b' : "prices",
    r'\bpriciples\b' : "principles",
    r'\bpricy\b' : "pricey",
    r'\bprking\b' : "parking",
    r'\bproceedures\b' : "procedures",
    r'\bprocurment\b' : "procurement",
    r'\bprofessionaly\b' : "professionally",
    r'\bproffessional\b' : "professional",
    r'\bproffit\b' : "profit",
    r'\bprofitt\b' : "profit",
    r'\bprogam\b' : "program",
    r'\bpromissed\b' : "promised",
    r'\bpublically\b' : "publicly",
    r'\bqucik\b' : "quick",
    r'\bquestion\(s\b' : "questions",
    r'\bquestionaire\b' : "questionnaire",
    r'\breall\b' : "really",
    r'\brealy\b' : "really",
    r'\breccomend\b' : "recommend",
    r'\breccommend\b' : "recommend",
    r'\breceieve\b' : "receive",
    r'\breciept\b' : "receipt",
    r'\breciepts\b' : "receipts",
    r'\brecieve\b' : "receive",
    r'\brecieved\b' : "received",
    r'\brecieves\b' : "receives",
    r'\brecieving\b' : "receiving",
    r'\brecived\b' : "received",
    r'\brecomend\b' : "recommend",
    r'\brecomended\b' : "recommended",
    r'\brediculous\b' : "ridiculous",
    r'\brediculously\b' : "ridiculously",
    r'\brefered\b' : "referred",
    r'\brefering\b' : "referring",
    r'\bregeants\b' : "regents",
    r'\bregistar\b' : "regisrtar",
    r'\bregistars\b' : "regisrtars",
    r'\bregulary\b' : "regularly",
    r'\breimbursment\b' : "reimbursement",
    r'\breponse\b' : "response",
    r'\breponsive\b' : "responsive",
    r'\brepresentitive\b' : "representative",
    r'\breserach\b' : "research",
    r'\bresonable\b' : "reasonable",
    r'\bresouces\b' : "resources",
    r'\bresourses\b' : "resources",
    r'\bresponsed\b' : "responded",
    r'\bresponsibilites\b' : "responsibilites",
    r'\bresponsiblities\b' : "responsibilites",
    r'\bresponsiblity\b' : "responsibility",
    r'\brestaraunts\b' : "restaurants",
    r'\brestraunts\b' : "restaurants",
    r'\brestuarant\b' : "restaurant",
    r'\brestuarants\b' : "restaurants",
    r'\bresturant\b' : "restaurant",
    r'\bresturants\b' : "restaurants",
    r'\bridiculus\b' : "ridiculous",
    r'\briduculous\b' : "ridiculous",
    r'\broomate\b' : "roommate",
    r'\broomates\b' : "roommates",
    r'\bsaleries\b' : "salaries",
    r'\bsandwhich\b' : "sandwich",
    r'\bsandwhiches\b' : "sandwiches",
    r'\bsandwitches\b' : "sandwiches",
    r'\bsatifaction\b' : "satisfaction",
    r'\bsatified\b' : "satisfisatisfieded",
    r'\bsattelite\b' : "satellite",
    r'\bsceience\b' : "science",
    r'\bschedual\b' : "schedule",
    r'\bseemless\b' : "seamless",
    r'\bselction\b' : "selection",
    r'\bsenority\b' : "seniority",
    r'\bsensative\b' : "sensitive",
    r'\bsensored\b' : "censored",
    r'\bseperate\b' : "separate",
    r'\bseperation\b' : "separation",
    r'\bserivce\b' : "service",
    r'\bserivces\b' : "services",
    r'\bserive\b' : "service",
    r'\bserives\b' : "services",
    r'\bservicesi\b' : "services",
    r'\bservidces\b' : "services",
    r'\bservive\b' : "survive",
    r'\bservives\b' : "survives",
    r'\bseverly\b' : "severely",
    r'\bsevice\b' : "service",
    r'\bsevices\b' : "services",
    r'\bshcool\b' : "school",
    r'\bshoud\b' : "should",
    r'\bshoudl\b' : "should",
    r'\bshutttle\b' : "shuttle",
    r'\bsimiliar\b' : "similar",
    r'\bsomeitmes\b' : "sometimes",
    r'\bsomeone\(s\b' : "someones",
    r'\bsomeones\b' : "someones",
    r'\bsometiems\b' : "sometimes",
    r'\bsomone\b' : "someone",
    r'\bsomthing\b' : "something",
    r'\bsophmore\b' : "sophomore",
    r'\bspecialy\b' : "especially",
    r'\bstafff\b' : "staff",
    r'\bstatment\b' : "statement",
    r'\bstong\b' : "strong",
    r'\bstongly\b' : "strongly",
    r'\bstoping\b' : "stopping",
    r'\bstrabucks\b' : "starbucks",
    r'\bstressfull\b' : "stressful",
    r'\bstructure\(s\b' : "structures",
    r'\bstucture\b' : "structure",
    r'\bstuctures\b' : "structures",
    r'\bstuden\b' : "student",
    r'\bstudent\(s\b' : "students",
    r'\bstudetns\b' : "students",
    r'\bstudnet\b' : "student",
    r'\bstudnets\b' : "students",
    r'\bsucess\b' : "success",
    r'\bsudent\b' : "student",
    r'\bsudents\b' : "students",
    r'\bsuperintendant\b' : "superintendent",
    r'\bsuperviser\b' : "supervisor",
    r'\bsupervisor\(s\b' : "supervisors",
    r'\bsupervisores\b' : "supervisors",
    r'\bsuport\b' : "support",
    r'\bsupples\b' : "supplies",
    r'\bsuppossed\b' : "supposed",
    r'\bsuprised\b' : "surprised",
    r'\bsuvey\b' : "survey",
    r'\bsytem\b' : "system",
    r'\bthats\b' : "that's",
    r"\bthe're\b" : "they're",
    r'\btheives\b' : "thieves",
    r'\bthiefs\b' : "thieves",
    r'\bthreating\b' : "threatening",
    r'\bthroughly\b' : "thoroughly",
    r'\bthrought\b' : "throughout",
    r'\bthroughtout\b' : "throughout",
    r'\btodays\b' : "today's",
    r'\btraing\b' : "training",
    r'\btrainning\b' : "training",
    r'\btranfers\b' : "transfers",
    r'\btransfered\b' : "transferred",
    r'\btransfering\b' : "transferring",
    r'\btransporation\b' : "transportation",
    r'\btransportaion\b' : "transportation",
    r'\btransportations\b' : "transportations",
    r'\btransportion\b' : "transportation",
    r'\btrashbags\b' : "trash bags",
    r'\btrashcans\b' : "trash cans",
    r'\btremedously\b' : "tremendously",
    r'\btshirt\b' : "t-shirt",
    r'\btshirts\b' : "t-shirts",
    r'\btution\b' : "tuition",
    r'\btutition\b' : "tuition",
    r'\bunaccessible\b' : "inaccessible",
    r'\bunconvenient\b' : "inconvenient",
    r'\bunecessary\b' : "unnecessary",
    r'\bunflexible\b' : "inflexible",
    r'\bunforseen\b' : "unforeseen",
    r'\buniverisity\b' : "university",
    r'\buniveristy\b' : "university",
    r'\buniverity\b' : "university",
    r'\bunknowledgeable\b' : "unknowledgable",
    r'\bunneccessary\b' : "unnecessary",
    r'\bunrealiable\b' : "unreliable",
    r'\buntill\b' : "until",
    r'\bunversity\b' : "university",
    r'\buseability\b' : "usability",
    r'\busefull\b' : "useful",
    r'\bususally\b' : "usually",
    r'\bvaccum\b' : "vacuum",
    r'\bvaccuum\b' : "vacuum",
    r'\bvaction\b' : "vacation",
    r'\bvacume\b' : "vacuum",
    r'\bvariaty\b' : "variety",
    r'\bvarities\b' : "varieties",
    r'\bvarity\b' : "variety",
    r'\bvegeterian\b' : "vegetarian",
    r'\bvegitarian\b' : "vegetarian",
    r'\bvegitarians\b' : "vegetarians",
    r'\bvegtables\b' : "vegetables",
    r'\bventillation\b' : "ventilation",
    r'\bveriety\b' : "variety",
    r'\bvisted\b' : "visited",
    r'\bvistor\b' : "visitor",
    r'\bvistors\b' : "visitors",
    r'\bweeekends\b' : "weekends",
    r'\bwierd\b' : "weird",
    r'\bwirless\b' : "wireless",
    r'\bwithdrawl\b' : "withdrawal",
    r'\bwoudl\b' : "would",
    r"\bwoudn't\b" : "wouldn't",
    r"\bthier\b" : "their",
    r"\bappartments\b" : "apartments",
    r"\bbenifits\b" : "benefits",
    r"\bexistant\b" : "existent",
    r"\bsaftey\b" : "safety",
    r'\bdon"t\b' : "don't",
}

In [53]:
# misspelled = ['realiable', 'becuase', 'definately', 'consistant']
# correct = ['reliable', 'because', 'definitely', 'consistent']

In [54]:
misspelled = []
correct = []

for k,v in spelling_regex_repls.items():
    tok = k.replace('\\b', '')
    if (not tok in words or not v in words): continue
        
    misspelled.append(tok)
    correct.append(v)

In [55]:
len(misspelled), len(correct)

(602, 602)

In [56]:
knn_idxs, knn_dists = get_token_knns(nms_idx, misspelled)

In [57]:
d = collections.defaultdict(lambda: [])

for i, (idxs, dists) in enumerate(zip(knn_idxs, knn_dists)):
    misspelled_words = []
    for idx, dist in zip(idxs, dists):
        misspelled_words.append(words[idx])
        print(misspelled[i], words[idx], dist)
    
    d[correct[i]] = misspelled_words
    print('')

acctg acctg 0.0
acctg acctng 0.5503459
acctg accouting 0.58349955
acctg controllership 0.61513233
acctg badm 0.6503613
acctg acct 0.6698546
acctg finanacial 0.6915154
acctg buad 0.6962557

bailable bailable -5.9604645e-07
bailable non-bailable 0.47395074
bailable non-cognizable 0.554535
bailable cognisable 0.6377856
bailable extraditable 0.6705159
bailable offence 0.67171013
bailable indictable 0.6843919
bailable cautionable 0.6849176

abilty abilty -3.5762787e-07
abilty ablility 0.3382318
abilty ablity 0.4068669
abilty abillity 0.41447353
abilty abiltiy 0.4155413
abilty abililty 0.4488775
abilty abilitiy 0.49680865
abilty abilityto 0.56547153

absolutly absolutly 2.3841858e-07
absolutly absolutley 0.16695768
absolutly absoultely 0.3163358
absolutly absoloutely 0.33011663
absolutly absoutely 0.3405972
absolutly absolutely 0.35008812
absolutly absolutey 0.3687563
absolutly absoloutly 0.36960322

absoultely absoultely 0.0
absoultely absoloutely 0.22947705
absoultely absoulutely 0.2311791

aparments apartments 0.63556844
aparments apartmens 0.63588846
aparments apartmetns 0.6444322
aparments 2bdr 0.65101755

apparrel 0-home 0.7109467
apparrel lightingcapital 0.7197984
apparrel lightingblauet 0.7217017
apparrel support/services 0.7254604
apparrel clivedon 0.73308146
apparrel environment-oriented 0.73545134
apparrel katholik 0.736058
apparrel lightingcorbett 0.7384646

appartment appartment -1.1920929e-07
appartment apartment 0.36979383
appartment townhouse 0.45873082
appartment aparment 0.48067194
appartment appartments 0.4808811
appartment condo 0.4918595
appartment annexe 0.5096849
appartment furnished 0.51267135

appriciate appriciate -4.7683716e-07
appriciate apreciate 0.2563212
appriciate appreicate 0.32483882
appriciate appreaciate 0.34382772
appriciate apperciate 0.36087972
appriciate appricate 0.36966002
appriciate appretiate 0.37237668
appriciate apprciate 0.3823955

assitance assitance -1.1920929e-07
assitance assistence 0.34634817
assitance asistance 0.42652136

compatability compatibilities 0.53086215

compatable compatable -2.3841858e-07
compatable compatiable 0.35565203
compatable compatible 0.39269692
compatable compatability 0.42521966
compatable compatibility 0.51012963
compatable backwards-compatible 0.5542185
compatable backward-compatible 0.5635599
compatable upgradeable 0.58643734

competative competative 1.1920929e-07
competative competetive 0.11851293
competative competitve 0.1756897
competative competive 0.17786181
competative competitive 0.45237923
competative resonable 0.5129019
competative competitively 0.51454145
competative cut-throat 0.51509476

competetive competetive 1.7881393e-07
competetive competative 0.11851293
competetive competitve 0.16228294
competetive competive 0.17993402
competetive competitive 0.41616786
competetive cut-throat 0.44648957
competetive competitively 0.48608547
competetive petitive 0.5390973

competive competive 1.7881393e-07
competive competative 0.17786181
competive competetive 0.17993402
competiv

dissapointed bummed 0.32934564
dissapointed dissapointing 0.34173447
dissapointed disheartened 0.3649621

dissapointing dissapointing 5.9604645e-08
dissapointing disapointing 0.18006128
dissapointing dissappointing 0.25348872
dissapointing underwhelming 0.31537294
dissapointing dissapointed 0.34173447
dissapointing disappointing 0.36424315
dissapointing dissappointed 0.37605578
dissapointing suprising 0.38572723

dissapointment dissapointment 5.9604645e-08
dissapointment disapointment 0.15767837
dissapointment dissappointment 0.25255597
dissapointment disappointment 0.3391925
dissapointment embarrasment 0.46271276
dissapointment letdown 0.46586156
dissapointment disappoinment 0.49030894
dissapointment anti-climax 0.5025642

dissappointed dissappointed -1.1920929e-07
dissappointed dissapointed 0.12008631
dissappointed disapointed 0.24984682
dissappointed suprised 0.32429636
dissappointed disheartened 0.35350287
dissappointed surprized 0.35786116
dissappointed dissapointing 0.37605578
di

experince expereince 0.2810557
experince experence 0.31866688
experince expirience 0.35047734
experince experiece 0.3556533
experince expierence 0.36952055
experince exprience 0.37395918

expierence expierence 1.7881393e-07
expierence experence 0.27884263
expierence experiance 0.28306282
expierence expirience 0.31483567
expierence expereince 0.32466233
expierence experiece 0.32705277
expierence expierience 0.35359228
expierence experince 0.36952055

expirence expirence 1.1920929e-07
expirence expirience 0.33799326
expirence experence 0.35521322
expirence exprience 0.3802606
expirence experiece 0.39487857
expirence expierence 0.4101805
expirence experiance 0.4134521
expirence expereince 0.41675115

explaination explaination 1.1920929e-07
explaination explination 0.22668397
explaination explantion 0.34213227
explaination explanation 0.40514016
explaination explenation 0.42001122
explaination explainations 0.44365954
explaination discription 0.47616416
explaination explanations 0.4822132


inforced inacted 0.6144527
inforced reenforced 0.61689967
inforced well-enforced 0.61982524

informaiton informaiton -2.3841858e-07
informaiton infromation 0.2882889
informaiton informaton 0.36863446
informaiton inforamtion 0.36933708
informaiton infomration 0.38610983
informaiton infomation 0.4066627
informaiton inormation 0.42274165
informaiton informatin 0.4401304

informtion informtion 1.1920929e-07
informtion infromation 0.51929855
informtion informaton 0.5330915
informtion informaiton 0.5344235
informtion informatin 0.54632986
informtion inforamtion 0.55704737
informtion infomration 0.5940105
informtion iformation 0.60339266

insentive insentive 1.1920929e-07
insentive dis-incentive 0.58088195
insentive insiration 0.60532844
insentive consessions 0.6057359
insentive enouth 0.6198218
insentive opputunity 0.62364113
insentive ennough 0.6282896
insentive reason/excuse 0.62924904

insufficent insufficent 0.0
insufficent insuficient 0.47852606
insufficent insuffient 0.48664588
insuffi

occassionally occationally 0.4230395
occassionally sporadically 0.45951724

occassions occassions 0.0
occassions ocassions 0.24788773
occassions occassion 0.30997872
occassions occasions 0.35755497
occassions occaisions 0.38849598
occassions ocassion 0.44454175
occassions ocasions 0.46431398
occassions occations 0.47671992

occations occations 4.7683716e-07
occations ocasions 0.4282354
occations ocassions 0.4313522
occations occaisions 0.4500934
occations occassions 0.47671992
occations occation 0.5613758
occations ocassion 0.5625797
occations occassion 0.57153386

occurances occurances 0.0
occurances occurrences 0.4286487
occurances occurences 0.43870324
occurances occurence 0.47306925
occurances occurance 0.49938112
occurances occurrances 0.577159
occurances ,159,795 0.5889991
occurances occurrance 0.5943202

occured occured 0.0
occured occurred 0.24911875
occured happened 0.3832515
occured occuring 0.38831353
occured happend 0.41943836
occured transpired 0.42201018
occured occur 0.4

progam programe 0.4751203
progam progrm 0.49817276
progam program.the 0.5127846
progam programm 0.5195595
progam programes 0.5211005

promissed promiced 0.57238334
promissed come.he 0.63839984
promissed were.i 0.64399207
promissed given.i 0.6507928
promissed wanted.i 0.65813327
promissed seeemed 0.6654482
promissed before.i 0.6724782
promissed come.i 0.6736932

publically publically 2.3841858e-07
publically publicly 0.22289878
publically openly 0.4569065
publically privately 0.4968058
publically divulged 0.5122688
publically publicy 0.51901287
publically disclosing 0.5293335
publically candidly 0.53346336

qucik qucik 1.1920929e-07
qucik quck 0.5675668
qucik qiuck 0.5751674
qucik qick 0.60705864
qucik super-quick 0.62174135
qucik quicky 0.64289606
qucik guick 0.6627713
qucik aquick 0.66886336

questionaire questionaire 0.0
questionaire questionnaire 0.4228912
questionaire questionnaires 0.49222565
questionaire questionaires 0.52297866
questionaire questionairre 0.5269037
questionaire s

shoudl shoud 0.22597998
shoudl shold 0.23430938
shoudl woudl 0.28208888
shoudl shuld 0.30799347
shoudl woud 0.32587028
shoudl whould 0.32648504
shoudl willl 0.3266182

similiar similiar 1.7881393e-07
similiar simular 0.2677228
similiar similair 0.40308022
similiar simmilar 0.41257727
similiar similer 0.4427514
similiar similar 0.4438697
similiar simliar 0.45220292
similiar simillar 0.49550718

someitmes someitmes 1.1920929e-07
someitmes unforunatly 0.46084297
someitmes sometiems 0.4636168
someitmes ufortunately 0.47277808
someitmes someimes 0.47301364
someitmes serioulsly 0.47387207
someitmes hoenstly 0.47749937
someitmes someties 0.48174524

someones someones 5.9604645e-08
someones elses 0.2526747
someones anyones 0.33306843
someones everyones 0.3432666
someones ur 0.3884642
someones thier 0.4212255
someones somebody 0.43279415
someones someone 0.4341377

sometiems sometiems -4.7683716e-07
sometiems someimes 0.3723632
sometiems soemtimes 0.40518695
sometiems somethimes 0.44015038
some

univerity universtity 0.44173968

unknowledgeable unknowledgeable 1.1920929e-07
unknowledgeable unknowledgable 0.4658481
unknowledgeable ill-educated 0.48626685
unknowledgeable under-informed 0.51526946
unknowledgeable undereducated 0.5180254
unknowledgeable under-educated 0.52069795
unknowledgeable ill-informed 0.526215
unknowledgeable uninformed 0.5421425

unneccessary unneccessary -2.3841858e-07
unneccessary unecessary 0.16425157
unneccessary unneccesary 0.23049235
unneccessary un-necessary 0.25670117
unneccessary unnecesary 0.28049332
unneccessary un-needed 0.32707262
unneccessary uneccessary 0.3351977
unneccessary unncessary 0.3554626

unrealiable unrealiable 2.3841858e-07
unrealiable un-reliable 0.47735244
unrealiable inacurrate 0.5308132
unrealiable extremenly 0.53182054
unrealiable incredibally 0.53584135
unrealiable idoitic 0.5360278
unrealiable incredebly 0.53652024
unrealiable incomptent 0.5381913

untill untill 0.0
untill till 0.19130045
untill until 0.29078698
untill til 0

Replace the nearest misspelled words with their word vectors

In [58]:
d = { k: [vectors[words.index(w)] for w in v] for k,v in d.items() }

Create the ensembled tranformation vector

In [59]:
xform_vecs = []

for k,v in d.items():
    if (not k in words): continue
        
    correct_word = vectors[words.index(k)]
    xform_vecs.append((v - correct_word).mean(axis=0))

In [60]:
xform_vec = np.array(xform_vecs).mean(axis=0)

In [61]:
xform_vec.shape

(300,)

Put the results into a DataFrame for further analysis.

You can use the ensembed vector *or* the one-word vector below, though the former outperforms the later by finding 98% of the misspellings vs. just 87%

In [62]:
results = []

for w, mw in zip(correct, misspelled):
    if (not w in words): continue
            
    new_vec = vectors[words.index(w)] - xform_vec #one_xform_vec
    knn_idxs, knn_dists = get_knns(nms_idx, np.expand_dims(new_vec, 0))
    nns = [[ words[idx] for idx in idxs ] for idxs in knn_idxs ][0]
    
    results.append({ 
        'incorrect' : mw, 
        'correct' : w, 
        'neighbors' : nns,
        'found' : w in nns, 
        'pos_found' : nns.index(w) if w in nns else -1
    })

In [63]:
df = pd.DataFrame(results, columns=['incorrect', 'correct', 'neighbors', 'found', 'pos_found'])

In [64]:
df.head()

Unnamed: 0,incorrect,correct,neighbors,found,pos_found
0,acctg,acct,"[acct, account, accounts, accounting, credit, management, banking, financial]",True,0
1,bailable,available,"[available, only, offered, be, offer, provided, provide, will]",True,0
2,abilty,ability,"[ability, able, can, could, certain, allow, cannot, however]",True,0
3,absolutly,absolutely,"[absolutely, truly, totally, simply, certainly, really, nothing, sure]",True,0
4,absoultely,absolutely,"[absolutely, truly, totally, simply, certainly, really, nothing, sure]",True,0


In [65]:
df_found = df[df.found == True]
df_notfound = df[df.found == False]

print(f'Total: {len(df)} | Found: {len(df_found)} | Not Found: {len(df_notfound)}')
print(f'Found %: {len(df_found) / len(df)}')

Total: 602 | Found: 584 | Not Found: 18
Found %: 0.9700996677740864
