# Embeddings 

The learning objective: 
(i) implement and evaluate a pre-cursor to modern word2vec embeddings.


# Word Embeddings 
For this first part, we're going to implement a word embedding approach that is a bit simpler than word2vec. The key idea is to look at co-occurrences between center words and context words (somewhat like in word2vec) but without any pesky learning of model parameters.

## Loading the Brown Corpus

The dataset for this part is the (in)famous [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) that is a collection of text samples from a wide range of sources, with over one million unique words. Good for us, you can find the Brown corpus in nltk. *Make sure you have already installed nltk with something like: conda install nltk*

In [3]:
import nltk
nltk.download('brown')
nltk.download('stopwords')
from nltk.corpus import brown
from nltk.corpus import stopwords
import re
import numpy as np

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\ronak\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ronak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Once you have it locally, you can load the dataset into your notebook. You can access the words using brown.words():

## 1.1 Dataset Pre-processing
OK, now we need to do some basic pre-processing. For this part we should:

* Remove stopwords and punctuation.
* Make everything lowercase.

Then, count how often each word occurs. We will define the 5,000 most  frequent words as your vocabulary (V). We will define the 1,000 most frequent words as our context (C). Include a print statement below to show the top-20 words after pre-processing.

In [143]:

stop_words = set(stopwords.words('english'))
word_dict = {}
processed = []
processed_key= {}
for word in brown.words():
    word = word.lower()
    word = re.sub(r'[^\w\s]',' ',word)
    if word!=' ' and word!='  ':
        if not word in stop_words:
            processed.append(word)
            if not word in word_dict:
                word_dict[word] = 1
            else:
                word_dict[word] = word_dict[word] + 1

import operator
sorted_d = sorted(word_dict.items(), key=operator.itemgetter(1),reverse=True)

for key,value in sorted_d[0:20]:
    print(key, '-->', value)
i = 0    
V = {}
for key,value in sorted_d[0:5000]:
    V[key] = i
    i = i+1
j = 0
C = {}   
for key,value in sorted_d[0:1000]:
    C[key] = j
    j = j+1
    
rev_V = {v: k for k, v in V.items()}

    

one --> 3292
would --> 2714
said --> 1961
new --> 1635
could --> 1601
time --> 1598
two --> 1412
may --> 1402
first --> 1361
like --> 1292
man --> 1207
even --> 1170
made --> 1125
also --> 1069
many --> 1030
must --> 1013
af --> 996
back --> 966
years --> 950
much --> 937


In [144]:
print(len(C))
print(len(V))

1000
5000


## 1.2 Building the Co-occurrence Matrix 

For each word in the vocabulary (w), we want to calculate how often context words from C appear in its surrounding window of size 4 (two words before and two words after).

In other words, we need to define a co-occurrence matrix that has a dimension of |V|x|C| such that each cell (w,c) represents the number of times c occurs in a window around w. 

In [166]:

import numpy as np
mat = np.zeros((len(V),len(C)))
for i in range(0,len(processed)):
    #print(processed[i])
    if processed[i] in V:
        context_list = []
        if i>=2:
            context_list.append(processed[i-2])
            context_list.append(processed[i-1])
        elif i==1:
            context_list.append(processed[i-1])
        else:
            context_list = context_list.append[processed[i]]
        if i<len(processed)-2:
            context_list.append(processed[i+2])
            context_list.append(processed[i+1])
        elif i==len(processed)-2:
            context_list.append(processed[i+1])
        else:
            context_list = context_list
        #print(context_list)
        for cw in context_list:
            if cw in C:
                #print(cw)
                mat[V[processed[i]]][C[cw]] += 1
print(mat)
print(mat.shape)


# import numpy as np
# mat = np.zeros((len(V),len(C)))
# for i in range(0,len(processed)):
#     #print(processed[i])
#     if processed[i] in V:
#         context_list = []
#         if i>=2:
#             left = i-2
#         elif i==1:
#             left = i-1
#         else:
#             left = i
#         if i<len(processed)-2:
#             right = i+2
#         elif i==len(processed)-2:
#             right = i+1
#         else:
#             right = i
#         for j in range(left,right+1):
#             if j!=i:
#                 if processed[j] in C:
#                     mat[V[processed[i]]][C[processed[j]]] = mat[V[processed[i]]][C[processed[j]]] + 1
# print(mat)
# print(mat.shape)



# import numpy as np
# import collections

# mat = np.zeros((len(V),len(C)))

# pairs = []

# for i in range(0,len(processed)):
#     if((i-2)>=0):
#         pairs.append((processed[i-2], processed[i]))
#     if((i-1)>=0):
#         pairs.append((processed[i], processed[i-1]))
#     if((i+1)<len(processed)):
#         pairs.append((processed[i], processed[i+1]))
#     if((i+2)<len(processed)):
#         pairs.append((processed[i], processed[i+2]))

# count = collections.Counter(pairs)


# for i in range(0,len(V)):
#     for j in range(0,len(C)):
#         search = (V[processed[i]],C[processed[j]])
#         mat[i][j] = count(search)

# print (mat)

[[82. 76. 52. ...  3.  1.  2.]
 [76. 70. 68. ...  1.  1.  0.]
 [52. 68. 34. ...  0.  0.  0.]
 ...
 [ 1.  1.  0. ...  0.  0.  0.]
 [ 0.  1.  1. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]]
(5000, 1000)


## 1.3 Probability Distribution

Using the co-occurrence matrix, we can compute the probability distribution Pr(c|w) of context word c around w as well as the overall probability distribution of each context word c with Pr(c).  

In [167]:
print ('Pr(c/w)')
pr_cw = np.zeros((len(V),len(C)))
for i in range(0,len(mat)):
    #for j in range(0,len(mat[0])):
    # no. of occurences of context word around the vocab word divided by no of occurences of vocab word
    # = [(c around w)/(occurence of w)]/[(occurence of w)/total_words]
    #pr_cw[i] = (mat[i]/(word_dict[rev_V[i]]))
    pr_cw[i] = mat[i]/np.sum(mat[i])
print (pr_cw)

#sum_c = np.sum(mat, 0)


Pr(c/w)
[[0.01294601 0.01199874 0.00820966 ... 0.00047363 0.00015788 0.00031576]
 [0.01460415 0.01345119 0.01306687 ... 0.00019216 0.00019216 0.        ]
 [0.01434879 0.0187638  0.0093819  ... 0.         0.         0.        ]
 ...
 [0.03125    0.03125    0.         ... 0.         0.         0.        ]
 [0.         0.02631579 0.02631579 ... 0.         0.         0.        ]
 [0.03030303 0.         0.         ... 0.         0.         0.        ]]


In [168]:
pr_c = np.zeros((1,(len(C))))
#print (pr_c.shape)
pr_c = np.sum(mat, axis=0)/ np.sum(np.sum(mat, axis=0))
#for i in range(0,len(mat[0])):
    #pr_c[i] = sum_c[i]/np.sum(np.sum(mat,0))
print(pr_c)
#rint(len(pr_c))

[0.01382866 0.01157575 0.00810964 0.006946   0.00696528 0.00696666
 0.00589804 0.0059545  0.00587463 0.0050952  0.00512412 0.00479775
 0.0046325  0.00433781 0.00439978 0.00441217 0.004218   0.00411885
 0.00425243 0.00409269 0.00385308 0.00384344 0.00377183 0.00313562
 0.00347989 0.0035212  0.00351294 0.00353635 0.00339175 0.00324854
 0.00339175 0.0032389  0.00325542 0.00312873 0.00340828 0.00305575
 0.0030406  0.00288499 0.00287948 0.00313286 0.00295522 0.00295384
 0.00287259 0.00273764 0.00261233 0.00289463 0.00266741 0.00260819
 0.00278997 0.00256688 0.00260682 0.00242091 0.00274177 0.00256826
 0.00273902 0.00243606 0.00247049 0.00239612 0.00245671 0.00242366
 0.00223914 0.00233553 0.00232865 0.00224327 0.00218818 0.00181913
 0.00226805 0.00223776 0.00210005 0.00217028 0.00213723 0.002151
 0.00206838 0.0020188  0.00215513 0.00214963 0.00221573 0.00209592
 0.00218818 0.00197887 0.00199952 0.0019706  0.00201743 0.00191139
 0.00196234 0.0019706  0.00187834 0.00186319 0.00174889 0.001875

# 1.4 Embedding Representation

Now we can represent each vocabulary word as a |C| dimensional vector using this equation:

Vector(w)= max(0, log (Pr(c|w)/Pr(c)))

This is a traditional approach called *pointwise mutual information* that pre-dates word2vec by some time. 

In [169]:
# Your Code Here...
import math
vector = np.zeros((len(V),len(C)))
for i in range(0,len(V)):
    for j in range(0,len(C)):
        if pr_cw[i][j] == 0:
            vector[i][j] = 0
        else:
            vector[i][j] = max(0,math.log10((pr_cw[i][j])/pr_c[j]))
print (vector)
print(vector.shape)

[[0.         0.01558624 0.00532383 ... 0.04793258 0.         0.        ]
 [0.02369618 0.06521146 0.20717017 ... 0.         0.         0.        ]
 [0.01603503 0.20977144 0.06328927 ... 0.         0.         0.        ]
 ...
 [0.3540699  0.43130073 0.         ... 0.         0.         0.        ]
 [0.         0.35666711 0.51121495 ... 0.         0.         0.        ]
 [0.34070594 0.         0.         ... 0.         0.         0.        ]]
(5000, 1000)


## 1.5 Analysis

So now we have some embeddings for each word. But are they meaningful? For this part, we:

- First, cluster the vocabulary into 100 clusters using k-means. Look over the words in each cluster, can we see any relation beween words?

- Second, for the top-20 most frequent words, find the nearest neighbors using cosine distance (1- cosine similarity). Do the findings make sense? Discuss.

In [170]:

vocab= []
for key,value in V.items():
    vocab.append(key)
from sklearn.cluster import KMeans
model = KMeans(n_clusters = 100, random_state = 0 ).fit(vector)


In [176]:
labels = model.predict(vector)
print(labels.shape)
#print(model.labels_.tolist())

V_ = dict((v,k) for k,v in V.items())

similar = {}
for l in range(len(labels)):
    if(labels[l] in similar.keys()):
        similar[labels[l]].append(l)
    else:
        similar[labels[l]] = [l]
print ("similar words:")
print ("Label \t list of words")
for key, value in similar.items():
    #print(value)
    print(key,"\t->\t",end="")
    for v in value:
        print(V_[v],",",end="")
    print('\n')

# from collections import defaultdict

# V_ = dict((v,k) for k,v in V.items())
# cluster = defaultdict(list)
# for i in range(0,len(labels.tolist())):
#     cluster[labels.tolist()[i]].append(V_[i])

# for item in cluster.items():
#     print (item)
    
print('we can observe that two words are similar of they occur in the same context')
print('would ,said ,could are in the same cluster')
print('two ,years ,three ,several ,four ,five ,six ,hundred ,ten ,couple ,seven ,eight are in the same cluster ')
print ('mr  ,president ,kennedy ,former ,de ,chief ,king ,mary ,governor ,judge ,leader ,daughter ,morgan ,frank ,honor')
print ('state ,states ,united ,public ,government ,program ,national ,development ,area ,service ,local ,federal ')

(5000,)
similar words:
Label 	 list of words
40 	->	one ,time ,man ,learned ,marriage ,pass ,names ,spoke ,patient ,captain ,strange ,mark ,remembered ,forced ,explained ,lady ,sam ,looks ,secret ,believed ,memory ,dog ,murder ,removed ,rich ,discovered ,die ,wine ,boat ,mercer ,beauty ,song ,realize ,willing ,weather ,sweet ,realized ,animal ,jury ,aside ,occurred ,wonder ,answered ,gets ,relief ,hanover ,dream ,wind ,jesus ,grow ,speaking ,save ,careful ,pleasure ,eat ,keeping ,refused ,orchestra ,fat ,taste ,lie ,songs ,struck ,negroes ,snow ,mine ,painting ,neighborhood ,surprised ,understood ,perfect ,occasion ,smile ,orders ,lose ,wondered ,informed ,artist ,birds ,appearance ,flowers ,enjoyed ,uncle ,bear ,vision ,wild ,palmer ,bought ,pure ,loved ,joe ,minute ,inner ,sounds ,unable ,faced ,soil ,thoughts ,experienced ,handle ,bitter ,band ,practically ,hero ,trust ,advice ,dozen ,indian ,flesh ,estate ,david ,surprise ,player ,agree ,cousin ,begun ,easier ,choose ,knowing ,prou

43 	->	even ,made ,well ,make ,still ,work ,another ,might ,since ,used ,use ,without ,place ,small ,found ,part ,high ,every ,number ,course ,though ,less ,put ,almost ,enough ,far ,yet ,set ,end ,called ,point ,give ,possible ,second ,often ,case ,large ,need ,children ,best ,least ,mind ,others ,although ,kind ,different ,began ,whole ,matter ,perhaps ,times ,line ,name ,example ,show ,whether ,gave ,today ,either ,quite ,seen ,death ,body ,half ,word ,field ,words ,already ,together ,money ,held ,keep ,probably ,seems ,cannot ,air ,making ,brought ,known ,position ,reason ,job ,close ,turn ,true ,full ,seem ,age ,following ,sometimes ,clear ,land ,able ,music ,child ,run ,short ,outside ,usually ,top ,sound ,strong ,surface ,lines ,book ,idea ,english ,alone ,living ,longer ,cut ,finally ,third ,expected ,needed ,kept ,space ,complete ,except ,hope ,beyond ,stage ,read ,person ,material ,instead ,lost ,heart ,low ,added ,feeling ,makes ,move ,simply ,hold ,actually ,beginning ,sort

41 	->	day ,last ,home ,night ,later ,next ,days ,early ,week ,office ,morning ,late ,hours ,hour ,summer ,evening ,hot ,spring ,hotel ,afternoon ,spent ,sunday ,dinner ,died ,yesterday ,winter ,session ,november ,monday ,saturday ,tomorrow ,cool ,friday ,tuesday ,spend ,breakfast ,tired ,o clock ,a m  ,scheduled ,supper ,wednesday ,thursday ,storm ,noon ,drank ,

82 	->	year ,period ,total ,rate ,million ,months ,miles ,1960 ,weeks ,nearly ,1961 ,average ,month ,daily ,march ,fiscal ,length ,30 ,cars ,cities ,season ,date ,1959 ,rates ,june ,20 ,1958 ,cover ,previous ,slightly ,25 ,file ,maximum ,100 ,gain ,april ,compared ,approximately ,calls ,initial ,estimated ,july ,gross ,page ,passing ,portion ,billion ,11 ,december ,sets ,prices ,50 ,ended ,requires ,september ,18 ,divided ,runs ,games ,payment ,january ,august ,investigation ,october ,uniform ,extra ,towns ,13 ,millions ,24 ,1957 ,21 ,sold ,prior ,23 ,recommended ,22 ,dollar ,60 ,1954 ,decade ,shares ,equivalent ,residential 

In [180]:
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(1, metric = 'cosine')
neigh.fit(mat)  

#sorted_d = sorted(word_dict.items(), key=operator.itemgetter(1),reverse=True)
a = neigh.kneighbors([mat[2]], 5, return_distance=False)
print (a)
print (a[0][1])


'''
i = 0
processed_dict = {}
for word in processed:
    if word not in processed_dict:
        processed_key[i] = word
        i= i+1

processed_key_ = {}
processed_key_ = dict((v,k) for k,v in processed_key.items())
'''   
print(V['one'])
for key,value in sorted_d[0:20]:
    #print (key)
    k = neigh.kneighbors([mat[V[key]]], 5, return_distance=False)
    
    #print(k[0][1])
    print(key,V_[k[0][1]],V_[k[0][2]],V_[k[0][3]],V_[k[0][4]])


[[  2 105  69   9 232]]
105
0
one still made first good
would must could much said
said told say like really
new city yankees central editor
could would couldn t way wanted
time long first place period
two four three seven several
may must might also would
first time one long every
like say said never thought
man woman one good thought
even still much would thought
made one make also still
also may made one well
many two several people among
must would may good way
af polynomial operator bond equation
back home away around along
years weeks days months centuries
much would little said better


*Insert discussion here*