source: https://deeplearningcourses.com/c/data-science-natural-language-processing-in-python <br>
source: https://www.udemy.com/data-science-natural-language-processing-in-python <br>
Author: http://lazyprogrammer.me <br>
Disclaimer: I've copied this to go line-by-line to understand what's going on here. I take no credit for this work.

### Overview
- Given pre-trained GloVe embeddings
    - GloVe from Stanford NLP has 400k words in vocab versus 3 million for word2vec from Google
- Pretty straightforward code to see word similarity and create understandable analogies (demonstrating embeddings good)
- Using the 50-dimension vectors, so possible to do better with the 100 or 200 dimension embeddings

### Notes
- Keep track of numpy arrays with (number, ) versus (number, 1). It matters. Requires resizing.
- '##' will denote my comments
- '#' are lazyprogrammer's comments

In [1]:
from future.utils import iteritems
## works for python2 or python3 iteration I suppose

In [2]:
import numpy as np 
from sklearn.metrics.pairwise import pairwise_distances

## why not also:  import sklearn.metrics.pairwise import cosine_distances ?
## cosine distance is 1 - cosine similarity

## why not also: from sklearn.metrics.pairwise import euclidean_distances ?

In [3]:
# lazyprogrammer (LP) defines two different distance metrics
def dist1(a, b):
    return np.linalg.norm(a-b)
def dist2(a, b):
    return 1 - a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))


#pick a distance measurement type
# dist, metric = dist1, 'euclidean'
dist, metric = dist2, 'cosine'


## manual implementation of dist1 and dist2 by LP allows you to choose which distance to use

In [4]:
# more intuitive version
def find_analogies(w1, w2, w3):
    ## if you try to make an analogy with a word outside of the vocab, it won't work
    for w in (w1, w2, w3):
        if w not in word2vec:
            print("%s not in dictionary" % w)
            return
    
    ## first - second : v0 - third
    
    first_word = word2vec[w1]
    second_word = word2vec[w2]
    third_word = word2vec[w3]
    v0 = first_word - second_word + third_word
    
    min_dist = float('inf') ## set min distance to infinity
    best_word = '' ## no best word yet
    for word, v1 in iteritems(word2vec):
        if word not in (w1, w2, w3): ## ensure the word isn't in analogy
            d = dist(v0, v1) ##uses the defined cosine distance from above dist2 function rather than library funct 
            if d < min_dist:  ## if new v0 to v1 dist is less than min dist, set new min dist and closest word
                min_dist = d
                best_word = word
    print(w1, " - ", w2, " = ", best_word, " - ", w3)

In [5]:
## less intuitive but faster code with few for loops

## def find_analogies(w1, w2, w3):
#   for w in (w1, w2, w3):
#     if w not in word2vec:
#       print("%s not in dictionary" % w)
#       return
#
#   king = word2vec[w1]
#   man = word2vec[w2]
#   woman = word2vec[w3]
#   v0 = king - man + woman
#
#   min_dist = float('inf')
#   best_word = ''
#   for word, v1 in iteritems(word2vec):
#     if word not in (w1, w2, w3):
#       d = dist(v0, v1)
#       if d < min_dist:
#         min_dist = d
#         best_word = word
#   print(w1, "-", w2, "=", best_word, "-", w3)

In [6]:
def nearest_neighbors(w, n=5):  ## nearest n neighbors for a word
    if w not in word2vec:
        print("%s not in dictionary" % w)
        return
    
    ## word2vec is a dictionary of key w and value v [w1:v1, w2:v2]
    ## v shape is (50,)
    v = word2vec[w] 
    
    ## compares distance between v for w and of all vectors in word2vec embeddings matrix
    ## metric arg takes a string, in this case metric = "cosine" which is allowed and 
    distances = pairwise_distances(v.reshape(1, D), embedding, metric=metric).reshape(V)
    
    ## v shape is (50,)
    ## v.reshape(1,D) is (1,50)
    ## embedding.shape is (400000,50)
    ## cosine similarity is 1 - (v0 dotproduct v1) / magnitude(v0) * magnitude(v1)
    
    ## original distances.shape is (1, 400000)
    ## distances.reshape(V).shape is (400000,)
    ##
    
    
    
    
    ## sorts by closest dist, returning an array of indices (not distances themselves)
    ## [1:n+1] ecludes vector v for searched word w (closest to itself) takes top n neighbors 
    idxs = distances.argsort()[1:n+1] 
    
    ## prints out n nearest neighbors based on ranking
    print("neighbors of %s:" % w)
    for idx in idxs:
        
        ## ind2word has idx
        print("\t%s" % idx2word[idx])
    

In [12]:
# load in pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
embedding = []
idx2word = []
with open('./large_files/glove.6B.50d.txt', encoding='utf-8') as f:
  # is just a space-separated text file in the format:
  # word vec[0] vec[1] vec[2] ...
  for line in f:
    values = line.split()
    word = values[0]
    vec = np.asarray(values[1:], dtype='float32')
    word2vec[word] = vec
    embedding.append(vec)
    idx2word.append(word)
print('Found %s word vectors.' % len(word2vec))
embedding = np.array(embedding)
V, D = embedding.shape

Loading word vectors...
Found 400000 word vectors.


In [8]:
find_analogies('king', 'man', 'woman')
find_analogies('france', 'paris', 'london')
find_analogies('france', 'paris', 'rome')
find_analogies('paris', 'france', 'italy')
find_analogies('france', 'french', 'english')
find_analogies('japan', 'japanese', 'chinese')
# find_analogies('japan', 'japanese', 'italian')
# find_analogies('japan', 'japanese', 'australian')
# find_analogies('december', 'november', 'june')
# find_analogies('miami', 'florida', 'texas')
# find_analogies('einstein', 'scientist', 'painter')
# find_analogies('china', 'rice', 'bread')
# find_analogies('man', 'woman', 'she')
# find_analogies('man', 'woman', 'aunt')
# find_analogies('man', 'woman', 'sister')
# find_analogies('man', 'woman', 'wife')
# find_analogies('man', 'woman', 'actress')
# find_analogies('man', 'woman', 'mother')
# find_analogies('heir', 'heiress', 'princess')
# find_analogies('nephew', 'niece', 'aunt')
# find_analogies('france', 'paris', 'tokyo')
# find_analogies('france', 'paris', 'beijing')
# find_analogies('february', 'january', 'november')
# find_analogies('france', 'paris', 'rome')
# find_analogies('paris', 'france', 'italy')

king  -  man  =  queen  -  woman
france  -  paris  =  britain  -  london
france  -  paris  =  italy  -  rome
paris  -  france  =  rome  -  italy
france  -  french  =  england  -  english
japan  -  japanese  =  china  -  chinese


In [9]:
nearest_neighbors('king')
nearest_neighbors('france')
nearest_neighbors('japan')
nearest_neighbors('Einstein')
nearest_neighbors('woman')
# nearest_neighbors('nephew')
# nearest_neighbors('february')
# nearest_neighbors('rome')


neighbors of king:
	prince
	queen
	ii
	emperor
	son
neighbors of france:
	french
	belgium
	paris
	spain
	netherlands
neighbors of japan:
	japanese
	china
	korea
	tokyo
	taiwan
Einstein not in dictionary
neighbors of woman:
	girl
	man
	mother
	her
	boy


In [10]:
word2vec["king"].reshape(1, D).shape

(1, 50)

In [11]:
embedding.shape

(400000, 50)