# Word Embeddings
In this notebook, we use pretrained word embeddings by GloVe to measure similarity between different words, and to solve word analogy problems.

In [1]:
import numpy as np

## <span style="color:blue"> Read Pretrained Word Vectors </span>

In [8]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        word_to_vec_map = {}
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            
    return word_to_vec_map

In [9]:
word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

In [13]:
sample_words = ['lion','chair','of']
for word in sample_words:
    print(f"{word}:\n{word_to_vec_map[word]}")

lion:
[ 0.60093    0.012934  -0.61032   -0.13871    1.2507     0.10128
 -0.20073   -1.1283     0.60857   -0.16666    0.22419    0.61781
  0.50102   -0.22303   -0.011922   0.24229    0.82199    0.5748
 -1.9703     0.044589   0.22157   -0.26673   -0.005156  -0.36311
 -0.42483   -0.79237   -1.557      0.22124   -0.32143   -0.46747
  1.0604     0.84237    0.045337   0.86075   -0.085506   0.0058815
 -0.16237   -1.0329    -0.25763   -0.65854   -0.13647    0.3653
 -0.61384   -0.54004    0.24051   -0.23007   -0.31155   -1.3485
  0.4327    -0.34512  ]
chair:
[-1.0443e+00  4.9202e-01 -7.5978e-01 -3.9224e-01  8.1217e-01 -3.9287e-02
  1.6706e-02 -6.8629e-01 -7.8359e-02 -1.3214e+00 -1.5354e-01  2.0438e-01
 -4.6503e-01  1.2145e+00 -1.8217e-01  2.7451e-01 -2.4086e-01  7.1145e-01
  3.2470e-01 -7.1320e-01  6.6721e-01  7.1307e-01 -1.0394e-01 -3.8439e-01
 -2.0260e-01 -1.4419e+00  4.2644e-01  5.9436e-01 -1.3615e+00  1.3784e-03
  1.8734e+00 -1.1334e-01 -8.8115e-01 -2.1715e-01 -5.6606e-01  1.4152e-01
  2.76

## <span style="color:blue"> Cosine Similarity </span>
Cosine similaity, shown by $\alpha$ here, between two vectors $\mathbf{u}$ and $\mathbf{v}$ is simply the cosine of the angle between them:
$$\alpha=\frac{\mathbf{u}^{\mathrm{T}}\mathbf{v}}{\lVert \mathbf{u}\rVert_2\lVert \mathbf{v}\rVert_2}.$$

In [14]:
def cosine_similarity(u, v):
    if np.all(u == v):
        return 1
    
    dot = np.dot(u,v) 
    norm_u = np.sqrt(np.dot(u,u))
    norm_v = np.sqrt(np.dot(v,v
    
    if norm_u*norm_v == 0:
        return 0
    
    cosine_similarity = dot/(norm_u*norm_v)
    
    return cosine_similarity

In [24]:
print("Some cosine similarity values:")
print(f"human versus vegtable: {cosine_similarity(word_to_vec_map['human'],word_to_vec_map['vegetable'])}")
print(f"girl versus woman: {cosine_similarity(word_to_vec_map['girl'],word_to_vec_map['woman'])}")
print(f"chicken versus television: {cosine_similarity(word_to_vec_map['chicken'],word_to_vec_map['television'])}")

Some cosine similarity values:
human versus vegtable: 0.14382288655846073
girl versus woman: 0.9065280671323898
chicken versus television: 0.21441582211490806


## <span style="color:blue"> Word Analogy Task </span>
The objective is to complete this sentence: <font color="olive"> "*a* is to *b* as *c* is to ____ </font>"

In other words, if $\mathbf{e}_a$, $\mathbf{e}_b$, and $\mathbf{e}_c$ denote the embeddings of, respectively, words *a*, *b*, and *c*, find the word *w* with embedding $\mathbf{e}_w$ such that $\mathbf{e}_b-\mathbf{e}_a\sim \mathbf{e}_w-\mathbf{e}_c$. Here $\sim$ means similar and we find the word *w* that provides the highest similarity.

In [25]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    
    words = word_to_vec_map.keys()
    
    max_cosine_sim = -2
    best_word = None   
    for w in words:
        if w == word_c:
            continue
            
        cosine_sim = cosine_similarity(e_b-e_a,word_to_vec_map[w]-e_c)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        
    return best_word

In [40]:
triads_to_try = [('italy', 'italian', 'spain'), ('man', 'woman', 'boy')]
for triad in triads_to_try:
    print(f"{triad[0]} -> {triad[1]} :: {triad[2]} -> {complete_analogy(*triad, word_to_vec_map)}")

italy -> italian :: spain -> spanish
man -> woman :: boy -> girl
