<font color = green >

## Hometask 
### 1. Learn word embeddings
</font>

Find custom corpus and learn your own word embeddings.

<font color = green >


### 2. Operations with pretrained vectors
</font>

Complete the `cosine_similarity()` and  `complete_analogy()` functions below.

In [1]:
import numpy as np
import time 
import os

In [2]:
path = os.getcwd() + "/data"
fn_glove = os.path.join(path , 'glove.6B.50d.txt')

<font color = green >

### Load pretrained vectors 

</font>

- `words`: set of words in the vocabulary.
- `word_to_vec_map`: dictionary mapping words to their GloVe vector representation.




In [3]:
def read_glove_vecs(glove_file):
    with open(glove_file, "r", encoding="utf-8") as f:
        words = set()
        word_to_vec_map = {}

        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

    return words, word_to_vec_map


start_time = time.time()
words, word_to_vec_map = read_glove_vecs(fn_glove)
time_loading = time.time()
print('loading: {:.3f}s'.format(time_loading - start_time))


loading: 4.010s


<font color = green >

### Cosine similarity

</font>



To measure how similar two words $u$ and $v$ are, cosine similarity is used: 

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

where $u\cdot v$ is the inner product of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value. 


<img src="img/lesson_23_images/16.png" align = 'left' style="width:300;height:300px;"> <br>
<div style="clear:left;"></div>

The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum \limits_{i=1}^{n} u_i^2}$

In [4]:
# Implement the function cosine_similarity() to evaluate similarity between word vectors.

def cosine_similarity(u, v):
    """
    Reflects the degree of similariy between u and v
    Arguments:
        u, v - words vectors of shape (n,)          
    Returns:
        the cosine similarity between u and v defined by the formula above.
    """
    
    # START CODE 
    # Compute  cosine similarity between u and v 
    cosine_similarity = np.divide(np.dot(u,v), (np.linalg.norm(u)*np.linalg.norm(v)))
    
    # END CODE
    
    return cosine_similarity



In [5]:
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]

print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))


cosine_similarity(father, mother) =  0.8909038442893615
cosine_similarity(ball, crocodile) =  0.2743924626137942
cosine_similarity(france - paris, rome - italy) =  -0.6751479308174202


<font color = blue >

### Expected Output

</font>


`cosine_similarity(father, mother) =  0.8909038442893615
cosine_similarity(ball, crocodile) =  0.2743924626137943
cosine_similarity(france - paris, rome - italy) =  -0.6751479308174202
`

<font color = green >

### Analogy reasoning task

</font>


Analogy reasoning task is to complete the sentence `"A"` is to `"B"` as `"C"` is to `?`, e.g. `man` is to `woman` as `king` is to `queen`'

In detail, it looks for word `"D"`, s.t. the embedding words vectors $E_a, E_b, E_c, E_d$ are related in the following: $E_b - E_a \approx E_d - E_c$. 

Use cosine similarity to measure the similarity between $E_b - E_a$ and $E_d - E_c$ . 


In [6]:
# Complete the code of complete_analogy to perform word analogies

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Finds answer for the analogy reasoning task: a is to b as c is to ? 
    
    Arguments:
    word_a, word_b, word_c  - string
    word_to_vec_map - dictionary that maps words to embedding vectors. 
    
    Returns:
    best_word -  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    # START CODE 
    # Get the word embeddings v_a, v_b and v_c 
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
        
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue        
        
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)
        cosine_sim = cosine_similarity(word_to_vec_map[w], e_c)
       
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word 
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
    # END CODE
        
    return best_word

Run the cell below to test your code, this may take 1-2 minutes.

In [7]:
triads_to_try = [('italy', 'italian', 'spain'), ('delhi', 'india', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} as {} -> {}'.format(*triad, complete_analogy(*triad,word_to_vec_map)))
    

italy -> italian as spain -> portugal
delhi -> india as japan -> japanese
man -> woman as boy -> girl
small -> smaller as large -> larger


<font color = blue >

### Expected Output

</font>

`italy -> italian as spain -> spanish
india -> delhi as japan -> tokyo
man -> woman as boy -> girl
small -> smaller as large -> larger`

<font color = green >

### Compare with spacy 

</font>


In [8]:
boy = word_to_vec_map["boy"]
girl = word_to_vec_map["girl"]
man = word_to_vec_map["man"]
woman = word_to_vec_map["woman"]
# analogy reasoning 
u = boy - man 
v = girl - woman
cosine_similarity(u, v) #   0.73 better than 0.28

0.7302759333621871

In [9]:
print  ('{} vs {}: {} '.format('boy', 'man', cosine_similarity(boy, man)))
print  ('{} vs {}: {} '.format('girl', 'woman',  cosine_similarity(girl, woman)))

boy vs man: 0.8564431790318322 
girl vs woman: 0.9065280671323898 


In [10]:
print  ('{} vs {}: {} '.format(
    'car', 'vehicle', cosine_similarity(word_to_vec_map["car"], word_to_vec_map["vehicle"])))


car vs vehicle: 0.8833684148214743 


In [11]:
triad = ('boy', 'girl', 'man')
print ('{} -> {} as {} -> {}'.format(*triad, complete_analogy(*triad,word_to_vec_map)))


boy -> girl as man -> woman
