In [1]:
NAME = "Reuel Yang"

---

# Lab 6: Skip Gram

**Please read the following instructions very carefully**

## Working on the assignment / FAQs
- **Always use the seed/random_state as *42* wherever applicable** (This is to ensure repeatability in answers, across students and coding environments)
- The type of question and the points they carry are indicated in each question cell
- To avoid any ambiguity, each question also specifies what *value* must be set. Note that these are dummy values and not the answers
- If an autograded question has multiple answers (due to differences in handling NaNs, zeros etc.), all answers will be considered.
- You can delete the `raise NotImplementedError()`
- **Submitting the assignment** : Download the '.ipynb' file from Colab and upload it to bcourses. Do not delete any outputs from cells before submitting.
- That's about it. Happy coding!


Available software:
 - Python's Gensim module: https://radimrehurek.com/gensim/ (install using pip)

_Note: The most important hyper parameters of skip-gram/CBOW are vector size and windows size_


In [2]:
!pip install gensim
import pandas as pd
import numpy as np
import gensim



In [3]:
import gensim.downloader as api

model = api.load('word2vec-google-news-300') # this step might take ~10-15 minutes



### **Q1 (1 point)**
Find the cosine similarity between the following word pairs

- (France, England)
- (come, go)
- (England, London)
- (France, Rocket)
- (big, bigger)

In [4]:
#Replace 0 with the code / value; Do not delete this cell
similarity_pair1 = model.similarity('France', 'England')
similarity_pair2 = model.similarity('come', 'go')
similarity_pair3 = model.similarity('England', 'London')
similarity_pair4 = model.similarity('France', 'Rocket')
similarity_pair5 = model.similarity('big', 'bigger')

In [5]:
#This is an autograded cell, do not edit/delete
print(similarity_pair1, similarity_pair2, similarity_pair3, similarity_pair4, similarity_pair5)

0.39804944 0.6604678 0.43992856 0.07114174 0.68423855


### **Q2 (1 point)**
Write an expression to extract the vector representations of the words:

- France
- England
- smaller
- bigger
- rocket
- big

Get only the first 5 elements for each vector representation.

In [12]:
#Replace 0 with the code / value to get the first 5 elements of each vector; Do not delete this cell
vector_1 = model['France'][:5]
vector_2 = model['England'][:5]
vector_3 = model['smaller'][:5]
vector_4 = model['bigger'][:5]
vector_5 = model['rocket'][:5]
vector_6 = model['big'][:5]

In [13]:
#This is an autograded cell, do not edit/delete
print(vector_1)
print(vector_2)
print(vector_3)
print(vector_4)
print(vector_5)
print(vector_6)


[0.04858398 0.07861328 0.32421875 0.03491211 0.07714844]
[-0.19824219  0.11523438  0.0625     -0.05834961  0.2265625 ]
[-0.05004883  0.03417969 -0.0703125   0.17578125  0.00689697]
[-0.06542969 -0.09521484 -0.06225586  0.16210938  0.01989746]
[-0.03198242  0.27148438 -0.2890625  -0.15429688  0.16894531]
[ 0.11132812  0.10595703 -0.07373047  0.18847656  0.07666016]


### **Q3 (1 point)**
Find the euclidean distances between the word pairs :

- (France, England)
- (smaller, bigger)
- (England, London)
- (France, Rocket)
- (big, bigger)


In [8]:
#Replace 0 with the code / value; Do not delete this cell
eu_dist1 = np.linalg.norm(model['France'] - model['England'])
eu_dist2 = np.linalg.norm(model['smaller'] - model['bigger'])
eu_dist3 = np.linalg.norm(model['England'] - model['London'])
eu_dist4 = np.linalg.norm(model['France'] - model['Rocket'])
eu_dist5 = np.linalg.norm(model['big'] - model['bigger'])

In [9]:
#This is an autograded cell, do not edit / delete
print(eu_dist1)
print(eu_dist2)
print(eu_dist3)
print(eu_dist4)
print(eu_dist5)


3.0151067
1.8618743
2.8752837
3.892071
1.9586496


### **Q4 (1 point)**
Time to dabble with the power of Word2Vec. Find the 2 closest words  for the following conditions:  
- (King - Man + Queen)
- (bigger - big + small)
- (waiting - wait + run)
- (Texas + Milwaukee – Wisconsin)

Note: If your kernel crashes due to low memory and restarts, reload the model from the top and try running this part again.

In [14]:
#Replace 0 with the code / value; Do not delete this cell
closest1 = model.most_similar(positive=['King', 'Queen'], negative=['Man'])[:2]
closest2 = model.most_similar(positive=['bigger', 'small'], negative=['big'])[:2]
closest3 = model.most_similar(positive=['waiting', 'run'], negative=['wait'])[:2]
closest4 = model.most_similar(positive=['Texas', 'Milwaukee'], negative=['Wisconsin'])[:2]

In [15]:
#This is an autograded cell, do not edit/delete
print(closest1)
print(closest2)
print(closest3)
print(closest4)

[('Queen_Elizabeth', 0.5257916450500488), ('monarch', 0.5004087090492249)]
[('larger', 0.7402471303939819), ('smaller', 0.7329993844032288)]
[('running', 0.5654535889625549), ('runs', 0.49639999866485596)]
[('Houston', 0.7767744064331055), ('Fort_Worth', 0.7270511388778687)]


### **Q5 (3 points)**
Using the vectors for the words in the Google News dataset, apply K-means clustering (K=2) and find the top 5 most representative words/phrases of each cluster.

*Note: Since there are ~3Mil words in the vocabulary, you can downsample it to 25k randomly selected words*  \\
*Hint: The "similar_by_vector" method might be useful*

**Do not delete the below cell**

In [16]:
# Replace 0 with the code / value; Do not delete this cell
import random
from sklearn.cluster import KMeans

vocab_keys = list(model.key_to_index.keys())
randomly_selected_words = random.sample(vocab_keys, 25000)
subset_word_vectors = [model[word] for word in randomly_selected_words]

kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
kmeans.fit(subset_word_vectors)

top_representative_words = {}
for cluster_label in range(2):
    centroid = kmeans.cluster_centers_[cluster_label]
    similar_words = model.similar_by_vector(centroid, topn=5)
    top_representative_words[cluster_label] = similar_words

most_rep_cluster1 = top_representative_words[0]
most_rep_cluster2 = top_representative_words[1]

In [17]:
#This is an autograded cell, do not edit/delete
print(most_rep_cluster1)
print(most_rep_cluster2)

[('http_dol##.net_index###.html_http', 0.9182074666023254), ('dol##.net_index####.html_http_dol##.net', 0.9084749817848206), ('index###.html_http_dol##.net_index###.html', 0.9070622324943542), ('Deltagen_undertakes', 0.9052171111106873), ('By_TRICIA_SCRUGGS', 0.9017032980918884)]
[('Emil_Protalinski_Published', 0.9214745759963989), ('By_HuDie_####-##-##', 0.9198623895645142), ('By_QianMian_####-##-##', 0.9184156656265259), ('BY_GEOFF_KOHL', 0.9172717332839966), ('By_XiaoBing_####-##-##', 0.9158840775489807)]


### **Q6 (1 point)**
What loss function does the skipgram model use and briefly describe what this function is minimizing.

**Do not delete the below cell**

The skipgram model uses Cross-entropy loss function. Skip-grams have a single target word as input and context words as output. The function is calculating the probability of observing a context word from the context window given a target word. The loss function aims to minimize the difference between the positive sample probability (probability of observing true context words) and the negative sample probability (probability of observing randomly sampled words from the vocabulary as context words).

In [18]:
# YOUR CODE HERE

### **Bonus Question (1 point)**
Find at least 2 interesting word vec combinations like the ones given in Q4

**Do not delete the below cell**

In [20]:
# YOUR CODE HERE
closest5 = model.most_similar(positive=['Nissan', 'Drift'], negative=['America'])[:2]
closest6 = model.most_similar(positive=['Reggaeton', 'Korean', 'artist'], negative=['unpopular'])[:2]
print(closest5)
print(closest6)

[('Proton_Satria', 0.47245484590530396), ('Toyota_Auris', 0.47118985652923584)]
[('Ho_Quynh_Huong', 0.5279245972633362), ('pansori', 0.5203095078468323)]
