# Skip-gram Word2Vec

In this notebook, I'll lead you through using PyTorch to implement the [Word2Vec algorithm](https://en.wikipedia.org/wiki/Word2vec) using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like machine translation.

## Readings

Here are the resources I used to build this notebook. I suggest reading these either beforehand or while you're working on this material.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of Word2Vec from Chris McCormick 
* [First Word2Vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [Neural Information Processing Systems, paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for Word2Vec also from Mikolov et al.

---
## Word embeddings

When you're dealing with words in text, you end up with tens of thousands of word classes to analyze; one for each word in a vocabulary. Trying to one-hot encode these words is massively inefficient because most values in a one-hot vector will be set to zero. So, the matrix multiplication that happens in between a one-hot input vector and a first, hidden layer will result in mostly zero-valued hidden outputs.

<img src='assets/one_hot_encoding.png' width=50%>

To solve this problem and greatly increase the efficiency of our networks, we use what are called **embeddings**. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the "on" input unit.

<img src='assets/lookup_matrix.png' width=50%>

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.

<img src='assets/tokenize_lookup.png' width=50%>
 
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix.

Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representations of words that contain semantic meaning.

---
## Word2Vec

The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words.

<img src="assets/context_drink.png" width=40%>

Words that show up in similar **contexts**, such as "coffee", "tea", and "water" will have vectors near each other. Different words will be further away from one another, and relationships can be represented by distance in vector space.

<img src="assets/vector_distance.png" width=40%>


There are two architectures for implementing Word2Vec:
>* CBOW (Continuous Bag-Of-Words) and 
* Skip-gram

<img src="assets/word2vec_architectures.png" width=60%>

In this implementation, we'll be using the **skip-gram architecture** because it performs better than CBOW. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

---
## Loading Data

Next, we'll ask you to load in data and place it in the `data` directory

1. Load the [text8 dataset](https://s3.amazonaws.com/video.udacity-data.com/topher/2018/October/5bbe6499_text8/text8.zip); a file of cleaned up *Wikipedia article text* from Matt Mahoney. 
2. Place that data in the `data` folder in the home directory.
3. Then you can extract it and delete the archive, zip file to save storage space.

After following these steps, you should have one file in your data directory: `data/text8`.

In [2]:
# read in the extracted text file      
with open('data/text8') as f:
    text = f.read()

# print out the first 100 characters
print(text[:100])

 anarchism originated as a term of abuse first used against early working class radicals including t


## Pre-processing

Here I'm fixing up the text to make training easier. This comes from the `utils.py` file. The `preprocess` function does a few things:
>* It converts any punctuation into tokens, so a period is changed to ` <PERIOD> `. In this data set, there aren't any periods, but it will help in other NLP problems. 
* It removes all words that show up five or *fewer* times in the dataset. This will greatly reduce issues due to noise in the data and improve the quality of the vector representations. 
* It returns a list of words in the text.

This may take a few seconds to run, since our text file is quite large. If you want to write your own functions for this stuff, go for it!

In [4]:
import utils

# get list of words
words = utils.preprocess(text)
print(words[:30])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']


In [5]:
# print some stats about this word data
print("Total words in text: {}".format(len(words)))
print("Unique words: {}".format(len(set(words)))) # `set` removes any duplicate words

Total words in text: 16680599
Unique words: 63641


### Dictionaries

Next, I'm creating two dictionaries to convert words to integers and back again (integers to words). This is again done with a function in the `utils.py` file. `create_lookup_tables` takes in a list of words in a text and returns two dictionaries.
>* The integers are assigned in descending frequency order, so the most frequent word ("the") is given the integer 0 and the next most frequent is 1, and so on. 

Once we have our dictionaries, the words are converted to integers and stored in the list `int_words`.

In [7]:
vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

print(int_words[:30])

[5233, 3080, 11, 5, 194, 1, 3133, 45, 58, 155, 127, 741, 476, 10571, 133, 0, 27349, 1, 0, 102, 854, 2, 0, 15067, 58112, 1, 0, 150, 854, 3580]


## Subsampling

Words that show up often such as "the", "of", and "for" don't provide much context to the nearby words. If we discard some of them, we can remove some of the noise from our data and in return get faster training and better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by 

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

$$ P(0) = 1 - \sqrt{\frac{1*10^{-5}}{1*10^6/16*10^6}} = 0.98735 $$

I'm going to leave this up to you as an exercise. Check out my solution to see how I did it.

> **Exercise:** Implement subsampling for the words in `int_words`. That is, go through `int_words` and discard each word given the probablility $P(w_i)$ shown above. Note that $P(w_i)$ is the probability that a word is discarded. Assign the subsampled data to `train_words`.

In [37]:
max(word_counts)/sum(word_counts)
np.random.rand(1).item() 
sum(word_counts)
#len(int_words)

2025056620

In [39]:
from collections import Counter
import random
import numpy as np

threshold = 1e-5
word_counts = Counter(int_words)
print(list(word_counts.items())[0:10])  # dictionary of int_words, how many times they appear

# discard some frequent words, according to the subsampling equation
# create a new list of words for training
train_words = []
total_words = sum(word_counts)
t = max(word_counts)/total_words
print_every = 1000
i=0
for word in int_words:
    i += 1
    rel_frequency = word_counts[word]/total_words
    prob = 1 - np.sqrt(t/rel_frequency) 
    if i % print_every == 0:
        print(i)
    if np.random.rand(1).item() > prob:
        train_words.append(int_words[word])

#print(train_words[:30])

[(5233, 303), (3080, 572), (11, 131815), (5, 325873), (194, 7219), (1, 593677), (3133, 563), (45, 28810), (58, 22737), (155, 8432)]
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000
78000
79000
80000
81000
82000
83000
84000
85000
86000
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140

1150000
1151000
1152000
1153000
1154000
1155000
1156000
1157000
1158000
1159000
1160000
1161000
1162000
1163000
1164000
1165000
1166000
1167000
1168000
1169000
1170000
1171000
1172000
1173000
1174000
1175000
1176000
1177000
1178000
1179000
1180000
1181000
1182000
1183000
1184000
1185000
1186000
1187000
1188000
1189000
1190000
1191000
1192000
1193000
1194000
1195000
1196000
1197000
1198000
1199000
1200000
1201000
1202000
1203000
1204000
1205000
1206000
1207000
1208000
1209000
1210000
1211000
1212000
1213000
1214000
1215000
1216000
1217000
1218000
1219000
1220000
1221000
1222000
1223000
1224000
1225000
1226000
1227000
1228000
1229000
1230000
1231000
1232000
1233000
1234000
1235000
1236000
1237000
1238000
1239000
1240000
1241000
1242000
1243000
1244000
1245000
1246000
1247000
1248000
1249000
1250000
1251000
1252000
1253000
1254000
1255000
1256000
1257000
1258000
1259000
1260000
1261000
1262000
1263000
1264000
1265000
1266000
1267000
1268000
1269000
1270000
1271000
1272000
1273000
1274000


2200000
2201000
2202000
2203000
2204000
2205000
2206000
2207000
2208000
2209000
2210000
2211000
2212000
2213000
2214000
2215000
2216000
2217000
2218000
2219000
2220000
2221000
2222000
2223000
2224000
2225000
2226000
2227000
2228000
2229000
2230000
2231000
2232000
2233000
2234000
2235000
2236000
2237000
2238000
2239000
2240000
2241000
2242000
2243000
2244000
2245000
2246000
2247000
2248000
2249000
2250000
2251000
2252000
2253000
2254000
2255000
2256000
2257000
2258000
2259000
2260000
2261000
2262000
2263000
2264000
2265000
2266000
2267000
2268000
2269000
2270000
2271000
2272000
2273000
2274000
2275000
2276000
2277000
2278000
2279000
2280000
2281000
2282000
2283000
2284000
2285000
2286000
2287000
2288000
2289000
2290000
2291000
2292000
2293000
2294000
2295000
2296000
2297000
2298000
2299000
2300000
2301000
2302000
2303000
2304000
2305000
2306000
2307000
2308000
2309000
2310000
2311000
2312000
2313000
2314000
2315000
2316000
2317000
2318000
2319000
2320000
2321000
2322000
2323000
2324000


3263000
3264000
3265000
3266000
3267000
3268000
3269000
3270000
3271000
3272000
3273000
3274000
3275000
3276000
3277000
3278000
3279000
3280000
3281000
3282000
3283000
3284000
3285000
3286000
3287000
3288000
3289000
3290000
3291000
3292000
3293000
3294000
3295000
3296000
3297000
3298000
3299000
3300000
3301000
3302000
3303000
3304000
3305000
3306000
3307000
3308000
3309000
3310000
3311000
3312000
3313000
3314000
3315000
3316000
3317000
3318000
3319000
3320000
3321000
3322000
3323000
3324000
3325000
3326000
3327000
3328000
3329000
3330000
3331000
3332000
3333000
3334000
3335000
3336000
3337000
3338000
3339000
3340000
3341000
3342000
3343000
3344000
3345000
3346000
3347000
3348000
3349000
3350000
3351000
3352000
3353000
3354000
3355000
3356000
3357000
3358000
3359000
3360000
3361000
3362000
3363000
3364000
3365000
3366000
3367000
3368000
3369000
3370000
3371000
3372000
3373000
3374000
3375000
3376000
3377000
3378000
3379000
3380000
3381000
3382000
3383000
3384000
3385000
3386000
3387000


4288000
4289000
4290000
4291000
4292000
4293000
4294000
4295000
4296000
4297000
4298000
4299000
4300000
4301000
4302000
4303000
4304000
4305000
4306000
4307000
4308000
4309000
4310000
4311000
4312000
4313000
4314000
4315000
4316000
4317000
4318000
4319000
4320000
4321000
4322000
4323000
4324000
4325000
4326000
4327000
4328000
4329000
4330000
4331000
4332000
4333000
4334000
4335000
4336000
4337000
4338000
4339000
4340000
4341000
4342000
4343000
4344000
4345000
4346000
4347000
4348000
4349000
4350000
4351000
4352000
4353000
4354000
4355000
4356000
4357000
4358000
4359000
4360000
4361000
4362000
4363000
4364000
4365000
4366000
4367000
4368000
4369000
4370000
4371000
4372000
4373000
4374000
4375000
4376000
4377000
4378000
4379000
4380000
4381000
4382000
4383000
4384000
4385000
4386000
4387000
4388000
4389000
4390000
4391000
4392000
4393000
4394000
4395000
4396000
4397000
4398000
4399000
4400000
4401000
4402000
4403000
4404000
4405000
4406000
4407000
4408000
4409000
4410000
4411000
4412000


5323000
5324000
5325000
5326000
5327000
5328000
5329000
5330000
5331000
5332000
5333000
5334000
5335000
5336000
5337000
5338000
5339000
5340000
5341000
5342000
5343000
5344000
5345000
5346000
5347000
5348000
5349000
5350000
5351000
5352000
5353000
5354000
5355000
5356000
5357000
5358000
5359000
5360000
5361000
5362000
5363000
5364000
5365000
5366000
5367000
5368000
5369000
5370000
5371000
5372000
5373000
5374000
5375000
5376000
5377000
5378000
5379000
5380000
5381000
5382000
5383000
5384000
5385000
5386000
5387000
5388000
5389000
5390000
5391000
5392000
5393000
5394000
5395000
5396000
5397000
5398000
5399000
5400000
5401000
5402000
5403000
5404000
5405000
5406000
5407000
5408000
5409000
5410000
5411000
5412000
5413000
5414000
5415000
5416000
5417000
5418000
5419000
5420000
5421000
5422000
5423000
5424000
5425000
5426000
5427000
5428000
5429000
5430000
5431000
5432000
5433000
5434000
5435000
5436000
5437000
5438000
5439000
5440000
5441000
5442000
5443000
5444000
5445000
5446000
5447000


6350000
6351000
6352000
6353000
6354000
6355000
6356000
6357000
6358000
6359000
6360000
6361000
6362000
6363000
6364000
6365000
6366000
6367000
6368000
6369000
6370000
6371000
6372000
6373000
6374000
6375000
6376000
6377000
6378000
6379000
6380000
6381000
6382000
6383000
6384000
6385000
6386000
6387000
6388000
6389000
6390000
6391000
6392000
6393000
6394000
6395000
6396000
6397000
6398000
6399000
6400000
6401000
6402000
6403000
6404000
6405000
6406000
6407000
6408000
6409000
6410000
6411000
6412000
6413000
6414000
6415000
6416000
6417000
6418000
6419000
6420000
6421000
6422000
6423000
6424000
6425000
6426000
6427000
6428000
6429000
6430000
6431000
6432000
6433000
6434000
6435000
6436000
6437000
6438000
6439000
6440000
6441000
6442000
6443000
6444000
6445000
6446000
6447000
6448000
6449000
6450000
6451000
6452000
6453000
6454000
6455000
6456000
6457000
6458000
6459000
6460000
6461000
6462000
6463000
6464000
6465000
6466000
6467000
6468000
6469000
6470000
6471000
6472000
6473000
6474000


7410000
7411000
7412000
7413000
7414000
7415000
7416000
7417000
7418000
7419000
7420000
7421000
7422000
7423000
7424000
7425000
7426000
7427000
7428000
7429000
7430000
7431000
7432000
7433000
7434000
7435000
7436000
7437000
7438000
7439000
7440000
7441000
7442000
7443000
7444000
7445000
7446000
7447000
7448000
7449000
7450000
7451000
7452000
7453000
7454000
7455000
7456000
7457000
7458000
7459000
7460000
7461000
7462000
7463000
7464000
7465000
7466000
7467000
7468000
7469000
7470000
7471000
7472000
7473000
7474000
7475000
7476000
7477000
7478000
7479000
7480000
7481000
7482000
7483000
7484000
7485000
7486000
7487000
7488000
7489000
7490000
7491000
7492000
7493000
7494000
7495000
7496000
7497000
7498000
7499000
7500000
7501000
7502000
7503000
7504000
7505000
7506000
7507000
7508000
7509000
7510000
7511000
7512000
7513000
7514000
7515000
7516000
7517000
7518000
7519000
7520000
7521000
7522000
7523000
7524000
7525000
7526000
7527000
7528000
7529000
7530000
7531000
7532000
7533000
7534000


8446000
8447000
8448000
8449000
8450000
8451000
8452000
8453000
8454000
8455000
8456000
8457000
8458000
8459000
8460000
8461000
8462000
8463000
8464000
8465000
8466000
8467000
8468000
8469000
8470000
8471000
8472000
8473000
8474000
8475000
8476000
8477000
8478000
8479000
8480000
8481000
8482000
8483000
8484000
8485000
8486000
8487000
8488000
8489000
8490000
8491000
8492000
8493000
8494000
8495000
8496000
8497000
8498000
8499000
8500000
8501000
8502000
8503000
8504000
8505000
8506000
8507000
8508000
8509000
8510000
8511000
8512000
8513000
8514000
8515000
8516000
8517000
8518000
8519000
8520000
8521000
8522000
8523000
8524000
8525000
8526000
8527000
8528000
8529000
8530000
8531000
8532000
8533000
8534000
8535000
8536000
8537000
8538000
8539000
8540000
8541000
8542000
8543000
8544000
8545000
8546000
8547000
8548000
8549000
8550000
8551000
8552000
8553000
8554000
8555000
8556000
8557000
8558000
8559000
8560000
8561000
8562000
8563000
8564000
8565000
8566000
8567000
8568000
8569000
8570000


9494000
9495000
9496000
9497000
9498000
9499000
9500000
9501000
9502000
9503000
9504000
9505000
9506000
9507000
9508000
9509000
9510000
9511000
9512000
9513000
9514000
9515000
9516000
9517000
9518000
9519000
9520000
9521000
9522000
9523000
9524000
9525000
9526000
9527000
9528000
9529000
9530000
9531000
9532000
9533000
9534000
9535000
9536000
9537000
9538000
9539000
9540000
9541000
9542000
9543000
9544000
9545000
9546000
9547000
9548000
9549000
9550000
9551000
9552000
9553000
9554000
9555000
9556000
9557000
9558000
9559000
9560000
9561000
9562000
9563000
9564000
9565000
9566000
9567000
9568000
9569000
9570000
9571000
9572000
9573000
9574000
9575000
9576000
9577000
9578000
9579000
9580000
9581000
9582000
9583000
9584000
9585000
9586000
9587000
9588000
9589000
9590000
9591000
9592000
9593000
9594000
9595000
9596000
9597000
9598000
9599000
9600000
9601000
9602000
9603000
9604000
9605000
9606000
9607000
9608000
9609000
9610000
9611000
9612000
9613000
9614000
9615000
9616000
9617000
9618000


10482000
10483000
10484000
10485000
10486000
10487000
10488000
10489000
10490000
10491000
10492000
10493000
10494000
10495000
10496000
10497000
10498000
10499000
10500000
10501000
10502000
10503000
10504000
10505000
10506000
10507000
10508000
10509000
10510000
10511000
10512000
10513000
10514000
10515000
10516000
10517000
10518000
10519000
10520000
10521000
10522000
10523000
10524000
10525000
10526000
10527000
10528000
10529000
10530000
10531000
10532000
10533000
10534000
10535000
10536000
10537000
10538000
10539000
10540000
10541000
10542000
10543000
10544000
10545000
10546000
10547000
10548000
10549000
10550000
10551000
10552000
10553000
10554000
10555000
10556000
10557000
10558000
10559000
10560000
10561000
10562000
10563000
10564000
10565000
10566000
10567000
10568000
10569000
10570000
10571000
10572000
10573000
10574000
10575000
10576000
10577000
10578000
10579000
10580000
10581000
10582000
10583000
10584000
10585000
10586000
10587000
10588000
10589000
10590000
10591000
10592000
1

11428000
11429000
11430000
11431000
11432000
11433000
11434000
11435000
11436000
11437000
11438000
11439000
11440000
11441000
11442000
11443000
11444000
11445000
11446000
11447000
11448000
11449000
11450000
11451000
11452000
11453000
11454000
11455000
11456000
11457000
11458000
11459000
11460000
11461000
11462000
11463000
11464000
11465000
11466000
11467000
11468000
11469000
11470000
11471000
11472000
11473000
11474000
11475000
11476000
11477000
11478000
11479000
11480000
11481000
11482000
11483000
11484000
11485000
11486000
11487000
11488000
11489000
11490000
11491000
11492000
11493000
11494000
11495000
11496000
11497000
11498000
11499000
11500000
11501000
11502000
11503000
11504000
11505000
11506000
11507000
11508000
11509000
11510000
11511000
11512000
11513000
11514000
11515000
11516000
11517000
11518000
11519000
11520000
11521000
11522000
11523000
11524000
11525000
11526000
11527000
11528000
11529000
11530000
11531000
11532000
11533000
11534000
11535000
11536000
11537000
11538000
1

12372000
12373000
12374000
12375000
12376000
12377000
12378000
12379000
12380000
12381000
12382000
12383000
12384000
12385000
12386000
12387000
12388000
12389000
12390000
12391000
12392000
12393000
12394000
12395000
12396000
12397000
12398000
12399000
12400000
12401000
12402000
12403000
12404000
12405000
12406000
12407000
12408000
12409000
12410000
12411000
12412000
12413000
12414000
12415000
12416000
12417000
12418000
12419000
12420000
12421000
12422000
12423000
12424000
12425000
12426000
12427000
12428000
12429000
12430000
12431000
12432000
12433000
12434000
12435000
12436000
12437000
12438000
12439000
12440000
12441000
12442000
12443000
12444000
12445000
12446000
12447000
12448000
12449000
12450000
12451000
12452000
12453000
12454000
12455000
12456000
12457000
12458000
12459000
12460000
12461000
12462000
12463000
12464000
12465000
12466000
12467000
12468000
12469000
12470000
12471000
12472000
12473000
12474000
12475000
12476000
12477000
12478000
12479000
12480000
12481000
12482000
1

13319000
13320000
13321000
13322000
13323000
13324000
13325000
13326000
13327000
13328000
13329000
13330000
13331000
13332000
13333000
13334000
13335000
13336000
13337000
13338000
13339000
13340000
13341000
13342000
13343000
13344000
13345000
13346000
13347000
13348000
13349000
13350000
13351000
13352000
13353000
13354000
13355000
13356000
13357000
13358000
13359000
13360000
13361000
13362000
13363000
13364000
13365000
13366000
13367000
13368000
13369000
13370000
13371000
13372000
13373000
13374000
13375000
13376000
13377000
13378000
13379000
13380000
13381000
13382000
13383000
13384000
13385000
13386000
13387000
13388000
13389000
13390000
13391000
13392000
13393000
13394000
13395000
13396000
13397000
13398000
13399000
13400000
13401000
13402000
13403000
13404000
13405000
13406000
13407000
13408000
13409000
13410000
13411000
13412000
13413000
13414000
13415000
13416000
13417000
13418000
13419000
13420000
13421000
13422000
13423000
13424000
13425000
13426000
13427000
13428000
13429000
1

14231000
14232000
14233000
14234000
14235000
14236000
14237000
14238000
14239000
14240000
14241000
14242000
14243000
14244000
14245000
14246000
14247000
14248000
14249000
14250000
14251000
14252000
14253000
14254000
14255000
14256000
14257000
14258000
14259000
14260000
14261000
14262000
14263000
14264000
14265000
14266000
14267000
14268000
14269000
14270000
14271000
14272000
14273000
14274000
14275000
14276000
14277000
14278000
14279000
14280000
14281000
14282000
14283000
14284000
14285000
14286000
14287000
14288000
14289000
14290000
14291000
14292000
14293000
14294000
14295000
14296000
14297000
14298000
14299000
14300000
14301000
14302000
14303000
14304000
14305000
14306000
14307000
14308000
14309000
14310000
14311000
14312000
14313000
14314000
14315000
14316000
14317000
14318000
14319000
14320000
14321000
14322000
14323000
14324000
14325000
14326000
14327000
14328000
14329000
14330000
14331000
14332000
14333000
14334000
14335000
14336000
14337000
14338000
14339000
14340000
14341000
1

15155000
15156000
15157000
15158000
15159000
15160000
15161000
15162000
15163000
15164000
15165000
15166000
15167000
15168000
15169000
15170000
15171000
15172000
15173000
15174000
15175000
15176000
15177000
15178000
15179000
15180000
15181000
15182000
15183000
15184000
15185000
15186000
15187000
15188000
15189000
15190000
15191000
15192000
15193000
15194000
15195000
15196000
15197000
15198000
15199000
15200000
15201000
15202000
15203000
15204000
15205000
15206000
15207000
15208000
15209000
15210000
15211000
15212000
15213000
15214000
15215000
15216000
15217000
15218000
15219000
15220000
15221000
15222000
15223000
15224000
15225000
15226000
15227000
15228000
15229000
15230000
15231000
15232000
15233000
15234000
15235000
15236000
15237000
15238000
15239000
15240000
15241000
15242000
15243000
15244000
15245000
15246000
15247000
15248000
15249000
15250000
15251000
15252000
15253000
15254000
15255000
15256000
15257000
15258000
15259000
15260000
15261000
15262000
15263000
15264000
15265000
1

16074000
16075000
16076000
16077000
16078000
16079000
16080000
16081000
16082000
16083000
16084000
16085000
16086000
16087000
16088000
16089000
16090000
16091000
16092000
16093000
16094000
16095000
16096000
16097000
16098000
16099000
16100000
16101000
16102000
16103000
16104000
16105000
16106000
16107000
16108000
16109000
16110000
16111000
16112000
16113000
16114000
16115000
16116000
16117000
16118000
16119000
16120000
16121000
16122000
16123000
16124000
16125000
16126000
16127000
16128000
16129000
16130000
16131000
16132000
16133000
16134000
16135000
16136000
16137000
16138000
16139000
16140000
16141000
16142000
16143000
16144000
16145000
16146000
16147000
16148000
16149000
16150000
16151000
16152000
16153000
16154000
16155000
16156000
16157000
16158000
16159000
16160000
16161000
16162000
16163000
16164000
16165000
16166000
16167000
16168000
16169000
16170000
16171000
16172000
16173000
16174000
16175000
16176000
16177000
16178000
16179000
16180000
16181000
16182000
16183000
16184000
1

In [44]:
print(train_words[:30])
for word in train_words[:30]:
    print(int_to_vocab[word])

print(len(train_words))

[2256, 113, 741, 6, 909, 2731, 97, 15200, 7088, 75, 63, 666, 4861, 5233, 19, 5233, 153, 1774, 11, 5233, 8, 0, 3080, 11, 1774, 284, 6, 1, 97, 194]
interests
government
working
to
divided
violent
up
coercive
anarchists
no
american
types
chaos
anarchism
that
anarchism
what
institutions
as
anarchism
nine
the
originated
as
institutions
europe
to
of
up
term
13586916


## Making batches

Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture, for each word in the text, we want to define a surrounding _context_ and grab all the words in a window around that word, with size $C$. 

From [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf): 

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $[ 1: C ]$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."

> **Exercise:** Implement a function `get_target` that receives a list of words, an index, and a window size, then returns a list of words in the window around the index. Make sure to use the algorithm described above, where you chose a random number of words to from the window.

Say, we have an input and we're interested in the idx=2 token, `741`: 
```
[5233, 58, 741, 10571, 27349, 0, 15067, 58112, 3580, 58, 10712]
```

For `R=2`, `get_target` should return a list of four values:
```
[5233, 58, 10571, 27349]
```

In [None]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    
    # implement this function
    
    return None

In [None]:
# test your code!

# run this cell multiple times to check for random window selection
int_text = [i for i in range(10)]
print('Input: ', int_text)
idx=5 # word index of interest

target = get_target(int_text, idx=idx, window_size=5)
print('Target: ', target)  # you should get some indices around the idx

### Generating Batches 

Here's a generator function that returns batches of input and target data for our model, using the `get_target` function from above. The idea is that it grabs `batch_size` words from a words list. Then for each of those batches, it gets the target words in a window.

In [None]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y
    

In [None]:
int_text = [i for i in range(20)]
x,y = next(get_batches(int_text, batch_size=4, window_size=5))

print('x\n', x)
print('y\n', y)

## Building the graph

Below is an approximate diagram of the general structure of our network.
<img src="assets/skip_gram_arch.png" width=60%>

>* The input words are passed in as batches of input word tokens. 
* This will go into a hidden layer of linear units (our embedding layer). 
* Then, finally into a softmax output layer. 

We'll use the softmax layer to make a prediction about the context words by sampling, as usual.

The idea here is to train the embedding layer weight matrix to find efficient representations for our words. We can discard the softmax layer because we don't really care about making predictions with this network. We just want the embedding matrix so we can use it in _other_ networks we build using this dataset.

---
## Validation

Here, I'm creating a function that will help us observe our model as it learns. We're going to choose a few common words and few uncommon words. Then, we'll print out the closest words to them using the cosine similarity: 

<img src="assets/two_vectors.png" width=30%>

$$
\mathrm{similarity} = \cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|}
$$


We can encode the validation words as vectors $\vec{a}$ using the embedding table, then calculate the similarity with each word vector $\vec{b}$ in the embedding table. With the similarities, we can print out the validation words and words in our embedding table semantically similar to those words. It's a nice way to check that our embedding table is grouping together words with similar semantic meanings.

In [None]:
def cosine_similarity(embedding, valid_size=16, valid_window=100, device='cpu'):
    """ Returns the cosine similarity of validation words with words in the embedding matrix.
        Here, embedding should be a PyTorch embedding module.
    """
    
    # Here we're calculating the cosine similarity between some random words and 
    # our embedding vectors. With the similarities, we can look at what words are
    # close to our random words.
    
    # sim = (a . b) / |a||b|
    
    embed_vectors = embedding.weight
    
    # magnitude of embedding vectors, |b|
    magnitudes = embed_vectors.pow(2).sum(dim=1).sqrt().unsqueeze(0)
    
    # pick N words from our ranges (0,window) and (1000,1000+window). lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples,
                               random.sample(range(1000,1000+valid_window), valid_size//2))
    valid_examples = torch.LongTensor(valid_examples).to(device)
    
    valid_vectors = embedding(valid_examples)
    similarities = torch.mm(valid_vectors, embed_vectors.t())/magnitudes
        
    return valid_examples, similarities

## SkipGram model

Define and train the SkipGram model. 
> You'll need to define an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) and a final, softmax output layer.

An Embedding layer takes in a number of inputs, importantly:
* **num_embeddings** – the size of the dictionary of embeddings, or how many rows you'll want in the embedding weight matrix
* **embedding_dim** – the size of each embedding vector; the embedding dimension

In [None]:
import torch
from torch import nn
import torch.optim as optim

In [None]:
class SkipGram(nn.Module):
    def __init__(self, n_vocab, n_embed):
        super().__init__()
        
        # complete this SkipGram model
    
    def forward(self, x):
        
        # define the forward behavior
        
        return x

### Training

Below is our training loop, and I recommend that you train on GPU, if available.

**Note that, because we applied a softmax function to our model output, we are using NLLLoss** as opposed to cross entropy. This is because Softmax  in combination with NLLLoss = CrossEntropy loss .

In [None]:
# check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

embedding_dim=300 # you can change, if you want

model = SkipGram(len(vocab_to_int), embedding_dim).to(device)
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

print_every = 500
steps = 0
epochs = 5

# train for some number of epochs
for e in range(epochs):
    
    # get input and target batches
    for inputs, targets in get_batches(train_words, 512):
        steps += 1
        inputs, targets = torch.LongTensor(inputs), torch.LongTensor(targets)
        inputs, targets = inputs.to(device), targets.to(device)
        
        log_ps = model(inputs)
        loss = criterion(log_ps, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if steps % print_every == 0:                  
            # getting examples and similarities      
            valid_examples, valid_similarities = cosine_similarity(model.embed, device=device)
            _, closest_idxs = valid_similarities.topk(6) # topk highest similarities
            
            valid_examples, closest_idxs = valid_examples.to('cpu'), closest_idxs.to('cpu')
            for ii, valid_idx in enumerate(valid_examples):
                closest_words = [int_to_vocab[idx.item()] for idx in closest_idxs[ii]][1:]
                print(int_to_vocab[valid_idx.item()] + " | " + ', '.join(closest_words))
            print("...")

## Visualizing the word vectors

Below we'll use T-SNE to visualize how our high-dimensional word vectors cluster together. T-SNE is used to project these vectors into two dimensions while preserving local stucture. Check out [this post from Christopher Olah](http://colah.github.io/posts/2014-10-Visualizing-MNIST/) to learn more about T-SNE and other ways to visualize high-dimensional data.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
# getting embeddings from the embedding layer of our model, by name
embeddings = model.embed.weight.to('cpu').data.numpy()

In [None]:
viz_words = 600
tsne = TSNE()
embed_tsne = tsne.fit_transform(embeddings[:viz_words, :])

In [None]:
fig, ax = plt.subplots(figsize=(16, 16))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)