# Word2Vec
___
## Notes
In this notebook, we will discuss the word2vec algorithm.

Specifically, we break our discussion down into the following sections:
> [Overview](#Overview:-word2vec)
>
> [Implementation](#Implementation)
>
>> [Pre-Trained Embeddings](#Pre-Trained-Embeddings)
>>
>> [Training Embeddings](#Training-Embeddings)
>>
>> [Preparing Word Vectors for ML Models](#Preparing-Word-Vectors-for-ML-Models)

We finish the notebook off with a [review](#Review) of everything discussed. 

## Overview
> **Word2Vec** is an embedding algorithm based on a shllow, two-layer neural network that takes a text corpus as input and outputs a vector representation for each word in the corpus. 
> 
> There are many ways to train a word2vec mode, but for now we will look at the **skip-gram** method, which looks at a window of words to the left and right of a given word to determine its context and map it into a vector space. 
> 
> This idea is based on the saying: "*you shall know a word by the company it keeps.*" 
>
> From this vector representation, we can determine **word similarity**. A popular way to calculate word similarity is via **cosine similarity**, which determines the cosine value of the angle between the two word vectors you are trying to determine the similarity of. Therefore, if the angle between the word vectors is small, the similarity is very high. 
>
> There vector representations also give way to the construction of word analogies. 

# Implementation
> When using word2vec, we can either:
> 1. use pre-trained embeddings, where a word2vec model has already been trained on a large corpus of text:
>     - `glove-twitter-{25/50/100/200}`
>     - `glove-wiki-gigaword-{50/100/200/300}`
>     - `word2vec-google-news-300`
>     - `word2vec-ruscorpora-news-300`
>     - and a few [others](https://radimrehurek.com/gensim/models/word2vec.html)...
> 2. train embeddings using our own set of data.
>
> Once we have our embeddings, we need to do a bit more prep word to get the vector representations ready for input into a machine learning model. 

### Pre-Trained Embeddings

We start by importing `gensim`, a package that comes with a bunch of pre-trained embeddings build in, and loading the `glove-wiki-gigaword-100` embedding (the 100 represents the length each vector should be in the embedding). 

In [3]:
# !pip install -U gensim

In [6]:
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')

Now, we'll examine the word vector for the word "king", and find words most similar to "king" based on our embeddings. 

In [5]:
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [7]:
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775828361511),
 ('monarch', 0.6977890729904175),
 ('throne', 0.691999077796936),
 ('kingdom', 0.6811409592628479),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

___

### Training Embeddings

Let's try to train our own embeddings using the `SMSSpamCollection.tsv` file. Start by loading in, and cleaning up, the file. 

While we could clean the data ourselves, we'll make use of the `gensim` `simple_preprocess` function. This will remove punctuation and stopwords before tokenizing the text. 

In [9]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth',100)

messages = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
messages.columns = ['label','text']

messages['clean_text'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,clean_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ve, been, searching, for, the, right, words, to, thank, you, for, this, breather, promise, wont..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[have, date, on, sunday, with, will]"


Now, go ahead and split our data up into train and test sets. 

In [10]:
X_train, X_test, y_train, y_test = train_test_split(messages['clean_text'],messages['label'], test_size=0.2)

We need to train our word2vec model now. For now, we'll set:
- `vector_size=100`: the length of the word embedding vectors (aka the dimensions the word gets mapped into)
- `window=5`: the number of words to look at for context (5 would mean 2 to the left and 2 to the right of the current word)
- `min_count=2`: the number of times a word needs to appear in the corpus to create a vector embedding for that word

In [12]:
w2v_model = gensim.models.Word2Vec(X_train, vector_size=100, window=5, min_count=2)

Now, we'll examine the word vector for the word "king", and find words most similar to "king" based on our embeddings.

In [15]:
w2v_model.wv['king']

array([-0.04996508,  0.06191731,  0.01920232, -0.00921991, -0.00794392,
       -0.11238229,  0.03144125,  0.15554667, -0.04757956, -0.06054191,
       -0.02493219, -0.08994311, -0.02334521,  0.03916509,  0.02084255,
       -0.03985299, -0.0059781 , -0.0659265 , -0.00192455, -0.13866237,
        0.033167  ,  0.04735364,  0.03730169, -0.02935965, -0.00567448,
       -0.01049724, -0.04019884, -0.05439038, -0.07118474,  0.03066384,
        0.07288796, -0.00085214,  0.03131102, -0.05915679, -0.03692801,
        0.08658414,  0.04561409, -0.05220363, -0.01140274, -0.1216219 ,
        0.02775528, -0.04251336, -0.0558888 , -0.00066932,  0.05631372,
       -0.04125058, -0.04528016, -0.01465464,  0.05670632,  0.03521145,
        0.04856405, -0.06812292, -0.00350109, -0.00766639, -0.0172387 ,
        0.05529414,  0.04766516,  0.00591104, -0.09423443,  0.02206806,
        0.01064999, -0.00531434, -0.00255395, -0.01964384, -0.07075865,
        0.06811004,  0.02430508,  0.07080574, -0.07174548,  0.07

In [17]:
w2v_model.wv.most_similar('king')

[('again', 0.9946290850639343),
 ('girl', 0.9946222901344299),
 ('think', 0.9945281744003296),
 ('there', 0.9945186376571655),
 ('done', 0.9944351315498352),
 ('same', 0.9944155812263489),
 ('why', 0.9943594336509705),
 ('wait', 0.9943505525588989),
 ('in', 0.9943446516990662),
 ('very', 0.9943395256996155)]

`Note`: This embedding don't make as much sense as the one we obtained from the pre-trained model earlier. 

___

### Preparing Word Vectors for ML Models

First, we'll take a look at the words that have embeddings in our model. 

In [34]:
w2v_model.wv.index_to_key[0:25]

['to',
 'you',
 'the',
 'and',
 'is',
 'in',
 'me',
 'it',
 'my',
 'for',
 'your',
 'of',
 'call',
 'have',
 'that',
 'on',
 'now',
 'are',
 'can',
 'so',
 'not',
 'but',
 'we',
 'or',
 'at']

Now, let's generate a vector for each message in the training data based on the word vectors for each word in that message. 

In [27]:
w2v_vect = np.array([np.array([w2v_model.wv[i] for i in msg if i in w2v_model.wv.index_to_key]) for msg in X_train], dtype=object)

A machine learning model will need the same set of features for each example it sees. In our case, each word is a feature, and the fact that messages have different numbers of words will cause issues. 

In [35]:
for i,v in enumerate(w2v_vect[0:25]): 
    print(len(X_test.iloc[i]), len(v))

7 7
75 23
21 23
12 3
25 16
6 5
15 17
25 21
6 5
18 11
38 58
43 8
10 30
60 6
24 25
22 4
13 26
8 6
8 8
16 19
19 15
21 1
8 4
6 14
8 6


To handle this, we will average all the word vectors we have for a single message together to obtain a single word vector. 

In [29]:
w2v_vect_avg = []

for vect in w2v_vect:
    if len(vect)!=0:
        w2v_vect_avg.append(vect.mean(axis=0))
    else:
        w2v_vect_avg.append(np.zeros(100))

We check to make sure this fixed our problem. 

In [36]:
for i,v in enumerate(w2v_vect_avg[0:25]): 
    print(len(X_test.iloc[i]), len(v))

7 100
75 100
21 100
12 100
25 100
6 100
15 100
25 100
6 100
18 100
38 100
43 100
10 100
60 100
24 100
22 100
13 100
8 100
8 100
16 100
19 100
21 100
8 100
6 100
8 100


Now our features are ready to be used in a machine learning model. 

___

## Review

In this notebook, we introduced Word2Vec, an embedding algorithm that takes a text corpus as input and outputs a vector representation for each word in the corpus. 

We saw how pre-trained embeddings could be used, as well as how embeddings could be trained using our own data. 

To finish up, we looked at the last bit of processing needed to be done in order to use the vector representations obtained from our word2vec model in a machine learning model. 