# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [None]:
# Install gensim
!pip install gensim

In [1]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')

In [8]:
# Explore the word vector for "king"
print(wiki_embeddings['king'])
print(wiki_embeddings['king'].shape)

[-0.32307  -0.87616   0.21977   0.25268   0.22976   0.7388   -0.37954
 -0.35307  -0.84369  -1.1113   -0.30266   0.33178  -0.25113   0.30448
 -0.077491 -0.89815   0.092496 -1.1407   -0.58324   0.66869  -0.23122
 -0.95855   0.28262  -0.078848  0.75315   0.26584   0.3422   -0.33949
  0.95608   0.065641  0.45747   0.39835   0.57965   0.39267  -0.21851
  0.58795  -0.55999   0.63368  -0.043983 -0.68731  -0.37841   0.38026
  0.61641  -0.88269  -0.12346  -0.37928  -0.38318   0.23868   0.6685
 -0.43321  -0.11065   0.081723  1.1569    0.78958  -0.21223  -2.3211
 -0.67806   0.44561   0.65707   0.1045    0.46217   0.19912   0.25802
  0.057194  0.53443  -0.43133  -0.34311   0.59789  -0.58417   0.068995
  0.23944  -0.85181   0.30379  -0.34177  -0.25746  -0.031101 -0.16285
  0.45169  -0.91627   0.64521   0.73281  -0.22752   0.30226   0.044801
 -0.83741   0.55006  -0.52506  -1.7357    0.4751   -0.70487   0.056939
 -0.7132    0.089623  0.41394  -1.3363   -0.61915  -0.33089  -0.52881
  0.16483  -0.98878

In [9]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king', topn=5)

[('prince', 0.7682328820228577),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175)]

### Train Our Own Model

In [2]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [3]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [4]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [6]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train, vector_size=100, window=5, min_count=2)

In [7]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.0212754 ,  0.06389245,  0.04460632,  0.00094099, -0.00391065,
       -0.11676724,  0.0297784 ,  0.20170496, -0.08366372, -0.06496309,
       -0.03755108, -0.13499345, -0.0012656 ,  0.05120006,  0.04955485,
       -0.06113401, -0.00381624, -0.11021534,  0.0125249 , -0.14004391,
        0.08311233,  0.02903786,  0.05951588, -0.05119926, -0.02715194,
       -0.00463635, -0.0614192 , -0.01011694, -0.07565041,  0.02539589,
        0.09724172, -0.01791699,  0.04793008, -0.081376  , -0.05113647,
        0.07603551,  0.03021528, -0.07487863, -0.0508635 , -0.16776735,
       -0.02662215, -0.04633563, -0.038279  ,  0.01469394,  0.05538112,
       -0.04864789, -0.05777286, -0.02366723,  0.03953808,  0.05064884,
        0.04276774, -0.07170005,  0.00160924, -0.02262928, -0.03850754,
        0.02692831,  0.03243688, -0.00372588, -0.0847783 ,  0.01359421,
        0.01419201, -0.00254815,  0.01087776, -0.00827837, -0.0881905 ,
        0.1025562 ,  0.03449335,  0.12332382, -0.09466483,  0.13

In [8]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('wish', 0.9967674612998962),
 ('friends', 0.9967151880264282),
 ('one', 0.9966976046562195),
 ('some', 0.9966698884963989),
 ('being', 0.9966636896133423),
 ('were', 0.9965731501579285),
 ('we', 0.9965695738792419),
 ('off', 0.9965654015541077),
 ('with', 0.9965614676475525),
 ('if', 0.996552586555481)]