# Word2Vec: How to implement it

**Word2vec** is the first method we're going to explore to try to create that numeric representation. In the final chapter of this course, we'll compare all techniques to one another to understand where each one excels. Now to frame this, it's worth noting that word2vec stands for word to vector. So it will convert a word or a string of characters to a numeric vector. Let's start with a formal definition. *"Word2vec is a shallow, two-layer neural network "that accepts a text corpus as an input, "and it returns a set of vectors, "also known as embeddings; "each vector is a numeric representation "of a given word."* In practical terms, you would train this word2vec neural network on some very large corpus of text.

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [4]:
# Install gensim
#!pip install -U gensim

Load the Wikipedia embeddings and the 100 at the end indicates that each vector should be of length 100. You can alter this if you train your own data. 

In [6]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')

In [7]:
# Explore the word vector for "king"
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [8]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king')

[('prince', 0.7682328820228577),
 ('queen', 0.7507690787315369),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775828361511),
 ('monarch', 0.6977891325950623),
 ('throne', 0.691999077796936),
 ('kingdom', 0.6811409592628479),
 ('father', 0.680202841758728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074862480164)]

In [9]:
wiki_embeddings.most_similar('pansy')

[('lorelei', 0.5161975622177124),
 ('neang', 0.4673246145248413),
 ('fanta', 0.46462252736091614),
 ('itsy', 0.46324679255485535),
 ('damsel', 0.4551335871219635),
 ('mako', 0.45218175649642944),
 ('phuong', 0.45166337490081787),
 ('aqua', 0.448862761259079),
 ('s.k.', 0.44639748334884644),
 ('bubblegum', 0.4431919455528259)]

### Train Our Own Model

In [11]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [12]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x:gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [13]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [14]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train,         # Training data for the Word2Vec model
                                  vector_size=100,  # Dimensionality of the word vectors
                                  window=5,         # Maximum distance between the current and predicted word within a sentence
                                  min_count=2)      # Ignores all words with a total frequency lower than this

In [15]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.03570467,  0.06272582,  0.01334739,  0.01283974,  0.00729148,
       -0.0999951 ,  0.04510066,  0.14050049, -0.05981153, -0.06371787,
       -0.02452814, -0.11130314, -0.00720558,  0.03077019,  0.00361437,
       -0.03360223,  0.03364979, -0.07396425, -0.03913377, -0.14615434,
        0.04541676,  0.02940973,  0.04646965, -0.05054672, -0.03402423,
       -0.01335635, -0.03123688, -0.04775636, -0.04388335,  0.00623232,
        0.08424723,  0.02730883,  0.03850055, -0.06609036, -0.03614746,
        0.06910188, -0.00429381, -0.07334486, -0.03180521, -0.09929204,
       -0.01158667, -0.05319798, -0.0300455 ,  0.03252953,  0.05778084,
       -0.03945362, -0.03331699, -0.01110804,  0.02960597,  0.07169679,
        0.03247268, -0.0454764 , -0.01925392, -0.02033678, -0.03748275,
        0.0602572 ,  0.03993828,  0.01774291, -0.08836221,  0.03714359,
        0.01047315,  0.01173347, -0.01178514, -0.00849021, -0.09196077,
        0.07148632,  0.03529032,  0.0519584 , -0.07264195,  0.10

In [16]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('live', 0.9951883554458618),
 ('holiday', 0.9951819181442261),
 ('xmas', 0.9951316714286804),
 ('every', 0.9951019287109375),
 ('http', 0.9950454235076904),
 ('has', 0.9950419664382935),
 ('msg', 0.995025098323822),
 ('chat', 0.994998037815094),
 ('buy', 0.9949895143508911),
 ('music', 0.9949892163276672)]

## word2vec: How To Prep Word Vectors For Modeling

#### Train Our Own Model

In [19]:
# Read in the data, clean it, split it into train and test sets, and then train a word2vec model
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)


messages = pd.read_csv('data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]



messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)


w2v_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100,
                                   window=5,
                                   min_count=2)

#### Prep Word Vectors

In [21]:
# Generate a list of words the word2vec model learned word vectors for
words_list = w2v_model.wv.index_to_key
print(words_list)



So what this represents is it represents all of the words that our Word2Vec model learned a vector for. Or put another way, it's all of the words that appeared in the training data at least twice. So you can explore these words if you'd like.

In [23]:
# Generate aggregated sentence vectors based on the word vectors for each word in the sentence
def average_word_vectors(sentence, model):
    
    # Filter out words not in the model's vocabulary
    words = [model.wv[word] for word in sentence if word in model.wv.index_to_key]
    if len(words) == 0:
        return np.zeros(model.vector_size)  # Return a zero vector if no words are in the vocabulary
        
    return np.mean(words, axis=0)           # Average the vectors

# Create sentence vectors for each sentence in X_test
w2v_vect = np.array([average_word_vectors(ls, w2v_model) for ls in X_test])

We're using lists comprehension to cycle through each text message in the test set. So the text message is represented by LS. This is a list of words. So this LS within this nested list comprehension, then we're cycling through each word in that text message. So again, each word is represented by i. And what we're doing for each word is we're telling the fit Word2Vec model to return the word vector for each word in the text message. And we're applying one condition. We're saying only try to return that word vector as long as that word vector was learned by the model. If we don't apply that condition, then the Word2Vec model might try to find a word vector for a word it never learned, and it will return an error. We'll have is a nested set of arrays within an array.

We're going to print the length of the original text message. So say X_test, and then we'll say find the location of that text message using the index. That's what iloc does. So I'll say pass in the index. So again, now we have the length of the original text message.

In [26]:
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [27]:
# Why is the length of the sentence different than the length of the sentence vector?

for i,v in enumerate(w2v_vect):
    print(len(X_test.iloc[i]), len(v))

17 100
5 100
6 100
5 100
23 100
10 100
12 100
4 100
23 100
4 100
13 100
24 100
6 100
11 100
26 100
11 100
18 100
9 100
10 100
6 100
9 100
14 100
6 100
7 100
10 100
9 100
29 100
5 100
13 100
6 100
29 100
52 100
9 100
22 100
10 100
5 100
86 100
5 100
26 100
5 100
15 100
9 100
6 100
21 100
12 100
8 100
8 100
5 100
5 100
10 100
25 100
5 100
11 100
21 100
10 100
10 100
6 100
6 100
12 100
4 100
24 100
6 100
11 100
5 100
5 100
9 100
16 100
23 100
9 100
10 100
7 100
5 100
5 100
25 100
3 100
12 100
7 100
16 100
60 100
30 100
23 100
6 100
17 100
12 100
4 100
5 100
19 100
1 100
4 100
9 100
26 100
14 100
22 100
11 100
6 100
11 100
11 100
5 100
5 100
5 100
21 100
30 100
17 100
6 100
16 100
6 100
11 100
11 100
9 100
7 100
8 100
8 100
15 100
10 100
9 100
6 100
33 100
16 100
5 100
8 100
1 100
10 100
18 100
7 100
31 100
7 100
23 100
23 100
9 100
20 100
15 100
5 100
6 100
23 100
10 100
17 100
17 100
16 100
9 100
4 100
20 100
5 100
3 100
20 100
4 100
6 100
21 100
6 100
19 100
14 100
16 100
12 100
8 100
2

In [28]:
# Compute sentence vectors by averaging the word vectors for the words contained in the sentence
w2v_vect_avg = []

for vect in w2v_vect:
    if len(vect) > 0:                # Check if vect is not empty
        vect_array = np.array(vect)  # Ensure vect is a 2D array
        if vect_array.ndim == 1:
            # If vect_array is 1D, convert it to 2D with one row
            vect_array = vect_array.reshape(1, -1)
        vect_avg = np.mean(vect_array, axis=0)  # Compute the mean vector
        w2v_vect_avg.append(vect_avg)           # Append the averaged vector
    else:
        # Handle the case where vect is empty (no words in sentence)
        w2v_vect_avg.append(np.zeros(w2v_model.vector_size))  # Append a zero vector

# Verify consistency of sentence vector lengths
for i, v in enumerate(w2v_vect_avg):
    print(f"Sentence {i}: Length of sentence = {len(X_test.iloc[i])}, Length of vector = {len(v)}")  # Print lengths of sentence and vector

Sentence 0: Length of sentence = 17, Length of vector = 100
Sentence 1: Length of sentence = 5, Length of vector = 100
Sentence 2: Length of sentence = 6, Length of vector = 100
Sentence 3: Length of sentence = 5, Length of vector = 100
Sentence 4: Length of sentence = 23, Length of vector = 100
Sentence 5: Length of sentence = 10, Length of vector = 100
Sentence 6: Length of sentence = 12, Length of vector = 100
Sentence 7: Length of sentence = 4, Length of vector = 100
Sentence 8: Length of sentence = 23, Length of vector = 100
Sentence 9: Length of sentence = 4, Length of vector = 100
Sentence 10: Length of sentence = 13, Length of vector = 100
Sentence 11: Length of sentence = 24, Length of vector = 100
Sentence 12: Length of sentence = 6, Length of vector = 100
Sentence 13: Length of sentence = 11, Length of vector = 100
Sentence 14: Length of sentence = 26, Length of vector = 100
Sentence 15: Length of sentence = 11, Length of vector = 100
Sentence 16: Length of sentence = 18, Le