# Word2Vec using gensim

Link to the Youtube tutorial video: https://www.youtube.com/watch?v=Q2NtCcqmIww&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=43

1) **Introduction to Word2Vec:**
    1) Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.
    2) Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany".
    3) Reference: https://en.wikipedia.org/wiki/Word2vec

2) **Insights of this tutorial:**
    1) The outputs of model.wv.most_similar() are the vocabularies (that the word2vec model learnt & claimed to be similar to the input word [in this case, the input word is "bad"]) & their corresponding similarity score to the input word.
    2) The similarity can be in terms of the relationship such as antonym, synonym, adjective,...
    3) The similarity score is a value that is related to 2 words.
    4) The outputs of model.wv.similarity() is the similarity score between 2 given words.
    5) When the similarity score between 2 given words is positive, it means the 2 words are similar in certain ways.
    6) When the similarity score between 2 given words is negative, it means the 2 words are not similar in certain ways.
    7) When the similarity score between 2 given words equals 1, it means the 2 words are exactly same.

In [40]:
import gensim # gensim library is an natural language processing (NLP) library for python
import pandas as pd

# Download the dataset (Amazon product review dataset)

The Amazon product review dataset is a subset of Amazon reviews from the cell phone and accessories categories. The data is stored as a JSON file and can be read by using pandas (because pandas support reading JSON file). 

In [41]:
# Load the dataset as dataframe
df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True) # lines=True means read the JSON file as a JSON object per line (also means 1 line in the JSON file is 1 JSON object)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


# Data Exploration

In [42]:
# The "reviewText" is the only column/feature we are interested. We use the feature to train a word2vec model so that we get word embeddings (our main goal in this tutorial) as the by-product.
# Show the "reviewText" feature
print(df.reviewText)
print('\n')

# Show the 1st sample of the "reviewText" feature
print(df.reviewText[0])

0         They look good and stick good! I just don't li...
1         These stickers work like the review says they ...
2         These are awesome and make my phone look so st...
3         Item arrived in great time and was in perfect ...
4         awesome! stays on, and looks great. can be use...
                                ...                        
194434    Works great just like my original one. I reall...
194435    Great product. Great packaging. High quality a...
194436    This is a great cable, just as good as the mor...
194437    I really like it becasue it works well with my...
194438    product as described, I have wasted a lot of m...
Name: reviewText, Length: 194439, dtype: object


They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again


# Data Preprocessing using gensim

The preprocessing module of gensim will extract each vocabulary/word from all given sentences, and:
1) Change the capital letter into small letter
2) Ignore the spacing
3) Ignore the punctuation

## Extra: Explain the working principle of gensim

In [43]:
print('The first sentence/sample in the "reviewText" feature/column:')
print(df.reviewText[0])

print('\nThe output of the gensim preprocesing module on the same sentence:')
gensim.utils.simple_preprocess(df.reviewText[0])

The first sentence/sample in the "reviewText" feature/column:
They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again

The output of the gensim preprocesing module on the same sentence:


['they',
 'look',
 'good',
 'and',
 'stick',
 'good',
 'just',
 'don',
 'like',
 'the',
 'rounded',
 'shape',
 'because',
 'was',
 'always',
 'bumping',
 'it',
 'and',
 'siri',
 'kept',
 'popping',
 'up',
 'and',
 'it',
 'was',
 'irritating',
 'just',
 'won',
 'buy',
 'product',
 'like',
 'this',
 'again']

## Preprocessing the "reviewText" feature/column using gensim

In [44]:
#  Preprocessing the "reviewText" feature/column using gensim, then store the output to the new dataframe called review_text. In the output dataframe, each row stores the extracted vocabularies/words for a sentence. apply() means apply the function stated after it to the data stated before it.
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

print('The review_text dataset/dataframe (contains the preprocessed "reviewText" feature/column using gensim):\n', review_text)
print('\n\nThe review_text dataset/dataframe (contains the preprocessed "reviewText" feature/column using gensim) has ' + str(review_text.shape[0]) + ' sentences.' )

The review_text dataset/dataframe (contains the preprocessed "reviewText" feature/column using gensim):
 0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object


The review_text dataset/dataframe (contains the preprocessed "reviewText" feature/column using gensim) has 194439 sentences.


# Develop the Word2Vec model (the model performs Word2Vec tasks) using gensim

1) **Concept to perform Word2Vec:**
    1) <img src="hidden\context-window.png" alt="This image is a representation of the simple neural network" style="width: 400px;"/>  <br />
        1) This rectangle is called a context window. So you keep on moving the context window to generate your training samples. 
        2) For the traning samples: context words (features) -> target (ground truth)

In [45]:
# Create a word2vec model using gensim
model = gensim.models.Word2Vec(
    window=10, # Specify the paramters for the context window. Window=10 means the context window, with the target word located at the center of the window, take 10 words before the target word & also 10 words after the target word.
    min_count=2, # Basically if you have a sentence which has only one word, then don't use that sentence for training (Don't take that sentence to generate training samples). At least two words need to be present in a sentence in order to for the sentence to be considered for the training.
    workers=4, # This parameter specifies how many CPU threads you want to use to train this word2vec model. If your CPU has 4 cores, then you can use four thread.
)


# Initialize the word2vec model by building vocabulary using the given extracted words from a dataset.Build a vocabulary means build a unique list of words. progress per means when you are training your word2vec model, after how many words you want to see a progress bar or progress update
model.build_vocab(review_text, progress_per = 100)

# Show the epochs setting of the word2vec model. By default, the epochs is set to 5.
print('The epochs setting of the word2vec model: ', model.epochs)

The epochs setting of the word2vec model:  5


In [46]:
# Train the word2vec model. It takes the data in review_text for training. total_examples specifies the total number of samples in the provided data will be used for training (here, model.corpus_count=194439 samples/sentences will be used for training)
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61505193, 83868975)

## Save the trained Word2Vec model into a file

In [47]:
# Save the model as a file. Usually, after training a word2vec model, save it to a file so that you can then use/deploy this pre-trained model (saved in a file) in most of the occasions (NLP needs) in future.
model.save("word2vec-amazon-cell-accessories-reviews-short.model")

## Deploy the trained Word2Vec model

In [48]:
# Deploy the model using .wv(word2vec), use most_similar() to find the vocabularies it has which are similar to the input word (the unseen data)
model.wv.most_similar("bad")

# Insights:
# 1) The outputs of model.wv.most_similar() are the vocabularies (that the word2vec model learnt & claimed to be similar to the input word [in this case, the input word is "bad"]) & their corresponding similarity score to the input word.
# 2) The similarity can be in terms of the relationship such as antonym, synonym, adjective,...
# 3) The similarity score is a value that is related to 2 words.

[('terrible', 0.6819854378700256),
 ('shabby', 0.6476674675941467),
 ('horrible', 0.622264564037323),
 ('good', 0.5699361562728882),
 ('awful', 0.5582795739173889),
 ('sad', 0.5427185893058777),
 ('okay', 0.5424753427505493),
 ('crappy', 0.5264002680778503),
 ('poor', 0.5140402913093567),
 ('cheap', 0.5124132037162781)]

In [49]:
# Print the similarity score (weightage) between 2 given words
print('The similarity score between the words of "cheap" and "inexpensive" :', model.wv.similarity(w1 = "cheap", w2 = "inexpensive"))
print('The similarity score between the words of "great" and "good" :', model.wv.similarity(w1 = "great", w2 = "good"))
print('The similarity score between the words of "great" and "product" :', model.wv.similarity(w1 = "great", w2 = "product"))
print('The similarity score between the words of "product" and "product" :', model.wv.similarity(w1 = "product", w2 = "product"))

# Insights:
# 1) The outputs of model.wv.similarity() is the similarity score between 2 given words.
# 2) When the similarity score between 2 given words is positive, it means the 2 words are similar in certain ways.
# 3) When the similarity score between 2 given words is negative, it means the 2 words are not similar in certain ways.
# 4) When the similarity score between 2 given words equals 1, it means the 2 words are exactly same.

The similarity score between the words of "cheap" and "inexpensive" : 0.5278502
The similarity score between the words of "great" and "good" : 0.7875755
The similarity score between the words of "great" and "product" : -0.03806456
The similarity score between the words of "product" and "product" : 1.0
