In [3]:
import gensim
import pandas as pd

## Reading and Exploring the Dataset

The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

In [12]:
df=pd.read_json(r"C:\Users\MY\Downloads\reviews_Cell_Phones_and_Accessories_5.json.gz",lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


## Simple Preprocessing & Tokenization

The first thing to do for any data science task is to clean the data. For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [13]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

Example of how data is preprocessed and stored in List

In [16]:
review_text[5]

['these',
 'make',
 'using',
 'the',
 'home',
 'button',
 'easy',
 'my',
 'daughter',
 'and',
 'both',
 'like',
 'them',
 'would',
 'purchase',
 'them',
 'again',
 'well',
 'worth',
 'the',
 'price']

In [17]:
df.reviewText.loc[5]

'These make using the home button easy. My daughter and I both like them.  I would purchase them again. Well worth the price.'

## Training the Word2Vec Model
Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. the min_count parameter specifies the minimum frequency a word must have in the training corpus to be included in the model's vocabulary

Workers define how many CPU threads to be used.

In [19]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=3,
    workers=4,
)

This function scans the input corpus and builds the vocabulary that the model will work with.

In [20]:
model.build_vocab(review_text, progress_per=1000)

In [21]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61413500, 83868975)

### Save the model 

Save the model so that it can be reused in other applications

In [22]:
model.save("D:\Python\Project\Word2_Vec_Gensim_Implementation\word2vec-amazon-cell-accessories-reviews-short.model")

### Finding Similar Words and Similarity between words

In [33]:
model.wv.most_similar("phone")

# we can see it shows the similar words and correlation between them

[('iphone', 0.5632883906364441),
 ('it', 0.5150007009506226),
 ('cellphone', 0.5023631453514099),
 ('gn', 0.4915114641189575),
 ('case', 0.47660550475120544),
 ('gravity', 0.4664592444896698),
 ('lap', 0.45979687571525574),
 ('device', 0.45393508672714233),
 ('cheek', 0.42919498682022095),
 ('tabletcons', 0.4241284728050232)]

In [35]:
model.wv.similarity(w1="great", w2="good")

# In this we see the correlation or similarity between two words

0.7976124

In [36]:
# Words that are trained in the model

model.wv.index_to_key

['the',
 'it',
 'and',
 'to',
 'is',
 'this',
 'of',
 'for',
 'my',
 'that',
 'in',
 'on',
 'phone',
 'with',
 'you',
 'case',
 'but',
 'have',
 'not',
 'was',
 'as',
 'so',
 'one',
 'very',
 'are',
 'like',
 'if',
 'be',
 'can',
 'or',
 'great',
 'your',
 'at',
 'when',
 'use',
 'screen',
 'just',
 'good',
 'all',
 'they',
 'battery',
 'from',
 'would',
 'out',
 'will',
 'well',
 'an',
 'has',
 'iphone',
 'had',
 'get',
 'charge',
 'up',
 'no',
 'me',
 'than',
 'more',
 'only',
 'charger',
 'about',
 'product',
 'other',
 'there',
 'really',
 'time',
 'also',
 'off',
 'these',
 'which',
 'works',
 'does',
 'because',
 'do',
 'don',
 'them',
 'much',
 'back',
 'what',
 'nice',
 'little',
 'price',
 'love',
 'usb',
 'its',
 'some',
 'quality',
 'charging',
 'work',
 'fit',
 'any',
 'easy',
 'even',
 've',
 'device',
 'too',
 'after',
 'still',
 'used',
 'protector',
 'while',
 'power',
 'using',
 'got',
 'better',
 'am',
 'bought',
 'two',
 'now',
 'by',
 'cable',
 'first',
 'recommend'