<a href="https://colab.research.google.com/github/sanvadds/Word2Vec-on-Amazon-Review-NLP/blob/main/Word2Vec_on_Amazon_Review_using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import gensim
import pandas as pd 

# **Reading & Explorring**

**The data set we are using here is a subset of Amazon Review for Cell phons & Accessories Categories. The data is in JSON file and can be read using pandas**

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [2]:
data = pd.read_json("/content/sample_data/Cell_Phones_and_Accessories_5.json",lines=True)

**How many rows & columns having file**

In [3]:
data.shape

(194439, 9)

**First Review**

In [4]:
data.reviewText[0]

"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

**Need to convert upper case charater to lower case, Need to remove trailing spaces, Need remove panctuation marks.**

**This will be done using gensim.utils.simple_preprocessin** 

In [6]:
review_text = data.reviewText.apply(gensim.utils.simple_preprocess)
review_text

0         [they, look, good, and, stick, good, just, don...
1         [these, stickers, work, like, the, review, say...
2         [these, are, awesome, and, make, my, phone, lo...
3         [item, arrived, in, great, time, and, was, in,...
4         [awesome, stays, on, and, looks, great, can, b...
                                ...                        
194434    [works, great, just, like, my, original, one, ...
194435    [great, product, great, packaging, high, quali...
194436    [this, is, great, cable, just, as, good, as, t...
194437    [really, like, it, becasue, it, works, well, w...
194438    [product, as, described, have, wasted, lot, of...
Name: reviewText, Length: 194439, dtype: object

# **Training the Word2Vec Model**
**Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using min_count parameter.**

**Workers define how many CPU threads to be used.**

**Initialize the model**

In [8]:
model = gensim.models.Word2Vec(
    window = 10,
    min_count = 2,
    workers = 4
)

**Build Vocabulary**

In [9]:
model.build_vocab(review_text,progress_per=1000 )

In [11]:
model.epochs

5

In [12]:
model.corpus_count

194439

**Train the model**

In [13]:
model.train(review_text,total_examples=model.corpus_count, epochs=model.epochs)

(61505832, 83868975)

**Finding similar words of bad**

In [14]:
model.wv.most_similar("bad")

[('terrible', 0.6843300461769104),
 ('shabby', 0.6587380170822144),
 ('horrible', 0.5882956981658936),
 ('good', 0.5735816359519958),
 ('awful', 0.5594509840011597),
 ('funny', 0.5436052083969116),
 ('crappy', 0.5303862690925598),
 ('cheap', 0.530036985874176),
 ('poor', 0.5223718285560608),
 ('mad', 0.5153685212135315)]

**Checking similarity in words**

In [15]:
model.wv.similarity(w1='good' , w2='great')

0.7853304

In [16]:
model.wv.similarity(w1 = 'cheap', w2= 'inexpensive')

0.5295695

In [17]:
model.wv.similarity(w1='good' , w2='iphone')

-0.0053710695