# Amazon Fine Food Reviews

## Sentiment Analysis

https://www.kaggle.com/snap/amazon-fine-food-reviews

## Context
This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

Attribute Info:
- id : reviewer id
- Product id : unique id of the prod
- UserId : unique id of user
- ProfileName : Name of the user
- HelpfulnessNumerator : no.of users who found the review helpful
- HelpfulnessDenominator : no.of users who doesn't found the  review helpful.
- Score : rating from 1-5
- Time : timestamp for th review
- Summary : Brief summary of the review
- Text : Text of the review

#### We Will eliminate some of the feautrs such as id and score.

### Objective:
Given a review, we determine whether its a positive(rating 4,5) or negative(rating 1,2)

## Loading the Data

We are loading the data from the sqlite file.<br>

We are ignoring the reviews with rating 3 as they are neither positive nor negative.

In [1]:
%matplotlib inline

import sqlite3
import pandas as pd #for data frames
import numpy as np #numpy array operations
import nltk #natural lang processing, for processing text
import string
import matplotlib.pyplot as plt
import seaborn as sns #for plotting
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer


In [2]:
# using sqlite table to read data

con = sqlite3.connect("database.sqlite")

In [3]:
# getting the reviews where rating is not equal to 3

filtered_data = pd.read_sql_query("""select * from Reviews where
score != 3
""",con)

In [4]:
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
filtered_data.shape

(525814, 10)

### Replacing scores with positive or negative

In [6]:
#Give review with score>3 as positive review and Score<3 as negative review

def partition(x):
    if x<3:
        return "negative"
    return "positive"


actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative

In [7]:
filtered_data.shape

(525814, 10)

In [8]:
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Data Cleaning: Deduplication

Its done to get the unbiased results

In [9]:
display = pd.read_sql_query("""
select * from Reviews where score !=3 and UserId="AR5J8UI46CURR" 
order by ProductID
""",con)

display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In the Above table we can see that User named Geetha Krishnan has reviewd a product and we can see that there are 5 products with same text, summary, Timestamp, How one can rate 5 products at the same time i.e Amazon is sharing the reviews if there are almost similar/ same products. This is duplication so we will remove 4 and retain 1. 

We Can check the amaozon product with Product id with below link<br>

https://www.amazon.com/dp/[PRODUCT ID]  - > https://www.amazon.com/dp/B000HDL1RQ 

In [10]:
# sorting data according to prodid

sorted_data = filtered_data.sort_values("ProductId",axis=0,ascending=True)

In [11]:
#Droping dup entries
# when "UserId","ProfileName","Time","Text" are same then remove dup

#1st param subset-> if these cols are same for every prod then its dup
# 2nd param Keep -> first /last/false -> keep first occ or last occ or rmv all
#3rd param -> inplace true-> return a dup copy false-> drop the entry
# returns the data frame
final = sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep="first",inplace=False)
final.shape

(364173, 10)

Observation: there were 5 lakh data points before cleaning now 364173 data points

In [12]:
final.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...


### Cleaning data with common sence Scenarios

always HelpfulnessNumerator <= HelpfullnessDenominator i.e HelpfulnessNumerator is ntg but ThumbsUp clicked for the review and HelpfullnessDenom is ntg but both ThumbsUp and ThumbsDown given to the Review.

<br>
So lets check if there are any data points deviating this condition so that we can remove them.

In [13]:
final = final[final["HelpfulnessNumerator"] <= final["HelpfulnessDenominator"]]
final.shape

(364171, 10)

Observation: 2 records were removed as they were wrong etries

In [14]:
### Checking No.of positive and Negative Reviews

final['Score'].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

Observation: There are More no.of positive reviews than negative reviews

## Text To d-dim Vector

#### Why to convert?

If we convert Text to vector Then we can Use Linear Algebra Techniques To Classify and Visualize the data.


<img src= "images/reviews1.png"/>

By using Linear Algebra we can classify the points like this i.e we will find a plane/line such that it divides the positive reviews to one side and negative reviews to other side

# Bag of Words(BOW)

Bag of words is a technique Where the Text is converted to vectors.<br>
Its Most Widely used for Classification/Filtering problems.<br>
Mostly the Frequency of words is mentioned in the vectors through which we can find the similarity between the vectors and classify.<br>

Its most widely used IR technique.

<img src="images/reviews2.png"/>

<img src="images/reviews3.png"/>

## Explanation of Terms in BOW

In the image 1 r1 r2... reviews are called Documents.

Collection of All the documents is called "Corpus"

in Bow first we convert the Text to d-dim vector as shown above.

Step1: Dictionary(not py term its eng term) i.e BOW Vector : set of all the unique words in the documents is created

Step2: All the Documents/reviews are converted to d-dim vectors as shown in the image. 

Note: Each word in the Document is one dimension

<img src = "images/reviews4.png"/>

In this The words like This Is  and are not useful these words are called STOP WORDS, we will remove these words by Text Preprocessing

# Text PreProcessing

- Stemming : Example: Taste, Tasteful, Tastes these are converted to Stem word Taste.

- Stop words Rmemoval

- Tokenization : the process of breaking the sentance into words

- Lemmatization : Breaking the sentance into words meaningfully for example 'New York' its one word not 2 words.

Note: BOW doesnt consider Semantics of the word i.e If there are words like "Beautiful", "Awesome" BOW treats them as different words.

# TF-IDF

Term Frequency - Inverse Doc Frequency

Term Frequency = (occ of word wi in the document/no of words in the doc)

Inverse Doc Frequency = log(Number of Words in the Doc Corpus i.e all docs / occ of word in that doc)


tfIdf = tf * idf <br>

Term Freq increases if the occ of the word is more.

Inverse Doc Freq increases if the word is rare in the Doc corpous.

# Word2Vec

In Word to vec we will consider a word and its converted to Vector while in Bag of words every sentence is converted to vector

https://www.tensorflow.org/tutorials/word2vec

It even knows the relations like men-women etc

# Text Preprocessing Code

- Remove html tags in our reviews
- Remove Punctuations
- remove alphanumeric 
- remove single letter words
- convert to lower case
- stemming

In [15]:
# Regular exp practice

import re

import string
from nltk.corpus import stopwords

from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop = set(stopwords.words('english')) #set of stop words

sno =nltk.stem.SnowballStemmer('english') 

def cleanhtml(sentence):
    cleanr = re.compile('<.*?>') #removes all th html tags
    cleantext = re.sub(cleanr,' ',sentence)
    return cleantext

def cleanpunc(sentence):
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return cleaned
print(stop)
print("-----------------------------------------------------------------")
print(sno.stem('tasteful'))
print(sno.stem('tasty'))
print(sno.stem('test'))
print(sno.stem('testing'))
cleanpunc("Father's").split()

{'by', "didn't", "should've", 'y', 'yourself', 'do', 'more', 'haven', 'each', 'ourselves', 'have', 'at', 'shouldn', "that'll", 'is', 'after', 'under', 'and', 'few', 'against', 'from', 'there', 'to', 'has', 'its', 'below', 've', 'why', 'we', 'up', "hasn't", "you'll", "needn't", 'having', 'out', 'are', 'him', 'd', 'it', "you're", 'won', 'was', 'because', "mustn't", "it's", 'couldn', 'didn', "couldn't", 'been', 'hers', 's', "mightn't", "she's", 'such', 'which', 'or', 'll', 'too', 'wouldn', 'my', 'who', "you'd", 'being', 'very', 'wasn', "isn't", 'that', 'during', "aren't", 'the', "shouldn't", 'what', 'of', 'further', 't', 'ma', 'you', 'into', 'his', 'before', "shan't", 'should', "you've", 'had', 'were', 'now', 'down', 'between', 'yourselves', 'in', 'through', 'be', 'this', 'so', 'all', 'ain', 'than', 'yours', 'm', 'then', 'whom', 'theirs', 'am', 'not', 'but', 'did', "doesn't", "wouldn't", 'themselves', 'until', 'doing', 'any', 'when', "don't", 'while', 'again', 'hasn', "won't", 'i', 'hadn'

['Fathers']

## Dont run this code....Executed this already and stored in sqlite file

In [16]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.

'''
i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final['Text'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print("Filtered sent:",filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1
    

'''

'\ni=0\nstr1=\' \'\nfinal_string=[]\nall_positive_words=[] # store words from +ve reviews here\nall_negative_words=[] # store words from -ve reviews here.\ns=\'\'\nfor sent in final[\'Text\'].values:\n    filtered_sentence=[]\n    #print(sent);\n    sent=cleanhtml(sent) # remove HTMl tags\n    for w in sent.split():\n        for cleaned_words in cleanpunc(w).split():\n            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    \n                if(cleaned_words.lower() not in stop):\n                    s=(sno.stem(cleaned_words.lower())).encode(\'utf8\')\n                    filtered_sentence.append(s)\n                    if (final[\'Score\'].values)[i] == \'positive\': \n                        all_positive_words.append(s) #list of all words used to describe positive reviews\n                    if(final[\'Score\'].values)[i] == \'negative\':\n                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews\n               

In [17]:
# adding a column of CleanedText which displays the data 
# after pre-processing of the review 
'''
final['CleanedText']=final_string
'''

"\nfinal['CleanedText']=final_string\n"

In [18]:
'''
final.head(3) #below the processed review can be seen in the CleanedText Column 


# store final table into an SQlLite table for future.
conn = sqlite3.connect('final.sqlite')
c=conn.cursor()
conn.text_factory = str
final.to_sql('Reviews', conn, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)
'''

"\nfinal.head(3) #below the processed review can be seen in the CleanedText Column \n\n\n# store final table into an SQlLite table for future.\nconn = sqlite3.connect('final.sqlite')\nc=conn.cursor()\nconn.text_factory = str\nfinal.to_sql('Reviews', conn, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)\n"

In [16]:
# creating pickle file for this 50k data points as 
# this took huge time to compute


import pickle
'''
pickle_out = open("cleanedData.pickle","wb")
pickle.dump(final,pickle_out)
pickle.dump(all_positive_words,pickle_out)
pickle.dump(all_negative_words,pickle_out)
pickle_out.close()
'''

'\npickle_out = open("cleanedData.pickle","wb")\npickle.dump(final,pickle_out)\npickle.dump(all_positive_words,pickle_out)\npickle.dump(all_negative_words,pickle_out)\npickle_out.close()\n'

In [17]:
pickle_in=open("cleanedData.pickle","rb")
pik = pickle.load(pickle_in)
pik.head(2)
all_positive_words = pickle.load(pickle_in)
all_negative_words = pickle.load(pickle_in)
all_positive_words

[b'witti',
 b'littl',
 b'book',
 b'make',
 b'son',
 b'laugh',
 b'loud',
 b'recit',
 b'car',
 b'drive',
 b'along',
 b'alway',
 b'sing',
 b'refrain',
 b'hes',
 b'learn',
 b'whale',
 b'india',
 b'droop',
 b'love',
 b'new',
 b'word',
 b'book',
 b'introduc',
 b'silli',
 b'classic',
 b'book',
 b'will',
 b'bet',
 b'son',
 b'still',
 b'abl',
 b'recit',
 b'memori',
 b'colleg',
 b'grew',
 b'read',
 b'sendak',
 b'book',
 b'watch',
 b'realli',
 b'rosi',
 b'movi',
 b'incorpor',
 b'love',
 b'son',
 b'love',
 b'howev',
 b'miss',
 b'hard',
 b'cover',
 b'version',
 b'paperback',
 b'seem',
 b'kind',
 b'flimsi',
 b'take',
 b'two',
 b'hand',
 b'keep',
 b'page',
 b'open',
 b'fun',
 b'way',
 b'children',
 b'learn',
 b'month',
 b'year',
 b'learn',
 b'poem',
 b'throughout',
 b'school',
 b'year',
 b'like',
 b'handmot',
 b'invent',
 b'poem',
 b'great',
 b'littl',
 b'book',
 b'read',
 b'nice',
 b'rhythm',
 b'well',
 b'good',
 b'repetit',
 b'littl',
 b'one',
 b'like',
 b'line',
 b'chicken',
 b'soup',
 b'rice',
 b

# Bag of Words Code

Now we are taking every review and converting them to vectors

In [21]:
count_vect = CountVectorizer() #scikit-learn

final_counts = count_vect.fit_transform(final['Text'].values)

In [22]:
type(final_counts) #its a compressed sparse matrix i.e it only stores non zero vals in the format row,col ->val 

scipy.sparse.csr.csr_matrix

In [23]:
final_counts.shape

(364171, 115281)

Observation: 364171 Reviews and 115281 unique words in the reviews

# Bi-Grams and N-grams

Now We have all positive and negative reviews in two lists, now lets get the frequency of most common positive reviews and most common negative reviews

In [24]:
freq_dist_positive = nltk.FreqDist(all_positive_words)
freq_dist_negative = nltk.FreqDist(all_negative_words)

print("Most Common Positive Words: ",freq_dist_positive.most_common(20))
print("Most Common Negative Words: ",freq_dist_negative.most_common(20))

Most Common Positive Words:  [(b'like', 139429), (b'tast', 129047), (b'good', 112766), (b'flavor', 109624), (b'love', 107357), (b'use', 103888), (b'great', 103870), (b'one', 96726), (b'product', 91033), (b'tri', 86791), (b'tea', 83888), (b'coffe', 78814), (b'make', 75107), (b'get', 72125), (b'food', 64802), (b'would', 55568), (b'time', 55264), (b'buy', 54198), (b'realli', 52715), (b'eat', 52004)]
Most Common Negative Words:  [(b'tast', 34585), (b'like', 32330), (b'product', 28218), (b'one', 20569), (b'flavor', 19575), (b'would', 17972), (b'tri', 17753), (b'use', 15302), (b'good', 15041), (b'coffe', 14716), (b'get', 13786), (b'buy', 13752), (b'order', 12871), (b'food', 12754), (b'dont', 11877), (b'tea', 11665), (b'even', 11085), (b'box', 10844), (b'amazon', 10073), (b'make', 9840)]


In [25]:
# bigrams, trigrams, ngrams

# we should not remove stop word like 'not' by default 
# it will be removed so before removing we need to perform this

# (1,2) indicates unigrams and bigrams if its (1,3) -> uni,bi and trigrams

'''
count_vect = CountVectorizer(ngram_range=(1,2))
final_uni_bigrams_count = count_vect.fit_transform(final['Text'].values)
'''

"\ncount_vect = CountVectorizer(ngram_range=(1,2))\nfinal_uni_bigrams_count = count_vect.fit_transform(final['Text'].values)\n"

In [26]:
'''
pickle_out = open("n_grams.pickle","wb")
pickle.dump(final_uni_bigrams_count,pickle_out)
pickle_out.close()
'''

'\npickle_out = open("n_grams.pickle","wb")\npickle.dump(final_uni_bigrams_count,pickle_out)\npickle_out.close()\n'

In [27]:
pickle_in=open("n_grams.pickle","rb")
final_uni_bigrams_count = pickle.load(pickle_in)

In [28]:
final_uni_bigrams_count.get_shape()

(364171, 2910192)

Observation: here we have 29lakhs unique uni and bigrams In the case of bag of words we got 115281 unique words

# TF-IDF

In [19]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))

final_tf_idf = tf_idf_vect.fit_transform(final['Text'].values)



In [20]:
final_tf_idf.get_shape()

(364171, 2910192)

In [21]:
final_tf_idf[1]

<1x2910192 sparse matrix of type '<class 'numpy.float64'>'
	with 84 stored elements in Compressed Sparse Row format>

Observation: As it is a compressed sparse matrix we cant access the data with index.

In [22]:
# we can access the data like this
# getting all the unique words from the tf_idf_vect

features = tf_idf_vect.get_feature_names()

len(features)

2910192

In [23]:
features[100000:100010]
tf_idf_vect

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

Above are the 10 words uni and bi grams

In [34]:
# To convert a row of sparse matrix to numpy array

print(final_tf_idf[100000,:].toarray()[0])

[0. 0. 0. ... 0. 0. 0.]


Observation: as its a sparse matrix most of the vals are 0

### Function to retrieve top 25 features/words for a given review

In [35]:
def top_tfidf_feats(row,features,top_n=25):
    
    #here argsort will return the top 25 features indices
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i],row[i]) for i in topn_ids]
    
    df = pd.DataFrame(top_feats)
    df.columns = ['features','tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[10,:].toarray()[0],features,25)


In [36]:
top_tfidf

Unnamed: 0,features,tfidf
0,with carol,0.19484
1,songs by,0.19484
2,quality kids,0.19484
3,kids storytelling,0.19484
4,heart quality,0.19484
5,or sound,0.19484
6,storytelling and,0.19484
7,king this,0.188815
8,these songs,0.188815
9,sound track,0.188815


# Word2Vec

In [37]:
# Using Google News Word2Vectors
'''

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
'''


'\n\nfrom gensim.models import Word2Vec\nfrom gensim.models import KeyedVectors\nimport pickle\n\n# in this project we are using a pretrained model by google\n# its 3.3G file, once you load this into your memory \n# it occupies ~9Gb, so please do this step only if you have >12G of ram\n# we will provide a pickle file wich contains a dict , \n# and it contains all our courpus words as keys and  model[word] as values\n# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" \n# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit\n# it\'s 1.9GB in size.\n\nmodel = KeyedVectors.load_word2vec_format(\'GoogleNews-vectors-negative300.bin\', binary=True)\n'

In [38]:
import pickle

#if you do NOT have RAM >= 12GB, use the code below.
with open('word2vec_model', 'rb') as handle:
    model = pickle.load(handle)

In [39]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors



### Observation: Here we got the model from the googles word2vec i.e for every word we have 300 dimension vector

In [40]:
type(model)
# Example we have taken is 'minions' for this we got 300 words vector 
model['minions'].shape

(300,)

In [41]:
model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [42]:
model.wv.similarity('woman','man')

AttributeError: 'dict' object has no attribute 'wv'

In [None]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'])

In [53]:
# Train your own Word2Vec model using your own text corpus
import gensim

'''
i=0
list_of_sent=[]
for sent in final['Text'].values:
    filtered_sentence=[]
    sent=cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):    
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue 
    list_of_sent.append(filtered_sentence)
'''

"\ni=0\nlist_of_sent=[]\nfor sent in final['Text'].values:\n    filtered_sentence=[]\n    sent=cleanhtml(sent)\n    for w in sent.split():\n        for cleaned_words in cleanpunc(w).split():\n            if(cleaned_words.isalpha()):    \n                filtered_sentence.append(cleaned_words.lower())\n            else:\n                continue \n    list_of_sent.append(filtered_sentence)\n"

In [45]:
pickleIn = open("listOfSentAfterCleaninhHTML_Punc.pickle","rb")
list_of_sent = pickle.load(pickleIn)

In [46]:
print(final['Text'].values[0])
print("******************************************************************")
print(list_of_sent[0])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
******************************************************************
['this', 'witty', 'little', 'book', 'makes', 'my', 'son', 'laugh', 'at', 'loud', 'i', 'recite', 'it', 'in', 'the', 'car', 'as', 'were', 'driving', 'along', 'and', 'he', 'always', 'can', 'sing', 'the', 'refrain', 'hes', 'learned', 'about', 'whales', 'india', 'drooping', 'i', 'love', 'all', 'the', 'new', 'words', 'this', 'book', 'introduces', 'and', 'the', 'silliness', 'of', 'it', 'all', 'this', 'is', 'a', 'classic', 'book', 'i', 'am', 'willing', 'to', 'bet', 'my', 'son', 'will', 'still', 'be', 'able', 'to', 'recite', 'from', 'memory', 'when', 'he', 'is

Observation: We have converted every sentence into words and cleaned data by removing html tags, punc's

## now training our own Word2Vec Model

In [40]:
# 1stparam -> list of words to be converted to vec
# min_count -> if word doesnt contain atleast 5 time we will not convert it to vec
# size -> no.of dimensions in a vec
# workers -> no. of cores in ur processor here its 4 so it will run on 4 cores

w2v_model = gensim.models.Word2Vec(list_of_sent, min_count=5, size=50, workers=4)

NameError: name 'gensim' is not defined

In [None]:
w2v_model

In [None]:
words = list(w2v_model.wv.vocab)

In [None]:
print(len(words))

In [None]:
type(words)
words[0:10]

In [None]:
w2v_model.wv.most_similar('tasty')

In [None]:
w2v_model.wv.most_similar('like')

In [None]:
w2v_model.wv.similarity('woman','man')

## Explanation: 0-> both are different words 1-> exact similar words

In [None]:
# getting feature names from bag of words, count_vect has bow from the BOW code

count_vect_features = count_vect.get_feature_names() 
count_vect_features.index('like')

In [None]:
count_vect_features[64055]

# Avg W2v, TFIDF-W2v

Avergae W2V is ntg but if given  a sentence it calculates the w2v of every word in the sentence and sum it up then divide by no.of words which give me the avg W2V of the 'SENTENCE'.<br>

Note: Avg W2V is used to get the Sentence Vector.
<br>

TFIDF weighted W2V we will calc w2v of a word in the sentece and multiply it with tfidf of that word, this is done for every word and is summed up and divided by sum of tfidf of every word.

Avg W2v = sum( w2v(wi) ) / (no.of words in the sentence)
<br>
Tf-IDF Weighted W2V = sum( ti * w2v(wi) ) / sum(ti) 
<br>
Here 'ti' is tfidf of the word.

In [56]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of 50 length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)

print(len(sent_vectors)) #no.of reviews
print(len(sent_vectors[0])) #no.of dimensions

  


364171
50


In [62]:
import pickle

'''
pickle_out = open("BOW_tfidf_avgW2V_TfidfW2V.pickle","wb")

pickle.dump(count_vect,pickle_out) #BOW
pickle.dump(final_counts,pickle_out) #BOW

pickle.dump(tf_idf_vect,pickle_out) #tfidf
pickle.dump(final_tf_idf,pickle_out) #tfidf
pickle.dump(features,pickle_out) #tfidf feature/ unique words


pickle.dump(w2v_model,pickle_out) #custom W2V model
pickle.dump(words,pickle_out) #custom W2V model

pickle.dump(sent_vectors,pickle_out) #avg W2V model

'''





'\npickle_out = open("BOW_tfidf_avgW2V_TfidfW2V.pickle","wb")\n\npickle.dump(count_vect,pickle_out) #BOW\npickle.dump(final_counts,pickle_out) #BOW\n\npickle.dump(tf_idf_vect,pickle_out) #tfidf\npickle.dump(final_tf_idf,pickle_out) #tfidf\npickle.dump(features,pickle_out) #tfidf feature/ unique words\n\n\npickle.dump(w2v_model,pickle_out) #custom W2V model\npickle.dump(words,pickle_out) #custom W2V model\n\npickle.dump(sent_vectors,pickle_out) #avg W2V model\n\n'

In [16]:
import pickle
pickleIn = open("listOfSentAfterCleaninhHTML_Punc.pickle","rb")
list_of_sent = pickle.load(pickleIn)

In [17]:
pickle_in = open("BOW_tfidf_avgW2V_TfidfW2V.pickle","rb")
count_vect = pickle.load(pickle_in) #BOW
final_counts = pickle.load(pickle_in) #BOW

tf_idf_vect = pickle.load(pickle_in) #TFIDF
final_tf_idf = pickle.load(pickle_in) #TFIDF
features = pickle.load(pickle_in) #TFIDF

w2v_model = pickle.load(pickle_in) #w2v
words = pickle.load(pickle_in) #w2v

sent_vectors = pickle.load(pickle_in) #avg W2V



In [39]:
# TF-IDF weighted Word2Vec
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent[0:1000]: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            
            vec = w2v_model.wv[word]
            
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            
            sent_vec += (vec * tfidf)
            
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
   
    tfidf_sent_vectors.append(sent_vec)
    row += 1
    

In [40]:

pickle_out = open("WiightedTfidfW2V.pickle","wb")

pickle.dump(tfidf_sent_vectors,pickle_out) #tfidf weighted w2v model



pickle_out.close()




In [41]:
 len(tfidf_sent_vectors)

1000

In [42]:
tfidf_sent_vectors

[array([-0.29627643, -0.51362031, -0.58506384,  0.0860804 ,  0.72537245,
        -0.35187534,  0.79041181,  0.01443852,  0.76386532, -0.37197787,
         0.17360276,  0.04758323,  0.23676344, -0.39042325, -0.08103206,
         0.35537419, -0.85508999,  0.03277539, -0.76552184,  0.13424716,
         0.30303481, -0.09814295, -0.10739398,  0.83406531, -1.41693885,
        -0.46265499, -0.33612626, -0.43161811, -0.04409374, -0.01153882,
        -0.11547316, -0.4279345 , -0.10832784, -0.65109224,  0.48110606,
         0.18781087, -0.7394375 , -0.07746156,  0.64385304, -0.07342261,
        -1.02265508, -0.25985266,  0.07963824, -0.44332141,  0.35927659,
         0.31822868,  0.51050469,  0.85788021,  0.21028271,  0.93582324]),
 array([ 0.54159437, -1.0764013 ,  0.48108009, -0.39473469,  0.59594002,
        -0.27888179,  0.53715707, -0.41475229,  0.38884871, -0.25067827,
         0.39091359,  0.29503965,  0.02602744,  0.33568362, -0.12182654,
         0.65016806, -0.7719431 ,  0.86089429, -0

In [None]:
# TF-IDF weighted Word2Vec
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent[0:10000]: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            
            vec = w2v_model.wv[word]
            
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            
            sent_vec += (vec * tfidf)
            
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
   
    tfidf_sent_vectors.append(sent_vec)
    row += 1
    

In [None]:
pickle_out = open("WiightedTfidfW2V10kDatapoints.pickle","wb")

pickle.dump(tfidf_sent_vectors,pickle_out) #tfidf weighted w2v model



pickle_out.close()

In [None]:
len(tfidf_sent_vectors)