# Amazon Fine Food Reviews Analysis

Attribute Information:
    1. Id
    2. ProductID - unique identifier for the product
    3. UserID - unique identifier for the user
    4. ProfileName
    5. HelpfulnessNumerator - number of users who found the review useful
    6. HelpfulnessDenominator - number of users indicating whether they found the review helpful or not
    7. Score - rating between 1 & 5
    8. Time - timestamp of the review
    9. Summary - brief summary of the review
    10. Text - text of the review.
    

<b>Task:</b>

Given a review, determine whether the review is positive (Rating of 4 / 5) or negative (Rating of 1 / 2).

In [1]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

# Using the SQLite table to read data
con = sqlite3.connect('./database.sqlite')

# Filtering only positive and negative reviews
filtered_data = pd.read_sql_query("""SELECT * FROM Reviews WHERE Score != 3""", con)

# Give reviews with Score > 3 a positive rating, and reviews with Score < 3 a negative rating.
def partition(x):
    if x < 3:
        return 'negative'
    else:
        return 'positive'
    
# Changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative

In [2]:
filtered_data.shape
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


<h3>Data Cleaning: Deduplication</h3>

In the real world, when we do machine learning we spend 20% to 30% time in Data Cleaning & Preprocessing. 

The Dataset has many duplicate rows.

Hence it is necessary to remove duplicates in order to get unbiased results for the analysis of the data. It is not adding any value to the system.

In [3]:
display = pd.read_sql_query("""
SELECT * FROM Reviews WHERE Score != 3 AND 
UserId = 'AR5J8UI46CURR' ORDER BY ProductID""", con)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In order to reduce the redundancy it was decided to eliminate the rows having same parameters.

Method is as follows:
    1. Sort the data according to ProductId and then just keep the first similar product review and delete the others.
    
This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [4]:
# Sort data according to ProductId in ascending order.
sorted_data = filtered_data.sort_values('ProductId', axis = 0, ascending = True, inplace = False)

In [5]:
# Deduplication of entries - # of rows left in the dataset
final = sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep = 'first', inplace = False)
final.shape

(364173, 10)

In [6]:
# Checking to see how much amount of data still remains after cleaning the duplicates 
# Retained 69% of the data
(final['Id'].size * 1.0) / (filtered_data['Id'].size * 1.0) * 100

69.25890143662969

<h4>Observation 2</h4> HelpfulnessNumerator should always be lesser than HelpfulnessDenominator.

In [7]:
display = pd.read_sql_query("""
SELECT * FROM Reviews 
WHERE Score != 3 AND Id = 44737 OR Id = 64422 
ORDER BY ProductID""", con)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [8]:
# Only keep those rows where HelpfulnessDenomintor >= HelpfulnessNumerator
final = final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

<h4>Next Phase of Preprocessing</h4>

In [9]:
print(final.shape)

# How many positives and negatives are present in our dataset after removing the duplicates 
# and other discrebancies
final['Score'].value_counts()

(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

Given the 8 features as input we have to predict the sentiment polarity (+ve/-ve).
We are determining the polarity by the Score.

The most useful features are Summary and Text.

Given any problem if we can convert into the problem of vectors, we can leverage the power of Linear Algebra.

How do you convert text into numerical vectors?

Convert Review Text into d-D vector in d-D space.

Suppose we have many vectors. Each point represents a d-D representation of a review in d-D space.

Draw a hyperplane 'pi' separating all positive reviews and all negative reviews.

1. Converting Review-text into a d-D vector
2. Finding a plane to separate the reviews.

<h4>Rules/Properties of this conversion </h4>

Suppose we have 3 reviews - $r_{1}$, $r_{2}$ and $r_{3}$. d-D representation of vectors for $r_{1}$ -> $v_{1}$, $r_{2}$ -> $v_{2}$, $r_{3}$ -> $v_{3}$.

If $r_{1}$ and $r_{2}$ are more similar semantically that $r_{1}$ and $r_{3}$, i.e. Eng. sim($r_{1}$, $r_{2}$) > Eng. sim($r_{1}$, $r_{3}$) then the dist($v_{1}$, $v_{2}$) < dist($v_{1}$, $v_{3}$). 

If $r_{1}$ & $r_{2}$ are more similar, $v_{1}$ and $v_{2}$ must be close i.e. <b>length($v_{1}$  - $v_{2}$) < length($v_{1}$ - $v_{3}$)</b>

<b>find {text -> d-D vector} such that similiar text must be closer geometrically.</b>

# Bag of Words (BoW)

Simplest Technique to convert text to a numerical vector is <b>Bag Of Words(BoW)</b>

$r_{1}$: This pasta is very tasty and affordable.

$r_{2}$: This pasta is not tasty and is affordable.

$r_{3}$: This pasta is delicious and cheap.

$r_{4}$: Pasta is tasty and pasta tastes good.

In NLP, a review is known as a document. Set of documents is called<b> corpus</b>.

1. Constructing a dictionary - set of all unique words in the reviews. 
{This, pasta, ...}
2. Construct Vector $v_{i}$ of size 'd'. Each word is a different dimension and each cell corresponds to # of times the word occurs in the review/document $r_{i}$.

$v_{i}$ is a sparse vector - most of the elements are zero.

<b>Objective of BoW:</b> Similar text must result as closer vectors.

BoW is thought of counting the common words when all the values exist only once. How many common words exist? 

BoW does not work very well when there are small changes in the terminology we are using.

<b>Binary BoW</b> or <b>Boolean BoW</b> is a variation of BoW. Instead of putting count, we put 1 if the word occurs atleast once and 0 if the word doesn't exist. 

||$v_{1}$ - $v_{2}$|| = $\sqrt number of different words$ between documents/reviews $r_{1}$ and $r_{2}$.

All the words like {This, is, and} do not matter much. What matters the most is the non-trivial words.

Removing the trivial words <b>Stop-words</b>.

If I remove the Stop-words, BoW vector will be smaller and more meaningful. You throw these Stop-words while constructing the vector.

In English 'not' is also considered as a Stop-word.

<b>So, removing the stop-words is not always the best choice.</b>

<h4>Text Pre-processing steps</h4>
1. Removing <b>Stop-words</b>.
2. Convert all your words <b>lowercase</b>.
3. <b>Stemming</b>: words coming from the same base word in English. Eg. tastes, tasful, tasty -> tast. Convert all these words into their common form i.e. taste and replace them with the common form. Related words are considered as single root word.
Stemming algorithms - PorterStemmer, SnowballStemmer 
4. <b>Lemmatization</b>: breaking up a sentence into words. A space is used to break the sentence into words. 
Eg. This pasta is very tasty. This is the best in New York.
But there can be complex words like New York. It is a location. 
Often times we break the sentence but there are lemmatizers available which will group New York into 1 word. 
5. Tasty and delicious are synonyms - very similar in meaning. But in BoW, we are considering them as 2 different words which are nowhere related because they are 2 different dimensions. In BoW we are not taking semantic meaning of words into consideration. A technique called <b>Word2Vec</b> where we try to get semantic meaning of these words into consideration when we build vectors of text.  

<b>BoW + Text Preprocessing </b>
Converting text to a d-D vector which doesn't guarantee semantic meaning of words will be at the same place. 
$r_{1}$ and $r_{3}$ are sematically same because our algorithm still think they are different. 

<b>The <u>drawback</u> of BoW is <i>it doesn't take semantic meaning into consideration</i>.</b>

# Uni-gram, Bi-gram, n-gram

$r_{1}$: This pasta is very tasty and affordable.

$r_{2}$: This pasta is not tasty and is affordable.

After removing stop-words $v_{1}$ and $v_{2}$ are exactly the same => $r_{1}$ and $r_{2}$ are very similar which is not TRUE. 

$r_{1}$ and $r_{2}$ are completely opposite. 

<b>Uni-gram</b>: Each word is considered as a dimension.
<b>Bi-gram</b>: Pairs of consecutive words is considered as a dimension.
<b>Tri-gram</b>: 3 consecutive words is considered as a dimension.
<b>n-gram</b>: n consecutive words is considered as a dimension.

<b>Why n-gram?</b> Uni-gram based BoW discards the sequence information. But using bi-gram, tri-gram or n-gram we are trying to retain some of the partial sequence information. 

Bi-gram, Tri-gram or n-gram can be easily encorporated into BoW.  

<b># of bi-grams >= # of uni-grams</b> because the number of pairs of consecutive words is greater than or equal to uni-grams. 

<b># of n-grams >= ... >= # of tri-grams >= # of bi-grams >= # of uni-grams</b>

For n-grams, where n > 1, dimensionality 'd' increases drastically.

# TF-IDF(Term Frequency - Inverse Document Frequency)

Variation of BoW

Let us assume we have 'N' documents / reviews. Each review is a combination of words.

Let us assume $r_{1}$ has some words. Similarly, other documents too.

$r_{1}$: $W_{1}$, $W_{2}$, $W_{3}$, $W_{2}$, $W_{5}$             --> 5 words

$r_{2}$: $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$, $W_{6}$, $W_{2}$    --> 6 words

$r_{3}$: 

.
.
.

$r_{N}$:

TF($W_{i}$, $r_{j}$) = # of times $W_{i}$ occurs in $r_{j}$ / total number of words in $r{j}$
TF($W_{2}$, $r_{1}$) = 2 / 5

<b>0 <= TF($W_{i}$, $r_{j}$) <= 1 </b> Can be interpreted as Probability.

<u>BoW</u> and <u>TF-IDF</u> are techniques done on the text for <i><b>Information Retrieval</b></i> (sub-area of NLP).

TF can be thought of as how often does $W_{i}$ occur in $r_{j}$. If it has all the same words then it has a TF of 1 else if the word occurs a very few times, the TF has a very small value. <b> More often the word occurs, the higher the frequency. </b>

<b>Term Frequency can be thought of as the probability of finding a word $W_{i}$ in a document $r_{j}$. </b>

<b><i>IDF- Inverse Document Frequency</i></b> is for a word $W_{i}$ in a corpus.

Suppose Dataset/Corpus ($D_{c}$) has the following documents:
 
$r_{1}$: $W_{1}$, $W_{2}$, $W_{3}$, $W_{2}$, $W_{5}$             --> 5 words

$r_{2}$: $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$, $W_{6}$, $W_{2}$    --> 6 words

$r_{3}$: 

.
.
.

$r_{N}$:

<b>IDF($W_{i}$, $D_{c}$) = log(N/$n_{i}$)</b>, where N is the number of documents and $n_{i}$ is the number of documents which contain the word $W_{i}$

<b>Since $n_{i}$ <= N, N/$n_{i}$ >= 1. So, log(N/$n_{i}$) >= 0</b>

1. IDF >= 0
2. If $n_{i}$ increases, then N/$n_{i}$ decreases. Here monotonic function log(N/$n_{i}$) decreases. 
<b>If $W_{i}$ is more frequent in $D_{c}$, the IDF is lower.</b> Hence, if IDF increases, $n_{i}$ decreases and vice-versa.

<b><i>If $W_{i}$ is more frequent, IDF will be low and if $W_{i}$ is very rare, IDF will be high.</i></b>

Given documents {$r_{1}$, $r_{2}$, $r_{3}$,..., $r_{j}$} in $D_{c}$, <b>TF-IDF: TF($W_{i}$, $r_{j}$) * IDF($W_{i}$, $D_{c}$)</b>, TF($W_{i}$, $r_{j}$) is higher if $W_{i}$ is frequent in $r_{j}$ and IDF($W_{i}$, $D_{c}$) is higher when $W_{i}$ is rare in $D_{c}$.

<b>TF-IDF gives
- gives more importance to rarer words in $D_{c}$.
- gives more importance if a word is more frequent in a document/review.</b>

But TF-IDF has a <u>drawback</u> that it <i><b>does not</b> take semantic meaning of words</i>.

# Word2Vec

<b>Word2Vec</b> takes semantic meaning of words into consideration.

This algorithm takes a word and converts it into a d-D vector where d is typically, 50, 100, 200 or 300. But this is not a sparse vector. But BoW / TF-IDF represented sentences into sparse vectors. 

Consider a 300-D vector. The higher the dimensions, more powerful is the representation.

1. If $W_{1}$ and $W_{2}$ are semantically similar, then $v_{1}$ and $v_{2}$ are closer.
2. In Word2Vec, it satisfies the relationships. 

<b>($V_{man}$ - $V_{woman}$) || ($V_{king}$ - $V_{queen}$) </b>

Word2Vec learns relationships automatically from raw-text.

Word2Vec takes a very large text Corpus as input and for every word it builds a vector. 

Larger dimensions --> more information rich the vector is. If we have a higher dimensional vector it can learn far more complex relationships. 

If $D_{c}$ is large, the higher is the dimensionality. 

Word2Vec looks at sequence information of words. Intuitively, for any word Word2Vec looks at neighborhood of that word. 

N($W_{i}$) is very similar to N($W_{j}$), then $v_{i}$ is very similar to $v_{j}$.

# Avg-Word2Vec, tf-idf weighted Word2Vec

<h4> Avg-Word2Vec </h4>

Word2Vec takes a word and converts it into a d-D vector. 

But $r_{i}$ is a sequence of words/sentences.

How do I convert my sentences to a vector using Word2Vec?

Suppose we have a review $r_{1}$ containing words
$r_{1}$: $W_{1}$, $W_{2}$, $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$

Suppose I want to convert $r_{1}$ to $v_{1}$, take the Avg Word2Vec representation. 
Take the first word $W_{1}$, convert it into a vector as $v_{1}$

For each word in $r_{1}$, I am getting a vector representation.

W2V($W_{1}$) + W2V($W_{2}$) + W2V($W_{1}$) + W2V($W_{3}$) + W2V($W_{4}$) + W2V($W_{5}$)

Each of these vectors will be d-D. Add all these vectors and then divide the sum by the number of words. 

Suppose in $r_{1}$, there are $n_{1}$ words then $v_{1}$ becomes <b>1/$n_{1}$[W2V($W_{1}$) + W2V($W_{2}$) + W2V($W_{1}$) + W2V($W_{3}$) + W2V($W_{4}$) + W2V($W_{5}$)]</b>

$v_{1}$ is the vector representation of review $r_{1}$. This is known as Avg-Word2Vec. It is not perfect but it works well. This is the simplest way to leverage Word2Vec to build sentence vectors. 

<h4> tf-idf weighted Word2Vec </h4>

Suppose we have a review $r_{1}$ containing words
$r_{1}$: $W_{1}$, $W_{2}$, $W_{1}$, $W_{3}$, $W_{4}$, $W_{5}$

We compute tf-idf for $r_{1}$ as $t_{1}$, $t_{2}$, $t_{3}$, $t_{4}$, $W_{5}$.

When we compute tf-idf-weighted Word2Vec of $r_{1}$

<b>tfidf-W2V($r_{1}$) = [$t_{1}$ * W2V($W_{1}$) + $t_{2}$ * W2V($W_{2}$) + $t_{3}$ * W2V($W_{3}$) + $t_{4}$ * W2V($W_{4}$) + $t_{5}$ * W2V($W_{5}$)] / ($t_{1}$ + $t_{2}$ + $t_{3}$ + $t_{4}$ + $t_{5}$)</b> where $t_{i}$ is tf-idf of the word $w_{i}$ in review $r_{1}$ or <b> $t_{i}$ = tf-idf($w_{i}$, $r_{1}$)
    
    
<b>If all $t_{i}$'s are 1, then tfidf-W2V is same as Avg-Word2Vec.</b>

<h3>Avg-Word2Vec and tf-idf weighted Word2Vec are simple weighting strategies to convert sentences/paragraphs to vectors.</h3>

# Bag of Words (BoW)

In [10]:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)

In [11]:
type(final_counts)

scipy.sparse.csr.csr_matrix

In [12]:
final_counts.get_shape()

(364171, 115281)

<h2>Text Preprocessing</h2>

1. Begin by removing all the HTML tags.
2. Remove any punctuation or a set of special characters like , or . or # etc.
3. Check if a word is in simple English and is not alpha-numeric.
4. Check to see if the length of the word is greater than 2.
5. Convert the word to lowercase. 
6. Remove Stop-words. 
7. <b>Snowball Stemming is observed to be better than Porter Stemming</b>

After executing the above steps collect the words that are used to describe whether a review is positive or negative.

In [13]:
import re

i = 0

for sentence in final['Text'].values:
    if(len(re.findall('<.*?>', sentence))):
        print(i)
        print(sentence)
        break;
    i += 1

6
I set aside at least an hour each day to read to my son (3 y/o). At this point, I consider myself a connoisseur of children's books and this is one of the best. Santa Clause put this under the tree. Since then, we've read it perpetually and he loves it.<br /><br />First, this book taught him the months of the year.<br /><br />Second, it's a pleasure to read. Well suited to 1.5 y/o old to 4+.<br /><br />Very few children's books are worth owning. Most should be borrowed from the library. This book, however, deserves a permanent spot on your shelf. Sendak's best.


In [14]:
import re

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

nltk.download('stopwords')

stop = set(stopwords.words('english')) # set of stopwords
sno = nltk.stem.SnowballStemmer('english') # initializing the Snowball Stemmer

def cleanhtml(sentence): # function to clean the word of any html tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext

def cleanpunc(sentence): # function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]', r'', sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]', r' ', cleaned)
    return cleaned

print(stop)

print("----------------------------------------")

print(sno.stem('tasty'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\venne\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'ain', 'been', 'shouldn', "mightn't", 'in', 'such', 'isn', 'did', "shan't", 'didn', 'does', 'itself', 'being', 'weren', 'they', 'is', 'be', 'y', "it's", 'mightn', "aren't", 'his', 'are', 'as', 'once', 'where', 'themselves', 'then', 'll', 'd', 'few', 'ours', "shouldn't", 'just', 'himself', 'her', 'most', "wasn't", 'their', 'don', 'too', 'both', "mustn't", 'am', 'any', 'there', "couldn't", 'which', 'during', 'now', 'when', 'wasn', 'some', 'do', 'it', 'if', 'we', "isn't", 've', 'over', 'were', "should've", 'that', 'again', 'whom', 'shan', 'myself', 'and', 'hers', 'yourself', 'herself', 'doing', 'couldn', 'because', 'hasn', 'can', 'she', 'them', 'up', 'our', 'here', 'o', 'this', 'of', 'your', "you'd", 'what', 'other', 'further', 'mustn', 'all', 'i', 's', 'ma', 'me', "doesn't", 'an', 'than', 'you', 'has', 'for', 'between', 't', 'him

# Bi-grams & n-grams

<b>Motivation</b>

Having the list of words describing positive and negative reviews let us analyze them.

We begin analysis by getting the frequency distribution of the words.

In [15]:
i = 0
str1 = ' '
final_string = []
all_positive_words = []
all_negative_words = []
s = ''
for sentence in final['Text'].values:
    filtered_sentence = []
    #print(sentence)
    sentence = cleanhtml(sentence)
    for w in sentence.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words) > 2)):
                if(cleaned_words.lower() not in stop):
                    s = (sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if(final['Score'].values)[i] == 'positive':
                        all_positive_words.append(s)
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s)
                else:
                    continue
            else:
                continue
    str1 = b" ".join(filtered_sentence)
    
    final_string.append(str1)
    i += 1                        

In [16]:
# Adding a column of CleanedText
final['CleanedText'] = final_string

In [17]:
final.head(3)

conn = sqlite3.connect('final.sqlite')
c = conn.cursor()
conn.text_factory = str
final.to_sql('Reviews', conn, flavor = None, schema = None, if_exists = 'replace')

In [18]:
freq_dist_positive = nltk.FreqDist(all_positive_words)
freq_dist_negative = nltk.FreqDist(all_negative_words)

print("Most common Positive words used: ", freq_dist_positive.most_common(20))
print("Most common Negative words used: ", freq_dist_negative.most_common(20))

Most common Positive words used:  [(b'like', 139429), (b'tast', 129047), (b'good', 112766), (b'flavor', 109624), (b'love', 107357), (b'use', 103888), (b'great', 103870), (b'one', 96726), (b'product', 91033), (b'tri', 86791), (b'tea', 83888), (b'coffe', 78814), (b'make', 75107), (b'get', 72125), (b'food', 64802), (b'would', 55568), (b'time', 55264), (b'buy', 54198), (b'realli', 52715), (b'eat', 52004)]
Most common Negative words used:  [(b'tast', 34585), (b'like', 32330), (b'product', 28218), (b'one', 20569), (b'flavor', 19575), (b'would', 17972), (b'tri', 17753), (b'use', 15302), (b'good', 15041), (b'coffe', 14716), (b'get', 13786), (b'buy', 13752), (b'order', 12871), (b'food', 12754), (b'dont', 11877), (b'tea', 11665), (b'even', 11085), (b'box', 10844), (b'amazon', 10073), (b'make', 9840)]


In [19]:
count_vect = CountVectorizer(ngram_range=(1,2))
final_bigram_counts = count_vect.fit_transform(final['Text'].values)

In [20]:
final_bigram_counts.get_shape()

(364171, 2910192)

# TF-IDF

In [20]:
tf_idf_vect = TfidfVectorizer(ngram_range = (1, 2))
final_tf_idf = tf_idf_vect.fit_transform(final['Text'].values)

In [21]:
final_tf_idf.get_shape()

(364171, 2910192)

In [22]:
features = tf_idf_vect.get_feature_names()
len(features)

2910192

In [23]:
features[100000:100010]

['ales until',
 'ales ve',
 'ales would',
 'ales you',
 'alessandra',
 'alessandra ambrosia',
 'alessi',
 'alessi added',
 'alessi also',
 'alessi and']

In [24]:
print(final_tf_idf[3,:].toarray()[0])

[0. 0. 0. ... 0. 0. 0.]


In [25]:
def top_tfidf_features(row, features, top_n = 25):
    '''Get top n tfidf values in row and return them with their corresponding values'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_features = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_features)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_features(final_tf_idf[1,:].toarray()[0], features, 25)

In [26]:
top_tfidf

Unnamed: 0,feature,tfidf
0,sendak books,0.173437
1,rosie movie,0.173437
2,paperbacks seem,0.173437
3,cover version,0.173437
4,these sendak,0.173437
5,the paperbacks,0.173437
6,pages open,0.173437
7,really rosie,0.168074
8,incorporates them,0.168074
9,paperbacks,0.168074


# Word2Vec

In [1]:
!pip install gensim



You are using pip version 9.0.1, however version 21.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [29]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

model = KeyedVectors.load_word2vec_format('GoogleNews-Vectors-negative300.bin.gz', binary = True)

In [30]:
model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [31]:
model.similarity('woman', 'man')

0.76640123

In [32]:
model.most_similar('woman')

[('man', 0.7664012312889099),
 ('girl', 0.7494640946388245),
 ('teenage_girl', 0.7336829900741577),
 ('teenager', 0.631708562374115),
 ('lady', 0.6288785934448242),
 ('teenaged_girl', 0.6141784191131592),
 ('mother', 0.607630729675293),
 ('policewoman', 0.6069462299346924),
 ('boy', 0.5975908041000366),
 ('Woman', 0.5770983099937439)]

In [None]:
model.most_similar('tasty')

[('delicious', 0.8730390071868896),
 ('scrumptious', 0.8007042407989502),
 ('yummy', 0.7856923937797546),
 ('flavorful', 0.7420163154602051),
 ('delectable', 0.7385421991348267),
 ('juicy_flavorful', 0.7114803791046143),
 ('appetizing', 0.7017217874526978),
 ('crunchy_salty', 0.7012301087379456),
 ('flavourful', 0.6912213563919067),
 ('flavoursome', 0.6857703328132629)]

In [None]:
import gensim
i = 0
list_of_sentences = []
for sentence in final['Text'].values:
    filtered_sentence = []
    sentence = cleanhtml(sentence)
    for w in sentence.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sentences.append(filtered_sentence)

In [None]:
print(final['Text'].values[0])
print('----------------------------')
print(list_of_sentences[0])

In [None]:
w2v_model = gensim.models.Word2Vec(list_of_sentences, min_count = 5, size = 50, workers = 4)

In [42]:
words = list(w2v_model.wv.vocab)
print(len(words))

33783


In [28]:
w2v_model.wv.most_similar('tasty')

[('tastey', 0.8978191018104553),
 ('yummy', 0.8643166422843933),
 ('satisfying', 0.8427529335021973),
 ('filling', 0.8251222372055054),
 ('delicious', 0.8162357211112976),
 ('flavorful', 0.7898172736167908),
 ('tasteful', 0.7695887684822083),
 ('versatile', 0.7648526430130005),
 ('addicting', 0.7619239091873169),
 ('delectable', 0.7548799514770508)]

In [29]:
w2v_model.wv.most_similar('like')

[('resemble', 0.7227350473403931),
 ('mean', 0.6622120141983032),
 ('dislike', 0.6540369987487793),
 ('prefer', 0.6520158052444458),
 ('think', 0.6218041181564331),
 ('fake', 0.6050191521644592),
 ('overpower', 0.5920742750167847),
 ('enjoy', 0.5799568891525269),
 ('miss', 0.5780380964279175),
 ('alright', 0.5727273225784302)]

In [30]:
count_vect_feature = count_vect.get_feature_names()
count_vect_feature.index('like')
print(count_vect_feature[64055])

like
