# <p style="text-align: center;">Amazon Reviews - Sentiment Analysis</p>
### <p style="text-align: center;">University of Denver</p>
### <p style="text-align: center;">Alex Liddle</p>

In [1]:
import nltk
import string
import re
import sklearn
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from scipy import stats
#nltk.download('stopwords') #<---uncomment if you haven't downloaded the stopwords library
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Import the dataset

In [2]:
df_reviews_raw = pd.read_csv('train_40k.csv')
df_reviews_raw.head()

Unnamed: 0,productId,Title,userId,Helpfulness,Score,Time,Text,Cat1,Cat2,Cat3
0,B000E46LYG,Golden Valley Natural Buffalo Jerky,A3MQDNGHDJU4MK,0/0,3.0,-1,The description and photo on this product need...,grocery gourmet food,meat poultry,jerky
1,B000GRA6N8,Westing Game,unknown,0/0,5.0,860630400,This was a great book!!!! It is well thought t...,toys games,games,unknown
2,B000GRA6N8,Westing Game,unknown,0/0,5.0,883008000,"I am a first year teacher, teaching 5th grade....",toys games,games,unknown
3,B000GRA6N8,Westing Game,unknown,0/0,5.0,897696000,I got the book at my bookfair at school lookin...,toys games,games,unknown
4,B00000DMDQ,I SPY A is For Jigsaw Puzzle 63pc,unknown,2/4,5.0,911865600,Hi! I'm Martine Redman and I created this puzz...,toys games,puzzles,jigsaw puzzles


All we care about for the purpose of sentiment analysis is the text (our feature) and the score (our label). Furthermore, we'll remove scores of 3.0 and recode 1.0 & 2.0 to 'bad' and 4.0 & 5.0 to 'good'. Lastly, we don't want to consider reviews with few words, so we will arbitrarily filter out reviews with less than 60 words.

### Clean the data

In [3]:
df_reviews = df_reviews_raw[['Text', 'Score']]
df_reviews = df_reviews[(df_reviews.Score < 3.0) | (df_reviews.Score > 3.0)]
df_reviews = df_reviews[df_reviews.Text.str.split().str.len().ge(60)]
df_reviews.replace([1.0, 2.0], 0, inplace=True)
df_reviews.replace([4.0, 5.0], 1, inplace=True)
df_reviews.head()

Unnamed: 0,Text,Score
2,"I am a first year teacher, teaching 5th grade....",1.0
3,I got the book at my bookfair at school lookin...,1.0
4,Hi! I'm Martine Redman and I created this puzz...,1.0
6,The real joy of this movie doesn't lie in its ...,1.0
13,"Parents, don't try to play this game with your...",1.0


### Examine the data

In [4]:
df_reviews.describe()

Unnamed: 0,Score
count,18936.0
mean,0.794149
std,0.404333
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


We want an approximately equal number of good and bad reviews for training purposes, so we'll create an evenly distributed subset by sampling the full dataset.

In [5]:
df_reviews_sampled = df_reviews.groupby('Score').apply(lambda x: x.sample(3000)).reset_index(drop=True)
df_reviews_sampled.describe()

Unnamed: 0,Score
count,6000.0
mean,0.5
std,0.500042
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


Now that the dataset is evenly distributed, it is time to conduct some preprocessing on the text data (i.e., remove stopwords, punctuation, etc.).

### Text Preprocessing

In [6]:
print("Before Preprocessing:")
print(df_reviews.Text.head(1))

tqdm.pandas()
stop = stopwords.words()

df_reviews.Text = df_reviews.Text.str.replace("[^\w\s]", "").str.lower()
df_reviews.Text = df_reviews.Text.progress_apply(lambda x: ' '.join([item for item in x.split() 
                                                               if item not in stop]))

print("After Preprocessing:")
print(df_reviews.Text.head(1))

Before Preprocessing:
2    I am a first year teacher, teaching 5th grade....
Name: Text, dtype: object


100%|██████████| 18936/18936 [02:26<00:00, 129.67it/s]

After Preprocessing:
2    first year teacher teaching 5th grade special ...
Name: Text, dtype: object





Now we are ready to split the dataset into a training and test set.

### Generate a training and test dataset

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df_reviews.Text,df_reviews.Score, test_size=0.3, random_state=42)

Now we are ready to create a model.

### Model selection

We will use a Multinomial Naive Bayes algorithm since we are creating a model for text classification. We'll use sklearn's TfidfVectorizer to convert the text into the vector form that MultinomialNB expects. Lastly, we'll use the GridSearchCV model to tune the hyperparameters for both the TfidfVectorizer and MultinomialNB.

In [8]:
pipe=Pipeline([("tfidf",TfidfVectorizer(stop_words="english")),
               ("nb",MultinomialNB())])
param_grid=[{"tfidf__min_df":[1,10],
             "tfidf__ngram_range":[(1,1), (1,2), (1,3)],
             "tfidf__norm":['l1','l2']}]
grid=GridSearchCV(estimator=pipe,param_grid=param_grid,cv=5)
grid.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words='english')),
                                       ('nb', MultinomialNB())]),
             param_grid=[{'tfidf__min_df': [1, 10],
                          'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
                          'tfidf__norm': ['l1', 'l2']}])

Now we'll reveal the best model hyperparameters.

In [9]:
grid.best_params_

{'tfidf__min_df': 10, 'tfidf__ngram_range': (1, 3), 'tfidf__norm': 'l2'}

Finally, we'll score the best model on training and test accuracy.

In [10]:
# training accuracy
grid.score(X_train,y_train)

0.843757072802716

In [11]:
# test accuracy
grid.score(X_test,y_test)

0.8186938919204365

The difference in accuracy between the training and test datasets are very small, so overfitting is not much of an issue. The next step to improve accuracy would be to use a much larger dataset, but that will not be done in the scope of this mini project.