CSCI544 Assignment 1: Sentiment Analysis on Amazon Reviews Assignment. 

@author: Ryan Luu 

@date: 9/8/2022


The purpose of this assignment is to experiment with text representations and how to use text representations 
for sentiment analysis. It takes an dataset of amazon reviews (~ 16 million reviews) and trys to train some 
classifiers to predict a products rating (1-5 stars) based on the reviews written. 

E.g) 4-5 star reviews will typically have words like: Great! Good! Happy! 

     1-2 Star reviews will have words like: Bad! Yuck! Gross! 

The code flow is like so: 
  -> Reading in file into pandas
  -> preprocessing on dataset (removing html,punctuation, numbers, etc)
  -> extracting the tfidf features 
  -> Training Preceptron, SVM, Logisitc Regression, and Multinomial Naive Bayes classifiers 

In [17]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import nltk

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC
import contractions


import ssl
#nltk.download('wordnet')
import re
from bs4 import BeautifulSoup
 

In [18]:
! pip3 install bs4 # in case you don't have it installed

# Dataset: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Jewelry_v1_00.tsv.gz

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


## Helper Functions

These are helper functions that I implemented to help with preprocessing the dataset and printing out the average 
length

In [3]:
# Helper Function to convert treebag POS to wordnet POS. No need for ADJ_SAT since we're gooing from POS to wordnet
def word_net_pos_converter(pos_tag):
    if pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    else:
        return wordnet.NOUN     # word_net defaults to Noun otherwise

# Helper function to print out average length of review before and after processing
def get_avg_review_len(col, step, flag, df):
    mean_len = df[col].apply(len).mean()
    if flag == False:
        print("This is the average length of reviews before " + step + " " + str(mean_len))
    else:
        print("This is the average length of reviews after " + step + " " + str(mean_len))

## Read Data

In [4]:
data = pd.read_table('amazon_reviews_us_Jewelry_v1_00.tsv', usecols=['star_rating', 'review_body'], low_memory=False)

## Keep Reviews and Ratings

In [5]:
data = data.dropna()                                  # Gets rid of NaN's in Table
data = data.reset_index(drop=True)                    # Resets the index
data['star_rating'] = data['star_rating'].astype(int) # Cast ratings to int

 ## We select 20000 reviews randomly from each rating class.



In [6]:
# Adds 20000 reviews with 1-5 star reviews. Total is balanced 100,000 reviews.
data = data.sample(frac=1).reset_index(drop=True)
sampled_amazon_df = data[data['star_rating'] == 1][:20000]
sampled_amazon_df = sampled_amazon_df.append(data[data['star_rating']== 2] [:20000])
sampled_amazon_df = sampled_amazon_df.append(data[data['star_rating'] == 3][:20000])
sampled_amazon_df = sampled_amazon_df.append(data[data['star_rating'] == 4][:20000])
sampled_amazon_df = sampled_amazon_df.append(data[data['star_rating'] == 5][:20000])
sampled_amazon_df = sampled_amazon_df.reset_index(drop=True)

# Data Cleaning



# Pre-processing

For Pre-Processing we have to clean the text before we extract its features. 
The finished preprocessed text body is stored in our df under review_body column.

The cleaning steps taken in this notebook were as follows: 

- Lowercasing all the words: "i'm a cat" -> "i'm am a cat"
- Fix Contractions: "i'm a cat" -> "i am a cat"
- Remove HTML, HTML Tags, and URL's: "go check out my soundcloud www.hoTtrash.html" -> "go check out my soundcloud"
- Remove all non-alphabetical characters: " very123 nice" -> "very nice"
- Remove extra spaces: "wowwww         wow: -> "wowww wow"

In [7]:
get_avg_review_len('review_body', 'data cleaning:', False, sampled_amazon_df)

# Lower Case
sampled_amazon_df['review_body'] = sampled_amazon_df['review_body'].str.lower()

# Fixes Contractions
sampled_amazon_df['fix_contractions'] = sampled_amazon_df['review_body'].apply(lambda l: [contractions.fix(word) for word in l.split()])
sampled_amazon_df['review_body'] = [' '.join(map(str, k)) for k in sampled_amazon_df['fix_contractions']]
del sampled_amazon_df['fix_contractions']

# Removes HTML, HTML Tags, and URL's.
sampled_amazon_df["review_body"] = sampled_amazon_df["review_body"].str.replace('<[^<]+?>', '', regex=True).str.strip()
sampled_amazon_df['review_body'] = sampled_amazon_df['review_body'].replace(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', regex=True)

# Removes all non-alphabetical chars
sampled_amazon_df['review_body'] = sampled_amazon_df['review_body'].str.replace('[^a-zA-Z]', ' ', regex=True)

# removes extra spaces
sampled_amazon_df['review_body'] = sampled_amazon_df['review_body'].replace(r'\s+', ' ', regex=True)

get_avg_review_len('review_body', 'data cleaning:', True, sampled_amazon_df)

This is the average length of reviews before data cleaning: 189.62446
This is the average length of reviews after data cleaning: 183.56328


## remove the stop words 

Removes all the stop words in our body using NLTK's list of stopwords and then regex to make the tokens.

- Removing Stopwords: "I am a cat" -> "cat" 

In [8]:
# Prints out average length before removing stop words and lemmization
get_avg_review_len('review_body', 'removing stop words and lemmization:', False, sampled_amazon_df) 

stop_words = set(stopwords.words('english'))
regexp = RegexpTokenizer('\w+')
wnl = WordNetLemmatizer()

# Creates column of NLTK tokens
sampled_amazon_df["nltk_tokens"] = sampled_amazon_df["review_body"].apply(regexp.tokenize)

# Removes the stop words
sampled_amazon_df['nltk_tokens'] = sampled_amazon_df['nltk_tokens'].apply(lambda x: [word for word in x if word not in stop_words])

# Prints out average length after removing stop words and lemmization
get_avg_review_len('nltk_tokens', 'removing stop words and lemmization:', True, sampled_amazon_df)

This is the average length of reviews before removing stop words and lemmization: 183.56328
This is the average length of reviews after removing stop words and lemmization: 16.74663


## perform lemmatization  

We lemmatize all the words after we remove the stop words

- Lemmatizing words: "am, are, is" -> be 

In [9]:
# Gets Parts of Speech for each word
sampled_amazon_df['part_of_speech_tags'] = sampled_amazon_df['nltk_tokens'].apply(nltk.tag.pos_tag)

# Creates column of Wordnet POS tokens
sampled_amazon_df['wordnet_part-of_speech_tags'] = sampled_amazon_df['part_of_speech_tags'].apply(lambda x: [(word,word_net_pos_converter(pos_tag)) for (word,pos_tag) in x])

# Lemmatizes the words with NLTK wordnetlemmatizer
sampled_amazon_df['lemmatized_reviews'] = sampled_amazon_df['wordnet_part-of_speech_tags'].apply(lambda x: " ".join([wnl.lemmatize(word,pos_tag) for word,pos_tag in x]))


# TF-IDF Feature Extraction

TF-IDF is the term frequency–inverse document frequency; a numerical statistic that is intended to reflect how important a word is to a document. 

We extract this using sk-learn's TfidfVectorizer's. We then then feed in a 80% training and 20% testing set from our dataset of 100,000 reviews (80,000 random reviews from 1-5 stars are for training, 20,000 random reviews from 1-5 stars are for testing). 

In [10]:
tf_idf_vectorizer = TfidfVectorizer()
X_train, X_test, Y_train, Y_test = train_test_split(sampled_amazon_df['lemmatized_reviews'], sampled_amazon_df['star_rating'], test_size = 0.20, random_state = 727)
#print("Train: ", X_train.shape, Y_train.shape,"Test: ", (X_test.shape, Y_test.shape))
tf_x_train = tf_idf_vectorizer.fit_transform(X_train)
tf_x_test = tf_idf_vectorizer.transform(X_test)

# Perceptron

In [39]:
p_classifier = Perceptron(tol=1e-3)
p_classifier.fit(tf_x_train, Y_train)
y_test_pred = p_classifier.predict(tf_x_test)
report = classification_report(Y_test, y_test_pred, output_dict=True)
p_df = pd.DataFrame(report).transpose()
print(svm_df)



              precision    recall  f1-score      support
1              0.538892  0.650802  0.589584   3992.00000
2              0.379596  0.309750  0.341134   4000.00000
3              0.393167  0.337391  0.363150   4025.00000
4              0.447660  0.406734  0.426217   4069.00000
5              0.592989  0.721768  0.651072   3914.00000
accuracy       0.483750  0.483750  0.483750      0.48375
macro avg      0.470461  0.485289  0.474231  20000.00000
weighted avg   0.469731  0.483750  0.473120  20000.00000


# SVM

In [16]:
svm_classifier = LinearSVC()
svm_classifier.fit(tf_x_train, Y_train)
y_test_pred = svm_classifier.predict(tf_x_test)
report = classification_report(Y_test, y_test_pred, output_dict=True)
svm_df = pd.DataFrame(report).transpose()
print(svm_df)

              precision    recall  f1-score      support
1              0.538892  0.650802  0.589584   3992.00000
2              0.379596  0.309750  0.341134   4000.00000
3              0.393167  0.337391  0.363150   4025.00000
4              0.447660  0.406734  0.426217   4069.00000
5              0.592989  0.721768  0.651072   3914.00000
accuracy       0.483750  0.483750  0.483750      0.48375
macro avg      0.470461  0.485289  0.474231  20000.00000
weighted avg   0.469731  0.483750  0.473120  20000.00000


# Logistic Regression

In [13]:
lr_classifier = LogisticRegression(max_iter=1000, solver='saga')
lr_classifier.fit(tf_x_train, Y_train)
y_test_pred = lr_classifier.predict(tf_x_test)
report = classification_report(Y_test, y_test_pred, output_dict=True)
lr_df = pd.DataFrame(report).transpose()
print(lr_df)

              precision    recall  f1-score     support
1              0.574497  0.643287  0.606949   3992.0000
2              0.403001  0.369250  0.385388   4000.0000
3              0.415688  0.387081  0.400875   4025.0000
4              0.474213  0.440649  0.456815   4069.0000
5              0.634686  0.703117  0.667152   3914.0000
accuracy       0.507400  0.507400  0.507400      0.5074
macro avg      0.500417  0.508677  0.503436  20000.0000
weighted avg   0.499614  0.507400  0.502401  20000.0000


# Naive Bayes

In [14]:
mnb_classifier = MultinomialNB()
mnb_classifier.fit(tf_x_train, Y_train)
y_test_pred = mnb_classifier.predict(tf_x_test)

report = classification_report(Y_test, y_test_pred, output_dict=True)
mnb_df = pd.DataFrame(report).transpose()
print(mnb_df)

              precision    recall  f1-score      support
1              0.578140  0.604208  0.590887   3992.00000
2              0.394772  0.370000  0.381985   4000.00000
3              0.398097  0.395031  0.396558   4025.00000
4              0.445312  0.420251  0.432419   4069.00000
5              0.624499  0.677312  0.649835   3914.00000
accuracy       0.492150  0.492150  0.492150      0.49215
macro avg      0.488164  0.493360  0.490337  20000.00000
weighted avg   0.487282  0.492150  0.489294  20000.00000
