### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from timeit import timeit
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

def remove_punctuation(text):
    # import string
    # translator = str.maketrans(' ', ' ', string.punctuation)
    # return text.translate(translator)
    import re
    return re.sub("[^A-Za-z]+", " ", text)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#a)
baby_df['review'] = baby_df['review'].apply(lambda x: remove_punctuation(x) if type(x) == str else x)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

False

In [3]:
#b)
baby_df["review"] = baby_df["review"].fillna("")

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [4]:
#c)
baby_df = baby_df.drop(baby_df[baby_df["rating"] == 3].index)

#short test:
sum(baby_df["rating"] == 3)

0

In [5]:
#d) 
baby_df["rating_norm"] = baby_df["rating"].apply(lambda x: 1 if x >= 4 else -1 if x <= 2 else x)

#short test:
sum(baby_df["rating"]**2 != 1)

151569

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
X_train_full, X_test_full, y_train, y_test = train_test_split(baby_df, baby_df["rating_norm"], test_size=0.2, random_state=42)

y_train.reset_index()
y_test.reset_index()
X_train_full_df = X_train_full.reset_index()
X_test_full_df = X_test_full.reset_index()
X_train = X_train_full["review"]
X_test = X_test_full["review"]

In [9]:
#b)
vectorizer = CountVectorizer(min_df=0.001)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
dict_size1 = len(vectorizer.get_feature_names_out())
print("Word dictionary size:", dict_size1)

Word dictionary size: 3475


## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

In [11]:
#b)
data = {"coef": model.coef_.reshape(-1), "name": vectorizer.get_feature_names_out()}
df = pd.DataFrame(data=data)

In [12]:
# the most negatively correlated
df.sort_values(by="coef", ascending=True).head(10)

Unnamed: 0,coef,name
3431,-2.59823,worthless
2177,-2.578908,poorly
3220,-2.543436,unusable
833,-2.455845,disappointing
3429,-2.345307,worst
3241,-2.220242,useless
832,-2.184499,disappointed
834,-2.14993,disappointment
392,-2.097502,bummer
2176,-2.004088,poor


In [13]:
# the most positively correlated
df.sort_values(by="coef", ascending=False).head(10)

Unnamed: 0,coef,name
614,2.124396,con
2498,2.117537,saves
2659,2.111183,skeptical
1635,2.032423,lifesaver
3017,1.965095,thankful
2566,1.907413,served
3409,1.869745,wonderfully
1807,1.861412,minor
2089,1.828501,penny
357,1.82836,brilliant


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [14]:
#a)
predict_time1 = timeit(lambda: model.predict(X_test_vec), number=1000)
y_pred = model.predict(X_test_vec)
print(y_pred)

[ 1 -1  1 ...  1  1  1]


In [15]:
#b)
y_pred_propa = model.predict_proba(X_test_vec)
print(y_pred_propa)

[[4.28771644e-01 5.71228356e-01]
 [7.71343379e-01 2.28656621e-01]
 [3.61969618e-01 6.38030382e-01]
 ...
 [1.63035363e-05 9.99983696e-01]
 [3.19743985e-04 9.99680256e-01]
 [8.19447470e-02 9.18055253e-01]]


In [16]:
#c)
y_pred_df = pd.DataFrame(y_pred_propa[:, 1], columns=['rating_pred'])
df = pd.concat([X_test_full_df, y_pred_df], axis=1)

In [17]:
df.sort_values(by="rating_pred", ascending=False).head(5)

Unnamed: 0,index,name,review,rating,rating_norm,rating_pred
5329,123632,"Zooper 2011 Waltz Standard Stroller, Flax Brown",I did a TON of research before I purchased thi...,5,1,1.0
31353,168086,Buttons Cloth Diaper Cover - One Size - 8 Colo...,Buttons vs Best Bottoms reviewFirst thing I wa...,5,1,1.0
8150,50735,"Joovy Zoom 360 Swivel Wheel Jogging Stroller, ...",The joovy zoom was the perfect solution for us...,5,1,1.0
24661,57108,BabyPlus Prenatal Education System,I started wearing the Babyplus when I was week...,5,1,1.0
29687,129722,Bumbleride 2011 Flite Lightweight Compact Trav...,This is a review of the Bumbleride Flite in Ru...,5,1,1.0


In [18]:
df.sort_values(by="rating_pred", ascending=True).head(5)

Unnamed: 0,index,name,review,rating,rating_norm,rating_pred
30933,147902,Graco Pack 'n Play Playard - Dempsey,My disappointment with this product prompted m...,1,-1,2.162411e-21
4809,175191,"Zooper Twist Escape Stroller, Summer Day",I had to return this stroller for three reason...,1,-1,2.100146e-17
6190,155287,VTech Communications Safe &amp; Sounds Full Co...,This is my second video monitoring system the ...,1,-1,1.791298e-14
12532,89902,"Peg-Perego Aria Twin Stroller, Java",I am so incredibly disappointed with the strol...,1,-1,3.600344e-14
25937,57234,"Dream On Me Bassinet, Blue",My husband and I are VERY disappointed and sho...,1,-1,7.266495e-14


In [19]:
#d)
accuracy1 = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy1)

Accuracy: 0.9324458037240263


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [20]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [21]:
#a)
vectorizer = CountVectorizer(vocabulary=significant_words)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_vec, y_train)

In [22]:
#b)
pd.DataFrame({"coef": model.coef_.reshape(-1), "name": vectorizer.get_feature_names_out()}).sort_values(by="coef", ascending=False)

Unnamed: 0,coef,name
6,1.704188,loves
5,1.518635,perfect
0,1.348611,love
2,1.182512,easy
1,0.927305,great
7,0.517645,well
4,0.497923,little
8,0.195995,able
9,0.073524,car
3,0.06781,old


In [23]:
#c)
y_pred = model.predict(X_test_vec)
predict_time2 = timeit(lambda: model.predict(X_test_vec), number=1000)
accuracy2 = accuracy_score(y_test, y_pred)
dict_size2 = len(vectorizer.get_feature_names_out())

print('Accuracy of model 1:\t\t', accuracy1)
print('Accuracy of model 2:\t\t', accuracy2)
print()
print('Prediction time of model 1:\t {:.3f}'.format(predict_time1))
print('Prediction time of model 2:\t {:.3f}'.format(predict_time2))
print()
print('Accuracy of the model decreased by {:.2f}%'.format(100 - (accuracy2 / accuracy1 * 100)))
print('Prediction time decreased by {:.2f}%'.format(100 - (predict_time2 / predict_time1 * 100)))
print('Size of the dictionary decreased by {:.2f}%'.format(100 - (dict_size2 / dict_size1 * 100)))

Accuracy of model 1:		 0.9324458037240263
Accuracy of model 2:		 0.8692692872777428

Prediction time of model 1:	 4.291
Prediction time of model 2:	 0.674

Accuracy of the model decreased by 6.78%
Prediction time decreased by 84.29%
Size of the dictionary decreased by 99.42%
