### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)


from google.colab import drive
drive.mount('/content/drive')

baby_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/amazon_baby.csv')
baby_df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [136]:
#b)

baby_df['review'] = baby_df['review'].fillna('')

#short test:
baby_df["review"][38] == baby_df["review"][38]


#in order to remove punctuation, we need to remove NaNs first (replace with strings). Otherwise punctuation function will call exceptions for rows with NaNs, as they are special floating-point type (not string)


True

In [137]:
#a)

baby_df["review"] = baby_df["review"].apply(remove_punctuation)

#short test:
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'


True

In [138]:
#c)

baby_df = baby_df[baby_df.rating != 3]

#short test:
sum(baby_df["rating"] == 3)

#test is passed, they are no ratings equal 3


0

In [139]:
#d)

baby_df.loc[baby_df["rating"] <= 2, "rating"] = -1
baby_df.loc[baby_df["rating"] >= 4, "rating"] = 1


#using loc to filter column and replace certain values, firstly we need to replace lower values to -1, becouse if we would like to transform high rating first, and set them to 1,
#second transformation of low ratings would turn all of the ratings to -1

#short test:
sum(baby_df["rating"]**2 != 1)

#test is passed, because since we replaced all remaining ratings with 1s and -1s, their 2nd power wont result in anything else than 1s

print(sum(baby_df["rating"] == 1))
print(sum(baby_df["rating"] == -1))

140259
26493


In [140]:
baby_df.head(25)

Unnamed: 0,name,review,rating
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,1
6,A Tale of Baby's Days with Peter Rabbit,Lovely book its bound tightly so you may not b...,1
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents We were able to keep t...,1
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,1
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,1


## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [141]:
vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [142]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer.

In [143]:
#a)

x = baby_df["review"]
y = baby_df["rating"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=44)


In [144]:
#b)

vectorizer = CountVectorizer()

x_train_transformed = vectorizer.fit_transform(x_train)

print(vectorizer.get_feature_names_out())


['00' '000' '001' ... 'zzzzzz' 'zzzzzzz' 'zzzzzzzzzzz']


## Exercise 3
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [145]:
#a)
model = LogisticRegression(solver='sag', max_iter=200)
model.fit(x_train_transformed, y_train)



In [146]:
#b)

d = {'coefs':model.coef_[0],'values':vectorizer.get_feature_names_out()}
df = pd.DataFrame(d)

df = df.sort_values("coefs", ascending=False)

print("Positive words:")
print(df.head(10)["values"])
print("Negative words:")
print(df.tail(10)["values"])

#I zipped coefficients and corresponding values in dictionary and made a dataframe out of it to sort it descending by coefficient values

Positive words:
52683        loves
65253      perfect
52615         love
31053         easy
13735         best
41933        happy
65305    perfectly
40390        great
35885         fits
33257      exactly
Name: values, dtype: object
Negative words:
93772    unfortunately
44990             idea
16612            broke
74087        returning
67455             poor
94961          useless
97136            waste
74061           return
74068         returned
28403     disappointed
Name: values, dtype: object


## Exercise 4
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [147]:
#a)

x_test_transformed = vectorizer.transform(x_test)

y_pred = model.predict(x_test_transformed)

#we transform test data using vectorizer we used to fit data

In [148]:
#b)

y_pred_proba = model.predict_proba(x_test_transformed)

print(model.classes_)

#classes we got predictions for are -1 and 1, for negative and positive reviews.

[-1  1]


In [149]:
#c)

d = {'predictions':y_pred_proba[:, 0],'values': x_test}
#we create a dictionary of probability that the review is negative and corresponding review
df = pd.DataFrame(d)

df = df.sort_values("predictions")

print("Positive reviews:")
print(df.head(5)["values"])
print("Negative reviews:")
print(df.tail(5)["values"])

#since we sorted the values ascending, we get smallest probablity values first, which means that probabilty of first reviews to be negative is really low,
#and the highest probability of reviews being negative is for the last ones

Positive reviews:
60298     This review is going to compare 3 JuJuBe bags ...
158209    updated 32213 After extensive research trial a...
116083    Ive had this stroller for a little more than s...
69511     Ive had this stroller for a little more than s...
42430     new to cloth diapering trying to figure out if...
Name: values, dtype: object
Negative reviews:
57234     My husband and I are VERY disappointed and sho...
10370     This product should be in the hall of fame sol...
133297    The first monitor broke within 1 month of use ...
120209    This is the first review I have ever written o...
87026     First off I did manage to find this product fo...
Name: values, dtype: object


In [155]:
#d)

print(f"Model accuracy [{round(model.score(x_test_transformed, y_test), 2)}%]")
print(f"Model accuracy [{round(accuracy_score(y_test, y_pred), 2)}%]")


Model accuracy [0.93%]
Model accuracy [0.93%]


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [None]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [None]:
#a)


In [None]:
#b)


In [None]:
#c)

#hint: %time, %timeit