### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [31]:
#a)
baby_df['review'] = baby_df['review'].astype(str)
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'


True

**Notes:**  I have clarified that the `review` column exclusively holds string values. Using the `apply` function, I've successfully eliminated all punctuation. This process aids in eliminating noise from our data, preventing situations where `word.` and `word` would be treated as distinct words. Generally, punctuation doesn't contribute significantly to the model (it's worth noting that the function responsible for punctuation removal also excludes exclamation points).

In [32]:
#b)
baby_df['review'].replace(np.nan, "", inplace=True)

baby_df["review"][38] == baby_df["review"][38]



True

**Notes:** The `np.nan` is a constant representing a missing or undefined numerical value in a NumPy array. It stands for not a number and has a float type. It was removed using replace function. Since CountVectorizer cannot process `nan` values, it was necessary to replace them with empty strings.

In [33]:
#c)
baby_df = baby_df[baby_df["rating"] != 3]

# Short test
print(sum(baby_df["rating"] == 3))

0


**Notes:** In the provided code, I removed all entries with a rating of 3 and subsequently verified whether the removal was successful.

In [34]:
#d)

baby_df["rating"] = np.where(baby_df["rating"] >= 4, 1, np.where(baby_df["rating"] <= 2, -1, baby_df["rating"]))

# Test if all ratings are now either 1 or -1
#short test:
sum(baby_df["rating"]**2 != 1)

0

**Notes:** Then I have set all positive (greater or equal to 4) ratings to 1 and negative(less or equal to 2) to -1.

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [36]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [37]:
#a)
train, test = train_test_split(baby_df, train_size=0.8, test_size=0.2, random_state=9)


**Notes:** In the provided code I have splitted the dataset into train(80% of the dataset) and test(20% of the dataset) sets. To achieve this, I'm utilizing the `train_test_split` function. I've also defined a random state. The random state value maintains dataset consistency, producing the same dataset for a given integer value.

In [38]:
#b)
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(list(train["review"]))
y = train["rating"]
X_test = vectorizer.transform(list(test["review"]))
y_test = test["rating"]

**Notes:** the code I've provided is using the CountVectorizer from scikit-learn to convert a collection of text documents into a matrix of token counts. The `fit_transform` method is used to convert the training reviews in the "review" column of the train dataframe into a sparse matrix (X) of token counts. It both fits the vectorizer on the training data and transforms the data simultaneously. The numbers stored in this vector represent the frequency of each word in the given review.

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [39]:
#a)
model = LogisticRegression()

model.fit(X, y)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.9315762645797727


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Notes:** I have constructed the model to identify positive and negative words. A rating of -1 signifies a predominantly negative sentiment in the review, while a rating of 1 indicates the opposite. 

The `fit` method trains the model using logistic regression to fit the appropriate regression curve. It takes an array of features as parameters and an array representing the response to these features.

The `accuracy_score` function from the sklearn.metrics package computes the accuracy score, which measures the accuracy of a set of predicted labels compared to the true labels.

In [40]:
#b)

feature_names = np.array(vectorizer.get_feature_names_out())

# Get the coefficients from the trained model
coefficients = model.coef_[0]

# Create a dictionary mapping feature names to coefficients
features_coefficients = dict(zip(feature_names, coefficients))

print("10 Most Positive Words:")
print(sorted(features_coefficients.items(), key=lambda x: x[1], reverse=True)[:10])

print("\n10 Most Negative Words:")
print(sorted(features_coefficients.items(), key=lambda x: x[1])[:10])


10 Most Positive Words:
[('excellent', 2.448654775730521), ('awesome', 2.4020460062160773), ('pleased', 2.192401846019049), ('glad', 2.074902946167239), ('satisfied', 2.0739927240906417), ('worry', 2.041578997992476), ('complaint', 1.9980294189031529), ('amazing', 1.9600222528227504), ('perfect', 1.9398866893268911), ('highly', 1.899928055075565)]

10 Most Negative Words:
[('worst', -3.3301183535992642), ('disappointing', -2.958751689285848), ('concept', -2.7636649330955727), ('poorly', -2.7152341727328357), ('useless', -2.523090243980442), ('poor', -2.3067019542612552), ('returning', -2.3010930001342635), ('worthless', -2.2501826072779845), ('disappointed', -2.242357295229989), ('terrible', -2.2258507630486144)]


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [41]:
#a)

# Predict sentiment on the test data
y_pred = model.predict(X_test)

# Print the predicted sentiment for the first few reviews
for i in range(5):
    print("Review:", test["review"].iloc[i])
    print("Actual Rating:", y_test.iloc[i])
    print("Predicted Sentiment:", y_pred[i])
    print()


Review: This is a very cute toy but I soon realized that the battery compartment cannot be accessed and the batteries cannot be replaced I should have researched this item better
Actual Rating: -1
Predicted Sentiment: -1

Review: This thing is well worth the price I paid my little guy loves it  He is 8 mos old and big for his age  When we first bought it we had the seat pulled up the highest it would go which worked for a while  Eventually DS got too big so we had to figure out a way to shorten the straps  For a while we did as someone else suggested and put tennis balls between the straps and the underside of the three spreaders  That worked  My husband didnt really like that solution though so he took a large carabiner and wrapped the main hanging strap between the adjuster and the spreaders around the carabiner a bunch of times  If we want to make the seat higher off the ground we just wrap the strap some more  As for the bumpers our guy is so active in this thing if it was in a reg

In [42]:
#b)
y_proba = model.predict_proba(X_test)

# Print the predicted probabilities for the first few reviews
for i in range(5):
    print("Review:", test["review"].iloc[i])
    print("Actual Rating:", y_test.iloc[i])
    print("Predicted Probabilities:", y_proba[i])
    print()
#hint: model.predict_proba()

Review: This is a very cute toy but I soon realized that the battery compartment cannot be accessed and the batteries cannot be replaced I should have researched this item better
Actual Rating: -1
Predicted Probabilities: [0.76581707 0.23418293]

Review: This thing is well worth the price I paid my little guy loves it  He is 8 mos old and big for his age  When we first bought it we had the seat pulled up the highest it would go which worked for a while  Eventually DS got too big so we had to figure out a way to shorten the straps  For a while we did as someone else suggested and put tennis balls between the straps and the underside of the three spreaders  That worked  My husband didnt really like that solution though so he took a large carabiner and wrapped the main hanging strap between the adjuster and the spreaders around the carabiner a bunch of times  If we want to make the seat higher off the ground we just wrap the strap some more  As for the bumpers our guy is so active in this

In [43]:
most_positive_indices = np.argsort(y_proba[:, 1])[::-1][:5]

print("Five Most Positive Reviews:")
for index in most_positive_indices:
    print("Predicted Probability (Positive):", y_proba[index, 1])
    print("Actual Rating:", y_test.iloc[index])
    print("Review:", test["review"].iloc[index])
    print()

Five Most Positive Reviews:
Predicted Probability (Positive): 1.0
Actual Rating: 1
Review: Background Ive been using Grovia diapers for four years when I bought them they were called grobaby I have 6 shells and 12 inserts I purchased them for my oldest daughter and used them only part time for 25 years until she was potty trained She still needed pullups at night for a long time so I used these until she was ready to go without She was probably 30lbs when she stopped wearing them and they fit well They were stored for a year until I had my other daughter who is now 55 months and exclusively CDd I didnt use them for either girl until about a month old since I had tiny babiesPros Theyre mostly organic When they were new they were so soft and thick Now theyre definitely worn in and not so soft anymoreI love how easy they are to use As a first time CDer these were perfect Just snap the inserts in strap on baby Easy for pretty much anyone No stuffing unstuffing folding etcI love that Im abl

In [44]:

most_negative_indices = np.argsort(y_proba[:, 0])[::-1][:5]


print("Five Most Negative Reviews:")
for index in most_negative_indices:
    print("Predicted Probability (Negative):", y_proba[index, 0])
    print("Actual Rating:", y_test.iloc[index])
    print("Review:", test["review"].iloc[index])
    print()


Five Most Negative Reviews:
Predicted Probability (Negative): 1.0
Actual Rating: -1
Review: Please see my email to the companyHelloI am writing to voice my familys anger over your unsafe cheap cosleeper  If you recall I had a problem with my newly purchased cosleeper back in May which I immediately called about and was told to send the frame back  At that time I asked to speak to a supervisor about the situation and was told that I would be contacted shortly  However Mayra was the only one who I was able to speak with after numerous attempts to be put in contact with the supervisor  After a huge delay due to mistakes on your end I finally got the cosleeper sent back to the company after speaking with Veronica on June 13thAt this time June 13th I asked to speak with the manager of the company and Veronica told me that Sharon was not in at the time but would be in later that day  I obviously never heard from Sharon or anyone else from this company for that matter from that point on  I wa

In [45]:
#d) 
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.9315762645797727


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [46]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [47]:
#a)
vectorizer_limited = CountVectorizer(vocabulary=significant_words)

X_limited = vectorizer_limited.fit_transform(list(train["review"]))
y_limited = train["rating"]
X_test_limited = vectorizer_limited.transform(list(test["review"]))
y_test_limited = test["rating"]

In [48]:

model_lim = LogisticRegression()

model_lim.fit(X_limited, y_limited)

y_pred_lim = model_lim.predict(X_test_limited)
accuracy = accuracy_score(y_test_limited, y_pred_lim)

# Print the accuracy
print("Accuracy:", accuracy)

Accuracy: 0.8666906539534047


In [49]:
feature_names_lim = np.array(vectorizer_limited.get_feature_names_out())

# Get the coefficients from the trained model
coefficients_lim = model_lim.coef_[0]

# Create a dictionary mapping feature names to coefficients
features_coefficients_lim = dict(zip(feature_names_lim, coefficients_lim))

print("10 Most Positive Words:")
print(sorted(features_coefficients_lim.items(), key=lambda x: x[1], reverse=True)[:10])

print("\n10 Most Negative Words:")
print(sorted(features_coefficients_lim.items(), key=lambda x: x[1])[:10])

10 Most Positive Words:
[('loves', 1.729994908769879), ('perfect', 1.547478580404956), ('love', 1.3627986082007149), ('easy', 1.1754698170687194), ('great', 0.9273480049863843), ('little', 0.4974999811869876), ('well', 0.487894263594029), ('able', 0.22196595718869577), ('old', 0.0850652051801621), ('car', 0.06677229311368224)]

10 Most Negative Words:
[('disappointed', -2.340349322394666), ('return', -2.1101705175264516), ('waste', -1.9888378411817036), ('broke', -1.7092612645773173), ('money', -0.9319460049895277), ('work', -0.6473130789671582), ('even', -0.5133966019999382), ('would', -0.3436397662299862), ('product', -0.30272946158459846), ('less', -0.17916736992083876)]


In [22]:
y_pred_lim = model_lim.predict(X_test_limited)

# Print the predicted sentiment for the first few reviews
for i in range(5):
    print("Review:", test["review"].iloc[i])
    print("Actual Rating:", y_test_limited.iloc[i])
    print("Predicted Sentiment:", y_pred_lim[i])
    print()


Review: This is a very cute toy but I soon realized that the battery compartment cannot be accessed and the batteries cannot be replaced I should have researched this item better
Actual Rating: -1
Predicted Sentiment: 1

Review: This thing is well worth the price I paid my little guy loves it  He is 8 mos old and big for his age  When we first bought it we had the seat pulled up the highest it would go which worked for a while  Eventually DS got too big so we had to figure out a way to shorten the straps  For a while we did as someone else suggested and put tennis balls between the straps and the underside of the three spreaders  That worked  My husband didnt really like that solution though so he took a large carabiner and wrapped the main hanging strap between the adjuster and the spreaders around the carabiner a bunch of times  If we want to make the seat higher off the ground we just wrap the strap some more  As for the bumpers our guy is so active in this thing if it was in a regu

In [50]:
y_proba_lim = model_lim.predict_proba(X_test_limited)

# Print the predicted probabilities for the first few reviews
for i in range(5):
    print("Review:", test["review"].iloc[i])
    print("Actual Rating:", y_test_limited.iloc[i])
    print("Predicted Probabilities:", y_proba_lim[i])
    print()

Review: This is a very cute toy but I soon realized that the battery compartment cannot be accessed and the batteries cannot be replaced I should have researched this item better
Actual Rating: -1
Predicted Probabilities: [0.21348739 0.78651261]

Review: This thing is well worth the price I paid my little guy loves it  He is 8 mos old and big for his age  When we first bought it we had the seat pulled up the highest it would go which worked for a while  Eventually DS got too big so we had to figure out a way to shorten the straps  For a while we did as someone else suggested and put tennis balls between the straps and the underside of the three spreaders  That worked  My husband didnt really like that solution though so he took a large carabiner and wrapped the main hanging strap between the adjuster and the spreaders around the carabiner a bunch of times  If we want to make the seat higher off the ground we just wrap the strap some more  As for the bumpers our guy is so active in this

In [24]:
most_positive_indices_lim = np.argsort(y_proba_lim[:, 1])[::-1][:5]

print("Five Most Positive Reviews:")
for index in most_positive_indices_lim:
    print("Predicted Probability (Positive):", y_proba_lim[index, 1])
    print("Actual Rating:", y_test_limited.iloc[index])
    print("Review:", test["review"].iloc[index])
    print()


most_negative_indices_lim = np.argsort(y_proba_lim[:, 0])[::-1][:5]


print("Five Most Negative Reviews:")
for index in most_negative_indices_lim:
    print("Predicted Probability (Negative):", y_proba_lim[index, 0])
    print("Actual Rating:", y_test_limited.iloc[index])
    print("Review:", test["review"].iloc[index])
    print()

Five Most Positive Reviews:
Predicted Probability (Positive): 0.9999999973436262
Actual Rating: 1
Review: We bought this stroller about 2 weeks ago I absolutely love it I have a 3 year old and a 5 month old They both fit in the stroller great My 3 yr old was so excited about this stroller He loves the color blue the canopies the UV shade everything He is looking forward to using the boot this winter I love love love the big basket Yes there is a bar across it However I am able to get a backpack in over the bar To take it out we just unsnap the back of the basket and pull the backpack under the bar It really is easy The stroller steers well I have not noticed it pulling to one side despite the 15 pound weight difference between my boys We have not used the rain shield but the UV shade is great The canopy will cover the whole front of the stroller but the UV shade keeps the boys shady when they want to be able to look out The infant boot works well It is meant to go across the whole fron

In [51]:
accuracy_lim = accuracy_score(y_test_limited, y_pred_lim)

# Print the accuracy
print("Accuracy:", accuracy_lim)

Accuracy: 0.8666906539534047


In [52]:
#b)
for word, coef in zip(vectorizer_limited.get_feature_names_out(),model_lim.coef_[0]):
    print(f"{word}: {coef}")


love: 1.3627986082007149
great: 0.9273480049863843
easy: 1.1754698170687194
old: 0.0850652051801621
little: 0.4974999811869876
perfect: 1.547478580404956
loves: 1.729994908769879
well: 0.487894263594029
able: 0.22196595718869577
car: 0.06677229311368224
broke: -1.7092612645773173
less: -0.17916736992083876
even: -0.5133966019999382
waste: -1.9888378411817036
disappointed: -2.340349322394666
work: -0.6473130789671582
product: -0.30272946158459846
money: -0.9319460049895277
would: -0.3436397662299862
return: -2.1101705175264516


In [53]:
import sys, time


In [54]:
%%time
%%timeit
model_lim.predict(X_test_limited)



357 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
CPU times: user 2.89 s, sys: 1.42 ms, total: 2.89 s
Wall time: 2.89 s


In [57]:
%%time
%%timeit
model.predict(X_test)


#hint: %time, %timeit

3.31 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: user 2.9 s, sys: 2.46 ms, total: 2.9 s
Wall time: 2.9 s


In [58]:


print(f"first model score: {model.score(X_test, y_test)}")
print(f"second model score: {model_lim.score(X_test_limited, y_test_limited)}")

first model score: 0.9315762645797727
second model score: 0.8666906539534047
