# Sentiment Classification of Product Reviews Using RNNs

## Objectives

* Explore Amazon product review dataset
* Data manipulations to extract predictors and labels from the raw data
* Cleaning of the text reviews
* Tokenize the reviews to convert them into sequences of numbers
* Pad the variable length sequences to make them of equal lenghts 
* Build and train the model for sentiment classification
* Predict sentiments using the trained model

## The problem

Suppose that we are working in Globomantics which is one of the most popular e-commerce companies. To improve user experience, you want to analyse your products based on customer reviews and change your product catalog accordingly. To achieve this, you want to build an intelligent system which will analyse the customer sentiments automatically from thousands of product reviews provided by the consumers. In short, you want to predict sentiments from the customer reviews. You'll build and train a model using recurrent neural networks to accomplish this. 

## Dataset

We'll be using an open-source Amazon customer review dataset available in Kaggle. The dataset can be downloaded from [here](https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products). 

The dataset contains about 34000 consumer reviews about Amazon producrs like Kindle, Fire TV Stick etc. Each record conatins a lot of details about the product and details about the reviews.

## Read and Explore the dataset

We are importing the libraries Pandas and Numpy to do basic data manipulations. We are also importing the python package to use [regular expressions](https://docs.python.org/3/library/re.html). This will be necessary to preprocess the text data.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import re

We already downloaded the dataset from [here](https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products?select=1429_1.csv) and saved it as a CSV file which we'll read now using Pandas.

In [3]:
df = pd.read_csv('1429_1.csv')

In [4]:
df.shape

(34660, 21)

In [5]:
df.iloc[0]

id                                                   AVqkIhwDv8e3D1O-lebb
name                    All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...
asins                                                          B01AHB9CN2
brand                                                              Amazon
categories              Electronics,iPad & Tablets,All Tablets,Fire Ta...
keys                    841667104676,amazon/53004484,amazon/b01ahb9cn2...
manufacturer                                                       Amazon
reviews.date                                     2017-01-13T00:00:00.000Z
reviews.dateAdded                                    2017-07-03T23:33:15Z
reviews.dateSeen        2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z
reviews.didPurchase                                                   NaN
reviews.doRecommend                                                  True
reviews.id                                                            NaN
reviews.numHelpful                    

In [6]:
print(df.iloc[3]["reviews.rating"])
print(df.iloc[3]["reviews.text"])

4.0
I've had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands.


In [7]:
print(df.iloc[27162]["reviews.rating"])
print(df.iloc[27162]["reviews.text"])

1.0
I purchased this item on the recommendation of many existing owners. However, when I got home and attempted to set up the device, it was frustrating and the instructions/FAQs were not very helpful. The echo could not find my NEST thermostat, nor could it find about half of my HUE lights. Once I argued and fought with Alexa for about 45 minutes, she finally started to pick up my lights and after COMLETELY re-configuring my thermostat, she could see it, but she would not work with it. Once she found all of my lights, she would turn them on or off, but would not change colors or brightness levels. I run a software company and have worked in and with computers my whole life. This should not have been this quirky. I would not recommend this item. Get a Google Home.


## Data Manipulations

In [8]:
reviews_df = df[["reviews.text", "reviews.rating",]]
reviews_df.columns = ["review", "rating"]

In [9]:
reviews_df.head()

Unnamed: 0,review,rating
0,This product so far has not disappointed. My c...,5.0
1,great for beginner or experienced person. Boug...,5.0
2,Inexpensive tablet for him to use and learn on...,5.0
3,I've had my Fire HD 8 two weeks now and I love...,4.0
4,I bought this for my grand daughter when she c...,5.0


In [10]:
reviews_df.isnull().sum()

review     1
rating    33
dtype: int64

In [11]:
reviews_df.dropna(inplace=True)

In [12]:
def sentiments(rating):
    if (rating == 5) or (rating == 4):
        return "positive"
    elif rating == 3:
        return "neutral"
    elif (rating == 2) or (rating == 1):
        return "negative"

In [13]:
reviews_df["sentiment"] = reviews_df["rating"].apply(sentiments)

In [14]:
reviews_df.sample(10, random_state = 86, ignore_index=True)

Unnamed: 0,review,rating,sentiment
0,I like the fact that I don't have to carry boo...,5.0,positive
1,Love Alexa. I have 3 echoes now and 2 dots. I ...,5.0,positive
2,"I like my ECHO, she however will talk on her o...",4.0,positive
3,Clunky and full of ads. Ok for my kids to use ...,2.0,negative
4,"Awesome product, works great. Good for listeni...",5.0,positive
5,Everyone should have one. Awesome!! Just ask A...,5.0,positive
6,This is a wonderful device. I love the extra c...,5.0,positive
7,"The Charger is not lasting, Screen is ok,Funct...",4.0,positive
8,I purchased this item on the recommendation of...,1.0,negative
9,Great to group all your program app into one d...,5.0,positive


In [15]:
sentiments = reviews_df["sentiment"].values
reviews = reviews_df["review"].values
print(sentiments[0:5])
print("\n")
print(reviews[0:5])

['positive' 'positive' 'positive' 'positive' 'positive']


['This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.'
 'great for beginner or experienced person. Bought as a gift and she loves it'
 'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already...'
 "I've had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic t

## Convert text categories into one-hot encoded vector

We are importing keras utils here, this will be needed to convert the labels to one-hot encoded vectors. We'll use the the function "to_categorical" from this library to do this. To know more about this function, please check out this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical).

In [16]:
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [17]:
import tensorflow.keras.utils as ku

2023-04-28 19:34:09.565811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Keras to-categorical function takes numerical arguments, that's why we need to convert the text labels to numerical values first. To achieve this, we'll define this funnction.

In [18]:
def encode_sentiments(sentiment):
    if sentiment == "negative":
        return 0
    elif sentiment == "neutral":
        return 1
    elif sentiment == "positive":
        return 2

In [19]:
label_encoding = {0: 'negative', 1: 'neutral', 2: 'positve'}

In [20]:
sentiments_encoded = [encode_sentiments(sentiment) for sentiment in sentiments]

In [21]:
print(sentiments[0:10])
print(sentiments_encoded[0:10])

['positive' 'positive' 'positive' 'positive' 'positive' 'positive'
 'positive' 'positive' 'positive' 'positive']
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


In [22]:
labels = ku.to_categorical(sentiments_encoded, num_classes = 3)

In [23]:
print(labels)

[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 ...
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


## Cleaning of the review texts

We'll use regular expression extensively for text preprocessing. A full discussion on regular expressions is out of scope for this course. However, I would highly encourage you to go through these links to know more about them. [link1](https://en.wikipedia.org/wiki/Regular_expression), [link2](https://developers.google.com/edu/python/regular-expressions), [link3](https://docs.python.org/3/library/re.html).

We'll mainly use the function "re.sub(...)" from the Pyhton "re" package. This function substitues a pattern by some other pattern in a text. You can find more information about this function from [here](https://docs.python.org/3/library/re.html).

In [24]:
test_str = "I am really really impressed by this product."
print(test_str)
old_pattern = "really"
new_pattern = "very"
new_str = re.sub(old_pattern, new_pattern, test_str)
print(new_str)

I am really really impressed by this product.
I am very very impressed by this product.


In [25]:
test_str = "Python 3.8"
print(test_str)
old_pattern = r'\d'
new_pattern = '<digit>'
new_str = re.sub(old_pattern, new_pattern, test_str)
print(new_str)

Python 3.8
Python <digit>.<digit>


### Remove hyperlinks

In [26]:
test_str = "Visit https://www.amazon.com for more information on this."
test_pattern = r'http\S+'
print(test_str)
new_str = re.sub(test_pattern, " ", test_str)
print(new_str)

Visit https://www.amazon.com for more information on this.
Visit   for more information on this.


In [27]:
def remove_hyperlinks(text):
    pattern_for_hyperlink = r'http\S+'
    return re.sub(pattern_for_hyperlink, " ", text)

In [28]:
reviews = [remove_hyperlinks(review) for review in reviews]

In [29]:
reviews[0:5]

['This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.',
 'great for beginner or experienced person. Bought as a gift and she loves it',
 'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already...',
 "I've had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands.",
 'I bought this for my grand daughter 

### Expand contracted words

In [30]:
def remove_contracted_words(text):
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text

In [31]:
test_str = "I can't use this product. I don't recommend this product to anyone."
print(test_str)
print(remove_contracted_words(test_str))

I can't use this product. I don't recommend this product to anyone.
I can not use this product. I do not recommend this product to anyone.


In [32]:
reviews = [remove_contracted_words(review) for review in reviews]

In [33]:
reviews[0:5]

['This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.',
 'great for beginner or experienced person. Bought as a gift and she loves it',
 'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already...',
 'I have had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands.',
 'I bought this for my grand daughte

### Remove everything other than letters of alphabet

In [34]:
def remove_non_letters(text):
    antipattern = r'[^A-Za-z]+'
    return re.sub(antipattern, " ", text)

In [35]:
reviews = [remove_non_letters(review) for review in reviews]
reviews[0:5]

['This product so far has not disappointed My children love to use it and I like the ability to monitor control what content they see with ease ',
 'great for beginner or experienced person Bought as a gift and she loves it',
 'Inexpensive tablet for him to use and learn on step up from the NABI He was thrilled with it learn how to Skype on it already ',
 'I have had my Fire HD two weeks now and I love it This tablet is a great value We are Prime Members and that is where this tablet SHINES I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing base model The build on this fire is INSANELY AWESOME running at only mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands ',
 'I bought this for my grand daughter when she comes over to visi

### Remove extra spaces and convert to lowercase
Remove extra spaces from the string. First we'll remove anything other than capital or small letter alphabets using the "re.sub(...)" method. Then, we'll split the string into words using using the string "split(...)" function and join them using the string "join(...)" function. This will effectively remove all the extra spaces within the string. Then, we'll remove the leading and trailing spaces by using the string "strip(...)" method. To know more about these string operations, you can go through this [link](https://docs.python.org/3.3/library/stdtypes.html?highlight=split).

In [36]:
test_string = "I recommend      resolution with   GB of RAM     "
test_string_list = test_string.split()
test_string_list

['I', 'recommend', 'resolution', 'with', 'GB', 'of', 'RAM']

In [37]:
' '.join(test_string_list)

'I recommend resolution with GB of RAM'

In [38]:
test_string = "       I do not like this product     "
test_string.strip()

'I do not like this product'

In [39]:
test_string = "I did not like Deliver Package provided by AMAZON"
test_string.lower()

'i did not like deliver package provided by amazon'

In [40]:
def remove_spaces_and_convert_to_Lowercase(text):
    return ' '.join(text.split()).strip().lower()

In [41]:
reviews = [remove_spaces_and_convert_to_Lowercase(review) for review in reviews]
reviews[0:5]

['this product so far has not disappointed my children love to use it and i like the ability to monitor control what content they see with ease',
 'great for beginner or experienced person bought as a gift and she loves it',
 'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already',
 'i have had my fire hd two weeks now and i love it this tablet is a great value we are prime members and that is where this tablet shines i love being able to easily access all of the prime content as well as movies you can download and watch laterthis has a screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing base model the build on this fire is insanely awesome running at only mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands',
 'i bought this for my grand daughter when she comes over to visit i

## Tokenization

We'll use keras tokenizer class and its methods to perform tokenization, create vocabulary and the word to number mapping. To know more about tokenizer class, please consult this [link](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).

In [42]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [43]:
VOCAB_SIZE = 30000
UNK_TOK = '<UNK>'

In [44]:
tokenizer = Tokenizer(num_words = VOCAB_SIZE, oov_token=UNK_TOK)

In [45]:
tokenizer.fit_on_texts(reviews)

In [46]:
sequences = tokenizer.texts_to_sequences(reviews)

In [47]:
print(sequences[0])
print(sequences[3])

[10, 47, 29, 155, 41, 13, 544, 11, 300, 25, 6, 21, 3, 5, 4, 43, 2, 404, 6, 1494, 212, 82, 311, 58, 257, 15, 415]
[4, 16, 61, 11, 36, 258, 206, 559, 112, 5, 4, 25, 3, 10, 17, 7, 9, 12, 269, 44, 37, 120, 756, 5, 20, 7, 416, 10, 17, 3351, 4, 25, 226, 133, 6, 315, 218, 40, 14, 2, 120, 311, 26, 72, 26, 134, 18, 27, 204, 5, 153, 7442, 41, 9, 87, 136, 41, 126, 79, 106, 376, 6, 3, 114, 106, 5, 884, 5, 31, 517, 7443, 3, 7, 1848, 188, 2, 156, 1446, 4510, 1337, 413, 2, 1050, 19, 10, 36, 7, 5867, 157, 879, 55, 90, 3352, 2894, 5, 2, 793, 5868, 441, 19, 2, 197, 3, 7, 79, 186, 6, 384, 114, 43, 2, 4511, 651, 22, 2249, 430]


## Padding

We'll use the Keras "pad_sequences" function to pad smaller sequences. To know more about this function, please go through this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences)

In [48]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [49]:
MAX_LEN = 32

In [50]:
padded_sequences = np.array(pad_sequences(sequences, 
                                          maxlen=MAX_LEN, 
                                          padding='post', 
                                          truncating='post'))

In [51]:
print(padded_sequences[0])
print(padded_sequences[1])
print(padded_sequences[3])

[  10   47   29  155   41   13  544   11  300   25    6   21    3    5
    4   43    2  404    6 1494  212   82  311   58  257   15  415    0
    0    0    0    0]
[  12    8  788   54 1927  585   34   26    9   96    5   38   68    3
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
[   4   16   61   11   36  258  206  559  112    5    4   25    3   10
   17    7    9   12  269   44   37  120  756    5   20    7  416   10
   17 3351    4   25]


## Create the model

Import Sequential model from Keras. Import Embedding, Bidirectioanl, SimpleRNN, Flatten and Dense layers from Keras. To know more about them, consult these links. 
[sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential), 
[embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding),
[bidirectional](https://keras.io/api/layers/recurrent_layers/bidirectional/),
[simpleRNN](https://keras.io/api/layers/recurrent_layers/simple_rnn/),
[flatten](https://keras.io/api/layers/reshaping_layers/flatten/),
[dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense).

To know more about relu activation check out this [link](https://deepai.org/machine-learning-glossary-and-terms/relu)

In [56]:
import keras
keras.backend.set_image_data_format("channels_last")
from keras.layers import LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, SimpleRNN, Flatten, Dense

In [57]:
def create_model():
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, 16, input_length=MAX_LEN))
    model.add(Bidirectional(SimpleRNN(64, return_sequences=True)))
    model.add(Bidirectional(SimpleRNN(64)))
    model.add(Flatten())
    model.add(Dense(24, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    return model

In [58]:
def create_model_lstm():
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, 16, input_length=MAX_LEN))
    model.add(Bidirectional(LSTM(64, return_sequences=True)))
    model.add(Bidirectional(LSTM(64)))
    model.add(Flatten())
    model.add(Dense(24, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    return model

In [69]:
from keras.layers import Conv1D, GlobalMaxPooling1D

def create_model_gatedCNN():
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, 16, input_length=MAX_LEN))
    model.add(Conv1D(64, 3, activation='relu', padding='same', dilation_rate=1))
    model.add(Conv1D(64, 3, activation='relu', padding='same', dilation_rate=2))
    model.add(Conv1D(64, 3, activation='relu', padding='same', dilation_rate=4))
    model.add(Conv1D(64, 3, activation='relu', padding='same', dilation_rate=8))
    model.add(GlobalMaxPooling1D())
    model.add(Dense(24, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    return model

In [70]:
model = create_model_gatedCNN()

In [71]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [72]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 32, 16)            480000    
                                                                 
 conv1d (Conv1D)             (None, 32, 64)            3136      
                                                                 
 conv1d_1 (Conv1D)           (None, 32, 64)            12352     
                                                                 
 conv1d_2 (Conv1D)           (None, 32, 64)            12352     
                                                                 
 conv1d_3 (Conv1D)           (None, 32, 64)            12352     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                      

## Model training

In [73]:
model.fit(padded_sequences, labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f1eed73ea10>

## Sentiment prediction using trained model

In [74]:
positive_sample = 'i gave this as a christmas gift to my inlaws husband and uncle they \
loved it and how easy they are to use with fantastic features'

In [75]:
sample_sequence = tokenizer.texts_to_sequences([positive_sample])[0]
sample_sequence_padded = pad_sequences([sample_sequence], 
                                       maxlen=MAX_LEN, 
                                       padding='post', 
                                       truncating='post')

In [76]:
predictions = model.predict(sample_sequence_padded, verbose=0)
print(np.round(predictions, 3))

[[0.002 0.    0.998]]


In [77]:
predicted_label = np.argmax(predictions, axis=1)[0]
print("Review:", positive_sample)
print("Sentiment:", label_encoding[predicted_label])

Review: i gave this as a christmas gift to my inlaws husband and uncle they loved it and how easy they are to use with fantastic features
Sentiment: positve


In [78]:
negative_sample = 'if ads dont bother you then this may be a decent device purchased this \
for my kid and it was loaded down with so much spam it kept loading it up making \
it slow and laggy plus the carrasoul loadout makes it hard to navigate for kids \
not very kid friendly oh you can pay to remove the ads but it wont remove them all \
buy the samsung better everything'

In [79]:
sample_sequence = tokenizer.texts_to_sequences([negative_sample])[0]
sample_sequence_padded = pad_sequences([sample_sequence], 
                                       maxlen=MAX_LEN, 
                                       padding='post', 
                                       truncating='post')
predictions = model.predict(sample_sequence_padded, verbose=0)
predicted_label = np.argmax(predictions, axis=1)[0]
print("Review:", negative_sample)
print("Sentiment:", label_encoding[predicted_label])

Review: if ads dont bother you then this may be a decent device purchased this for my kid and it was loaded down with so much spam it kept loading it up making it slow and laggy plus the carrasoul loadout makes it hard to navigate for kids not very kid friendly oh you can pay to remove the ads but it wont remove them all buy the samsung better everything
Sentiment: negative
