<a href="https://colab.research.google.com/github/sahilpocker/Sentiment-Analysis/blob/master/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis of Amazon.com reviews**

**Sentiment Analysis:-** is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive or negative.

Here I have taken Amazon 'Health and Personal Care' product reviews, sourced from: - https://nijianmo.github.io/amazon/index.html



In [None]:
#Required Imports

import numpy as np 
import pandas as pd 
import tensorflow as tf 
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care_5.json.gz #download the data

--2020-08-26 15:47:53--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85180885 (81M) [application/x-gzip]
Saving to: ‘reviews_Health_and_Personal_Care_5.json.gz’


2020-08-26 15:48:04 (7.41 MB/s) - ‘reviews_Health_and_Personal_Care_5.json.gz’ saved [85180885/85180885]



The downloaded data is an archive of the type *.gz(gzip)*, consisting of data in *json* format.
The data needs to be extracted from the archive and loaded.

In [None]:
### load the meta data

data = []
with gzip.open('reviews_Health_and_Personal_Care_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

346355
{'reviewerID': 'ALC5GH8CAMAI7', 'asin': '159985130X', 'reviewerName': 'AnnN', 'helpful': [1, 1], 'reviewText': "This is a great little gadget to have around.  We've already used it to look for splinters and a few other uses.  The light is great.  It's a handy size.  However, I do wish I'd bought one with a little higher magnification.", 'overall': 5.0, 'summary': 'Handy little gadget', 'unixReviewTime': 1294185600, 'reviewTime': '01 5, 2011'}


In [None]:
# convert the obtained list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))

346355


The total length of the Dataframe is 346455, let us take a look at five rows of data, from 25 to 30.

In [None]:
df.iloc[25:30]

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
25,ADUTFUVJRHLKS,1933622865,Joel R. Wise,"[0, 0]",You have to be in the right spot to eliminate ...,4.0,Helps with the small print,1395014400,"03 17, 2014"
26,A35W3JQYP0M655,1933622865,"John Thomas... ""New England...USA""","[68, 87]",I recently saw this at a local AC Moore store....,1.0,...Wear Glasses And Forget This,1306627200,"05 29, 2011"
27,A26EXMDN188M0,1933622865,Lysan,"[0, 0]",Less helpful than a good magnifying glass. Jus...,2.0,Disappointed!,1374796800,"07 26, 2013"
28,A22DXDIYXPBVSP,1933622865,M. Heck,"[1, 1]",Several reviews of similar items reported &#34...,5.0,Nice and sturdy,1367625600,"05 4, 2013"
29,A2B3T42QBKDFFX,1933622865,"Raymond Holderman ""raaymond1""","[0, 0]","This magnifies modestly well, is lightweight, ...",3.0,it's adequate,1397088000,"04 10, 2014"


There are a lot us unnecessary columns like *reviewerID, asin, reviewerName, etc*. Our interest mainly lies is the *reviewText* itself and the overall rating, which is the rating out of 5. 

So let us drop all the other columns from the Dataframe.

In [None]:
df1 = df.drop(['reviewerID','asin','reviewerName','unixReviewTime','reviewTime','helpful','summary'],axis=1)


Since this is a binary classification into positive and negative reviews, we have to convert the overall rating (out of 5) into positive or negative. 
1 and 2 star reviews can be considered as negative and 3+ stars as positive.

In [None]:
df1['target'] = (df1['overall'] > 2).astype(int) #create new column 'target' which gives 1 if 'overall' is greater than 2, 0 otherwise.

In [None]:
df1 = df1.drop('overall',axis = 1) #since we no longer need 'overall'

Since the total length of the dataframe is huge, let us take only the last 50,000 for simplicity.

In [None]:
df2 = df1[:50000]
target = df2.pop('target') #store target variable (0/1) in another array


In [None]:
dataset = (
    tf.data.Dataset.from_tensor_slices(
        (
            tf.cast(df2['reviewText'].values, tf.string),
            tf.cast(target.values, tf.int32)
        )
    )
) #convert the dataframe into Tensorflow dataset

Let us take a look at 5 entries in the dataset and its associated labels(target)

In [None]:
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))


Features: b"This is a great little gadget to have around.  We've already used it to look for splinters and a few other uses.  The light is great.  It's a handy size.  However, I do wish I'd bought one with a little higher magnification.", Target: 1
Features: b'I would recommend this for a travel magnifier for the occasional reading.I had read on another review about a magnifier having a problem with the light coming on. I did find that this one appeared to be DOA out of the box. But, after opening & shutting the viewer to turn on & off the light, the light began to come on. After several times of doing this, the light appears to be coming on all the time.It is small, but for taking it someplace & reading things like a menu in a dark corner of a restaurant, this is great.', Target: 1
Features: b"What I liked was the quality of the lens and the built in light.  Then lens had no discernable distortion anywhere.  It magnified everything evenly without the ripples and  distortion that I've 

Next, let us split the data into train, test, and validation. Let the split be 70% train, 15% validation and 15%test.

In [None]:
DATASET_SIZE = len(dataset)


train_size = int(0.7 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)

In [None]:
raw_train_dataset = dataset.take(train_size)
raw_test_dataset = dataset.skip(train_size)
raw_val_dataset = raw_test_dataset.skip(val_size)
raw_test_dataset = raw_test_dataset.take(test_size)

Next step is to convert the reviews data and vectorise it (map each word into integers) for training.

In [None]:
max_features = 10000 #total number of words
sequence_length = 250 

vectorize_layer = TextVectorization(
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length)

In [None]:
# Make a text-only dataset (without labels), then call adapt
train_text = dataset.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

In [None]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

Take a look at the mapped integer to the corresponding word in the vocabulary.

In [None]:
print("1287 ---> ",vectorize_layer.get_vocabulary()[1287])
print(" 313 ---> ",vectorize_layer.get_vocabulary()[313])
print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))

1287 --->  safety
 313 --->  face
Vocabulary size: 10000


Now vectorise all three sets seperately.

In [None]:
train_ds = raw_train_dataset.map(vectorize_text)
val_ds = raw_val_dataset.map(vectorize_text)
test_ds = raw_test_dataset.map(vectorize_text)

In [None]:
train_ds = train_ds.map(lambda x_text, x_label: (x_text, tf.expand_dims(x_label, -1))) #to match dimensions of 'target' while fitting model.
val_ds = val_ds.map(lambda x_text, x_label: (x_text, tf.expand_dims(x_label, -1)))
test_ds = test_ds.map(lambda x_text, x_label: (x_text, tf.expand_dims(x_label, -1)))

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
embedding_dim = 16

Design the model, it consists of an embedding layer, a dropout after it, a bidirection LSTM, A dense layer with 32 units, another dropout layer and finally an output Layer with 1 unit.

In [None]:
model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.Bidirectional(tf.keras.layers.LSTM(16)),
  layers.Dense(32),
  layers.Dropout(0.2),
  layers.Dense(1)])

model.summary() #view model summary

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 16)          160016    
_________________________________________________________________
dropout_3 (Dropout)          (None, None, 16)          0         
_________________________________________________________________
bidirectional (Bidirectional (None, 32)                4224      
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_4 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 165,329
Trainable params: 165,329
Non-trainable params: 0
________________________________________________

Compile the model 

In [None]:
model.compile(loss=losses.BinaryCrossentropy(from_logits=True), optimizer='adam', metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

In [None]:
epochs = 8
batch_size = 32
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    batch_size=batch_size,
    callbacks=[callback]) #train the model

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


This should be around 95% accuracy on the train set and 90% on the validation set, which means it is overfitting and exhibits variance. Let us now test it on the test set.

In [None]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.3001821041107178
Accuracy:  0.9111999869346619


On the test set it shows an accuracy of 91% which is decent. By removing the overfitting, improving and optimsing the model, the accuracy could be higher.

Finally, save the model.

In [None]:
export_model = tf.keras.Sequential([
  vectorize_layer,
  model,
  layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), optimizer="adam", metrics=['accuracy']
)

