<h1 align="center">Classifying Amazon Food Reviews using LSTMs </h1>

## Introduction

The Amazon Food Reviews is a dataset published [here](https://www.kaggle.com/snap/amazon-fine-food-reviews) on Kaggle of nearly 500K user reviews collected on the site for more than ten years upto 2012. The original dataset has a rating of 1 to 5 for each of the products given by the users along with their text reviews. Based on this rating, the objective of this study is to predict  whether a review is positive or negative.

I've cleaned the data already and saved it on disk. The two important columns in it which we'll use are `cleaned_text` and `Score`. The column Score is our Target variable with values 0 or 1 and it  indicates whether the review is positive or negative. The value 0 bein negative and 1 being positive.

Let's start by importing the libraries we need.

In [0]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
import sqlite3
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
import keras_metrics
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import scikitplot as skplt

### Mount the Google Drive

The cleaned dataframe is stored on my Google Drive and I'm using Colaboratory because it has access to a GPU environment. The following block of code attaches the drive to Google Colaboratory.

In [None]:
from google.colab import drive
drive.mount('/gdrive')

## Load the reviews from disk

The reviews are stored in a `.sqlite`  file. Load them in a dataframe.

In [0]:
# load sqlite database
con = sqlite3.connect('/gdrive/My Drive/amazon/reviews_cleaned_final.sqlite')

In [66]:
#conn = sqlite3.connect('/gdrive/My Drive/amazon/reviews_cleaned_final.sqlite')
df = pd.read_sql('select * from Reviews;', con, index_col='index')
con.close()
df.head()

Unnamed: 0_level_0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,cleaned_text
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,b'bought sever vital can dog food product foun...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,b'product arriv label jumbo salt peanut peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,b'confect around centuri light pillowi citru g...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,b'look secret ingredi robitussin believ found ...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...,b'great taffi great price wide assort yummi ta...


Only keep the `cleaned_text` and `Score` columns because these will be used for training the neural network.

In [0]:
df = df[['cleaned_text', 'Score']]

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 364171 entries, 0 to 525813
Data columns (total 2 columns):
cleaned_text    364171 non-null object
Score           364171 non-null object
dtypes: object(2)
memory usage: 8.3+ MB


## Generate frequency counts of words

In [69]:
counter = Counter()

c = 0

for review in df.cleaned_text:
    for word in review.decode('utf-8').split():
        counter[word] += 1
    print(c, end='\r')
    c += 1



Create a dictionary mapping from word to its frequency in the entire review corpus

In [0]:
word_to_freq_DICT_5k = dict(counter.most_common(5000))

Reverse the above mapping and store it in another variable

In [0]:
freq_to_word_DICT_5k = {v:k for k, v in word_to_freq_DICT_5k.items()}

Now, generate mappings between a word and its index and vice versa. e.g. 'abc' : 4 will mean 'abc' is the 4th most frequent word encountered in the text.

In [0]:
word_to_index_lookup = dict(zip(freq_to_word_DICT_5k.values(), range(1,5001)))
index_to_word_lookup = {v:k for k,v in word_to_index_lookup.items()}

Create a dummy column in the dataframe. This column will contain the index-vector representation of each review. i.e. each word in a review is replaced by its index from the mapping we defined above. This index is what will be given as input to the LSTM.

In [0]:
df['freq_vectors'] = df.cleaned_text

def text_to_word_frequency(review):
    return [word_to_index_lookup[word] if word in word_to_index_lookup.keys() else 0 for word in review.decode('utf-8').split()]

df['freq_vectors'] = df.freq_vectors.map(text_to_word_frequency)

Here's what the new index-vectorized reviews look like.

In [75]:
print(df['freq_vectors'][2])

[0, 193, 0, 246, 0, 1378, 0, 262, 209, 0, 382, 692, 0, 0, 529, 238, 43, 692, 361, 1224, 522, 3, 171, 44, 396, 57, 0, 1477, 0, 0, 0, 0, 57, 0, 0, 354, 0, 1124, 0]


Map the Score variable from string to an integer

In [0]:
df.Score = df.Score.map({'positive' : 1, 'negative' : 0 })

## Train  and Test data

Let's divide all the reviews in the ratio 80:20 for Train and Test respectively.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df.freq_vectors.values,df.Score.values, test_size=0.2, random_state=13)

In [78]:
X_train.shape, y_train.shape,  X_test.shape,  y_test.shape

((291336,), (291336,), (72835,), (72835,))

## Truncate or Pad input sequences

In real world data, it is impossible that each review will have the same length. But our neural network requires that the length of input is consistent. To that end, we'll fix the length of each review to 75 words and pad the reviews which are smaller than 75 words by zeros.

In [79]:
# truncate and/or pad input sequences
max_review_length = 75
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

print(X_train.shape)
print(X_train[1])

(291336, 75)
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0  987  105
  378 1269    2  396  317    0   98  121  140   95    6  113  105    0
  364  369  105   43 1452  145  378  142   98  780  234    3    9  815
    0   85  151  370  495]


Shape of input train and test data after padding

In [80]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((291336, 75), (72835, 75), (291336,), (72835,))

## Model 1


Architecture: 

**[ 75(E) - 100(L) - 1(Sigmoid Output) ]**


where 

E = Embedding Layer

L = LSTM Layer

In [81]:
# create the model
top_words = 5000
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
#Refer: https://datascience.stackexchange.com/questions/10615/number-of-parameters-in-an-lstm-model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 75, 32)            160000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


## Compile the model

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall() ])

## Fit the model

In [83]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)
#model.fit(X_train, y_train, nb_epoch=10, batch_size=32)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=1)
#print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 291336 samples, validate on 72835 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [84]:
print("Accuracy: %.2f%%" % (scores[1]*100))
print("Precision: %.2f%%" % (scores[2]*100))
print("Recall: %.2f%%" % (scores[3]*100))

Accuracy: 92.58%
Precision: 94.40%
Recall: 96.96%


## Model 2: With Dropout

**[ 75(E) - D - 100(L) - D - 1 (sigmoid output)]**

In [85]:
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 75, 32)            160000    
_________________________________________________________________
dropout_3 (Dropout)          (None, 75, 32)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


Compile model

In [0]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy',keras_metrics.precision(), keras_metrics.recall()])

## Fit model 

In [87]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)
#model.fit(X_train, y_train, nb_epoch=10, batch_size=32)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=1)
#print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 291336 samples, validate on 72835 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [88]:
print("Accuracy: %.2f%%" % (scores[1]*100))
print("Precision: %.2f%%" % (scores[2]*100))
print("Recall: %.2f%%" % (scores[3]*100))

Accuracy: 92.47%
Precision: 93.71%
Recall: 97.63%


In [1]:
#!pip install keras-metrics

# Conclusion:

* We classifed Amazon Food Reviwes using LSTMs. The two architectures we tried were:
    
        [ 75(E) - 100(L) - 1(Sigmoid Output) ]
        [ 75(E) - D - 100(L) - D - 1 (sigmoid output)]
 where
 
        E = Embedding Layer
        D = Dropout Layer
        L = LSTM
        
 Both the models gave similar performance. The accuracy, precision and Recall obtained were:
    
        - Accuracy: 92.58%
        - Precision: 94.40%
        - Recall: 96.96%