<a href="https://colab.research.google.com/github/tikendraw/Amazon-review-sentiment-analysis/blob/main/amazon-review-sentiment-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Reviews for Sentiment Analysis

## Objective

Here we will be Building ML and DL models to predict the Polarity of reviews.
We will be performing series of experiments with different models to achieve the best classification metrics.(while not abusing the machine we have)

## About Dataset 
[Dataset here](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)


### OVERVIEW
Contains 34,686,770 Amazon reviews from 6,643,669 users on 2,441,053 products, from the Stanford Network Analysis Project (SNAP). This subset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.

### ORIGIN
The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

### DESCRIPTION
The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as negative, and 4 and 5 as positive. Samples of score 3 is ignored. In the dataset, class 1 is the negative and class 2 is the positive. Each class has 1,800,000 training samples and 200,000 testing samples.

If you need help extracting the train.csv and test.csv files check out the starter code.

The files train.csv and test.csv contain all the training samples as comma-separated values.

The CSVs contain polarity, title, text. These 3 columns in them, correspond to class index (1 or 2), review title and review text.

polarity - 1 for negative and 2 for positive
title - review heading
text - review body
The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

In [1]:
!git clone https://github.com/tikendraw/Amazon-review-sentiment-analysis.git
!cd Amazon-review-sentiment-analysis

Cloning into 'Amazon-review-sentiment-analysis'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 16 (delta 6), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), 394.74 KiB | 3.56 MiB/s, done.


In [14]:
import pandas as pd
import numpy as np
import datetime
import tensorflow as tf
from tensorflow import keras
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tensorflow.keras.layers import Embedding ,LSTM, Dense, Dropout, Conv1D, MaxPool1D, BatchNormalization, TextVectorization
import matplotlib.pyplot as plt
!pip install polars -q
import polars as pl
!pip install wget -q
import wget
import tensorflow_hub as hub
import os
import re
import json
import tensorflow as tf
import tarfile


# if 'google.colab' in str(get_ipython()):
#     from google.colab import drive
#     drive.mount('/content/drive')


##importing useful functions
!git clone https://github.com/tikendraw/funcyou.git -q

from funcyou.metrics import calculate_results
# from funcyou.plot import plot_history, compare_histories
from funcyou.dataset import download_kaggle_dataset

# !pip install tensorflow_hub

print('Tf version: ',tf.__version__)
print('GPU: ',gpu:= len(tf.config.list_physical_devices('GPU')))

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
os.environ["TFHUB_CACHE_DIR"] = './tmp/tfhub'

if gpu:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)


use_url = 'https://storage.googleapis.com/tfhub-modules/google/universal-sentence-encoder/4.tar.gz'



fatal: destination path 'funcyou' already exists and is not an empty directory.
Tf version:  2.9.2
GPU:  0


In [3]:
def download_USEncoder():
    try:
        print('downloading universal sentence encoder...')
        use_filename = wget.download(use_url)

        print('Downloaded!')
        # Extracting
        os.makedirs('universal_sentence_encoder', exist_ok = True)
        print('Extracting universal sentence encoder....')
        # open file
        file = tarfile.open(use_filename)
        
        # extracting file
        file.extractall('./universal_sentence_encoder')
        
        file.close()
        print('Extracted.')
    except Exception as e:
        print(e)


In [4]:
# Download the data if you don't have locally
data_url = 'https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews'

def download_data(data_url):
    download_kaggle_dataset(url = data_url)
    print('Dataset Downloaded.')

    import zipfile
    with zipfile.ZipFile('./amazon-reviews.zip', 'r') as zip_ref:
        zip_ref.extractall('./dataset')
    print('Extracted.')




In [5]:
if 'google.colab' in str(get_ipython()):
    print('Running on CoLab')
    download_USEncoder()
    download = input('Did you upload kaggle.json?(Yes/No) ')
    if download in ['yes','Yes','Y','y']:
        print('Dataset Downloading...')
        download_data(data_url)
    
else:
  print('Not running on CoLab')


Running on CoLab
downloading universal sentence encoder...
Downloaded!
Extracting universal sentence encoder....
Extracted.
Did you upload kaggle.json?(Yes/No) y
Dataset Downloading...
kaggle datasets download -d kritanjalijain/amazon-reviews
0
Dataset Downloaded.
Extracted.


In [15]:
embed = hub.KerasLayer("./universal_sentence_encoder")

# Load the data

In [16]:
#reading data
df = pl.read_csv('./dataset/train.csv',new_columns = ['polarity', 'title','text'])  # gives TextFileReader, which is iterable with chunks of 1000 rows.


In [17]:
print('Shape: ', df.shape)
# print('Info: ',df.to_pandas().info())

Shape:  (3599999, 3)


In [18]:
# check for nulls and drop if any
df.null_count()

polarity,title,text
u32,u32,u32
0,0,0


In [19]:
#drop nulls
df.drop_nulls()
print()




In [20]:
# df.to_pandas().isna().sum()

In [21]:
# checking for classs imbalance
df['polarity'].value_counts()

polarity,counts
i64,u32
2,1799999
1,1800000


**Note:** The dataset is fairly large, we will use tensorflow's data api to load and handle the data

# We will map the polarity between 0 for negative sentiment to 1 for positive sentiment

In [22]:
df = df.with_columns([
                    # pl.col('polarity').apply(lambda x: 0 if x == 1 else 1).alias('polarity'),
                     pl.col('polarity').cast(pl.Int16, strict=False).alias('polarity')
                     ])


In [23]:
df.sample(10)

polarity,title,text
i16,str,str
1,"""Sorry to...""","""...say but thi..."
1,"""Toy Built""","""Troy built bat..."
2,"""unexpectedly e...","""I just saw thi..."
2,"""BEST CD EVER""","""My boyfriend i..."
2,"""a triumph of g...","""Not a typical ..."
2,"""Hours Of Enter...","""Turn of the TV..."
1,"""Power problems...","""I tried these ..."
2,"""Wife loves thi...","""My wife and I ..."
2,"""Quiet Timeless...","""I played Eastm..."
1,"""Ineffective, u...","""There's not mu..."


## Note: We will be combining text and title columns . makes more sense.

In [24]:
#preprocessing functions to clear punctuations, lower strings, remove special chars removing contractions
from funcyou.preprocessing.text import  text_cleaning_apos, cont_to_exp, text_cleaning


def clean_all(text):
    text = text_cleaning_apos(text)
    text = cont_to_exp(text)
    text = text_cleaning(text)
    return text

In [25]:
%%time
# with pandas
# print('Started at: ',datetime.datetime.now().strftime("%H:%M:%S"))
# print('This cell takes 10 minutes to process 3.6M data')
# #joining columns and cleaning the text 
# df['review'] = df['title']+' ' + df['text']
# # df['review'] = df['review'].apply(clean_all)
# df['review'] = df['review'].astype(np.object_)


df = df.with_columns([
    (pl.col('title')+' ' + pl.col('text')).alias('review')
])

# df['review'] = df['review'].apply(clean_all)
# df['review'] = df['review'].astype(np.object_)


CPU times: user 807 ms, sys: 760 ms, total: 1.57 s
Wall time: 1.54 s


In [26]:
# df.review.sample(1).values.dtype

In [27]:
df.sample(10)

polarity,title,text,review
i16,str,str,str
1,"""Should Have Be...","""Tres is a grea...","""Should Have Be..."
2,"""Helps me to Be...","""This is very h...","""Helps me to Be..."
2,"""Excellent sour...","""I consider mys...","""Excellent sour..."
2,"""Transition HDT...","""An excellent s...","""Transition HDT..."
2,"""Great""","""This album is ...","""Great This alb..."
2,"""Buy It Now!""","""No lover of po...","""Buy It Now! No..."
2,"""Avatar Extende...","""If you liked t...","""Avatar Extende..."
1,"""COPY OF MY SCE...","""I DON'T CARE W...","""COPY OF MY SCE..."
1,"""They used to b...","""I have been bu...","""They used to b..."
1,"""Avoid""","""This brush can...","""Avoid This bru..."


A dataframe to store results

In [63]:
#creating a dataframe to store results
all_result = pd.DataFrame(columns=['model','accuracy','precision','recall','f1','discription'])

In [64]:
def add_to_big_result(res:dict):
    global all_result
    res = pd.DataFrame([res])
    all_result = pd.concat([all_result, res], ignore_index=True)
    print(all_result)
    return all_result

# Data Preparation

In [29]:
xtrain, xtest, ytrain, ytest = train_test_split( df.select('review'), df.select('polarity'), test_size=  .001,  random_state = 89)
# xtrain, xval, ytrain, yval = train_test_split( xtrain, ytrain, test_size=  .05, random_state = 89)

print('xtrain shape',xtrain.shape, 'ytrain shape', ytrain.shape)
print('xtest shape',xtest.shape, 'ytest shape', ytest.shape)
# print('xval shape',xval.shape, 'yval shape', yval.shape)

xtrain shape (3596399, 1) ytrain shape (3596399, 1)
xtest shape (3600, 1) ytest shape (3600, 1)


In [30]:
del(df) # deleting variables to keep the memory free
xtrain.head()

review
str
"""ack!! ""my dad ..."
"""Very noire If ..."
"""Too many chara..."
"""Poor Design Di..."
"""Do not buy! Th..."


In [33]:
ytrain.head()

polarity
i16
1
2
1
1
1


# Creating tensorflow dataset using `tf.data` api

In [34]:
xtrain.head()

review
str
"""ack!! ""my dad ..."
"""Very noire If ..."
"""Too many chara..."
"""Poor Design Di..."
"""Do not buy! Th..."


In [None]:
BATCH_SIZE = 32

#train
train_feature = tf.data.Dataset.from_tensor_slices(xtrain.to_list())
train_label = tf.data.Dataset.from_tensor_slices(ytrain.to_list())
#test
test_feature = tf.data.Dataset.from_tensor_slices(xtest.to_list())
test_label = tf.data.Dataset.from_tensor_slices(ytest.to_list())
#val
# val_feature = tf.data.Dataset.from_tensor_slices(xval)
# val_label = tf.data.Dataset.from_tensor_slices(yval)

In [None]:
for i in train_feature.take(1):
    print(i)
    break

In [None]:
for i in train_label.take(1):
    print(i)
    break

In [None]:
BATCH_SIZE = 16
train_dataset = tf.data.Dataset.zip((train_feature, train_label))
train_dataset = train_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

test_dataset = tf.data.Dataset.zip((test_feature, test_label))
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

val_dataset = tf.data.Dataset.zip((val_feature, val_label))
val_dataset = val_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
del(train_feature, train_label, test_feature, test_label, val_feature, val_label) # deleting variables to keep the memory free

In [None]:
print('len train dataset: ', len(train_dataset))
print('len test dataset: ', len(test_dataset))
print('len val dataset: ', len(val_dataset))

# Model:0 (Naive bayes model)

In [None]:
model0 = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('multino',MultinomialNB())
])

In [None]:
%%time
#fit and predict
# model0.fit(xtrain['review'], ytrain)

In [None]:
# pred0 = model0.predict(xtest['review'])

# print(pred0.shape ==  ytest.shape)
# print('pred00.shape: ',pred0.shape)
# print('ytest.shape: ',ytest.shape)

# model0_res = calculate_results(y_true=ytest, y_pred=pred0, model_name='model0: naive bayes')
# print(model0_res)

In [None]:
# preditions on texts is more accurate than titles, simply because there is more words and combinations which describes the sentiments better.

In [None]:
# all_result =add_to_big_result(model0_res)

# Text vectorization

In [None]:
# MAX_TOKEN = 1_00_00
# OUTPUT_SEQUENCE_LENGTH = 200  # limiting reviews to 200 words

In [None]:
# text_vectorizer = TextVectorization(max_tokens=MAX_TOKEN, standardize='lower_and_strip_punctuation',
#                                    split='whitespace',
#                                     ngrams= None ,
#                                     output_mode='int',
#                                     output_sequence_length=OUTPUT_SEQUENCE_LENGTH, 
#                                     pad_to_max_tokens=False)

In [None]:
# %%time
# #adapting to training data
# print("cell takes: ")
# text_vectorizer.adapt(train_feature)

# random_review = df['review'].sample(n = 1).values
# print('random Review: ', random_review)
# print('random Review length: ', len(random_review))
# print('-------\n')
# print('vectorized review: ',text_vectorizer(random_review))
# print('-------\n')
# print('Vocabulary_length: ',len(text_vectorizer.get_vocabulary()))
# print('Most frequent words: ',text_vectorizer.get_vocabulary()[:10])
# print('least frequent words: ',text_vectorizer.get_vocabulary()[-10:])

# Embedding

In [None]:
# embedding = Embedding(input_dim = MAX_TOKEN,output_dim= 32, mask_zero=True, input_length=OUTPUT_SEQUENCE_LENGTH)
# print('Embedded text vectorized random sentence: ',embedding(text_vectorizer(random_review)))

# Model1

In [None]:
# inputs  = keras.Input(shape= (1), dtype = tf.string)
# vectorizer_layer  = text_vectorizer(inputs)
# embedding_layer  = embedding(vectorizer_layer)

# x = LSTM(16, return_sequences=True)(embedding_layer)
# # x = LSTM(32, return_sequences=True)(x)
# x = LSTM(16)(x)
# x = Dropout(.4)(x)
# x = Dense(64, activation='relu')(x)
# outputs = Dense(1, activation = 'sigmoid')(x)

# #building model
# model1 = keras.Model(inputs = inputs, outputs = outputs, name = 'model1_lstm')

# #compiling model
# model1.compile(loss = keras.losses.binary_crossentropy,
#               optimizer = keras.optimizers.Adam(),
#               metrics = ['accuracy'])

In [None]:
# EPOCHS = 5
# print(len(train_dataset), len(val_dataset))

In [None]:
# %%time

#fit the model
# history1 = model1.fit(train_dataset, epochs = EPOCHS, 
#                       validation_data= val_dataset, 
#                       steps_per_epoch=int(.1*(len(train_dataset) / EPOCHS)),
#                       validation_steps=int(.1*(len(val_dataset) / EPOCHS)))

In [None]:
#@title Plot history function
def plot_history1(history, plot = ['loss','accuracy'], split = ['train','val'], epoch:int = None, figsize = (20,10),colors = ['r','b'], **plot_kwargs ):
    
    ''' Plots History

    Arguments:
    ###############
    histroy 	:	History to plot
    plot:list	:   what to plot (what metrics you want to compare)  -> ['loss', 'accuracy']  
    split:list  :   what split to compare -> ['train', 'val']
    epoch:int   :   for how many epochs to comapre (cannot be greater than highest epoch of histories)
    figsize:tuple:  size of plot
    plot_kwargs :   kwargs to plt.plot to customize plot

    Returns:
    ##############
    Plots history 

    '''

    try:
        import matplotlib as mpl
        mpl.rcParams['figure.dpi'] = 500
        
        if not len(colors) == len(split):
            raise ValueError('not enogh colors')
        
        cols = []
        for i in plot:
            for j in split:
                if j == 'val':
                    cols.append(j+'_'+i)
                else:
                    cols.append(i)
        
        #compare to epoch
        if epoch is None:
            epoch = history.epoch

        def display(col, plot_num, history, epoch:int = None,label = None, **plot_kwargs):
            plt.subplot(len(plot),len(split),plot_num)
            plt.grid(True)
            
            if epoch == None:
                epoch = history.epoch
            
            if label is None:
                label=history.model.name
                
            plt.plot(epoch, pd.DataFrame(history.history)[col], label=label, **plot_kwargs)
            plt.title((' '.join(col.split('_'))).upper())
            plt.xlabel('epochs')
            plt.legend()
        
        plt.figure(figsize = figsize)
        plot_title = " ".join(plot).upper()+" PLOT"
        plt.suptitle(plot_title)

        for plot_num,col in enumerate(plot,1):
            display(col, plot_num, history, epoch, label = 'train',color = colors[0], **plot_kwargs)
            if 'val' in split:
                display('val_'+col, plot_num, history, epoch,label = 'val' ,color = colors[1])
    except Exception as e:
        print('Error Occured: ',e)


In [None]:
#plot history
# plot_history1(history1, plot=['loss','accuracy'], figsize=(15,5))

#### Evaluation

In [None]:
# ypred1 = tf.squeeze(tf.round(model1.predict(xtest)))
# print('ypred1.shape: ',ypred1.shape)

# model1_res = calculate_results(ytest,ypred1, model_name='model1: LSTM', discription = 'small lstm model with vectorizer and embedding layer')
# print(model1_res)

## adding result to all_result 
# all_result = add_to_big_result(model1_res)

#### saving the model

In [None]:
# model2.save('./drive/MyDrive/amazon_review/saved_models/model2_lstm_use_layer.tf', save_format='tf')

In [None]:
#load the saved model
# model1_loaded = keras.models.load_model('./saved_models/model1_lstm.tf')

In [None]:
# ypred11 = tf.squeeze(tf.round(model1_loaded.predict(xtest)))
# model11_res = calculate_results(ytest,ypred11, model_name='model1: LSTM loaded')
# print(model11_res)
# all_result = add_to_big_result(model11_res)

# Model2

In [None]:
inputs = keras.Input(shape = [], dtype = 'string')
use_layer = embed(inputs)
print(use_layer.shape)
use_layer = tf.expand_dims(use_layer, axis = 1)
print(use_layer.shape) 
x = LSTM(32, return_sequences=True)(use_layer)
# x = LSTM(32, return_sequences=True)(x)
x = LSTM(16)(x)
x = Dropout(.4)(x)
x = Dense(64, activation='relu')(x)
outputs = Dense(1, activation = 'sigmoid')(x)

#building model
model2 = keras.Model(inputs = inputs, outputs = outputs, name = 'model2_use_layer')

#compiling model
model2.compile(loss = keras.losses.binary_crossentropy,
              optimizer = keras.optimizers.Adam(),
              metrics = ['accuracy'])

In [None]:
# from funcyou.callbacks import create_model_checkpoint
def create_model_checkpoint(model_name, save_dir, monitor:str = 'val_loss',verbose: int = 0, save_best_only: bool = True, save_weights_only: bool = False,
                            mode: str = 'auto', save_freq='epoch', options=None, initial_value_threshold=None, **kwargs):
    model_name = model_name+'-'+ str(datetime.datetime.now())
    dir = os.path.join(save_dir, model_name)

    if not os.path.exists(dir):
        os.makedirs(dir)

    return tf.keras.callbacks.ModelCheckpoint(
                                                dir,
                                                monitor = monitor,
                                                verbose = verbose,
                                                save_best_only = save_best_only,
                                                save_weights_only = save_weights_only,
                                                mode = mode,
                                                save_freq = save_freq,
                                                options=options,
                                                initial_value_threshold = initial_value_threshold,
                                                **kwargs)


In [None]:
EPOCHS = 100
print(len(train_dataset), len(val_dataset))

In [None]:
# del(train_dataset, test_dataset, val_dataset)

In [None]:
# %%time
# #fit the model
# history2 = model2.fit(train_dataset, epochs = EPOCHS, 
#                       validation_data= val_dataset, 
#                       steps_per_epoch=int((len(train_dataset) / EPOCHS)),
#                       validation_steps=int(1*(len(val_dataset) / EPOCHS)),
#                       callbacks = [
#                                     create_model_checkpoint(model_name = 'model2:use_lstm', 
#                                     save_dir = '/content/drive/MyDrive/amazon_review/', 
#                                     monitor = 'val_accuracy',
#                                     save_best_only = True, 
#                                     save_weights_only = True,
#                                         mode= 'auto', save_freq='epoch')]
#                       )

In [None]:
# plot_history
# plot_history1(history2, plot=['loss','accuracy'], figsize=(15,5))

In [None]:
model2.load_weights('./amazon_review-20221127T193218Z-001/amazon_review/model2_use_lstm-2022-11-27 06_40_52.876175')

#### Evaluation

In [None]:
ypred2 = tf.squeeze(tf.round(model2.predict(test_dataset,
                                            use_multiprocessing=True)))
print('ypred2.shape: ',ypred2.shape)

ytest_true = [y for x,y in test_dataset.unbatch()]
print('ypred2.shape: ',len(ytest_true))

model2_res = calculate_results(np.array(ytest_true),ypred2, model_name='model2: use layer lstm')
print('model2_res: ',model2_res)

all_result = add_to_big_result(model2_res)

# Model3: Conv1D

In [None]:
inputs = layers.Input(shape=(1,), dtype = 'string')
vect_layer = text_vectorizer(inputs)
embed_layer = embed(vect_layer)
x = layers.Conv1D(64,3,1,padding = 'same',activation = 'relu')(embed_layer)
# x = layers.BatchNormalization()(x)
# x = layers.MaxPooling1D()(x)
x = layers.GlobalMaxPool1D()(x)
outputs = layers.Dense(5, activation = 'softmax')(x)


inputs = keras.Input(shape = [], dtype = 'string')
use_layer = embed(inputs)
print(use_layer.shape)
use_layer = tf.expand_dims(use_layer, axis = 1)
print(use_layer.shape) 
x = LSTM(32, return_sequences=True)(use_layer)
# x = LSTM(32, return_sequences=True)(x)
x = LSTM(16)(x)
x = Dropout(.4)(x)
x = Dense(64, activation='relu')(x)
outputs = Dense(1, activation = 'sigmoid')(x)

#building model
model2 = keras.Model(inputs = inputs, outputs = outputs, name = 'model2_use_layer')

#compiling model
model2.compile(loss = keras.losses.binary_crossentropy,
              optimizer = keras.optimizers.Adam(),
              metrics = ['accuracy'])