## Problem Statement

### Dataset contains flipkart product reviews. Our main goal is predict the review sentiment of product. This will help flipkart to improve their products quality. This is binary machine learning classification problem. Here we aregoing to use RNN to predict the sentiments of the review.

### Import basic libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

### Load dataset

In [158]:
df=pd.read_csv("D:\\PGP IN DATA SCIENCE with Careerera\\Data Sets\\NLP Dataset\\flipkart_product_reviews.csv")
df.head()

Unnamed: 0,Product_name,Review,Rating
0,Lenovo Ideapad Gaming 3 Ryzen 5 Hexa Core 5600...,Best under 60k Great performanceI got it for a...,5
1,Lenovo Ideapad Gaming 3 Ryzen 5 Hexa Core 5600...,Good perfomence...,5
2,Lenovo Ideapad Gaming 3 Ryzen 5 Hexa Core 5600...,Great performance but usually it has also that...,5
3,DELL Inspiron Athlon Dual Core 3050U - (4 GB/2...,My wife is so happy and best product 👌🏻😘,5
4,DELL Inspiron Athlon Dual Core 3050U - (4 GB/2...,"Light weight laptop with new amazing features,...",5


**Drop some unwanted column**

In [159]:
df.drop('Product_name', axis=1, inplace=True)
df.columns

Index(['Review', 'Rating'], dtype='object')

**Creating sentiment column using rating**

In [160]:
def sent_class(rating):
    if rating>=3:
        return 1
    else:
        return 0
    
df['sentiment']=df['Rating'].apply(sent_class)
df['sentiment'].value_counts()

1    2074
0     230
Name: sentiment, dtype: int64

#### drop rating column

In [161]:
df.drop('Rating', axis=1, inplace=True)

### Understanding the dataset

In [50]:
df.shape

(2304, 3)

In [34]:
df.columns

Index(['Review', 'sentiment'], dtype='object')

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2304 entries, 0 to 2303
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Review     2304 non-null   object
 1   sentiment  2304 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 36.1+ KB


### Missing values

In [162]:
df.isna().sum()

Review       0
sentiment    0
dtype: int64

### Text preprocessing

In [163]:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
lemma=WordNetLemmatizer()
stopwords=stopwords.words('english')

In [164]:
def text_preprocessing(text):
    text=re.sub("[^a-zA-Z]", " ", text)
    text=str(text).lower()
    text=[word for word in text.split(" ") if word not in stopwords]
    text=[lemma.lemmatize(word) for word in text]
    text= " ".join(text)
    text=re.sub(" +", " ", text)
    return text

df['clean_reviews']=df['Review'].apply(text_preprocessing)
df.head()

Unnamed: 0,Review,sentiment,clean_reviews
0,Best under 60k Great performanceI got it for a...,1,best k great performancei got around battery b...
1,Good perfomence...,1,good perfomence
2,Great performance but usually it has also that...,1,great performance usually also gaming laptop i...
3,My wife is so happy and best product 👌🏻😘,1,wife happy best product
4,"Light weight laptop with new amazing features,...",1,light weight laptop new amazing feature batter...


### Input and output features

In [165]:
X=df['clean_reviews']
y=df['sentiment']

In [166]:
X[:5]

0    best k great performancei got around battery b...
1                                     good perfomence 
2    great performance usually also gaming laptop i...
3                             wife happy best product 
4    light weight laptop new amazing feature batter...
Name: clean_reviews, dtype: object

### Train test split

In [167]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y, test_size=0.2, random_state=0)

### Text to numerical vector conversion using Word Embedding

In [168]:
from keras.preprocessing.text import Tokenizer

tokenizer=Tokenizer(num_words=20000)
tokenizer.fit_on_texts(x_train)

In [169]:
#text to sequence
x_train_seq=tokenizer.texts_to_sequences(x_train)
x_test_seq=tokenizer.texts_to_sequences(x_test)

In [170]:
print(x_train[0])
print(x_train_seq[0])

best k great performancei got around battery backup bit low thanks rapid charger fast display ok price range decent speaker many customisation optionsvantage software good customisationoverall good performance till nowwill update later problem occurs
[324, 491, 1041, 292, 17, 2, 37, 2, 38, 53, 6, 1, 46, 2, 8, 293, 86, 46, 2, 281, 1552, 2, 520, 18, 1553, 1554, 799, 363, 1555, 1556, 606, 1557, 2, 420, 1042, 606, 51, 565, 1558]


#### Using Padding so that all sentence have same length

In [171]:
df_copy=df.copy()
df_copy['num_words']=df['clean_reviews'].apply(lambda x : len(x.split(" ")))
df_copy['num_words'].max()

65

**Maximum words in a sentence is 65 so i will take 100 as max_sent_len**

In [172]:
from keras.utils import pad_sequences

max_sent_len=100
x_train_seq_pad=pad_sequences(x_train_seq, padding="post", maxlen=max_sent_len)
x_test_seq_pad=pad_sequences(x_test_seq, padding="post", maxlen=max_sent_len)

In [173]:
print(x_train[0])
print(x_train_seq[0])
print(x_train_seq_pad[0])

best k great performancei got around battery backup bit low thanks rapid charger fast display ok price range decent speaker many customisation optionsvantage software good customisationoverall good performance till nowwill update later problem occurs
[324, 491, 1041, 292, 17, 2, 37, 2, 38, 53, 6, 1, 46, 2, 8, 293, 86, 46, 2, 281, 1552, 2, 520, 18, 1553, 1554, 799, 363, 1555, 1556, 606, 1557, 2, 420, 1042, 606, 51, 565, 1558]
[ 324  491 1041  292   17    2   37    2   38   53    6    1   46    2
    8  293   86   46    2  281 1552    2  520   18 1553 1554  799  363
 1555 1556  606 1557    2  420 1042  606   51  565 1558    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


#### Process output data

In [174]:
from keras.utils import to_categorical
y_train_class=to_categorical(y_train, num_classes=2)
y_test_class=to_categorical(y_test, num_classes=2)

In [175]:
y_train_class[0]

array([0., 1.], dtype=float32)

### Preparing to to feed to RNN

In [176]:
#reshaping to 3D
x_train_seq_pad=np.array(x_train_seq_pad).reshape((x_train_seq_pad.shape[0], x_train_seq_pad.shape[1],1))
x_test_seq_pad=np.array(x_test_seq_pad).reshape((x_test_seq_pad.shape[0], x_test_seq_pad.shape[1],1))
print(x_train_seq_pad.shape)
print(x_test_seq_pad.shape)

(1843, 100, 1)
(461, 100, 1)


### Build RNN model

In [177]:
import keras
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, Activation
from tensorflow.keras.optimizers import Adam

In [178]:
model=Sequential()
model.add(SimpleRNN(10, input_shape=(100,1)))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.0001), metrics=['accuracy'])
model.summary()

Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_12 (SimpleRNN)   (None, 10)                120       
                                                                 
 dense_12 (Dense)            (None, 2)                 22        
                                                                 
 activation_12 (Activation)  (None, 2)                 0         
                                                                 
Total params: 142
Trainable params: 142
Non-trainable params: 0
_________________________________________________________________


In [183]:
from keras.wrappers.scikit_learn import KerasClassifier
model.fit(x_train_seq_pad, y_train_class, epochs=10, batch_size=32, validation_data=(x_test_seq_pad, y_test_class))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x14316e74c10>

In [184]:
y_pred=np.argmax(model.predict(x_test_seq_pad), axis=-1)



In [185]:
y_pred

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [187]:
from sklearn.metrics import classification_report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.91      0.95       461

    accuracy                           0.91       461
   macro avg       0.50      0.46      0.48       461
weighted avg       1.00      0.91      0.95       461



### MLFLOW

In [190]:
import mlflow
from sklearn.metrics import accuracy_score, precision_score, recall_score
with mlflow.start_run():
    model=Sequential()
    model.add(SimpleRNN(10, input_shape=(100,1)))
    model.add(Dense(2))
    model.add(Activation('softmax'))
    model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.0001), metrics=['accuracy'])
    model.fit(x_train_seq_pad, y_train_class, epochs=10, batch_size=32, validation_data=(x_test_seq_pad, y_test_class))
    model.save("rnn_flipkart_review.h5")
    
    y_pred=np.argmax(model.predict(x_test_seq_pad), axis=-1)
    acc=accuracy_score(y_test,y_pred)
    precision=precision_score(y_test,y_pred)
    recall=recall_score(y_test,y_pred)
    mlflow.log_metrics({"Accuracy":acc, "Precision":precision, "Recall":recall})

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
