<br>
<center><img src="https://i.imgur.com/KXbnThQ.png" width="500px"></center>
<br>

**Hi** reader, this is a small and simple guide to LSTMs, I will discuss all the basic requirements that you need to get started with LSTMs from underneath concepts to code implementation. We will be implementing using TensorFlow 2.0.

<br>

# Table of Content:

1. **Understand the data**
2. **Loading Data Using Pandas**
3. **Some Basic Exploratory Data Analysis**
4. **Data Preprocessing**
5. **Model Building**
6. **Train Test Split**
7. **Model Training**
8. **Model Evaluation**
9. **Conclusion**
<br>

### Understand the data:
The data is called `Amazon fine food review` which is basically a dataset of user reviews on some amazon food products. You can download [here](https://www.kaggle.com/snap/amazon-fine-food-reviews). You can check out this Kaggle dataset for more info about the dataset.

## Utils

In [43]:
import warnings
warnings.filterwarnings("ignore")                     #Ignoring unnecessory warnings

import numpy as np                                  #for large and multi-dimensional arrays
import pandas as pd                                 #for data manipulation and analysis
import nltk                                         #Natural language processing tool-kit

from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer

from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF
from gensim.models import Word2Vec                                   #For Word2Vec

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense



## Loading Data Using Pandas:

In [1]:
import pandas as pd
pd.pandas.set_option('display.max_columns',None)
pd.pandas.set_option('display.max_rows',None)
df = pd.read_csv('../input/amazon-fine-food-reviews/Reviews.csv')
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


### Some Basic Exploratory Data Analysis:

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


In [3]:
df.shape

(568454, 10)

In [4]:
df.isna().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [5]:
df1 = df[df['Score']!=3]
df1.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
df1.shape

(525814, 10)

### Data Preprocessing:

Here, we are dropping the data points where `UseId`,`"ProfileName`,`Time`,`Text` are same and keeping the first data point.

In [32]:
df1 = df1.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)

Here, we are keeping the values where `HelpfulnessNumerator` <= `HelpfulnessDenominator`.

In [33]:
df1 = df1[df1['HelpfulnessNumerator']<=df1['HelpfulnessDenominator']]

As we have taken this dataset as a NLP problem, so we are only considering the text column as the i/p data and `score` as the i/p labels.

In [34]:
list1 = list(df1['Score'])
list2 = list(df1['Text'])

score_df = pd.DataFrame(list1,columns = ['score'])
text_df = pd.DataFrame(list2,columns = ['text'])

df2 = pd.concat([text_df,score_df],axis=1)
df2.head()

Unnamed: 0,text,score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


Checking out some of the text which starts from 500 index and ends at 550 index.

In [35]:
for i in range(500,550):
    print(df2['text'].values[i])
    print('-----------------------------------------------------')

...you can absolutely forget about these. Confirmed by other reviewers, these chips are now total garbage. Like chewing on styrofoam packaging "peanuts". Positively awful, no hyperbole or exaggeration. I'll NEVER buy anything from Kettle brand ever again! From a reportedly once great "premium" brand, literally any mass market chip I've ever tried tastes better than these. Stale and rancid tasting, and virtually no salty taste whatsoever. Completely awful!
-----------------------------------------------------
These chips are nasty.  I thought someone had spilled a drink in the bag, no the chips were just soaked with grease.  Nasty!!
-----------------------------------------------------
Unless you like salt vinegar chips as salty as eating actual pinches of salt and drinking actual vinegar, i doubt you will like these chips.  These are the saltiest & sourest chips I have ever had, and the only reason stops me from throwing these away is because I paid for 2 full boxes and dont like to wa

In [36]:
df3 = df2.head(50000)
df3.head()

Unnamed: 0,text,score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


Here we have created a function that will mark the score from 1 to 3 as negative and 4 to 5 as positive. We are going to do binary classification that's why we are keeping the labels to positive and negative.

In [37]:
def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

score_upd = df3['score']
t = score_upd.map(partition)
df3['score']=t

df3.head()

Unnamed: 0,text,score
0,I have bought several of the Vitality canned d...,positive
1,Product arrived labeled as Jumbo Salted Peanut...,negative
2,This is a confection that has been around a fe...,positive
3,If you are looking for the secret ingredient i...,negative
4,Great taffy at a great price. There was a wid...,positive


In [38]:
df3.isna().sum()

text     0
score    0
dtype: int64

Here, df_x is the i/p text data and df_y is the i/p lable.

In [39]:
df_x = df3['text']
df_y = df3['score']

In [41]:
stop_words = set(stopwords.words('english'))
len(stop_words) #finding stop words

179

Here we are doing the real pre-processing,which are like- keeping only the alphabets,making all the alphabets lower,removing all the stop word.

In [42]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
snow = nltk.stem.SnowballStemmer('english')

corpus = []
for i in range(0, len(df3)):
    review = re.sub('[^a-zA-Z]', ' ', df3['text'][i])
    review = review.lower()
    review = review.split()
    
    review = [snow.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [18]:
corpus[1]

'product arriv label jumbo salt peanut peanut actual small size unsalt sure error vendor intend repres product jumbo'

In [19]:
df_x = corpus

In [20]:
type(df_x)

list

## Model Building:

This is an important step, here we are creating word vectors by doing one hot encoding, and we are only taking 5000 words as the dictionary.

In [22]:
voc_size=5000
onehot_repr=[one_hot(words,voc_size)for words in corpus] 
type(onehot_repr)

list

Padding is one of the most important parts before feeding the data to the model. In a simple way, padding is a way to keep the input size the same for all i/p text by adding zeros at the front. Here we are considering each i/p text corpus will be of 400 words. 

In [23]:
sent_length=400
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 3458 3330 2186]
 [   0    0    0 ... 1777 3330 1834]
 [   0    0    0 ... 1528 4148 1991]
 ...
 [   0    0    0 ... 2070 3369 4429]
 [   0    0    0 ... 1245 2665 4883]
 [   0    0    0 ... 3739  679 3796]]


This is the model build using TensorFlow 2.0. Think about the whole model like this,

first, it's a sequential model starting with an `Embedding` layer (which is a deep learning version of the word to vec, or more intuitively WV is an example of `Embedding` and we use it because we can be trained, it's not static like WV) and input size is 5000x40 and output size of 400x40. Now we created a dropout layer using `Dropout()` which will reduce the chance of overfitting. After that, we are taking 100 layers of LSTM and feeding the output of the Embedding layer to each of the LSTM layers. Now, each of the `LSTM` layers will spit out a scalar value, and then we will be stacking them out, because of that, the output size after the `LSTM` layer would be 100. Then we again perform a Dropout layer. In the end, we have a single activation unit of `sigmoid` (line 8) which will take the 100 sized vectors and give a prediction or in general, words give a scaler value between 0 to 1. 

Now we are setting the optimixer as `adam` and evaluation matrics as `accurecy` and loss as `binary_crossentropy`(this is because we have taken this as a binary classification problem) using the `model.compile()` method.
At the 10th line we are printing the model architecture using `model.summary()` method.

In [24]:
## Creating model
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 40)           200000    
_________________________________________________________________
dropout (Dropout)            (None, 400, 40)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 100)               56400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


Some preprocessing before feeding the data. We are doing lable encoding of the `score` column.

In [25]:
from sklearn.preprocessing import LabelEncoder
encode = LabelEncoder()
df_y2 = encode.fit_transform(df_y)
type(df_y2)

numpy.ndarray

This is a important part, where we are converting our data to nd numpy arrays as we cant just input a pandas data frame.

In [26]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(df_y2)

### Train Test Split:

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_final, y_final, test_size=0.2, random_state=42)

## Model Training:

In [29]:
#we are feeding the training data to the model,and here the model starts trining,i have taken only 10 epochs and batch_size 
#as 64, you an choose lower and higher values for both the variables as per your requirments.
model.fit(X_train,Y_train,validation_data=(X_test,Y_test),epochs=10,batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f16000c5750>

## Model Evaluation:

In [30]:
#As we preserved X_test as the validation data, so here we are validating our model by making prediction using X_test
y_pred_lstm = model.predict_classes(X_test)

In [31]:
#We have taken accurecy score to validate our model 
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,y_pred_lstm)

0.9012

## Conclusion:

Our model is giving more that 90% accurecy which is very good.
This is how you train a RNN using tensorflow 2.0

Thank you for reading.