# LSTM Deep Learning Model

The purpose of this notebook is to create a simple LSTM (Long Short-term memory) neural network model to be deployed to production. LSTM networks are a specific kind of RNN used in Natural language processing as it has the capability of "remembering" past states. There will not be a detailed explanation on how RNN or LSTM works within this notebook, for more details head to my [RNN](https://www.kaggle.com/code/bropen24/recurrent-neural-networks-rnn) or [LSTM](https://www.kaggle.com/code/bropen24/character-lstm-in-pytorch) posts on Kaggle. 

The focus of this project will not be on fine tuning the parameters, but setting up a model that can be saved properly for deployment. Deployment of the model will be performed in the application named `app.py` located in the current directory.

In [1]:
# Import dependencies
import numpy as np
import pandas as pd

In [2]:
# Load all required tensorflow dependencies for LSTM 
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding
from tensorflow.keras.layers import Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2024-01-10 22:33:14.032058: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-10 22:33:16.026665: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-10 22:33:16.026752: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-10 22:33:16.322685: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-10 22:33:16.948720: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-10 22:33:16.950063: I tensorflow/core/platform/cpu_feature_guard.cc:1

In [3]:
# SKlearn and other utilities
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

In [4]:
# regular expressions and other utils
import re
import pickle

## Data loading  

Let's now load the data and valiate the size. In this case we have a file with typical reviews that will need to be tokenized before applying LSTM to it. We will perform some basic data analysis while processing the data.

In [5]:
df = pd.read_csv('reviews_dataset.tsv.zip', header=0, delimiter="\t", quoting=3)


In [6]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [7]:
df['sentiment'].unique()

array([1, 0])

In [8]:
# Validate that the sentiment 1 means a good review
df.loc[4, 'review']

'"Superbly trashy and wondrously unpretentious 80\'s exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that we\'re dealing with a serious and harrowing drama, but you need not fear because barely ten minutes later we\'re up until our necks in nonsensical chainsaw battles, rough fist-fights, lurid dialogs and gratuitous nudity! Bo and Ingrid are two orphaned siblings with an unusually close and even slightly perverted relationship. Can you imagine playfully ripping off the towel that covers your sister\'s naked body and then stare at her unshaven genitals for several whole minutes? Well, Bo does that to his sister and, judging by her dubbed laughter, she doesn\'t mind at all. Sick, dude! Anyway, as kids they fled from Russia with their parents, but nasty soldiers brutally slaughtered mommy and daddy. A friendly smuggler took custody over them, however, and even raised and trained Bo and Ingrid into expert smugglers. When the actual plot lifts off

Let's dispose the ID column and keep the sentiment and review columns, which are what we will focus on. As we can see from the previous cells, there are only two possible sentiment values, expressed as 0 or 1 expressing a **Negative** and **Positive** review respectively.

#### Class balance

As part of our process, we need to validate if the data is fairly balanced, if not, some adjustments would need to be made not to favor one of the sentiments over the other while training the model.

In [9]:
df = df.drop(columns=['id'])

In [10]:
df.head()

Unnamed: 0,sentiment,review
0,1,"""With all this stuff going down at the moment ..."
1,1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,0,"""The film starts with a manager (Nicholas Bell..."
3,0,"""It must be assumed that those who praised thi..."
4,1,"""Superbly trashy and wondrously unpretentious ..."


In [11]:
# Let's get the shape, in other words, the size of the DF we will be dealing with
df.shape

(25000, 2)

In [12]:
# Find out if we have a balanced DF
df['sentiment'].value_counts()

sentiment
1    12500
0    12500
Name: count, dtype: int64

#### Data Cleanup  

We now need to start cleaning up the data, for this we will be turning everything to lowercase and then apply a simple regular expression to remove all non-digit, non-alpha and non-white characters from the review text.

In [13]:
df['review'] = df['review'].apply(lambda x: x.lower())
df['review'] = df['review'].apply((lambda x: re.sub('[^a-zA-Z0-9\s]','',x)))

In [14]:
df.head()

Unnamed: 0,sentiment,review
0,1,with all this stuff going down at the moment w...
1,1,the classic war of the worlds by timothy hines...
2,0,the film starts with a manager nicholas bell g...
3,0,it must be assumed that those who praised this...
4,1,superbly trashy and wondrously unpretentious 8...


In [15]:
df.loc[4, 'review']

'superbly trashy and wondrously unpretentious 80s exploitation hooray the precredits opening sequences somewhat give the false impression that were dealing with a serious and harrowing drama but you need not fear because barely ten minutes later were up until our necks in nonsensical chainsaw battles rough fistfights lurid dialogs and gratuitous nudity bo and ingrid are two orphaned siblings with an unusually close and even slightly perverted relationship can you imagine playfully ripping off the towel that covers your sisters naked body and then stare at her unshaven genitals for several whole minutes well bo does that to his sister and judging by her dubbed laughter she doesnt mind at all sick dude anyway as kids they fled from russia with their parents but nasty soldiers brutally slaughtered mommy and daddy a friendly smuggler took custody over them however and even raised and trained bo and ingrid into expert smugglers when the actual plot lifts off 20 years later theyre facing the

We can quickly see the difference in the fourth review, from the first time we looked at the cell ([10]) to the last([19]). The review is clean from punctuation marks and other non-value-adding characters. We now have the text in a format we can tokenize.

#### Tokenizing

Before we start tokenizing the reviews, we will limit them to the top thousand words found in the "corpus" of the reviews, by limiting the number of features. To do this we will be passing the argument `num_words` to the `fit_on_text()` method of the tokenizer instance.

Then we will be adding some padding to them, that means that all sentences will become of the same size, that of the longest one.

For more information on how [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) works, follow the link.

In [16]:
max_features = 1000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(df['review'].values)

In [17]:
X = tokenizer.texts_to_sequences(df['review'].values)

In [18]:
X = pad_sequences(X)

In [19]:
# Let's check the length of the longest sequence by checking at the shape of the resulting np.array
X.shape

(25000, 1473)

#### Embedding layer

Embedding layers are used in Natural Language Processing task as an alternative to one-hot-encoding (and dimensional reduction) to be able to train our own word embeddings instead of using pre-trained word embeddings. 

Once we have our model tokenized, we have a vector of, in our case, a thousand items. This makes the model too cumbersome, so we apply an embedding layer to create a vector dense of fixed size to make it more manageable. This layer works as a kind of lookup table, where words are the keys for the values in the dense vector.

To apply the Embedding layer into our model, we define the layer size and pass it as the first step of the model. The model will then be responsible to train this layer to the adequate models by means of the loss and optimization strategies defined for it. We will be defining the Embedding layer by using `Embedding(input_dim, output_dim, input_length)`, for more information on [Embedding](https://keras.io/api/layers/core_layers/embedding/) follow the link.

In [20]:
embed_dim = 50  # Embedding layer size

In [22]:
# LSTM Model sequential model setup using tensorflow
model = Sequential()
model.add(Embedding(max_features, embed_dim, input_length=X.shape[1])) # input_length is given by the size of the input vectors

#### LSTM Layer

As we defined the model to be sequential, we will add a [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM#attributes) layer to the model with an output size of 10 by passing the argument `LSTM(10)` to `model.add()`.

#### Dense layer

We will now define the final or output layer of our model by setting up a [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer with 2 outputs (corresponding to the positive and negative sentiments). The activation function to use will be softmax.

In [23]:
# Add LSTM to the sequential model and a couple of dense layers with a softmax activation function
model.add(LSTM(10))
model.add(Dense(2, activation='softmax'))

In [24]:
# Define loss function and optimizer for model training
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1473, 50)          50000     
                                                                 
 lstm (LSTM)                 (None, 10)                2440      
                                                                 
 dense (Dense)               (None, 2)                 22        
                                                                 
Total params: 52462 (204.93 KB)
Trainable params: 52462 (204.93 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


#### Setup y

Now we need to setup y as a boolean array matching the two possible sentiment values of a review. To do this, we will use `.get_dummies()`, that produces a one-hotencoding of the sentiment values.

In [27]:
# Setup y
y = pd.get_dummies(df['sentiment']).values

In [30]:
# Split train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=99)

In [32]:
# Check the sizes of the resulting splits
print('X-train size:{}, y-train size:{}'.format(X_train.shape,y_train.shape))
print('X-test size:{}, y-test size:{}'.format(X_test.shape,y_test.shape))

X-train size:(18750, 1473), y-train size:(18750, 2)
X-test size:(6250, 1473), y-test size:(6250, 2)


#### Model training

After training we get a pretty decent accuracy score of around 87%, we can move forward to saving the model without trying to improve the accuracy (at least for the moment).

In [33]:
model.fit(X_train, y_train, epochs=5, verbose=1)

2024-01-10 22:58:20.746605: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 110475000 exceeds 10% of free system memory.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7f85d251fd30>

#### Test the model

To test the model we will create a sample review and peform all the data setup processes we performed with the dataframe reviews. This way the data will be encoded in a similar way.

We will define a definite negative review to test the model, we will get the correct sentiment from the prediction, as we can see below.

In [34]:
test = ['Movie was pathetic'] # Negative review
test = tokenizer.texts_to_sequences(test)
test = pad_sequences(test, maxlen=X.shape[1], dtype='int32', value=0) # Add padding
print(test.shape)

(1, 1473)


In [35]:
# Make predictions
sentiment = model.predict(test)[0]
if(np.argmax(sentiment)==0):
    print('Negative')
elif(np.argmax(sentiment)==1):
    print('Positive')

Negative


### Save the model for deployment

Now we will save the model and tokenizer steps in pickle format for deployment into production. 

In [36]:
# Pickle dump the tokenizer step
with open('tokenizer.pickle', 'wb') as tk:
    pickle.dump(tokenizer, tk, protocol=pickle.HIGHEST_PROTOCOL)

In [37]:
# save model as json
model_json = model.to_json()
with open('model.json', 'w') as js:
    js.write(model_json)

In [38]:
model.save_weights('model.h5') # Save model weights in hierarchical format