# LSTM Sentiment Classifier

This document outlines the process of crafting a machine learning model using a Long-Short-Term-Memory (LSTM) neural network. This model is trained on a dataset that contains 50000 movie reviews (Maas 2011). 

### Installing basic dependencies

In [51]:

!pip install tensorflow
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install -U nltk







In [2]:
import tensorflow as tf

print(tf.config.list_physical_devices('GPU'))


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [3]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import keras 
import nltk 



from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Dropout, TimeDistributed, Bidirectional, SpatialDropout1D
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from keras.utils.np_utils import to_categorical
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer # decided that Lemmatizer is better
from nltk.stem import WordNetLemmatizer 

nltk.download('stopwords')
nltk.download('wordnet')

import re
import os

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hungy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hungy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Reading in the Dataset
Individual reviews are initially stored as text files. These reviews are collated and compiled into one dataframe.

In [4]:

df1 = pd.read_csv("trainreviews.csv")
df2 = pd.read_csv("testreviews.csv")
df = pd.concat([df1, df2], ignore_index=True)
df = df.sample(frac=1)

print(df)

                                                 reviews  Positivity
49648  Contains SpoilersThis is a Peter Watkins film ...           1
36485  Very good s movie about mob operations in New ...           1
10126  I have a month old and got really tired of wat...           1
10139  For the big thinkers among us The Intruder is ...           0
48473  I must have been around ten years old when my ...           1
...                                                  ...         ...
538    Comedies often have the unfortunate reputation...           1
27093  Its easy to see why many people consider In th...           1
35629  Tonights film course film was The Legend of th...           0
13077  There are laughs in this film that is for sure...           0
49583  I could almost wish this movie had not been ma...           0

[50000 rows x 2 columns]


### Processing Data

##### Cleaning Data
The dataframe is checked for rows where nothing is entered as reviews. Fortunately, there are no empty reviews in this dataset.

In [5]:
print(df1[df1['reviews'].isnull()])
print(df2[df2['reviews'].isnull()])
df = df.dropna()


Empty DataFrame
Columns: [reviews, Positivity]
Index: []
Empty DataFrame
Columns: [reviews, Positivity]
Index: []


##### Lemmatising Words and Further Cleaning
The dataset contains unwanted phrases or words, such as website links, as well as html code and numbers. To reduce noise within the dataset (and to aid computation), these unwanted phrases are removed.
Furthermore, to aid with computation, words are lemmatised. For instance, the word "eats" will be reduced to "eat".

In [6]:

lemmatizer = WordNetLemmatizer()
#stopwds = set(stopwords.words("english"))
#stopwds.remove('not')

def clean(words):
    cleaned = re.sub("<.*?\/>", "", words)
    cleaned = re.sub("http:[^\s]*\s", "", cleaned) 
    cleaned = re.sub("[^a-zA-Z0-9\s]+", "", cleaned) # get rid of special characters
    cleaned = re.sub(r'\w*\d\w*', "", cleaned)
    return cleaned
    

def lem(words):
    words = clean(words)
    words = words.lower().split()
    #words = [w for w in words if not w in stopwds and len(w) >= 3]
    #words = [w for w in words]
    words = [lemmatizer.lemmatize(w) for w in words]
    return words


df['reviews'] = df['reviews'].map(lambda x: lem(x))
print(df)

                                                 reviews  Positivity
49648  [contains, spoilersthis, is, a, peter, watkins...           1
36485  [very, good, s, movie, about, mob, operation, ...           1
10126  [i, have, a, month, old, and, got, really, tir...           1
10139  [for, the, big, thinker, among, u, the, intrud...           0
48473  [i, must, have, been, around, ten, year, old, ...           1
...                                                  ...         ...
538    [comedy, often, have, the, unfortunate, reputa...           1
27093  [it, easy, to, see, why, many, people, conside...           1
35629  [tonight, film, course, film, wa, the, legend,...           0
13077  [there, are, laugh, in, this, film, that, is, ...           0
49583  [i, could, almost, wish, this, movie, had, not...           0

[50000 rows x 2 columns]


##### Tokenisation
A tokeniser is used to convert words into discrete tokens. Thereafter, with the function `fit_on_texts`, an internal "vocabulary" system is created over all these words in the dataset. The function `texts_to_sequences` then converts each sentence (each row of words above) into a list of numbers.

In [7]:
revls = []
for review in df['reviews']:
    revls.append(len(review))
input_len = int(np.mean(revls) + 1)

print(f"input length: {input_len}")
    
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['reviews'])
vocab = len(tokenizer.word_index) + 1 
print(f"vocabulary: {vocab}")
sequences = tokenizer.texts_to_sequences(df['reviews'])
data = pad_sequences(sequences, maxlen=input_len, padding='post', truncating='post')
print(data)


input length: 228
vocabulary: 140601
[[ 1332 18141     6 ...    80  8093    33]
 [   51    47   148 ...     0     0     0]
 [    9    24     2 ...     0     0     0]
 ...
 [ 3915    14   270 ...   933     6  1660]
 [   36    21   314 ...     0     0     0]
 [    9    97   219 ...     0     0     0]]


From the above code, we learn that the average length of each review is 228-words long, and there are a total of 140601 unique tokens (including words with spelling mistakes).

##### Splitting into Test and Train Datasets
To know how well our model learns, the dataset is split into two pieces (in an 80-20 ratio). The larger portion will be used to train the machine learning model, and the model will be tested against the smaller portion of the dataset. 

The larger dataset contains 40000 sentences while the smaller dataset contains 10000, as seen below.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(data, df['Positivity'], test_size=0.2, stratify=df['Positivity'])

print(X_train)
print(len(X_train))
print(X_test)
print(len(X_test))

[[    9  4453   380 ...     0     0     0]
 [    2   210    46 ...     0     0     0]
 [  204  2698   694 ...     0     0     0]
 ...
 [    9   215     1 ...     0     0     0]
 [    2  4003  2042 ...     0     0     0]
 [    1    53 93932 ...     0     0     0]]
40000
[[8084    1  971 ...    0    0    0]
 [  71    7    6 ... 2770  448   18]
 [ 365   79    3 ...    0    0    0]
 ...
 [  71    8    5 ...    0    0    0]
 [1987 2373  372 ...  363    1  754]
 [ 528   42   44 ...    0    0    0]]
10000


### Building the Model

Using the Keras module, the LSTM model is created below. A Bidirectional LSTM model is appropriate here as LSTMs deal with sequences, which means that the order of words within a sentence will be important. For example, "not good" and "good ... not..." are treated differently. Therefore, this method is more appropriate than just adding the sentiments of each word in each sentence.

Each token above is now converted into a vector of 90 dimensions. The model is first trained in small batches at a time for 20 epochs, before it is trained on the entire dataset for 4 more epochs.

As this is a binary classification problem, binary cross entropy is used as the loss function.

In [9]:
epochs = 20

dims = 90
model = Sequential(name="LSTM_Sentiment_Classifier")
model.add(Embedding(vocab, dims, input_length=input_len))
model.add(Bidirectional(LSTM(dims*2 , return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(dims)))
model.add(Dropout(0.2))
model.add(Dense(32,activation='relu'))
model.add(Dense(8,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())



Model: "LSTM_Sentiment_Classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 228, 90)           12654090  
                                                                 
 bidirectional (Bidirectiona  (None, 228, 360)         390240    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 228, 360)          0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 180)              324720    
 nal)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 180)               0         
                                                                 
 dense (Dense)               (None, 32)  

### Training the Model
40000 preprocessed sentences from the larger dataframe is fed into the LSTM model for training. Notably, `val_accuracy` is the accuracy of the model against the test dataset of 10000 sentences, and `accuracy` is the accuracy of the model while training.

In [10]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size = 300, epochs=epochs)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2277f9687f0>

In [11]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size = 1, epochs = 4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x22776a85910>

With a final accuracy of 87.86%, the model seems to perform relatively well.

### Saving the Model
The model is saved for future use.

In [12]:
model_path = './models'
exists = os.path.exists(model_path)
if not exists:
    os.mkdir(model_path)
print(model.summary())
model.save(model_path)

Model: "LSTM_Sentiment_Classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 228, 90)           12654090  
                                                                 
 bidirectional (Bidirectiona  (None, 228, 360)         390240    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 228, 360)          0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 180)              324720    
 nal)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 180)               0         
                                                                 
 dense (Dense)               (None, 32)  



INFO:tensorflow:Assets written to: ./models\assets


INFO:tensorflow:Assets written to: ./models\assets


### Additional Tests
Some additional prompts are created and tested on the model.

In [48]:
def processor(df_test, vocab, input_len, tokenizer):
    df_test['reviews'] = df_test['reviews'].map(lambda x: lem(x))
    print(df_test)
    
    sequences = tokenizer.texts_to_sequences(df_test['reviews'])
    data = pad_sequences(sequences, maxlen=input_len, padding ='post', truncating='post')
    print(data)
    return data

prompt_1 = "the romance between the two characters was quite unrealistic. \
everything else was great, but the movie was not good"
prompt_2 = "i enjoyed watching the basketball matches. \
although i think the players were quite horrible, watching the matches was quite exciting and fun"
df_test = pd.DataFrame(np.array([[prompt_1], [prompt_2]]), columns = ['reviews'])
data_test = processor(df_test, vocab, input_len, tokenizer)

                                             reviews
0  [the, romance, between, the, two, character, w...
1  [i, enjoyed, watching, the, basketball, match,...
[[   1  816  207    1  111   48   13  184 1967  276  333   13   78   17
     1   12   13   19   47    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0   

#### Results
As with the labels in the original dataset, a prediction of a number close to "1" denotes that the sentiment of the input is likely to be positive, while a prediction of a number close to "0" denotes the opposite.

In [49]:
model.predict(data_test)




array([[0.12854104],
       [0.9878555 ]], dtype=float32)

# Works Cited
Maas, Andrew, et al. "Learning word vectors for sentiment analysis." _Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies_. 2011.