# Sequence Classification with LSTM Recurrent Neural Networks with Keras

Sequence classification is a type of predictive modelling problem whereby you have some sequence of inputs and the task is to predict a category that the sequence belongs to.

The challenge of such a problem is the varying sequence lengths. There could be sequences with very large vocabulary of input symbols and as such the model then has to learn long term context and/or dependencies between the symbols in the input sequence. In this project, I develop an LSTM recurrent neural network model for sequence classification problems in Python, using keras.

What makes this problem difficult is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model to learn the long term context or dependencies between symbols in the input sequence.

## Framing the problem

The problem used in this report will be based on the [IMDB movie review sentiment classification problem](http://ai.stanford.edu/~amaas/data/sentiment/). In this problem, each movie review is a variable sequence of words, and the sentiment of the movie reviews has to be classified.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is thus to determine whether a given movie review has a positive or negative sentiment.

Keras provides access to the IMDB dataset built-in. The `imdb.load_data()` function is leveraged for this. The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

## Word Embedding

Each movie review gets projected into a different space (a real vector domain) onto a representation, using word embedding. This is where the words are encoded as real - valued vectors in a high dimensional space, to capture the notion of similarity between words in a (Euclidean) distance setting.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer. Each word gets projected onto a 32 length real - valued vector. Also, for computational reasons, we limit the total number of words we model to the 5000 most frequent words (first 5000 of the dataset, due to the token ordering method). And, the sequence length in each review varies, so we constrain each review to be maximum 500 words, and the shorter reviews are padded with `0` values.

We can now develop an LSTM model to classify the sentiment of movie reviews.

## Simple LSTM model, for Sequence Classification

This is a basic, small LSTM for the IMDB problem and achieves good accuracy.

In [1]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.preprocessing import sequence

# fix random seed for reproducibility
numpy.random.seed(7)

We are constraining the dataset to the top 5,000 words. We also split the dataset into train (50%) and test (50%) sets.

In [2]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)



Next, we truncate and pad the input sequences so that they are all the same length for modelling purposes.

In [3]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

We can now define, compile and fit our LSTM model.

Of our model, the first layer is the Embedding layer, using (real - valued) vectors of length 32 to represent each word. Following this layer is the LSTM layer with 100 memory units (smart neurons). Then, the final layer is a dense/linear output layer with just one neuron. The activation function used is a sigmoid function, that makes binary predictions for the two classes in the problem.

In terms of the optimiser and loss function used, as this is a binary classification problem, the log loss is used as the loss function (which is `binary_crossentropy` in Keras), and the `Adam` optimisation is utilised. In this model, only 2 epochs are used, as the model overfits fairly quickly. The batch size is 64.

In [5]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=64)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 500, 32)           160000    
                                                                 
 lstm_1 (LSTM)               (None, 100)               53200     
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 213301 (833.21 KB)
Trainable params: 213301 (833.21 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x1e521e62850>

The model is now fit, and we can see how the model performs on the test data:

In [6]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.48%


This simple LSTM with a small amount of tuning achieves excellent results on the IMDB problem. This is the foundation for more complex models that can be explored.

## LSTM for Sequence Classification With Dropout

Typically, recurrent neural networks, such as LSTM have the problem of overfitting.

In order to combat this, dropout is applied between the layers. In Keras this is done smoothly. We simply add new dropout layers between the embedding and LSTM layers, and between the LSTM and dense output layers. Furthermore, we add dropout to the input of the embeded layer, just using a dropout parameter.

In [8]:
# LSTM with Dropout for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import Embedding
from keras.preprocessing import sequence

In [9]:
# fix random seed for reproducibility
numpy.random.seed(7)

# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)



In [12]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=2, batch_size=64)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 500, 32)           160000    
                                                                 
 dropout_2 (Dropout)         (None, 500, 32)           0         
                                                                 
 lstm_3 (LSTM)               (None, 100)               53200     
                                                                 
 dropout_3 (Dropout)         (None, 100)               0         
                                                                 
 dense_3 (Dense)             (None, 1)                 101       
                                                                 
Total params: 213301 (833.21 KB)
Trainable params: 213301 (833.21 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/2

<keras.src.callbacks.History at 0x1e52149a280>

In [13]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.56%


## LSTM and Convolutional Neural Networks for Sequence Classification

Convolutional neural networks are strong at ascertaining spatial structure within input data.

For this dataset, the IMDB review data does indeed have a one - dimensional spatial structure. The CNN may be able to identify latent features for good and bad sentiment. The LSTM layer can then learn these latent features during training.

We can then add a one - dimensional CNN as well as max pooling layers, after the embedding layer. These feed into the consolidated features, onto the LSTM. We use a small set of 32 features, with a filter length of 3. The pooling layer uses the standard legnth of 2, in order to reduce the feature map size by half, achieving a succesful dimensional reduction from the feature map.

We can easily add a one-dimensional CNN and max pooling layers after the Embedding layer which then feed the consolidated features to the LSTM. We can use a smallish set of 32 features with a small filter length of 3. The pooling layer can use the standard length of 2 to halve the feature map size.

The full code listing with a CNN and LSTM layers is listed below for completeness.

In [22]:
# LSTM and CNN for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Conv1D, MaxPooling1D
from keras.layers import Embedding
from keras.preprocessing import sequence

In [18]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)



In [19]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [28]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=3, batch_size=64) #I would choose 3, but due to computational reasons onl

Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_11 (Embedding)    (None, 500, 32)           160000    
                                                                 
 conv1d_4 (Conv1D)           (None, 500, 32)           3104      
                                                                 
 max_pooling1d_3 (MaxPoolin  (None, 250, 32)           0         
 g1D)                                                            
                                                                 
 lstm_7 (LSTM)               (None, 100)               53200     
                                                                 
 dense_7 (Dense)             (None, 1)                 101       
                                                                 
Total params: 216405 (845.33 KB)
Trainable params: 216405 (845.33 KB)
Non-trainable params: 0 (0.00 Byte)
_____________

<keras.src.callbacks.History at 0x1e514910940>

In [29]:

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 88.72%


We can see that we achieve similar results to the first example although with less weights and faster training time.

I would expect that even better results could be achieved if this example was further extended to use dropout.