# Sequence Classification of Movie Reviews

Sequence classification is a predictive modeling problem where you have some sequence of inputs over space or time and the task is to predict a category for the sequence.

Problem with sequences is that
* They can vary in length
* Can be comprised of very large vocabulary of input symbols
* May require model to learn long-term contexts or dependencies between input symbols

IMBD movie reviews dataset has review and it's corresponding sentiment. The dataset has 50,000 such observations. We will use this datasets to build the following models and compare which works best for sentiment classification

1. Model 1 : Baseline Model
2. Model 2 : LSTM with 100 units
3. Model 3 : LSTM with Dropout
4. Model 4 : LSTM with CNNs


### Loading the dataset

In [None]:
import pandas as pd
df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
df.head()

We can observe from the dataset that sentiment label is in textual form. This needs to be converted to numeric. We will use `LabelEncoder` from `sklearn.preprocessing` to encode `positive` and `negative` sentiment to `1` and `0`

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df.sentiment)
df.head()

In [None]:
### Defining hyperparameters for Text Processing

In [None]:
TOP_WORDS      = 5000      # to keep only 5000 top used words
MAX_REVIEW_LEN = 500       # caps the sequence length to this number (Keras requires all sequences to be of same length)
OOV_TOKEN      = '<OOV>'   # any out of vocabulary word (not part of top words) is replaced with this text
TRUNC_TYPE     = 'post'
PADDING_TYPE   = 'post'
TEST_SIZE      = 0.5
EMBEDDING_LEN  = 32 
EPOCHS         = 10
BATCH_SIZE     = 64

### Tokenize
We will use `Tokenizer` from Keras API. The `Tokenizer` vectorizes the text corpus by converting text to integers (each integer being the index of a token in a dictionary).
The dictionary containing word integer mapping can be accessed from `tokenizer.word_index`

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words = TOP_WORDS, oov_token = OOV_TOKEN)
tokenizer.fit_on_texts(df.review.to_numpy())
word_index = tokenizer.word_index
word_index_inv = dict([(v,k) for (k,v) in word_index.items()])

We can compare the tokenized and orignal sentences using the reversed `word_index`. In the sample below, the first 50 words of first review was tokenized and decoded to get the output strings.

In [None]:
def decode_sentence(text):
    return ' '.join([word_index_inv.get(i, '?') for i in text])

sample_seq = [' '.join(df.review[0].split(' ')[:50])]
tokenized_sample = tokenizer.texts_to_sequences(sample_seq)
print (sample_seq[0])
print ('------------------')
print (tokenized_sample[0])
print ('------------------')
print (decode_sentence(tokenized_sample[0]))

**Note**
* All words are converted to lowercase
* Words like *brutality* and *unflinching* do not feature in tokenized sentence and are replaced with our defined Out of Vocabulary token `<OOV>`
* punctuations are removed - this could be a problem while extracting meanings of sentences. May not be an issue for sentiment classification

### Train Test Split

In [None]:
reviews = df.review.to_numpy()
labels  = df.sentiment.to_numpy()

train_count      = int(len(reviews) * (1 - TEST_SIZE))
training_reviews = reviews[:train_count]
testing_reviews  = reviews[train_count:]
y_train          = labels[:train_count]
y_test           = labels[train_count:]

print ('Training Count :', len(training_reviews))
print ('Testing Count :', len(testing_reviews))

Converting reviews to sequences by fitting tokenizer to these reviews. The resulting sequences will be padded i.e. If the review size is less than `MAX_REVIEW_LEN` parameter, the resulting sequences will get populated with `0` after the sentence

In [None]:
training_sequences = tokenizer.texts_to_sequences(training_reviews)
X_train            = pad_sequences(training_sequences, maxlen = MAX_REVIEW_LEN, padding = PADDING_TYPE, truncating = TRUNC_TYPE)

testing_sequences  = tokenizer.texts_to_sequences(testing_reviews)
X_test             = pad_sequences(testing_sequences,  maxlen = MAX_REVIEW_LEN, padding = PADDING_TYPE, truncating = TRUNC_TYPE)

## Baseline Model
We first define the embedding layer which represents each words with 32 length vectors. Next, we define LSTM layer with 100 memory units. Lastly, we use a Dense layer with `sigmoid` activation to classify the sentiment.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam

model = Sequential() 
model.add(Embedding(TOP_WORDS, EMBEDDING_LEN, input_length=MAX_REVIEW_LEN))
model.add(GlobalAveragePooling1D())
model.add(Dense(100, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid')) 
model.compile(loss = 'binary_crossentropy', optimizer = Adam(0.0005), metrics = ['accuracy']) 
model.summary()
%time history = model.fit(X_train, y_train, validation_data =(X_test, y_test), epochs=EPOCHS, batch_size=BATCH_SIZE, verbose = 0)

In [None]:
results = model.evaluate(X_test, y_test, batch_size = 128, verbose = 0)
print (f'Accuracy : {round(results[1]*100, 2)} %')

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('LOSS')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()


plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('ACCURACY')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()   

## LSTMs

In [None]:
model = Sequential() 
model.add(Embedding(TOP_WORDS, EMBEDDING_LEN, input_length=MAX_REVIEW_LEN))
model.add(LSTM(100))
model.add(Dense(1, activation = 'sigmoid')) 
model.compile(loss = 'binary_crossentropy', optimizer = Adam(0.005), metrics = ['accuracy']) 
model.summary()
%time history = model.fit(X_train, y_train, validation_data =(X_test, y_test), epochs=EPOCHS, batch_size=BATCH_SIZE, verbose = 0)

In [None]:
results = model.evaluate(X_test, y_test, batch_size = 128, verbose = 0)
print (f'Accuracy : {round(results[1]*100, 2)} %')

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('LOSS')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()


plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('ACCURACY')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()   

## LSTM with Dropout
LSTMs are prone to overfitting, for which we can use Dropout. Dropout can be added 
1. To the embedding input layer
2. Between embedding and LSTM layer
3. Between LSTM and Dense layer
4. To the input and recurrent connections of the memory units with the LSTM

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, GlobalAveragePooling1D, Dropout

model = Sequential() 
model.add(Embedding(TOP_WORDS, EMBEDDING_LEN, input_length=MAX_REVIEW_LEN))
model.add(Dropout(0.2)) 
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(1, activation = 'sigmoid')) 
model.compile(loss = 'binary_crossentropy', optimizer = Adam(0.005), metrics = ['accuracy']) 
model.summary()
%time history = model.fit(X_train, y_train, validation_data =(X_test, y_test), epochs=EPOCHS, batch_size=BATCH_SIZE, verbose = 0)

In [None]:
results = model.evaluate(X_test, y_test, batch_size = 128, verbose = 0)
print (f'Accuracy : {round(results[1]*100, 2)} %')

Accuracy becomes better with dropout than without for LSTM model

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('LOSS')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()


plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('ACCURACY')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()   

## LSTM with CNN
Convolutional neural networks excel at learning the spatial structure in input data. The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Conv1D, MaxPooling1D

model = Sequential() 
model.add(Embedding(TOP_WORDS, EMBEDDING_LEN, input_length = MAX_REVIEW_LEN))
model.add(Conv1D(32, (3), activation = 'relu')) 
model.add(MaxPooling1D(2)) 
model.add(LSTM(100))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = Adam(0.005), metrics = ['accuracy']) 
model.summary()
%time history = model.fit(X_train, y_train, validation_data =(X_test, y_test), epochs=EPOCHS, batch_size=BATCH_SIZE, verbose = 0)

In [None]:
results = model.evaluate(X_test, y_test, batch_size = 128, verbose = 0)
print (f'Accuracy : {round(results[1]*100, 2)} %')

In [None]:
import matplotlib.pyplot as plt
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('LOSS')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()


plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('ACCURACY')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'])
plt.show()   

## Next Steps

1. Trying out different LSTM layers
2. Experiment with different number of `TOP_WORDS`, `MAX_REVIEW_LEN` for input_data
3. Try `pre` paddings and truncations while creating sequences
4. Eperimenting with stacked mulitple convolutional and LSTM layers 