# LSTM for Sentiment Analysis

Sentiment analysis (a.k.a. opinion mining or emotion AI) is the use of natural language processing to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

### Model Flowchart 

To conduct sentiment analysis using LSTM, we need several processing steps: 

- convert the raw text-words into so-called tokens which are integer values. These tokens are really just indices into a list of the entire vocabulary. 

- convert these integer-tokens into so-called embeddings which are real-valued vectors, whose mapping will be trained along with the neural network, so as to map words with similar meanings to similar embedding-vectors. 

- input these embedding-vectors to a LSTM network which can take sequences of arbitrary length as input and output a kind of summary of what it has seen in the input. 

- use a sigmoid function to give us a value between 0.0 and 1.0, where 0.0 is taken to mean a negative sentiment and 1.0 means a positive sentiment. 

This whole process allows us to classify input-text as either having a negative or positive sentiment, and the flowchart is roughly:

<img src='../figs/17_Flowchart.png' width = '400'>

## IMDb Data Set

In this notebook, we will go through a sentiment analysis on [IMDb movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) using Long-Short Term Memory (LSTM) network. IMDb (an acronym for Internet Movie Database) is an online database of information related to films, television series, home videos, video games, and streaming content online. There are 25,000 reviews with positive/negative sentiment labels in the training set and an equal amount in the test set. It comes with Keras by default.

#### Import libraries

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras

Import IMDB dataset using `imdb.load_data`:

Words are ranked by how often they occur (in the training set) and only the `num_words` most frequent words are kept. Any less frequent word will appear as `oov_char` (out-of-vocabulary character) value in the sequence data. All words will be kept if `num_words` is not specified.

In [3]:
from keras.datasets import imdb
vacab_size = 5000
(x_train, y_train),(x_test, y_test) = imdb.load_data(num_words = vacab_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Confirm the dataset size to be 25000:

In [2]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(25000,)
(25000,)
(25000,)
(25000,)


## Data Preprocessing

The features in `imdb` dataset are vector representations of **word indexes** for the reviews, not the raw texts themselves. That is, they are already tokenized.

In [3]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 2,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 2,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 2,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 2,
 18,
 51,
 36,
 

We need to pad the sequence to the **same length** of a maximum of 500 words. For that Keras provides us with the `pad_sequences` method:

In [4]:
from keras.preprocessing import sequence
x_train = sequence.pad_sequences(x_train, maxlen = 500)
x_test = sequence.pad_sequences(x_test, maxlen = 500)

Let us visualize now how it has transformed our training set:

In [21]:
x_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

Now we need to build a `word_to_id` dictionary so that these indexes can be transformed into words for further analysis. In the dictionary, we will need to provide 'PAD' token to index 0, 'START' token to index 1, and 'UNK'' token to index 2. So we have to shift the default indexes by 3 to adjust these tokens.

In [None]:
imdb.get_word_index()

In [35]:
word_to_id = imdb.get_word_index()

In [None]:
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

After building `word_to_it`, we need to build `id_to_word` dictionary:

In [36]:
id_to_word = {idx:word for word, idx in word_to_id.items()}

Now we can provide `id_to_word` an index and it will output the word associated with it.

In [37]:
id_to_word[20]

'movie'

In [38]:
id_to_word[0]

'<PAD>'

In [41]:
id_to_word[2]

'<UNK>'

In [39]:
id_to_word[5003]

'1987'

In [40]:
id_to_word[50000]

"belmondo's"

Now we can read what we have in the training set. Less frequent words will appear as 'UNK'

In [42]:
print(" ".join(id_to_word[id] for id in x_train[0]))

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

The labels are either 0 (negative review) or 1 (positive review).

In [11]:
y_train[:10]

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0])

## Build LSTM model

We build a model that can input sequence of word indexes and output the probability of how good or bad the user review is.

- Create the instance of sequential model

- Add an **embedding layer** with maximum vocab size and dimension of output. It turns positive integers (word indexes) into dense vectors of fixed size. Embedding layer can only be used as the first layer in a model.

- Add a layer of LSTM with many-to-one input-output style.

- Add a Dense layer with sigmoid activation for predicting the probability.

- Compile the model for training with loss function as binary cross-entropy, optimizer as adam, and metric as accuracy.

#### Word Embedding

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over the more traditional bag-of-word model (e.g., TF-IDF) encoding schemes where large sparse vectors were used to represent each word or to score/weigh each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding,

- words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

- the position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

- the position of a word in the learned vector space is referred to as its embedding.

#### Keras Emedding Layer

[Parameters](https://keras.io/api/layers/core_layers/embedding/#embedding):

- `input_dim`: the size of vocabulary. It is the number of unique words in the vocabulary, which is 5000 in our case.

- `output_dim`: the number of dimensions we wish to embed into. Each word will be represented by a vector of this much dimensions.

- `input_length`: lenght of the maximum document, which is stored in `maxlen` variable in our case.

In [6]:
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras import Sequential

embedding_vector_length = 32

def lstm_sentiment():
  model = Sequential()
  model.add(Embedding(input_dim = vacab_size, output_dim = embedding_vector_length, input_length=500)) 
  model.add(LSTM(units = 100)) 
  model.add(Dense(units = 1, activation='sigmoid')) 
  model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
  return model

In [13]:
model = lstm_sentiment()

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 32)           160000    
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


We'll use the `ModelCheckpoint` callback to regularly save checkpoints, and
the `EarlyStopping` callback to interrupt training when the validation loss
is not improving for `patience=3` consecutive epochs.

In [20]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

path_checkpoint = "model_checkpoint.h5"
es_callback = EarlyStopping(monitor="val_accuracy", patience=3)

modelckpt_callback = ModelCheckpoint(
    monitor="val_accuracy",
    filepath=path_checkpoint,
    verbose=1,
    save_weights_only=True,
    save_best_only=True,
)

### Training

Now train the model by providing the training set, labels, epochs, batches, etc parameters. We will use `validation_data` with `x_test` and `y_test`.

Batch size, epochs, LSTM units, etc. are all hyperparameters and can be tuned further for better accuracy.

In [21]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=64, callbacks=[es_callback, modelckpt_callback])

Epoch 1/10
Epoch 00001: val_accuracy improved from -inf to 0.86020, saving model to model_checkpoint.h5
Epoch 2/10
Epoch 00002: val_accuracy improved from 0.86020 to 0.86224, saving model to model_checkpoint.h5
Epoch 3/10
Epoch 00003: val_accuracy improved from 0.86224 to 0.86232, saving model to model_checkpoint.h5
Epoch 4/10
Epoch 00004: val_accuracy did not improve from 0.86232
Epoch 5/10
Epoch 00005: val_accuracy did not improve from 0.86232
Epoch 6/10
Epoch 00006: val_accuracy did not improve from 0.86232


<keras.callbacks.History at 0x7fc7890a9350>

Load the best validated model and check the validation accuracy

In [23]:
model.load_weights(path_checkpoint)
val_accuracy = model.evaluate(x_test, y_test)



### Prediction

Let's predict some random reviews and see how our model performs. Note that our trained model can only deal an input of word indexes. We write a function for generating the prediction for a given review in text.

In [24]:
# for user prediction
def user_input_processing(review):
    vec = []
    for word in review.split(" "):
        if word[-1] == ".":
            word = word[:-1]
        vec.append(word_to_id[str.lower(word)])
    vec_padded = sequence.pad_sequences([vec], 500)
    print(review, model.predict(vec_padded))

A good review example:

In [25]:
user_input_processing("One of the finest films made in recent years.")

One of the finest films made in recent years. [[0.978964]]


A bad review example:


In [26]:
user_input_processing("Predictable and bad. The acting was terrible and the story was common.")

Predictable and bad. The acting was terrible and the story was common. [[0.00377825]]
