# Sentiment Analysis

## Part 3: Keras & LSTM

In this notebook you will learn how to use Keras to build a neural network as well as the LSTM

**Outline**:

- Keras
- RNN
- LSTM

*Some codes are adapted from [deeplearning.ai](https://www.deeplearning.ai/). Please do not use the code for ANY commercial use.*

Make sure you've installed the following packages:

- tensorflow
- keras
- nltk
- pandas
- h5py
- emoji

> If you're using `conda` as your package manager, you may noticed that `emoji` is not included in conda. To install it, you need to use `pip` instead:

> 1. Activate your virtual environment: `source activate <your_venv>`.
> 2. Verfiy that you're using `pip` along with the virtual environemnt: `which pip`.
> 3. Install the package by `pip install emoji`. (Do not use `pip3`! There will be only one `pip` version inside your virtual environment.)
> 4. Deactivate your virtual environment: `source deactivate` <your_venv>.

**Pipeline**

<img src="resources/pipeline.png" width="800px">

### Keras

In [None]:
# import
from ml_utils import *

from keras.models import Model, Sequential
from keras.layers import Input, Dense, Activation

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [None]:
X_train, X_test, Y_train, Y_test = load_moon()

# Plot the training set
plt.scatter(
    X_train[0,:],
    X_train[1,:],
    c=Y_train[0],
    cmap=plt.cm.Spectral)

**Using Keras:**

1. Define the structure of the network. 
2. Print the summary of your network to see if shape and #of params is correct.
3. **Compile the model**.
4. Fit the model.
5. Evaluate the model.

In [None]:
model = Sequential()
model.add(Dense(4, input_dim=2, activation='tanh'))  # hidden layer
model.add(Dense(1, activation='sigmoid'))  # output layer
model.summary()

<img src="resources/keras_network.png" width="800">

<center>*Keras 2-layer neural network*</center>

In [None]:
model.compile(
    loss='binary_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])

In [None]:
print(X_train.shape)
print(X_train.T.shape)

In [None]:
model.fit(
    X_train.T, 
    Y_train.T, 
    epochs=100)

In [None]:
model.evaluate(X_train.T, Y_train.T)

In [None]:
model.evaluate(X_test.T, Y_test.T)

<span style="color:red">**Notes:**</span>

Another good practice to create a model is to treat each layer as a "transformer" or a "function" that helps us to map the input (training features) to the output (labels).

In [None]:
def simple_nn_model():
    X = Input(shape=(2,))
    Z = Dense(4, activation='tanh')(X)
    Y = Dense(1, activation='sigmoid')(Z)
    return Model(inputs=X, outputs=Y)

In [None]:
model2 = simple_nn_model()
model2.summary()

### LSTM

For more intuitive explanation of LSTM, you may refer to [this post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

In [None]:
# import 
import numpy as np
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation, Embedding
from keras.preprocessing import sequence

np.random.seed(1)

<img src="resources/deep_lstm.png" style="width:700px;height:400px;"> <br>
<caption><center> A 2-layer LSTM sequence classifier. </center></caption>

In [None]:
X_train, X_test, Y_train, Y_test = load_emoji()

In [None]:
word_to_index, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

**Load pretrained word embeddings**

2 dictionaries are loaded:

- `word_to_index`: map a word to its index in the vocabulary
    - Example:  `'word' -> 1234`

- `word_to_vec_map`: map a word to its embedding
    - Example: `'word' -> [0.1, 0.2, ..., 0.45]`

When adding a custom embedding layer in Keras, we can only load the pretrained embedding as a big matrix instead of a dictionary. An index will help us locate the entry for a given word.

In [None]:
# Encode the sentence to the index array
X_tmp = np.array(["I like it"])
sentences_to_indices(X_tmp, word_to_index, max_len = 5)

<span style="color:red">**Notes:**</span>: `sentences_to_indices` is provided in `ml_utils`.

#### Embedding Layer

We need to build a embedding matrix where each row represent a word vector.

In [None]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Build and return a Keras Embedding Layer given word_to_vec mapping and word_to_index mapping
    
    Args:
        word_to_vec_map (dict[str->np.ndarray]): map from a word to a vector with shape (N,) where N is the length of a word vector (50 in our case)
        word_to_index (dict[str->int]): map from a word to its index in vocabulary

    Return:
        Keras.layers.Embedding: Embedding layer
    """
    
    # Keras requires vocab length start from index 1
    vocab_len = len(word_to_index) + 1  
    emb_dim = list(word_to_vec_map.values())[0].shape[0]
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    return Embedding(
        input_dim=vocab_len, 
        output_dim=emb_dim, 
        trainable=False,  # Indicating this is a pre-trained embedding 
        weights=[emb_matrix])

> For more information on how to define a pre-trained embedding layer in Keras, please refer to [this post](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html).

In [None]:
def emoji_model(input_shape, word_to_vec_map, word_to_index):
    """
    Build and return the Keras model
    
    Args:
        input_shape (np.ndarray): The shape of input layer, usually it means (#training_example, max_len)
        word_to_vec_map (dict[str->np.ndarray]): map from a word to a vector with shape (N,) where N is the length of a word vector (50 in our case)
        word_to_index (dict[str->int]): map from a word to its index in vocabulary
    
    Returns:
        Keras.models.Model: 2-layer LSTM model
    """
    
    # Input layer
    sentence_indices = Input(shape=input_shape, dtype='int32')
    
    # Embedding layer
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    embeddings = embedding_layer(sentence_indices)   
    
    # 2-layer LSTM
    X = LSTM(128, return_sequences=True, recurrent_dropout=0.5)(embeddings)  # N->N RNN
    X = Dropout(0.5)(X)
    X = LSTM(128, recurrent_dropout=0.5)(X)  # N -> 1 RNN
    X = Dropout(0.5)(X)
    X = Dense(5, activation='softmax')(X)
    
    # Create and return model
    model = Model(inputs=sentence_indices, outputs=X)
    
    return model

In [None]:
maxlen = len(max(X_train, key=len).split())
print(maxlen)

In [None]:
model = emoji_model((maxlen,), word_to_vec_map, word_to_index)
model.summary()

In [None]:
model.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])

In [None]:
# Convert training/testing features into index list
X_train_indices = sentences_to_indices(X_train, word_to_index, maxlen)
X_test_indices = sentences_to_indices(X_test, word_to_index, maxlen)

# Convert training/testing labels into one hot array
Y_train_oh = convert_to_one_hot(Y_train, C = 5)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)

In [None]:
history = model.fit(
    X_train_indices, 
    Y_train_oh, 
    epochs = 50, 
    batch_size = 32, 
    shuffle=True)

In [None]:
plt.plot(history.history['loss'])

In [None]:
plt.plot(history.history['acc'])

In [None]:
loss, acc = model.evaluate(X_train_indices, Y_train_oh)
print('loss = %.4f, acc = %.2f%%' % (loss, acc * 100))

In [None]:
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print('loss = %.4f, acc = %.2f%%' % (loss, acc * 100))

### Save & Load Model

Two parts need to be saved inorder to use the model in prod:

1. Neural Network Structure
2. Trained Weights (Matrix)

We will save them separately. This makes it easy to manage multiple versions of weights and you can always choose which version to go for production.

In [None]:
# import
import h5py

Use JSON to store model structure and h5py to store compressed weights.

In [None]:
# Save model structure as json
with open("emoji_model.json", "w") as fp:
    fp.write(model.to_json())

# Save model weights
model.save_weights("emoji_model.h5")

The reverse will load the model structure and trained weights.

In [None]:
from keras.models import model_from_json

# Load model structure
with open("emoji_model_best.json", "r") as fp:
    model = model_from_json(fp.read())

# Load model weights
model.load_weights("emoji_model_best.h5")

In [None]:
model.compile(
    loss='binary_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy'])

In [None]:
loss, acc = model.evaluate(X_train_indices, Y_train_oh)
print('loss = %.4f, acc = %.2f%%' % (loss, acc * 100))

In [None]:
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print('loss = %.4f, acc = %.2f%%' % (loss, acc * 100))

In [None]:
x_test = np.array(["i am not feeling happy"])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxlen)
print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))

## TODO: IMDB Dataset

In [None]:
from keras.preprocessing import sequence
from keras.datasets import imdb

In [None]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

In [None]:
# Pad sequences
x_train = sequence.pad_sequences(x_train, maxlen=80)
x_test = sequence.pad_sequences(x_test, maxlen=80)
x_train.shape

In [None]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

In [None]:
model.summary()

In [None]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

In [None]:
batch_size = 32

model.fit(x_train, 
          y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))

In production, you should split your data into 3 parts: training data, validation data and test data. You should not feed `validation_data` with test data like we did here. This is only for quick test.

In [None]:
# Save model structure
with open("imdb_model.json", "w") as fp:
    fp.write(model.to_json())

# Save model weights
model.save_weights("imdb_model.h5")

### Steps

- Use our pre-trained model to replace the Embedding layer and train the model for 30 epochs.
- Collect the training history data
- Plot the accuracy and loss
- Find the best epoch number to stop traning
- Retrain the model and save it for later use.


The model summary should be similar to:

```plain
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_34 (Embedding)     (None, None, 50)          20000050  
_________________________________________________________________
lstm_56 (LSTM)               (None, 128)               91648     
_________________________________________________________________
dense_39 (Dense)             (None, 1)                 129       
=================================================================
Total params: 20,091,827
Trainable params: 91,777
Non-trainable params: 20,000,050
_________________________________________________________________
```

The plot should be similar to:

<table>
    <tr>
        <td><img src="resources/acc.png"></td>
        <td><img src="resources/loss.png"></td>
    </tr>
</table>

### Hint

In [None]:
model2 = Sequential()
model2.add(pretrained_embedding_layer(word_to_vec_map, word_to_index))
model2.add(LSTM(128, dropout=0.3, recurrent_dropout=0.2))
model2.add(Dense(1, activation='sigmoid'))

In [None]:
model2.summary()

In [None]:
model2.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

In [None]:
batch_size = 32

history = model2.fit(x_train, 
                     y_train,
                     batch_size=batch_size,
                     epochs=30,
                     validation_data=(x_test, y_test))

In [None]:
h = history.history.copy()

In [None]:
plt.plot(h['acc'])
plt.plot(h['val_acc'])

In [None]:
plt.plot(h['loss'])
plt.plot(h['val_loss'])