# Using ELMO in keras models
## Task: Sentiment analysis of movie reviews

In [0]:
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, CuDNNLSTM
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
from tensorflow.keras.layers import Input, Lambda
from tensorflow.keras import Model
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

print(tf.__version__)

## Download the IMDB dataset

The IMDB dataset is conviently preprocessed by others and can be easily obtained using Keras. 
The reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary. The dictionary is also pre-built. 

Keras contains the following helper function that downloads the IMDB dataset to your machine.

```python
def load_data(path='imdb.npz', num_words=None, skip_top=0, maxlen=None, seed=113,
              start_char=1, oov_char=2, index_from=3, **kwargs):
```

We talked about the design challenge of setting a vocabulary size. For now, we will set it to 10,000 words.

In [0]:
(train_data, train_labels), (test_data, test_labels) = \
       keras.datasets.imdb.load_data()

Keras also comes with a pre-built dictionary of mapping words to its ID. However, it does not match the preprocessed word IDs. We need to add special words into this dictionary. `<PAD>` `<SOS>` `<UNK>` are added to match the settings `start_char=1, oov_char=2, index_from=3` in `load_data()`.

It is common in NLP to add these special words in the dictionary. We want to add the *PADDING* symbol `<PAD>`, *Start-of-sentence* symbol `<SOS>`, and *Unknown* symbol `<UNK>`.

In [0]:
# A dictionary mapping words --> integer index
word_index = keras.datasets.imdb.get_word_index()

# Shift word index by 3 because we want to add special words
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0  # padding
word_index["<SOS>"] = 1  # start of sequence
word_index["<UNK>"] = 2  # unknown (out of the top 10,000 most frequent words)

For our convenience, we will create a helper function to convert integer IDs back to words. It is easier to find errors that way!

In [0]:
# Build another dictionary of mapping integer --> words 
reverse_word_index = dict([(v, k) for (k, v) in word_index.items()])

# Create a helper function to convert the integer to words
def decode_review(text):
  words = [reverse_word_index[i] for i in text]
  return ' '.join(words)

## Always check the content

### Confirm the correctness of preprocessing, because things can go wrong in so many ways.

When you deal with your own dataset, you have to write your own preprocessing procedure. Remember to check the correctness of the preprocessing!

In our design, each input should contain a list of integers representing the words of the movie review, and output should be an integer of 0 or 1. We use 0 to represent a negative review and 1 positive. First we want to make sure the number of reviews and labels are equal.

In [0]:
print("Training data: {} reviews, {} labels".format(len(train_data), len(train_labels)))

The words should be converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:

In [0]:
print("Word IDs")
print(train_data[0])
print("Label")
print(train_labels[0])

Movie reviews may be of different lengths. We can see by examining a few of them. 

Since inputs must be the same length, we'll need to resolve this later.

In [0]:
len(train_data[0]), len(train_data[1])

We can use the `decode_review` function to display the text for the first review, and also check for any error.

In [0]:
print(decode_review(train_data[0]))

If we did **not** add special words in word_index, we will see that the reviews don't make any sense when using `decode_review`.  It is important to always check for errors like this!

## Prepare the data for input

No need to pad input sentence when using ELMO. **We need to convert word IDs back to words**!

Important options:
* Max length = 100


In [0]:
# NO NEED TO PAD !!!!!
# train_data = keras.preprocessing.sequence.pad_sequences(train_data,
#                                                         value=word_index["<PAD>"],
#                                                         padding='pre',
#                                                         truncating='pre',
#                                                         maxlen=100)
# test_data = keras.preprocessing.sequence.pad_sequences(test_data,
#                                                        value=word_index["<PAD>"],
#                                                        padding='pre',
#                                                        truncating='pre',
#                                                        maxlen=100)
maxlen = 100
train_data = [decode_review(t[:maxlen]) for t in train_data]
test_data = [decode_review(t[:maxlen]) for t in test_data]

Let's look at the length of the examples now:

In [0]:
len(train_data[0]), len(train_data[1])

And inspect the first review:

In [0]:
print(train_data[0])

Compare the review now vs. the original above.

If you need to process your own dataset, you can use another convenient helper function in Keras
```python
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=vocab_size, oov_token="<UNK>")
# SENTENCES = list of list of words
tokenizer.fit_on_texts(SENTENCES)
sequences = tokenizer.texts_to_sequences(SENTENCES)

word_index = tokenizer.word_index
```

Remember to call `pad_sequences` later!

## Create a validation set

When training, we want to check the accuracy of the model on data it hasn't seen before. Create a *validation set* by setting apart some examples from the original training data. 

Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy. 

To save time, we will use only a part of the training data.

In [0]:
x_val = train_data[:1000]
partial_x_train = train_data[1000:4000]

y_val = train_labels[:1000]
partial_y_train = train_labels[1000:4000]

## Build the model

Keras can help us build a model quickly. The neural network is created by adding layers. However, you need to decide :
* How many layers?
* How many hidden units to use for each layer?

Let's build a simple model for the sentiment analysis problem!

In [0]:
import tensorflow_hub as hub
sess = tf.Session()
K.set_session(sess)

elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=False)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

In [0]:
# We create a function to integrate the tensorflow model with a Keras model
# This requires explicitly casting the tensor to a string, because of a Keras quirk
def ElmoEmbedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", \
                      as_dict=True)["default"]
# the output dictionary:
# ["default"] -> a fixed mean-pooling of all contextualized word representations
#                with shape [batch_size, 1024].
# ["elmo"]    -> the weighted sum of the 3 layers, where the weights are
#                trainable. This tensor has shape [batch_size, max_length, 1024]

  
input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
dense = Dense(256, activation='relu')(embedding)
pred = Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[input_text], outputs=pred)

model.summary()

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Train the model

Train the model for *n* epochs in mini-batches of samples. Recall that this is *n* iterations over all samples in the training data. While training, monitor the model's loss and accuracy on the validation set:

** INPUT SENTENCE MUST BE NUMPY ARRAY! **

In [0]:
history = model.fit(np.array(partial_x_train),
                    partial_y_train,
                    epochs=8,
                    batch_size=32,
                    validation_data=(np.array(x_val), y_val))

## Evaluate the model

And let's see how the model performs on the test set. Two values will be returned when calling `.evaluate` function, **loss** (we defined it as binary cross entropy) and **accuracy**.  Keras will report whatever we used in the `model.compile` function as the evaluation metrics.


In [0]:
results = model.evaluate(np.array(test_data), test_labels)
print(results)

## Create a plot of accuracy and loss over time

`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training. There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy. We will write a helper function to plot loss and accuracy of each epoch.

In [0]:
def plot_hist(history):
    history_dict = history.history
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    # plot for loss
    plt.clf()   # clear figure
    # "bo" is for "blue dot"
    plt.plot(epochs, loss, 'bo', label='Training loss')
    # r is for "red solid line"
    plt.plot(epochs, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()
    # plot for accuracy
    plt.clf()   # clear figure
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.show()

plot_hist(history)

## Build a recurrent model

You can easily change to a recurrent model in Keras. Simply replace `Dense` with `LSTM` and that is it! 


---


Specifically, some parameters can be set for the LSTM cell.

```python
LSTM(hidden_units, dropout=0.0, recurrent_dropout=0.0)
```

The first `dropout` refers to the dropping of input features, and `recurrent_dropout` refers to the dropping of the previous output. Recall that in the slides we showed some connection between the previous output and the current input?

Let's build a recurrent model for the sentiment analysis problem!

In [0]:
model = None
K.clear_session()
sess = tf.Session()
K.set_session(sess)

elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=False)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

# Note the final dictionary key `elmo`
def ElmoEmbedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", \
                      as_dict=True)["elmo"]

# the output dictionary:
# ["default"] -> a fixed mean-pooling of all contextualized word representations
#                with shape [batch_size, 1024].
# ["elmo"]    -> the weighted sum of the 3 layers, where the weights are
#                trainable. This tensor has shape [batch_size, max_length, 1024]


input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ElmoEmbedding, output_shape=(None,1024,))(input_text)
rnn_lstm = CuDNNLSTM(16)(embedding)
pred = Dense(1, activation='sigmoid')(rnn_lstm)

model = Model(inputs=[input_text], outputs=pred)

model.summary()

In [0]:
model.compile(loss='binary_crossentropy',
              optimizer=tf.train.AdamOptimizer(),
              metrics=['accuracy'])

history = model.fit(np.array(partial_x_train), \
                    partial_y_train, \
                    epochs=4, \
                    batch_size=32, \
                    validation_data=(np.array(x_val), y_val))

plot_hist(history)

In [0]:
results = model.evaluate(np.array(test_data), test_labels)

print(results)