# Part I: NLP Basics
## Task: Sentiment analysis of movie reviews

We begin our tutorial by introducing some basic steps involved in natural language processing (NLP) tasks. Our problem at hand is to classify movie reviews as *positive* or *negative* using the text of the review. 
It is an example of a binary classification, a fundamental and widely applicable kind of machine learning problem. 

A widely-used dataset for this task is the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb). 
It contains 50,000 movie reviews from IMDB, and preprocessed into
25,000 reviews for training and 25,000 reviews for testing. 
Notice that both sets are balanced, which means that there are equal numbers of items in both classes.  

You will first use a high-level framework for deep learning, [Keras](https://www.tensorflow.org/guide/keras), as the tool to build a model for this task. 
Keras helps us to leverage powerful backend toolkits such as Theano and [TensorFlow](https://www.tensorflow.org/).
You may find that it is much easier to build a simple model for sentiment analysis using Keras than other complex toolkits.
However, as you will see in the future courses, when the problem becomes more difficult, you need to dive into the finer mechanisms of deep learning toolkits.

Let's begin!

In [0]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, CuDNNLSTM
from tensorflow.keras.layers import Dropout, Dense, GlobalAveragePooling1D
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

print(tf.__version__)

## Download the IMDB dataset

The IMDB dataset is conviently preprocessed by others and can be easily obtained using Keras. 
The reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary. The dictionary is also pre-built. 

Keras contains the following helper function that downloads the IMDB dataset to your machine.

```python
def load_data(path='imdb.npz', num_words=None, skip_top=0, maxlen=None, seed=113,
              start_char=1, oov_char=2, index_from=3, **kwargs):
```

We talked about the design challenge of setting a vocabulary size. For now, we will set it to 10,000 words.

In [0]:
vocab_size = 10000
(train_data, train_labels), (test_data, test_labels) = \
       keras.datasets.imdb.load_data(num_words=vocab_size)

The argument `num_words=vocab_size` keeps the top 10,000 most frequently occurring words in the training data. 
Other rare words are replaced by `oov_char` to keep the size of the model manageable. 
Recall that we have an embedding matrix that maps words to a vector.
If there are 10,000 words each having a 100-dimension vector, that would take up 1,000,000, or 1 million parameters in our model. However, compared to sparse encoding which would have needed 100 million parameters, this is much smaller.

Keras also comes with a pre-built dictionary of mapping words to its ID. However, it does not match the preprocessed word IDs. We need to add special words into this dictionary. `<PAD>` `<SOS>` `<UNK>` are added to match the settings `start_char=1, oov_char=2, index_from=3` in `load_data()`.

It is common in NLP to add these special words in the dictionary. We want to add the *PADDING* symbol `<PAD>`, *Start-of-sentence* symbol `<SOS>`, and *Unknown* symbol `<UNK>`.

In [0]:
# A dictionary mapping words --> integer index
word_index = keras.datasets.imdb.get_word_index()

# Shift word index by 3 because we want to add special words
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0  # padding
word_index["<SOS>"] = 1  # start of sequence
word_index["<UNK>"] = 2  # unknown (out of the top 10,000 most frequent words)

For our convenience, we will create a helper function to convert integer IDs back to words. It is easier to find errors that way!

In [0]:
# Build another dictionary of mapping integer --> words 
reverse_word_index = dict([(v, k) for (k, v) in word_index.items()])

# Create a helper function to convert the integer to words
# also limit max length
def decode_review(text):
  words = [reverse_word_index.get(i, "<UNK>") for i in text]
  fixed_width_string = []
  # limit max length = 10
  for w_pos in range(len(words)):
    fixed_width_string.append(words[w_pos])
    if (w_pos+1) % 10 == 0:
      fixed_width_string.append('\n')
  return ' '.join(fixed_width_string)

## Always check the content

### Confirm the correctness of preprocessing, because things can go wrong in so many ways.

When you deal with your own dataset, you have to write your own preprocessing procedure. Remember to check the correctness of the preprocessing!

In our design, each input should contain a list of integers representing the words of the movie review, and output should be an integer of 0 or 1. We use 0 to represent a negative review and 1 positive. First we want to make sure the number of reviews and labels are equal.

In [0]:
print("Training data: {} reviews, {} labels".format(len(train_data), len(train_labels)))

The words should be converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:

In [0]:
print("Word IDs")
print(train_data[0])
print("Label")
print(train_labels[0])

Movie reviews may be of different lengths. We can see by examining a few of them. 

Since inputs must be the same length, we'll need to resolve this later.

In [0]:
len(train_data[0]), len(train_data[1])

We can use the `decode_review` function to display the text for the first review, and also check for any error.

In [0]:
print(decode_review(train_data[0]))

If we did **not** add special words in word_index, we will see that the reviews don't make any sense when using `decode_review`.  It is important to always check for errors like this!

## Prepare the data for input

Since the inputs must be of the same length, we will use the helper function `pad_sequences` in Keras to unify the lengths.

Important options:
* Max length = 100
* The argument `padding='pre'` means that we are padding the beginning of a sentence.
* If we set `maxlen=None`, Keras will automatically pad to __longest__ sequence in the dataset. 
* `truncating='pre'` indicates that the truncating will happen from the beginning of the review, so we are keeping the __last__ `maxlen=100` words

In [0]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='pre',
                                                        truncating='pre',
                                                        maxlen=100)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='pre',
                                                       truncating='pre',
                                                       maxlen=100)

Let's look at the length of the examples now:

In [0]:
len(train_data[0]), len(train_data[1])

And inspect the first review:

In [0]:
print(train_data[0])

Compare the review now vs. the original above.

In [0]:
print(decode_review(train_data[0]))

We can see that the review has been cut from the beginning, because we set `truncating='pre'` in `pad_sequences`.

For shorter reviews, we can see that it has been padded from the beginning as in the next example.

In [0]:
print(decode_review(train_data[5]))

If you need to process your own dataset, you can use another convenient helper function in Keras
```python
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=vocab_size, oov_token="<UNK>")
# SENTENCES = list of list of words
tokenizer.fit_on_texts(SENTENCES)
sequences = tokenizer.texts_to_sequences(SENTENCES)

word_index = tokenizer.word_index
```

Remember to call `pad_sequences` later!

## Create a validation set

When training, we want to check the accuracy of the model on data it hasn't seen before. Create a *validation set* by setting apart some examples from the original training data. 

Why not use the testing set now? Our goal is to develop and tune our model using only the training data, then use the test data just once to evaluate our accuracy. 

To save time, we will use only a part of the training data.

In [0]:
x_val = train_data[:1000]
partial_x_train = train_data[1000:10000]

y_val = train_labels[:1000]
partial_y_train = train_labels[1000:10000]

## Build the model

Keras can help us build a model quickly. The neural network is created by adding layers. However, you need to decide :
* How many layers?
* How many hidden units to use for each layer?

Let's build a simple model for the sentiment analysis problem!

In [0]:
model = keras.Sequential()
# Embedding layer maps each of the 10000 words to 100-d embeddings
model.add(Embedding(vocab_size, 100))
# Average the embeddings
model.add(GlobalAveragePooling1D())
# 1 Fully-connected layer
model.add(Dense(16, activation=tf.nn.relu))
# 2 Fully-connected layer
model.add(Dense(1, activation=tf.nn.sigmoid))

model.summary()

### Layers
In this example, the layers are linked sequentially, i.e., the output of the previous layer is sent to the next layer only.

The first layer is an `Embedding` layer. This layer takes a sequence of word IDs (integer) and looks up an embedding matrix for a vector that represents that ID. **These vectors are learned as the model trains**. Note that, since this layer converts the 2D input of shape `(batch, sequence_len)` to `(batch, sequence_len, embedding_size)`.

---

The next layer, `GlobalAveragePooling1D` layer, calculates an average of the **second dimension**. So a batch of sequences of embeddings with shape `(batch, sequence_len, embedding_size)` will be averaged to a shape `(batch, embedding_size)`. 

---

The last two layers are fully-connected (`Dense`) layer with 16  and 1 hidden unit(s).
First fully-connected layer can be thought of as a feature reduction.
The final layer is applying the `sigmoid` activation function, which has an output value between 0 and 1, to act as a probability or confidence level.

### Hidden units

The above model has several intermediate or "hidden" layers, between the input and output. The number of "units" (or neurons) is the dimension of the representational space for the layer. In other words, the amount of freedom the network is allowed when learning an internal representation.

If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance on training data but not on the test data. This is called *overfitting*, and we'll explore it later.

### Loss function and optimizer

A model needs a loss function and an optimizer for training. Since this is a **binary** classification problem and the model outputs of a probability (a single-unit layer with a `sigmoid` activation function), we'll use the `binary_crossentropy` loss function. 

This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions. 


---


If you are handling a classification problem with more that two classes, you will need to set the final layer to have the same number of units as your classes. The model then outputs probabilities of each class (using a `softmax` activation function). Then, you will need to use the `categorical_crossentropy` loss function. 

When you are dealing with regression problems (say, to predict the price of a house), you  will need to use other loss functions such as `mean_squared_error`.


---


The design of Keras requires us to configure the loss and optimizer of the model together.

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Train the model

Train the model for *n* epochs in mini-batches of samples. Recall that this is *n* iterations over all samples in the training data. While training, monitor the model's loss and accuracy on the validation set:

In [0]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=8,
                    batch_size=100,
                    validation_data=(x_val, y_val))

## Evaluate the model

And let's see how the model performs on the test set. Two values will be returned when calling `.evaluate` function, **loss** (we defined it as binary cross entropy) and **accuracy**.  Keras will report whatever we used in the `model.compile` function as the evaluation metrics.


In [0]:
results = model.evaluate(test_data, test_labels)
print(results)

This fairly naive approach achieves an accuracy of 83%.

## Create a plot of accuracy and loss over time

`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training. There are four entries: one for each monitored metric during training and validation. We can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy. We will write a helper function to plot loss and accuracy of each epoch.

In [0]:
def plot_hist(history):
    history_dict = history.history
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    # plot for loss
    plt.clf()   # clear figure
    # "bo" is for "blue dot"
    plt.plot(epochs, loss, 'bo', label='Training loss')
    # r is for "red solid line"
    plt.plot(epochs, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()
    # plot for accuracy
    plt.clf()   # clear figure
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.show()

plot_hist(history)

In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.

Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected because we designed the model to optimize for this goal.

However, the loss and accuracy for validation data is usually different from the training data. They usually peak after some epochs. 

Recall that we have talked about **overfitting**: the model performs much better on the training data than it does on new data. 
After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.

We could prevent overfitting by simply stopping the training after some epochs by observing these plots. Also,  we could apply a simple method that uses `Dropout`.

## Deal with overfitting
A very straightforward method is to insert `Dropout` layers in between our previous layers.  

**Important**: `Dropout(rate)` rate is a float between 0 and 1 that indicates the fraction of the input units to **drop**. 

However, in `tensorflow`, the dropout layers take an argument of `keep_prob` which indicates the fraction to **keep**.

In [0]:
# Clear previous model
model = None
K.clear_session()
model = keras.Sequential()
# Embedding layer maps each of the 10000 words to 100-d embeddings
model.add(Embedding(vocab_size, 100))
# Average the embeddings
model.add(GlobalAveragePooling1D())
# 1 Fully-connected layer
model.add(Dense(16, activation=tf.nn.relu))

# Dropout layer
model.add(Dropout(0.5))

# 2 Fully-connected layer
model.add(Dense(1, activation=tf.nn.sigmoid))
model.summary()

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=12,
                    batch_size=100,
                    validation_data=(x_val, y_val))
plot_hist(history)

Compare the two plots from two models, it seems that `Dropout` is effective at reducing overfitting!


Also remember that when applying dropout, we usually need to train for more epochs.

In [0]:
results = model.evaluate(test_data, test_labels)
print(results)

## Build a recurrent model

You can easily change to a recurrent model in Keras. Simply replace `Dense` with `LSTM` and that is it! 


---


Specifically, some parameters can be set for the LSTM cell.

```python
LSTM(hidden_units, dropout=0.0, recurrent_dropout=0.0)
```

The first `dropout` refers to the dropping of input features, and `recurrent_dropout` refers to the dropping of the previous output. Recall that in the slides we showed some connection between the previous output and the current input?

![dropout_difference](https://drive.google.com/uc?export=view&id=1kiiV6BvPalvGnA6zg3LioDHOMs44TBrr)

Let's build a recurrent model for the sentiment analysis problem!

In [0]:
model = None
K.clear_session()
model = keras.Sequential()
model.add(Embedding(vocab_size, 100))

# Add a recurrent layer
# model.add(LSTM(32, dropout=0.5, recurrent_dropout=0.5))
## or CuDNNLSTM
model.add(CuDNNLSTM(32))
##

model.add(Dense(1, activation='sigmoid'))
model.summary()

Again, we build the model and train for some epochs. Observe the change of loss and accuracy compared with the previous model.

**Note**: due to speed issues we will train for 4 epochs when using recurrent models. You can try changing that later! Also, you can use `CuDNNLSTM` which is the latest LSTM version in tensorflow that is optimized for GPU. However, it does **not** support dropout itself.

In [0]:
model.compile(loss='binary_crossentropy',
              optimizer=tf.train.AdamOptimizer(),
              metrics=['accuracy'])
history = model.fit(partial_x_train, \
                    partial_y_train, \
                    epochs=8, \
                    batch_size=100, \
                    validation_data=(x_val, y_val))

plot_hist(history)

We can see a very large decrease in speed. (**55us/step** vs. **3ms/step**, which is about **55** times slower)

And evaluate on the test set.

In [0]:
results = model.evaluate(test_data, test_labels)
print(results)

### Improvements to a recurrent model
First, we will try to make it *deeper* by adding more LSTMs. 

**We must change the previous layers!!**

By default, a recurrent cell only returns the **last** output. But for multiple layers of recurrent cells, we need the **entire** output. 

So, we have to set `return_sequences=True` for all recurrent layers except the last one.

In [0]:
model = None
K.clear_session()
model = keras.Sequential()
model.add(Embedding(vocab_size, 32))

model.add(LSTM(32, return_sequences=True))
model.add(LSTM(32, return_sequences=False))

model.add(Dense(1, activation='sigmoid'))
model.summary()

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=4,
                    batch_size=100,
                    validation_data=(x_val, y_val))
plot_hist(history)

Next, we will try using *bi-directional* RNN. It is also very easy, just wrap your LSTM cell in `Bidirectional()`.


In [0]:
model = None
K.clear_session()
model = keras.Sequential()
model.add(Embedding(vocab_size, 32))

model.add(Bidirectional(LSTM(32, return_sequences=False)))

model.add(Dense(1, activation='sigmoid'))
model.summary()

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=4,
                    batch_size=100,
                    validation_data=(x_val, y_val))
plot_hist(history)

Can you think of our next move? 

We can combine deep LSTM with bidirectional!

In [0]:
model = None
K.clear_session()
model = keras.Sequential()
model.add(Embedding(vocab_size, 32))
model.add(Bidirectional(LSTM(32, return_sequences=True)))
model.add(Bidirectional(LSTM(32, return_sequences=False)))
model.add(Dense(1, activation='sigmoid'))
model.summary()

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=4,
                    batch_size=100,
                    validation_data=(x_val, y_val))
plot_hist(history)

We can see that it is even slower than our previous model (about 3 times slower), but the loss and accuracy are better than previous models.

Remember that there is no guarantee that deeper or more complex models perform better than simple models, especially when you don't have enough data.

## Pretrained embeddings
Finally, we will try using **pretrained embeddings**.
Pretrained embeddings are not limited to recurrent models. They can be very useful in almost every NLP applications. 

The reason? 

Instead of randomly initialize the word embeddings, we can train them using a huge amout of (unlabeled) data in an unsupervised fashion.


---

We will use the [GloVe embeddings](http://nlp.stanford.edu/projects/glove/) to initialize our embedding weights. 
The file format is 
```
word1  0.xx -0.xx ... 
word2  0.yy 0.yy ...
```
Here, we read the entire file and see which of those words are in our dictionary. If we find one, we will set its embedding accordingly. Other words that are not found in the pretrained embeddings can be initialized as all zeros or very small numbers.

In [0]:
## Download pretrained GloVe embeddings from kaggle
## !kaggle datasets download -d terenceliu4444/glove6b100dtxt
## !unzip glove6b100dtxt.zip

## Randomize pretrained embeddings matrix with very small numbers
pretrained_embedding_matrix = (np.random.rand(vocab_size, 100) - 0.5) / 1e4
## Initialize embeddings matrix to all zeros
# pretrained_embedding_matrix = np.zeros((vocab_size, 100))
## Load pretrained embeddings
with open('glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        if word in word_index:
            embs = np.asarray(values[1:], dtype='float32')
            if word_index[word] >= vocab_size: continue
            pretrained_embedding_matrix[word_index[word]] = embs

`pretrained_embedding_matrix` now contains pre-trained embeddings and can be used to initialize the embedding layer.

An option is to set the embeddings to `trainable=False` which stops them from begin updated during training. 
However, this is not always useful as the pretrained embeddings may come from a **different dataset**. 

If you have a large amount of unlabeled data and a smaller labeled data, both from the same source, you can consider setting the `trainable=False` flag. This has an additional benefit of **reducing the number of parameters**, which in turn reduces the amount of training data that you need!

We can test the effect of pretrained embeddings on our very simple model first.

In [0]:
model = None
K.clear_session()
model = keras.Sequential()

# Embedding layer with pretrained embeddings
model.add(Embedding(vocab_size, 100, weights=[pretrained_embedding_matrix]))

model.add(GlobalAveragePooling1D())
model.add(Dense(16, activation=tf.nn.relu))
model.add(Dense(1, activation=tf.nn.sigmoid))
model.summary()
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=8,
                    batch_size=100,
                    validation_data=(x_val, y_val))
plot_hist(history)

In [0]:
results = model.evaluate(test_data, test_labels)
print(results)

We can see that the model improved slightly from 83% to 84%. 

Next we will try it on the RNN.

In [0]:
model = None
K.clear_session()
model = keras.Sequential()
model.add(Embedding(vocab_size, 100, weights=[pretrained_embedding_matrix]))
model.add(CuDNNLSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=4,
                    batch_size=100,
                    validation_data=(x_val, y_val))
plot_hist(history)

Finally, we test its performance on the test set.

In [0]:
results = model.evaluate(test_data, test_labels)
print(results)

# Summary
What we learned today:
1. Basic preprocessing of NLP data
    * Tokenize words
    * Create a dictionary that maps words to unique IDs
    * Convert words to ID
    * Pad/truncate sequences to unified lengths
2. Building a model using Keras
    * Design model structure
    * Add layers
    * Define loss
    * Define optimizer
3. Training and evaluation
    * Create a plot to clearly observe training progress
    * Evaluate on test set
4. Improvements to the model
    * Dropout
    * Recurrent
    * Pretrained embeddings
    * Mixture of the above

You are now capable of building a deep learning model for NLP classisification tasks! Please modify this codelab later to use the entire training data and construct different model structures.  However, do not build overly large models because 1) it is not always effective ans 2) it may not run on this cloud platform.

## Extension
Multi-class text classification using [reuters](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/reuters) dataset in tensorflow 

# Appendix: How to setup Kaggle on Google Colab

In [0]:
!pip install -q kaggle

Download Kaggle API key and upload it here.

In [0]:
from google.colab import files
k_config = files.upload()

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [0]:
!kaggle datasets download -d terenceliu4444/glove6b100dtxt
!unzip glove6b100dtxt.zip