# Deep Learning

Deep Learnings has been introduced to NLP tasks in 2010s. One of the earliest approach is RNN and LSTM which work best for sequence like input. In this Notebook, we will experiment two different LSTM model on our IMDB dataset. Then, similar to Notebook 3, we will use pre-trained Word2Vec as input layer and see if it helps. Finally, we will use CNN which is a Deep Learning technique commonly used in Computer Vision.

In [53]:
%load_ext autoreload
%autoreload

from lib.dataset import download_tfds_imdb_as_tensor_subword_8k, download_tfds_imdb_as_text_tiny, download_tfds_imdb_as_text
from lib.deep_learning import run_lstm_pipeline, run_cnn_pipeline


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
dataset = download_tfds_imdb_as_text()
dataset_tiny = download_tfds_imdb_as_text_tiny()

# Basic LSTM

We will try two slightly different uses of LSTM in text classification

- BiLSTMLastStateClassification - We use the Bi-directional LSTM to encode a vector that represent the meaning of the whole sentence, then feed it to two layers of fully connected neural networks to make a prediction. This model is very different from what we have done in Notebook 1 - 3 in a sense that it takes "order" in to account (all previous experiments are bag-of-word). For example, it can differentiate this classic example `This movie is boring it is not good` and `This movie is good it is not boring`. Although bigrams can differentiate this particular example, in the real world there are many other examples that bigram is not enough and you need 3-gram, 4-gram. Increasing n-gram is not the solution because it introduced sparseness to feature space. That's why LSTM comes in. The drawback of this model is that if the sentence is very long, the last output state may encode very little information of the very first tokens of the sentence. 

- BiLSTMPoolClassification - This is another use of LSTM. Instead of using the output of the last state of LSTM, now we use the LSTM output of every token in the sentence and, sum them up, and feed to two layers of FC layer. Since we sum up to output vector of each token, we go back to the bag-of-word approach again. However, it's not "very" bag-of-word. This approach is different from other bag-of-word style we used previously e.g. the Word2Vec bag-of-word in Notebook 2 - 3. In those models, we the word representing each tokens are static i.e. the vector for `good` are always same regarding where it presents in the sentence. However, in this architecture, the vector representing each token is contextualized It is aware of the word nearby because it's the output of LSTM. It can learn that vector for `good` can be different in different context. Moreover, since we use every LSTM output of each token, they also include the last state output as we used in above model. Although this vector is corresponding to the last token, it also represents the meaning of the whole sentence.

The  BiLSTMLastStateClassification is very similar to the TensorFlow official tutorial [here](https://www.tensorflow.org/tutorials/text/text_generation). However, we use the words as features to streamline with other experiments from Notebook 1 -4 while the tutorial use subwords. The model used in that tutorial has the F1 of 0.87.

In [23]:
run_lstm_pipeline(dataset, "BiLSTMLastStateClassification", embeddings_size=64, hidden_unit=64)


KeyboardInterrupt: 

In [54]:
run_lstm_pipeline(dataset, "BiLSTMPoolClassification", embeddings_size=64, hidden_unit=64)


KeyboardInterrupt: 

Discussion

# Basic LSTM with Pre-trained Word2Vec on Tiny Dataset

In this experiment, we will see if it helps if we initialize the embeddings with pre-trained Word2Vec. Similar to the Notebook 3, the idea is to see if it's useful to introduce the knowledge of the language to the model via pre-trained Word2Vec. Since the dimension of Word2Vec is 300, we have to change our embeddings layer size to be 300.

**Prerequisite**

Download [Google Word2Vec Model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) to this directory and run 

```
gunzip GoogleNews-vectors-negative300.bin.gz
```
If you already have those files or you don't want to save it in this directory, you can either change constant variable PRETRAINED_WV_MODEL_PATH  and PRETRAINED_GLOVE_MODEL_PATH or create symbolic link.
    
```
ln -s /path/to/your/word2vec ./GoogleNews-vectors-negative300.bin

```

In [41]:
# load pre-trained word embeddings from disk - take about 5 mins to run
import gensim 
PRETRAINED_WV_MODEL_PATH = "./GoogleNews-vectors-negative300.bin"
word2vec = gensim.models.KeyedVectors.load_word2vec_format(PRETRAINED_WV_MODEL_PATH, binary=True)


In [None]:
run_lstm_pipeline(dataset, "BiLSTMLastStateClassification", embeddings_size=300, hidden_unit=10, word2vec=word2vec)


Epoch 1, Loss: 0.69, Accuracy: 0.55, Test Loss: 0.66, Test Accuracy: 0.61, Time: 388.91 s
Epoch 2, Loss: 0.56, Accuracy: 0.74, Test Loss: 0.49, Test Accuracy: 0.80, Time: 396.00 s
Epoch 3, Loss: 0.53, Accuracy: 0.75, Test Loss: 0.46, Test Accuracy: 0.81, Time: 395.56 s
Epoch 4, Loss: 0.38, Accuracy: 0.86, Test Loss: 0.45, Test Accuracy: 0.81, Time: 393.17 s
Epoch 5, Loss: 0.34, Accuracy: 0.88, Test Loss: 0.42, Test Accuracy: 0.82, Time: 395.48 s
Epoch 6, Loss: 0.29, Accuracy: 0.90, Test Loss: 0.40, Test Accuracy: 0.84, Time: 391.67 s


In [None]:
run_lstm_pipeline(dataset, "BiLSTMLastStateClassification", embeddings_size=300, hidden_unit=10)


# Basic CNN

CNN is widely used in Computer Vision. However, recent NLP researchers also adopt CV for many different NLP tasks. The CNN is different from LSTM in the sense that it examines few tokens within window size (5 in our experiment) then move to the next window. Then it pools (taking average in our experiment) the output from each windows to the single vector. Intuitively, the CNN should be able to capture the local semantic within the window size better.

This [paper](https://arxiv.org/pdf/1702.01923.pdf) discuss the comparative study of CNN and RNN for NLP tasks. 

In [5]:
run_cnn_pipeline(dataset, embeddings_size=64, hidden_unit=64, window=5)

Epoch 1, Loss: 0.69, Accuracy: 0.53, Test Loss: 0.69, Test Accuracy: 0.61, Time: 37.38 s
Epoch 2, Loss: 0.67, Accuracy: 0.68, Test Loss: 0.66, Test Accuracy: 0.72, Time: 35.38 s
Epoch 3, Loss: 0.62, Accuracy: 0.77, Test Loss: 0.60, Test Accuracy: 0.78, Time: 35.57 s
Epoch 4, Loss: 0.54, Accuracy: 0.82, Test Loss: 0.53, Test Accuracy: 0.82, Time: 35.42 s
Epoch 5, Loss: 0.47, Accuracy: 0.86, Test Loss: 0.47, Test Accuracy: 0.83, Time: 35.37 s
Epoch 6, Loss: 0.41, Accuracy: 0.88, Test Loss: 0.42, Test Accuracy: 0.86, Time: 35.76 s
Epoch 7, Loss: 0.36, Accuracy: 0.89, Test Loss: 0.39, Test Accuracy: 0.87, Time: 35.83 s
Epoch 8, Loss: 0.32, Accuracy: 0.90, Test Loss: 0.36, Test Accuracy: 0.88, Time: 35.73 s
Epoch 9, Loss: 0.29, Accuracy: 0.91, Test Loss: 0.34, Test Accuracy: 0.88, Time: 35.67 s
Epoch 10, Loss: 0.26, Accuracy: 0.92, Test Loss: 0.33, Test Accuracy: 0.88, Time: 35.74 s
Epoch 11, Loss: 0.24, Accuracy: 0.93, Test Loss: 0.31, Test Accuracy: 0.89, Time: 35.71 s
Epoch 12, Loss: 0.2