##### Copyright 2018 The TensorFlow Authors.

This text classification tutorial trains a [recurrent neural network](https://developers.google.com/machine-learning/glossary/#recurrent_neural_network) on the [IMDB large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) for sentiment analysis.

Adapted from: https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/text_classification_rnn.ipynb

# Text classification with an RNN

This text classification tutorial trains a [recurrent neural network](https://developers.google.com/machine-learning/glossary/#recurrent_neural_network) on the [IMDB large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) for sentiment analysis.

NB: the leaderboard with all the best models can be found here: https://paperswithcode.com/sota/sentiment-analysis-on-imdb

In [None]:
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

## Setup input pipeline


The IMDB large movie review dataset is a *binary classification* dataset—all the reviews have either a *positive* or *negative* sentiment.

Download the dataset using [TFDS](https://www.tensorflow.org/datasets). See the [loading text tutorial](https://www.tensorflow.org/tutorials/load_data/text) for details on how to load this sort of data manually.


In [None]:
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

In [None]:
###############
# Can you analyse and plot the distribution of the length of the reviews?
###############
length_train_examples = [len(str(example[0].numpy())) for example in train_dataset]


In [None]:
plt.hist(length_train_examples);

Next shuffle the data for training and create batches of these `(text, label)` pairs:

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

In [None]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

## Create the text encoder

The raw text loaded by `tfds` needs to be processed before it can be used in a model. The simplest way to process text for training is using the `TextVectorization` layer. This layer has many capabilities, but this tutorial sticks to the default behavior. In particular, this will:
- make all lowercase
- strip punctuation
- tokenize with simple whitespace
- and finally, map a token to an integer

Create the layer, and pass the dataset's text to the layer's `.adapt` method:

In [None]:
VOCAB_SIZE = 10000
encoder = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)
# Then, the encoder will adapt on the training set.
encoder.adapt(train_dataset.map(lambda text, label: text))

The .adapt method sets the layer's vocabulary. Note that if you have a token never seen in the test set, it will be assign to an "UNK" (for unknown) integer.

Here are the first 20 tokens. After the padding and unknown tokens they're sorted by frequency:

In [None]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed output_sequence_length):

In [None]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

With the default settings, the process is not completely reversible. There are three main reasons for that:

1. The default value for `preprocessing.TextVectorization`'s `standardize` argument is `"lower_and_strip_punctuation"`.
2. The limited vocabulary size and lack of character-based fallback results in some unknown tokens.

In [None]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

# You can see with this cell that the generated sentence is not exactly the same as the original one.

## Create the model

Above is a diagram of the model.

1. This model can be build as a `tf.keras.Sequential`.

2. The first layer is the `encoder`, which converts the text to a sequence of token indices.

3. After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

  This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a `tf.keras.layers.Dense` layer.

4. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.

  The `tf.keras.layers.Bidirectional` wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the final output.

  * The main advantage of a bidirectional RNN is that the signal from the beginning of the input doesn't need to be processed all the way through every timestep to affect the output.  

  * The main disadvantage of a bidirectional RNN is that you can't efficiently stream predictions as words are being added to the end.

5. After the RNN has converted the sequence to a single vector the two `layers.Dense` do some final processing, and convert from this vector representation to a single logit as the classification output.

In [None]:
# Embedding lookup
embedding_layer = tf.keras.layers.Embedding(
    input_dim=len(encoder.get_vocabulary()),
    output_dim=64,
    # Use masking to handle the variable sequence lengths
    mask_zero=True,
    trainable=True
),

In [None]:
model = tf.keras.Sequential([
    encoder,
    embedding_layer,
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1)
])

The embedding layer [uses masking](https://www.tensorflow.org/guide/keras/masking_and_padding) to handle the varying sequence-lengths. All the layers after the `Embedding` support masking:

In [None]:
print([layer.supports_masking for layer in model.layers])

To confirm that this works as expected, evaluate a sentence twice. First, alone so there's no padding to mask:

In [None]:
# predict on a sample text without padding.

sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

Now, evaluate it again in a batch with a longer sentence. The result should be identical:

In [None]:
# predict on a sample text with padding

padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])
print(predictions[1])

In [None]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-3),
              metrics=['accuracy'])

## Train the model

In [None]:
NB_EPOCHS = 8

history = model.fit(train_dataset,
                    epochs=NB_EPOCHS,
                    validation_data=test_dataset)

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

What do you observe? Can you comment?
Remember: tradeoff bias / variance!

Run a prediction on a new sentence:

If the prediction is >= 0.0, it is positive else it is negative.

In [None]:
sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
predictions

In [None]:
prediction = model.predict(test_dataset)

In [None]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (classification_report,
                             confusion_matrix,
                             roc_auc_score)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Making predictions on our model
y_pred = (prediction > 0.5)
y_test = np.concatenate([label.numpy() for _, label in test_dataset])

report = classification_report(y_test, y_pred)
print(report)

roc_auc = roc_auc_score(y_test, prediction)
print("ROC AUC score:", roc_auc)

def plot_cm(labels, predictions, p=0.5):
    cm = confusion_matrix(labels, predictions)
    plt.figure(figsize=(5, 5))
    sns.heatmap(cm, annot=True, fmt="d")
    plt.title("Confusion matrix (non-normalized))")
    plt.ylabel("Actual label")
    plt.xlabel("Predicted label")

plot_cm(y_test, y_pred)

## Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the `return_sequences` constructor argument:

If False it returns only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)). This is the default, used in the previous model.

If True the full sequences of successive outputs for each timestep is returned (a 3D tensor of shape (batch_size, timesteps, output_features)).

The interesting thing about using an RNN with return_sequences=True is that the output still has 3-axes, like the input, so it can be passed to another RNN layer, like this:

In [None]:
model = tf.keras.Sequential([
    encoder,
    embedding_layer,
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1)
])

Let's try the concatenation of two LSTMs!

## CNN on text!

As a bonus question, you can try to replace the LSTM by a CNN. Actually, CNN work well on text also! To be more precise, they work well for tasks where there is no long-distance relationship between parts of the text (which is the case for classification).

Be careful, this time you will use `Conv1D` (not `Conv2D`), since the convolution is only along the sentence (and not along the embedding dimension). You will also need `GlobalMaxPooling1D` at the end to reduce each feature maps" into a single value.

In [None]:
model = tf.keras.Sequential([
    encoder,
    embedding_layer,
    tf.keras.layers.Conv1D(32, kernel_size=8, activation="relu"),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dropout(.2),
    tf.keras.layers.Dense(1)
])

# Transfer learning on text

We'll use already pretrained word embeddings called Glove!
More info here: https://nlp.stanford.edu/projects/glove/

Next, we compute an index mapping words to known embeddings, by parsing the data dump of pre-trained embeddings:



In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
import os
import zipfile

local_zip = 'glove.6B.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

In [None]:
import numpy as np

embeddings_index = {}
f = open(os.path.join("/tmp", 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

At this point we can leverage our embedding_index dictionary and our word_index to compute our embedding matrix:

In [None]:
embeddings_dim = 100

embedding_matrix = np.zeros((len(vocab), embeddings_dim))
for i, word in enumerate(vocab):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

💡 Let's compute the cosine similarity between vectors, to observe the semantic relationships between words!

In [None]:
# 💡 compute cosine similarity: let's try different words!
A = embeddings_index["car"]
B = embeddings_index["truck"]

cosine = np.dot(A,B)/(np.linalg.norm(A)*np.linalg.norm(B))
print("Cosine Similarity:", cosine)

# You can however note something: these embeddings are computed based on their context,
# and then do not capture very well the opposites. For example, the verb "like"
# is quite similar to the verb "hate" even if it is the opposite.

We load this embedding matrix into an Embedding layer.

Note that we can set trainable=False to prevent the weights from being updated during training.



In [None]:
embedding_layer = tf.keras.layers.Embedding(len(vocab),
                                            embeddings_dim,
                                            weights=[embedding_matrix],
                                            mask_zero=True,
                                            trainable=False)

💡 Now, put this embedding layer already pre-trained on the network and check if the network is learning faster! You can try with and without freezing th embedding layer.

# Some insights on solution

You should notice that with a simple embedding layer and a bi-LSTM, the model overfits a lot: the accuracy on the training set is very high, and the validation set start decreasing after a few epochs.

You can prevent overfitting by using pretrained word embeddings, freeze them (with trainable=False), and train only on the classification part: you can stack 2 bi-LSTM, some dropout and you'll see that the model may be a little bit worse but it does not overfit.

# Try non supervised techniques on word embeddings!

You can do a lot of things without any training, just based on some pre-trained word embeddings.
For example, you can do topic modelling or even text classification. Let's try text classification!

In [None]:
from scipy import spatial
from sklearn import neighbors

embeddings_dim = 100

The goal will be to classify a text in these 5 following categories:

In [None]:
label_names = ['business', 'entertainment', 'politics', 'sport', 'technology']

In [None]:
def get_centroid(vectors):
  """ Will compute the average of the input vectors.
  Returns 1 vector of dimension embeddings_dim.
  """
  return np.mean(vectors, axis=0)

def embed(text):
  """ Given a text, returns the list of vector embeddings from each word of
      the text.
      The output array is of dimension (Nb_words X embeddings_dim)
  """
  vectors = []
  for token in text.split():
    if embeddings_index.get(token) is not None:
        vectors.append(embeddings_index[token])
  return np.asarray(vectors)

In [None]:
# Embed the class names into their embedding vector.
label_vectors = np.asarray([embed(label) for label in label_names])
label_vectors.resize(len(label_names), embeddings_dim)
label_vectors.shape

The goal will be to assign a text to its class using the nearest neighbor algorithm:
1. Let's take the average of all word embeddings of the text
2. Find its nearest neighbor amongst the 5 classes

In [None]:
neigh = neighbors.NearestNeighbors(
         n_neighbors=5,
         metric=spatial.distance.cosine)

neigh.fit(label_vectors)

In [None]:
my_text = "I like theater !"
text_embedding = embed(my_text)
centroid = get_centroid(text_embedding)

In [None]:
for label in neigh.kneighbors([centroid], return_distance=False)[0]:
  print(label_names[label])