<a href="https://colab.research.google.com/github/rahiakela/data-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/01_words_representing_approach_sets_and_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Words representing approache: Sets and sequences

A much more problematic question,
however, is how to encode the way words are woven into sentences: word order.

The problem of order in natural language is an interesting one: unlike the steps of a timeseries, words in a sentence don’t have a natural, canonical order.

The simplest thing you could do is just discard order and
treat text as an unordered set of words—this gives you **bag-of-words models**.

You could also decide that words should be processed strictly in the order in which they appear, one at a time, like steps in a timeseries—you could then leverage the **recurrent models**.

Finally, a hybrid approach is also possible: the **Transformer architecture** is technically order-agnostic, yet it injects word-position information into
the representations it processes, which enables it to simultaneously look at different parts of a sentence, while still being order-aware. 

Because they take into account word order, **both RNNs and Transformers are called sequence models.**

We’ll demonstrate each approach on a well-known text classification benchmark:
the IMDB movie review sentiment-classification dataset.

Let’s process the raw IMDB text data, just like you would do when approaching a new text-classification problem in the real world.





##Setup

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np
import os, pathlib, shutil, random

Let’s start by downloading the dataset from the Stanford page and uncompressing it:

In [2]:
%%shell

curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

# delete unwanted file and subdirectory
rm -rf aclImdb/train/unsup
rm -rf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  16.7M      0  0:00:04  0:00:04 --:--:-- 17.9M




In [3]:
# take a look at the content of a few of these text files
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

let’s download the GloVe word embeddings.

In [None]:
%%shell

wget http://nlp.stanford.edu/data/glove.6B.zip
unzip -q glove.6B.zip

##Preparing the IMDB movie reviews data

For instance, the `train/pos/` directory contains a set of `12,500` text files, each of which
contains the text body of a positive-sentiment movie review to be used as training data.
The negative-sentiment reviews live in the “neg” directories. 

In total, there are `25,000`
text files for training and another 25,000 for testing.

Next, let’s prepare a validation set by setting apart 20% of the training text files in a new directory, `aclImdb/val`:

In [6]:
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir/"val"
train_dir = base_dir/"train"

for category in ("neg", "pos"):
  os.makedirs(val_dir/category)
  files = os.listdir(train_dir/category)
  # Shuffle the list of training files using a seed, to ensure we get the same validation set every time we run the code
  random.Random(1337).shuffle(files)
  # Take 20% of the training files to use for validation
  num_val_samples = int(0.2 * len(files))
  val_files = files[-num_val_samples:]
  for fname in val_files:
    # Move the files to aclImdb/val/neg and aclImdb/val/pos
    shutil.move(train_dir/category/fname, val_dir/category/fname)

Remember how, we used the `image_dataset_from_directory` utility to
create a batched Dataset of images and their labels for a directory structure? You can do the exact same thing for text files using the `text_dataset_from_directory` utility.

Let’s create three Dataset objects for training, validation, and testing:

In [7]:
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory("aclImdb/train", batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory("aclImdb/val", batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


These datasets yield inputs that are TensorFlow `tf.string` tensors and targets that are `int32` tensors encoding the value “0” or “1.”

In [8]:
for inputs, targets in train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"Not an altogether bad start for the program -- but what a slap in the face to real law enforcement. The worst part of the series is that it attempts to bill itself as reality fare -- and is anything but. Men and women that dedicate their lives to the enforcement of laws deserve better than this. What is next, medical school in a minute? Charo performing lipo? Charles Grodin assisting on a hip replacement? C'mon...show a little respect. Even the citizens of Muncie are outing the program as staged. Police Academy = High School Gym? Poor editing (how many times can they use the car-to-car shot of the Taco Bell in the background?), cheesy siren effects (the same loop added ad nauseum to every 'call' whether rolling code or not), and last, but not least -- more officer safety issues than you could shake a stick at.<br /><br />If I want to see manufactured police wo

All set. Now let’s try learning something from this data.

##Processing words as a set: The bag-of-words approach

The simplest way to encode a piece of text for processing by a machine learning
model is to discard order and treat it as a set (a “bag”) of tokens. 

You could either look at individual words (unigrams), or try to recover some local order information by looking at groups of consecutive token (N-grams).

###Single words with binary encoding

If you use a bag of single words, the sentence “the cat sat on the mat” becomes-

```python
{"cat", "mat", "on", "sat", "the"}
```

The main advantage of this encoding is that you can represent an entire text as a single
vector, where each entry is a presence indicator for a given word. 

For instance,
using binary encoding (multi-hot), you’d encode a text as a vector with as many
dimensions as there are words in your vocabulary—with 0s almost everywhere and
some 1s for dimensions that encode words present in the text.

First, let’s process our raw text datasets with a TextVectorization layer so that they yield multi-hot encoded binary word vectors.

In [None]:
# Limit the vocabulary to the 20,000 most frequent words.In general, 20,000 is the right vocabulary size for text classification.
text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode="multi_hot") # Encode the output tokens as multi-hot binary vectors

# Prepare a dataset that only yields raw text inputs (no labels).
text_only_train_ds = train_ds.map(lambda x, y: x)

# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

In [None]:
# Prepare processed versions of our training, validation, and test dataset.
binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

You can try to inspect the output of one of these datasets.

In [None]:
for inputs, targets in binary_1gram_train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


Next, let’s write a reusable model-building function that we’ll use in all of our experiments.

In [None]:
def get_model(max_tokens=20000, hidden_dim=16):
  inputs = keras.Input(shape=(max_tokens, ))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
  return model

Finally, let’s train and test our model.

In [None]:
model = get_model()
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)]

We call `cache()` on the datasets to cache them in memory: this way, we will only do the preprocessing once, during the first epoch, and we’ll reuse the preprocessed texts for the following epochs. This can only be done if the data is small enough to fit in memory.

In [None]:
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbefc443c90>

In [None]:
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Test acc: 0.888


This gets us to a test accuracy of 88.8%: not bad! Note that in this case, since the dataset
is a balanced two-class classification dataset (there are as many positive samples as
negative samples), the “naive baseline” we could reach without training an actual model
would only be 50%. 

Meanwhile, the best score that can be achieved on this dataset
without leveraging external data is around 95% test accuracy.

###Bigrams with binary encoding

Of course, discarding word order is very reductive, because even atomic concepts can
be expressed via multiple words: the term `United States` conveys a concept that is
quite distinct from the meaning of the words `states` and `united` taken separately.

For this reason, you will usually end up re-injecting local order information into your
bag-of-words representation by looking at N-grams rather than single words (most
commonly, bigrams).

With bigrams, our sentence becomes:

```python
{"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the mat", "mat"}
```

The TextVectorization layer can be configured to return arbitrary N-grams: bigrams,
trigrams, etc. Just pass an `ngrams=N` argument as in the following listing.

In [None]:
# Limit the vocabulary to the 20,000 most frequent words.In general, 20,000 is the right vocabulary size for text classification.
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode="multi_hot") # Encode the output tokens as multi-hot binary vectors

# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

In [None]:
# Prepare processed versions of our training, validation, and test dataset.
binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [None]:
model2 = get_model()
model2.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("binary_2gram.keras", save_best_only=True)]

model2.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbefa128890>

In [None]:
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Test acc: 0.897


We’re now getting 89.5% test accuracy, a marked improvement! Turns out local order is pretty important.

###Bigrams with TF-IDF encoding

You can also add a bit more information to this representation by counting how many
times each word or N-gram occurs, that is to say, by taking the histogram of the words
over the text:

```python
{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}
```

If you’re doing text classification, knowing how many times a word occurs in a sample
is critical: any sufficiently long movie review may contain the word `terrible` regardless
of sentiment, but a review that contains many instances of the word `terrible` is
likely a negative one.

Here’s how you’d count bigram occurrences with the TextVectorization layer.

```python
text_vectorization = TextVectorization(
ngrams=2,
max_tokens=20000,
output_mode="count"
)
```

Now, of course, some words are bound to occur more often than others no matter
what the text is about. The words `the,` `a,` `is,` and `are` will always dominate your
word count histograms, drowning out other words—despite being pretty much useless
features in a classification context. How could we address this?

The best practice is to go with something called **TF-IDF normalization—TF-IDF** stands for `term frequency,
inverse document frequency.`

TF-IDF is so common that it’s built into the TextVectorization layer. All you need to do to start using it is to switch the output_mode argument to `tf_idf`.

In [None]:
# Limit the vocabulary to the 20,000 most frequent words.In general, 20,000 is the right vocabulary size for text classification.
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode="tf_idf") # Encode the output tokens as tf-idf binary vectors

# Use that dataset to index the dataset vocabulary via the adapt() method
text_vectorization.adapt(text_only_train_ds)

# Prepare processed versions of our training, validation, and test dataset.
binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

model3 = get_model()
model3.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_6 (Dense)             (None, 16)                320016    
                                                                 
 dropout_3 (Dropout)         (None, 16)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram.keras", save_best_only=True)]

model3.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbef9f70910>

In [None]:
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Test acc: 0.889


This gets us an `88.9%` test accuracy on the IMDB classification task: it doesn’t seem to
be particularly helpful in this case. However, for many text-classification datasets, it
would be typical to see a one-percentage-point increase when using `TF-IDF` compared
to plain binary encoding.

**Exporting a model that processes raw strings**

Just create a new model that reuses your TextVectorization layer and adds to it
the model you just trained:

In [None]:
# One input sample would be one string
inputs = keras.Input(shape=(1, ), dtype="string")
# Apply text preprocessing
processed_inputs = text_vectorization(inputs)
# Apply the previously trained model
outputs = model(processed_inputs)
# Instantiate the end-to-end model
inference_model = keras.Model(inputs, outputs)

The resulting model can process batches of raw strings:

In [None]:
raw_text_data = tf.convert_to_tensor([["That was an excellent movie, I loved it."]])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

90.86 percent positive


##Processing words as a sequence: The sequence model approach

As we know that word order matters: **manual engineering of order-based features, such as bigrams, yields a nice accuracy boost**. 

Now remember: **the history of deep learning is that of a move away from manual feature engineering, toward letting models learn their own features from exposure to data alone.**

What if, instead of manually crafting order-based features, we exposed the model to raw word sequences and let it figure out such features on its own? 

**This is what sequence models are about.**

To implement a sequence model, you’d start by representing your input samples as
sequences of integer indices (one integer standing for one word). Then, you’d map each integer to a vector to obtain vector sequences. 

Finally, you’d feed these sequences of vectors into a stack of layers that could cross-correlate features from adjacent
vectors, such as a 1D convnet, a RNN, or a Transformer.

For some time around 2016–2017, bidirectional RNNs (in particular, bidirectional
LSTMs) were considered to be the state of the art for sequence modeling.

However, nowadays sequence modeling is almost universally done with Transformers.

Oddly, one-dimensional convnets were never very popular in NLP, even though, a residual stack of depthwise-separable 1D convolutions can often achieve comparable performance to a bidirectional
LSTM, at a greatly reduced computational cost.

###Preparing datasets for sequences

First, let’s prepare datasets that return integer sequences.

In [10]:
max_length = 600
max_tokens = 20000

# Prepare a dataset that only yields raw text inputs (no labels).
text_only_train_ds = train_ds.map(lambda x, y: x)

# This is a reasonable choice, since the average review length is 233 words, and only 5% of reviews are longer than 600 words.
text_vectorization = layers.TextVectorization(max_tokens=max_tokens,
                                              output_mode="int",
                                              output_sequence_length=max_length)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

###A simple bidirectional LSTM

The simplest way to convert our integer sequences to vector
sequences is to one-hot encode the integers (each dimension would represent one
possible term in the vocabulary). On top of these one-hot vectors, we’ll add a simple bidirectional LSTM.

In [None]:
# One input is a sequence of integers
inputs = keras.Input(shape=(None, ), dtype="int64")
# Encode the integers into binary 20,000-dimensional vectors
embedded = tf.one_hot(inputs, depth=max_tokens)
# add a bidirectional LSTM
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 20000)       0         
                                                                 
 bidirectional (Bidirectiona  (None, 64)               5128448   
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
___________________________________________________

Now, let’s train our model.

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras", save_best_only=True)]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.model.load_model("one_hot_bidir_lstm.keras")
print(f"Test accuracy: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
 46/625 [=>............................] - ETA: 3:32:57 - loss: 0.6931 - accuracy: 0.5088

A first observation: this model trains very slowly, especially compared to the lightweight model of the previous section. This is because our inputs are quite large: each input sample is encoded as a matrix of size `(600, 20000) (600 words per sample, 20,000 possible words)`. That’s `12,000,000` floats for a single movie review. Our bidirectional LSTM has a lot of work to do.

Second, the model only gets to `87%` test accuracy—it doesn’t perform nearly as well as our (very fast) binary unigram model.

Clearly, using one-hot encoding to turn words into vectors, which was the simplest thing we could do, wasn’t a great idea. There’s a better way: **word embeddings**.

###Word embeddings

Word embeddings are vector representations of words that map human language into a structured geometric space.

There are two ways to obtain word embeddings:

- Learn word embeddings jointly with the main task using embeddings layer
- Load pretrained word embeddings into your model

####Learn word embeddings with embedding layer

It’s thus reasonable to learn a new embedding space with every new task. Fortunately,
backpropagation makes this easy, and Keras makes it even easier. It’s about
learning the weights of a layer: the Embedding layer.

In [None]:
# One input is a sequence of integers
inputs = keras.Input(shape=(None, ), dtype="int64")
# The Embedding layer takes at least two arguments: the number of possible tokens and the dimensionality of the embeddings (here, 256).
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
# add a bidirectional LSTM
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         5120000   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               73984     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
___________________________________________________

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras", save_best_only=True)]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.model.load_model("embeddings_bidir_gru.keras")
print(f"Test accuracy: {model.evaluate(int_test_ds)[1]:.3f}")

It trains much faster than the one-hot model (since the LSTM only has to process
256-dimensional vectors instead of 20,000-dimensional), and its test accuracy is comparable
(87%).

###Understading padding and masking

We need some way to tell the RNN that it should skip the iterations for padded value(mostly zeros). There’s an API for that: masking.

The Embedding layer is capable of generating a “mask” that corresponds to its
input data. This mask is a tensor of ones and zeros (or True/False booleans), of shape
(`batch_size`, `sequence_length`), where the entry `mask[i, t]` indicates where timestep
t of sample i should be skipped or not (the timestep will be skipped if `mask[i, t]`
is 0 or False, and processed otherwise).

By default, this option isn’t active—you can turn it on by passing mask_zero=True
to your Embedding layer. You can retrieve the mask with the compute_mask() method:

In [None]:
embedding_layer = layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)

In [None]:
some_input = [
   [4, 3, 2, 1, 0, 0, 0],
   [5, 4, 3, 2, 1, 0, 0],
   [2, 1, 0, 0, 0, 0, 0]           
]

In [None]:
mask = embedding_layer.compute_mask(some_input)
mask

<tf.Tensor: shape=(3, 7), dtype=bool, numpy=
array([[ True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False, False],
       [ True,  True, False, False, False, False, False]])>

In practice, you will almost never have to manage masks by hand. Instead, Keras will
automatically pass on the mask to every layer that is able to process it (as a piece of
metadata attached to the sequence it represents). 

This mask will be used by RNN layers
to skip masked steps. If your model returns an entire sequence, the mask will also
be used by the loss function to skip masked steps in the output sequence.

In [None]:
# One input is a sequence of integers
inputs = keras.Input(shape=(None, ), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
# add a bidirectional LSTM
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 256)         5120000   
                                                                 
 bidirectional_1 (Bidirectio  (None, 64)               73984     
 nal)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5,194,049
Trainable params: 5,194,049
Non-trainable params: 0
_________________________________________________

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras", save_best_only=True)]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)

model = keras.model.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test accuracy: {model.evaluate(int_test_ds)[1]:.3f}")

This time we get to 88% test accuracy—a small but noticeable improvement.

###Using pretrained word-embeddings

Sometimes you have so little training data available that you can’t use your data alone
to learn an appropriate task-specific embedding of your vocabulary. 

In such cases,
instead of learning word embeddings jointly with the problem you want to solve, you
can load embedding vectors from a precomputed embedding space that you know is
highly structured and exhibits useful properties—one that captures generic aspects of
language structure.

Let’s use GloVe word embedding and parse the unzipped file (a .txt file) to build an index that maps words (as strings) to their vector representation.

In [9]:
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
  for line in f:
    word, coefs = line.split(maxsplit=1)
    coefs = np.fromstring(coefs, "f", sep=" ")
    embeddings_index[word] = coefs
print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


Next, let’s build an embedding matrix that you can load into an Embedding layer. 

It must be a matrix of shape `(max_words, embedding_dim)`, where each entry $i$ contains the `embedding_dim` - dimensional vector for the word of index $i$ in the reference word index (built during tokenization).

In [11]:
embedding_dim = 100

# Retrieve the vocabulary indexed by our previous TextVectorization layer
vocabulary = text_vectorization.get_vocabulary()
# Use it to create a mapping from words to their index in the vocabulary.
word_index = dict(zip(vocabulary, range(len(vocabulary))))

In [12]:
# Prepare a matrix that we’ll fill with the GloVe vectors.
embedding_matrix = np.zeros((max_tokens, embedding_dim))

for word, i in word_index.items():
  # Fill entry i in the matrix with the word vector for index i. Words not found in the embedding index will be all zeros.
  if i < max_tokens:
    embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

Finally, we use a Constant initializer to load the pretrained embeddings in an Embedding layer. 

So as not to disrupt the pretrained representations during training, we freeze
the layer via `trainable=False`:

In [13]:
embedding_layer = layers.Embedding(max_tokens, embedding_dim, 
                                   embeddings_initializer=keras.initializers.Constant(embedding_matrix),
                                   trainable=False, mask_zero=True)

We’re now ready to train a new model—identical to our previous model, but leveraging the `100-dimensional` pretrained GloVe embeddings instead of `128-dimensional` learned embeddings.

In [14]:
# One input is a sequence of integers
inputs = keras.Input(shape=(None, ), dtype="int64")
embedded = embedding_layer(inputs)
# add a bidirectional LSTM
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         2000000   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               34048     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 2,034,113
Trainable params: 34,113
Non-trainable params: 2,000,000
______________________________________________

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras", save_best_only=True)]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=1, callbacks=callbacks)

model = keras.model.load_model("glove_embeddings_sequence_model.keras")
print(f"Test accuracy: {model.evaluate(int_test_ds)[1]:.3f}")

You’ll find that on this particular task, pretrained embeddings aren’t very helpful, because the dataset contains enough samples that it is possible to learn a specialized enough embedding space from scratch. 

However, leveraging pretrained embeddings can be very helpful when you’re working with a smaller dataset.