## Processing raw IMDB text data
take a look at the content of these text files. No matter working with what kinds of data, always remeber to inspect what data looks like before diving into modeling it.

In [None]:
!cat aclImdb/train/pos/4077_10.txt

Next, prepare a validation set by setting apart 20% of the training text files in a new directory, aclImdb/val:
### Run only once

In [2]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir/"val"
train_dir = base_dir/"train"

In [None]:
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(714).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

Create three $Dataset$ objects for training, validation, and test.

In [1]:
from tensorflow import keras

batch_size = 32
train_ds = keras.preprocessing.text_dataset_from_directory("aclImdb/train",
                                                           batch_size=batch_size)
val_ds = keras.preprocessing.text_dataset_from_directory("aclImdb/val",
                                                         batch_size=batch_size)
test_ds = keras.preprocessing.text_dataset_from_directory("aclImdb/test",
                                                          batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


Displaying the shapes and dtypes of the first batch.

In [2]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    # print("inputs[0]:", inputs[0])
    # print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>


## Preprocessing words as a set: the bag-of-words approach
The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set of tokens. You could either look at individual words(unigrams), or try to recover some local order information by looking at groups of consecutive token(N-grams)
### Single words (unigrams) with binary encoding

In [3]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

text_vectorization = TextVectorization(max_tokens=20000,
                                       # encode output tokens as binary vectors
                                       output_mode="binary")

# prepare a dataset that only yields raw text inputs(no label)
text_only_train_ds = train_ds.map(lambda x, y:x)

# use this dataset to index the dataset vocabulary, via the adapt() method.
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y)) 
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

Inspect

In [7]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    
    # these vectors consist entirely of ones and zeros
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


Let's write a reusable model-building function that we're going be using in all of our experiments.
### model-building utility

In [10]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                 loss="binary_crossentropy",
                 metrics=["accuracy"])
    return model

Let's train and test

In [11]:
model = get_model()
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 20000)]           0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                320016    
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [12]:
callbacks = [keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                            save_best_only=True)]

# call cache() on datasets to cache in the memory: so that we will only do the preprocessing once, during the first
# epoch, and we'll reuse the preprocessed texts for the following epochs. This can only be done if the data is small
# enough to fit in memory
model.fit(binary_1gram_train_ds.cache(),
         validation_data=binary_1gram_val_ds.cache(),
         epochs=10,
         callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.888


This gets us to a test accuracy 88.8%. Let's begin
### Bigrams with binary encoding
"The cat sat on the mat" => {"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the mat", "mat"}

The TextVectorization layer can be configured to return arbitary N-grams by setting $ngrams=N$

In [13]:
text_vectorization = TextVectorization(ngrams=2,
                                       max_tokens=20000,
                                       output_mode="binary")

Training and testing the binary bigram model

In [14]:
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [17]:
model = get_model()
# model.summary()
callbacks = [keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                            save_best_only=True)]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.894


89.4% test accuracy. Turns out local order is pretty important.
### Bigarms with TF-IDF encoding
we can also add a bit more information to this representation by counting how many times each word or N-gram occurs

{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1, "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}

Configuring the TextVectorization layer to return token counts

In [18]:
text_vectorization = TextVectorization(ngrams=2,
                                       max_tokens=20000,
                                       output_mode="count")

TF-IDF: term frequency, inverse document frequency

Configuring the TextVectorizaiton layers to return TF-IDF weighted outputs

In [20]:
text_vectorization = TextVectorization(ngrams=2,
                                       max_tokens=20000,
                                       output_mode="tf-idf")

Training and testing the TF-IDF bigram model

In [21]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

model = get_model()
# model.summary()
callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                            save_best_only=True)]

model.fit(tfidf_2gram_train_ds.cache(),
         validation_data=tfidf_2gram_val_ds.cache(),
         epochs=10,
         callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.888


This gets us 88.8% test accuracy on the IMDB classification task: it does not seem to be helpful in this case. However, for many text classfication datasets, it would be typical to see 1% increase when using TF-IDF compared to plain binary  encoding.
## 11.3.3 Preprocessing words as a sequence: the Sequence Model approach
To implement a sequence model, we'd start by representing our input samples as sequence of integers indices. Then you'd map each integer to a vector, to obtain vector sequences. Finally, we'd feed these sequences of vectors into a stack of layers that can cross-correlate feature from adjacent vectors.
### A first practical example

In [4]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

max_length = 600
max_tokens = 20000
text_vectorization = TextVectorization(max_tokens=max_tokens,
                                      output_mode="int",
                                      output_sequence_length=max_length)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

The simplest tool available to convert our integer sequences to vector sequences is to one-hot encode the integers. On top of these one-hot vectors, we'll add a simple bidirectional LSTM.

In [5]:
import tensorflow as tf
from tensorflow.keras import layers

inputs = keras.Input(shape=(None, ), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
             loss="binary_crossentropy",
             metrics=["accuracy"])
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
tf.one_hot (TFOpLambda)      (None, None, 20000)       0         
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                5128448   
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 5,128,513
Trainable params: 5,128,513
Non-trainable params: 0
_________________________________________________________________


Train first basic sequence model.

In [6]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)]
model.fit(int_train_ds, validation_data=int_val_ds,
          epochs=10, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.865


At first observation: this model trains very slowly. This is because our inputs are quite large: each input sample is encoded as a matrix of size (600, 20000). Second the model only gets to 87% test accuracy-it doesn't perform nearly as well as our binary unigram model.
### Word embeddings
What makes a good word-embedding space depends heavily on our task. The importance of certain semantic relationships varies from task to task. It's reasonable to *learn* a new embedding space with every new task. Fortunately, backpropagation makes this easy.

11.17 Instantiating an __Embedding__ layer

In [9]:
# takes at least two args: # of possible tokens and the dimensionality of the embeddings (here 256)
embedding_layer = layers.Embedding(input_dim = max_tokens, output_dim = 256)

When we instantiate an *Embedding* layer, its weight are initially random, just as with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit. Once fully trained, the embedding space will show a lot of structure - a kind of structure specialized for the specific problem for which we're training our model.

Model that uses an *Embedding* layer trained from scratch

In [10]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
# model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.862


### Understanding padding and masking
We're using a bidirecitonal RNN: two RNN layers running in parallel, one processing the tokens in their natural order, and the other processing the same tokens in reverse.  The RNN that looks at the tokens in their natural order will spend its last iterations seeing only vectors that encode paddingâ€”possibly for several hundreds of iterations, if the original sentence was short. We need some way to tell the RNN that it should skip these iterations.

There's an API fot that: masking. Let's inspect this

In [13]:
embedding_layer = layers.Embedding(input_dim = 10, output_dim = 256, mask_zero = True)

some_input = [[4, 3, 2, 1, 0, 0, 0],
              [5, 4, 3, 2, 1, 0, 0],
              [2, 1, 0, 0, 0, 0, 0]]
mask = embedding_layer.compute_mask(some_input)

In [14]:
mask

<tf.Tensor: shape=(3, 7), dtype=bool, numpy=
array([[ True,  True,  True,  True, False, False, False],
       [ True,  True,  True,  True,  True, False, False],
       [ True,  True, False, False, False, False, False]])>

Let's try retraining model with masking enabled.

In [None]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens,
                            output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
# model.summary()
callbacks = [keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                             save_best_only=True)]
model.fit(int_train_ds, validation_data=int_val_ds,
          epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

### Using pretrained word embeddings
Boring and skip, will learn when necessary.