In [None]:
# Install TensorFlow
# !pip install -q tensorflow-gpu==2.0.0-beta1

try:
  %tensorflow_version 2.x  # Colab only.
except Exception:
  pass

import tensorflow as tf
print(tf.__version__)

`%tensorflow_version` only switches the major version: `1.x` or `2.x`.
You set: `2.x  # Colab only.`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.
2.0.0-beta1


## Goal
To demonstrate how to do text preprocessing.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

We create a dummy dataset with just three sentences. Next, we define a max vocab size = 20,000. This is usually a reasonable value. The Oxford Dictionary has almost 200,000 words, but the most common words in the English language number only about 3000, so 3000 will cover about 95% of most texts. So 20,000 is probably good enough. 

We instantiate the `Tokenizer()` class. Next we call `tokenizer.fit_on_texts()` and pass in our sentences list. 
Then we call `tokenizer.texts_to_sequences()` and pass in the same sentences list. This returns our sequences of integers. By the way, you can think of these two functions in terms of SKlearn feature transformer where you would always have a `fit()` and `transform()`. So the first one is like `fit()`, and the second one is like `transform()`. 

In [None]:
# Just a simple test
sentences = [
    "I like eggs and ham.",
    "I love chocolate and bunnies.",
    "I hate onions."
]

In [None]:
MAX_VOCAB_SIZE = 20000
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

We're going to print out our sequences list. Normally on a text data set of any practical size, this would just print out way too much data. But since we only have three sentences, this is reasonable. We can see that each sentence has been converted into a list of integers, each integer corresponding to a word. Importantly, note that the integers start counting from 1 and not 0 as you might have expected. This is because Tensorflow uses 0 for padding.

In [None]:
print(sequences)

[[1, 3, 4, 2, 5], [1, 6, 7, 2, 8], [1, 9, 10]]


You might be wondering how do I know which word corresponds to which integer? Luckily the tokenizer object already stores a dictionary with this information. Let's print out the tokenizer word index, and this is just like our word to index mapping. You should be able to look at this and confirm that if you map each integer back to the corresponding word, you should get back to the original sentences.

In [None]:
# How to get the word to index mapping?
tokenizer.word_index

{'and': 2,
 'bunnies': 8,
 'chocolate': 7,
 'eggs': 4,
 'ham': 5,
 'hate': 9,
 'i': 1,
 'like': 3,
 'love': 6,
 'onions': 10}

Let's try out `pad_sequences()` with all the default values. While the default appears to set the maximum sequence length to be the maximum sentence length ie. 5, the first 2 sentences were already length 5 so there's no padding. For the third sentence, padding was added at the beginning.

In [None]:
# use the defaults
data = pad_sequences(sequences)
print(data)

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 0  0  1  9 10]]


We can see that if we explicitly pass in max sequence length = 5, we get the same answer. 

In [None]:
MAX_SEQUENCE_LENGTH = 5
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(data)

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 0  0  1  9 10]]


What happens if we set padding to `post`. We can see that the first two sentences still have no padding because they are of the maximum length. The third sentence now has 0s at the end instead of at the beginning. 

In [None]:
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
print(data)

[[ 1  3  4  2  5]
 [ 1  6  7  2  8]
 [ 1  9 10  0  0]]


What happens if we add too much padding, so let's set padding = 6, which is longer than all of the sentences in our dataset. The first two sentences have been padded with one 0 so that the sequence length is 6. 

In [None]:
# too much padding
data = pad_sequences(sequences, maxlen=6)
print(data)

[[ 0  1  3  4  2  5]
 [ 0  1  6  7  2  8]
 [ 0  0  0  1  9 10]]


What happens if we set max length equal to a number less than the maximum sequence length. So let's try 4. In this case we can see that each sequence has been truncated to cut off the beginning of the sequences. This makes sense as the default because an RNN typically pays more attention to the final values in a sequence.

In [None]:
# truncation
data = pad_sequences(sequences, maxlen=4)
print(data)

[[ 3  4  2  5]
 [ 6  7  2  8]
 [ 0  1  9 10]]


What happens if we set max length = 4 again, but this time we're going to set the truncating argument to `post`. In this case, we can see that the ends of the sequences have been cut off instead of the beginnings.

In [None]:
data = pad_sequences(sequences, maxlen=4, truncating='post')
print(data)

[[ 1  3  4  2]
 [ 1  6  7  2]
 [ 0  1  9 10]]
