Now that we have built a character level model:
    1. it's time to look at word-level models and tackle a common neural language processing task
    2. Sentiment Analysis, 
    3. in this process we will learn how to handle sequences of variables length using masking

In [2]:
# Importing Libraries
import tensorflow as tf
from tensorflow import keras
import numpy as np
import tensorflow_datasets as tfds

X_train, consists of a list of reviews, each of which is represented as a numpy array:
    1. where each integer represents a word
    2. finally indexed by frequency(so low integers correspond to frequent words)
    3. integers 0, 1, 2 are special, representing the padding token, start-of-sequence(SSS) token and Unknown words
    4. 0 for negative, 1 for positive review

In [3]:
# Load the IMDb reviews
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
X_train[0][:10]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [4]:
# Google's SentencePiece project provides an Open Source Implementationo Descriped in a paper by Taku Kudo and John Richardson

In [5]:
# If you want to visualize a review, you can decode it like this
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}# +3, because of 0, 1, and 3 tokens
for id_, token in enumerate(('<pad>', '<sos>', '<unk>')):# id_, 0, 1, 2
    id_to_word[id_] = token

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [6]:
' '.join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

if you want to deploy your model to a mobile device or a web browser, and you don't want to write a different
preprocessing function every time, then you will want to handle preprocessing using only Tensorflow operations:
    1. first let's load the original IMDb reviews, as text(byte strings)
    2. Using TensorFlow Datasets

In [7]:
# let's first load the original IMDb reviews, as text(byte strings)
# Load the original IMDb reviews as text(byte strings), using Tensorflow Datasets(introduced in Chapter 13)
import tensorflow_datasets as tfds

datasets, info = tfds.load('imdb_reviews', as_supervised=True, with_info=True)
train_size = info.splits['train'].num_examples

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteCXL629/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteCXL629/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteCXL629/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


next:
    1. it will use regular expression to replace <br />, tagw with spaces
       for example: "Well, i Can't<br />" will become "Well I Can't"
finally:
    1. the preprocess() function splits the reviews by the spaces, which returns a ragged tensor
    2. and it converts the ragged tensor to a dense tensor, padding will reviews with the padding token "<pad>"
       so that they all have the same length.

In [8]:
# Next, let's write the preprocessing function:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)#truncating the review, keeps only the first 300 characters of each
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)#which returns a ragged tensor
    return X_batch.to_tensor(default_value=b'<pad>'), y_batch#it convert this ragged tensor to a dense tensor,padding all reviews with the padding token <pad>

In [9]:
# Next, we need to construct the vocabulary.
# this requires going through the whole training set once, applying our preprocess(), function 
# and using a counter to count the number of occurrences of each word.
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets['train'].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

In [10]:
# let's look at the 3 most common words:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [11]:
# We don't need our model to know all the words in the dictionary to get good performance
# so let's truncate the vocabulary, keeping only the 10,000 most common words:
vocab_size = 10000
truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

In [12]:
# Now, we need to add a preprocessing step to replace each word with it's ID(it's index in the vocabulary)
# just, like we did in Ch 13, we will create a lookup table for this, using 1000 out-of-vocabulary(oov) buckets
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

Note:
    1. the words 'this', 'movie' and 'was', were found in the table so their IDs are lower than 10,000
    2. while the word "fantastic", was not found, so it was mapped to one of the oov buckets, with an ID greater than or equal to 10,000


In [13]:
# We can then use this table to look up the IDs of a few words
table.lookup(tf.constant([b"This movie was faaaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10770]])>

In [14]:
# Now, we are ready to create the final training set.
# We batch the reviews, then convert them to short sequences of words using the preprocess(), functionn
# then encode, these words using encode_words() function that uses the table we just built, and finally prefetch the next batch
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets['train'].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [15]:
# at last we can create the model and train it.
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, input_shape = [None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics= ['accuracy'])
history = model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# MASKING

In [16]:
K = keras.backend
inputs = keras.layers.Input(shape=[None])
mask= keras.layers.Lambda(lambda inputs: K.not_equal(inputs, 0))(inputs)
z = keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(inputs)
z = keras.layers.GRU(128, return_sequences=True)(z, mask=mask)
z = keras.layers.GRU(128)(z, mask=mask)
outputs = keras.layers.Dense(1, activation='sigmoid')(z)
model = keras.Model(inputs = [inputs], outputs=[outputs])

'\nAfter training for a few epochs, this model will become quite good at judging whether a review is positive or not\n'