<a href="https://colab.research.google.com/github/samehra/Projects/blob/master/tf_notebooks/Keras_Preprocessing_Layers_Sentiment_Analysis_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2021 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Intro

This colab is a companion to the "An Introduction to Keras Preprocessing Layers" blog post, and contains a runnable version of all code presented in the post. Unlike in post, here we will also load a validation dataset better evaluate our models.

We can start by downloading and batching the [imdb_reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) dataset.

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

train_ds, test_ds = tfds.load(
    'imdb_reviews', split=['train', 'test'], as_supervised=True)
train_ds = train_ds.batch(32)
test_ds = test_ds.batch(32)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSEIX2F/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSEIX2F/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteSEIX2F/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


## Building a model

We will build two functions, `preprocess()` which applies our preprocessing to our input features, and `forward_pass()` which applies our trainable layers.

For the `preprocess()` function, we will use the [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer to produce a multi-hot encoding of which words are present in each review. We will [adapt](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/PreprocessingLayer#adapt) the layer to automatically learn a vocabulary from the input.

For the `forward_pass()` function, we will try a simple linear model with a single [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer.

In [None]:
features = train_ds.map(lambda x, y: x)
text_vectorizer = tf.keras.layers.TextVectorization(
    output_mode='multi_hot', max_tokens=2500)
text_vectorizer.adapt(features)

def preprocess(x):
  return text_vectorizer(x)

def forward_pass(x):
  return tf.keras.layers.Dense(1)(x)  # Linear model.

inputs = tf.keras.Input(shape=(1,), dtype='string')
model = tf.keras.Model(inputs, forward_pass(preprocess(inputs)))
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=tf.keras.metrics.BinaryAccuracy())
model.fit(train_ds, validation_data=test_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3457311650>

## Adding a new feature

Next up, we can add a new feature for normalized string length. We will use [tf.strings.length](https://www.tensorflow.org/api_docs/python/tf/strings/length) to determine a review length, and the [Normalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization) layer to normalize the feature values.

In [None]:
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(features.map(lambda x: tf.strings.length(x)))

def preprocess(x):
  multi_hot_terms = text_vectorizer(x)
  normalized_length = normalizer(tf.strings.length(x))
  return tf.keras.layers.concatenate((multi_hot_terms, normalized_length))

def forward_pass(x):
  return tf.keras.layers.Dense(1)(x)  # Linear model.

inputs = tf.keras.Input(shape=(1,), dtype='string')
model = tf.keras.Model(inputs, forward_pass(preprocess(inputs)))
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=tf.keras.metrics.BinaryAccuracy())
model.fit(train_ds, validation_data=test_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3458f34110>

## Speeding up training with tf.data

One improvement we can make is to speed up training by using [tf.data](https://www.tensorflow.org/guide/data). We will split our model into two using the [functional API](https://keras.io/guides/functional_api/), and apply the preprocessing layers with [tf.data.Dataset.map](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map). We will use [tf.data.Dataset.prefetch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#prefetch), to precompute preprocessed batches. We will also call [tf.data.Dataset.cache](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cache) to cache our preprocessed data after the first epoch.

In [None]:
inputs = tf.keras.Input(shape=(1,), dtype='string')
preprocessed_inputs = preprocess(inputs)
outputs = forward_pass(preprocessed_inputs)

# Split the model into two parts.
preprocessing_model = tf.keras.Model(inputs, preprocessed_inputs)
training_model = tf.keras.Model(preprocessed_inputs, outputs)

# Apply preprocessing asynchonously with tf.data.
preprocessed_train_ds = train_ds.map(
    lambda x, y: (preprocessing_model(x), y),
    num_parallel_calls=tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)
preprocessed_test_ds = test_ds.map(
    lambda x, y: (preprocessing_model(x), y),
    num_parallel_calls=tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE)

# Now the GPU can focus on the training part of the model!
training_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=tf.keras.metrics.BinaryAccuracy())
training_model.fit(
    preprocessed_train_ds, validation_data=preprocessed_test_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f34572d8f50>

## Saving an inference model

Lastly, we combine our split model into a single model that can take as input raw strings. We could save this model and use it later for inference.

In [None]:
inputs = preprocessing_model.input
outputs = training_model(preprocessing_model(inputs))
inference_model = tf.keras.Model(inputs, outputs)
inference_model.predict(
    tf.constant(['Terrible, no good, trash.', 'I loved this movie!']))

array([[-1.1480266 ],
       [ 0.70757645]], dtype=float32)