# Tensorflow for text classification

Today's challenge is based on the colab "Text classification with TensorFlow Hub: Movie reviews" proposed by Tensorflow. The original colab can be accessed [here](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub).

This tutorial uses data from the IMDB dataset. It contains text of 50,000 movie reviews. We will split them into 60% and 40%, to have 15,000 examples for training, 10,000 examples for validation and 25,000 examples for testing.

There are 2 labels: 0 for a negative sentiment and 1 for positive sentiment.

The training and testing sets are balanced - they contain an equal number of positive and negative reviews.

In [1]:
# imports
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from pprint import pprint

Let's define a function to load the dataset.

In [2]:
def load_ds(set_name, train_split, validation_split):
  train_data, validation_data, test_data = tfds.load(
    name=set_name, # in this case, the set name will be "imdb_reviews"
    split=('train[:' + str(train_split) + '%]', 'train[' + str(validation_split) + '%:]', 'test'),
    as_supervised=True)
  
  return train_data, validation_data, test_data
  
# load data
train_data, validation_data, test_data = load_ds('imdb_reviews', 60, 60)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteWU2ERT/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteWU2ERT/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteWU2ERT/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


Tensorflow uses then tf.data API to encode their datasets. It allows us to handle big datasets that don't fit in memory (amongst a lot of other things).

I have avoided using this format on tutorials and preferred to use Python lists and dictionaries or pandas to handle data. But let's use this time to gain knowledge on the tf.data API. 



In [3]:
def echo_batch(dataset, examples_qty):
  # print data type
  print('Data type:')
  print(type(dataset))

  # print data shape
  print('\nData shape:')
  print(tf.data.experimental.cardinality(dataset))

  # print the texts on the ds
  print('\nTexts:')
  pprint(next(iter(dataset.batch(examples_qty)))[0])

  print('\nLabels:')
  # Now, print the labels on the ds
  pprint(next(iter(dataset.batch(examples_qty)))[1])

# print the first 5 examples and labels
echo_batch(train_data, 5)

Data type:
<class 'tensorflow.python.data.ops.dataset_ops.DatasetV1Adapter'>

Data shape:
tf.Tensor(15000, shape=(), dtype=int64)

Texts:
<tf.Tensor: shape=(5,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination

On text problems, we usually apply pre-processing. This includes steps such as tokenizing, special character removal, normalization, etc. But let's keep things simple and focus on one concept at a time. We can revisit pre-processing later.

## Build the model

The official tutorial includes the concept of transfer learning. It means that you will use a pre-trained model's weights to ammeliorate the performance of your own model.

This will save you time and resources. To know mode about this concept, read [this](https://keras.io/guides/transfer_learning/), and [this](https://towardsdatascience.com/keras-transfer-learning-for-beginners-6c9b8b7143e).




In [4]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


In [5]:
# compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

# train the model
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [6]:
# evaluate
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 4s - loss: 0.3200 - accuracy: 0.8597
loss: 0.320
accuracy: 0.860
