# Text classification with an RNN

Referenced from https://www.tensorflow.org/tutorials/text/text_classification_rnn

This text classification tutorial trains a recurrent neural network on the IMDB large movie review dataset for sentiment analysis.

# Setup

``` bash
# install dataset
pip3 install -q tensorflow_datasets
```

In [5]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

tfds.disable_progress_bar()

# Dataset

In [53]:
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

print(f"Description:\n\n{info.description}\n")
print(f"Features:\n\n{info.features}\n")
print(f"Train Element:\n\n{train_dataset.element_spec}\n")
print(f"{len(train_dataset)} train samples and {len(test_dataset)} samples\n")

Description:

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

Features:

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'text': Text(shape=(), dtype=tf.string),
})

Train Set:

(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))

25000 train samples and 25000 samples



In [54]:
for example, label in train_dataset.take(1):
    print('text: ', example.numpy())
    print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0


# Prepare data for training

In [55]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

# shuffle and batch data
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [56]:
for example, label in train_dataset.take(1):
    print(f"{len(example.numpy())} text and {len(label.numpy())} in a batch")

64 text and 64 in a batch


In [57]:
VOCAB_SIZE = 1000

# encode text data
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

In [83]:
# get vocabulary using encoder
vocab = np.array(encoder.get_vocabulary())
print(f"vocabulary: {vocab[:5]}")

vocabulary: ['' '[UNK]' 'the' 'and' 'a']


In [86]:
# tokenize string with encoder
for text, label in train_dataset.take(1):
    original = text.numpy()[0]
    tokenized = encoder(original).numpy()
    recovered = vocab[tokenized]
    print("original\n", original)
    print("\ntokenize\n", tokenized)
    print("\nrecovered\n", recovered)

original
 b"I saw this film as it was the second feature on a disc containing the previously banned Video Nasty 'Blood Rites'. As Blood Rites was entirely awful, I really wasn't expecting much from this film; but actually, it would seem that trash director Andy Milligan has outdone himself this time as Seeds of Sin tops Blood Rites in style and stands tall as a more than adequate slice of sick sixties sexploitation. The plot is actually quite similar to Blood Rites, as we focus on a dysfunctional family unit, and of course; there is an inheritance at stake. The film is shot in black and white, and the look and feel of it reminded me a lot of the trash classic 'The Curious Dr Humpp'. There's barely any gore on display, and the director seems keener to focus on sex, with themes of incest and hatred seeping through. The acting is typically trashy, but most of the women get to appear nude at some point and despite a poor reputation, director Andy Milligan actually seems to have an eye for 

# Model

In [87]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        mask_zero=True
    ),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64)
    ),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1),
])

In [89]:
print([layer.supports_masking for layer in model.layers])

[False, True, True, True, True]


In [92]:
# predict on a sample text without padding.
sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = model.predict(np.array([sample_text]))
print(predictions[0])

# predict on a sample text with padding
padding = "the " * 2000
predictions = model.predict(np.array([sample_text, padding]))
print(predictions[0])

[-0.01171657]
[-0.01171658]


# Train Model

In [93]:
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(1e-4),
    metrics=['accuracy']
)

In [95]:
# history = model.fit(
#     train_dataset,
#     epochs=10,
#     validation_data=test_dataset, 
#     validation_steps=30
# )

# Evaluate Model

In [96]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.6905168890953064
Test Accuracy: 0.5


In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history['val_'+metric], '')
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, 'val_'+metric])

In [None]:
plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plot_graphs(history, 'accuracy')
plt.subplot(1,2,2)
plot_graphs(history, 'loss')

In [106]:
sample_text = ['The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.', 'asd']
np.array([*sample_text]).shape

(2,)