# Neural Networks for Data Science Applications
## Lab session 6: Text classification with neural networks

**Contents of the lab session:**
+ Using pre-trained word embeddings to classify text.
+ Manually tokenize text and learn embeddings.
+ Using TF Datasets and TF Hub for downloading datasets and modules.

In [0]:
# Remember to enable a GPU on Colab by:
# Runtime >> Change runtime type >> Hardware accelerator (before starting the VM).
!pip -q install tensorflow-gpu==2.0.0

[K     |████████████████████████████████| 380.8MB 45kB/s 
[K     |████████████████████████████████| 450kB 45.1MB/s 
[K     |████████████████████████████████| 3.8MB 31.5MB/s 
[K     |████████████████████████████████| 81kB 10.1MB/s 
[31mERROR: tensorflow 1.15.0 has requirement tensorboard<1.16.0,>=1.15.0, but you'll have tensorboard 2.0.2 which is incompatible.[0m
[31mERROR: tensorflow 1.15.0 has requirement tensorflow-estimator==1.15.1, but you'll have tensorflow-estimator 2.0.1 which is incompatible.[0m
[31mERROR: tensorboard 2.0.2 has requirement grpcio>=1.24.3, but you'll have grpcio 1.15.0 which is incompatible.[0m
[31mERROR: google-colab 1.0.0 has requirement google-auth~=1.4.0, but you'll have google-auth 1.7.1 which is incompatible.[0m
[?25h

In [0]:
import tensorflow as tf

### Download the dataset (IMDB movie reviews)

In [0]:
# If you are running the code locally, you might need to install tensorflow_datasets first.
import tensorflow_datasets as tfds

In [0]:
# Print a list of all the available datasets
print(tfds.list_builders())

In [0]:
# Learn more about the dataset here: https://www.tensorflow.org/datasets/catalog/imdb_reviews
# If you are on Windows and having errors while unzipping, you might need to increase the maximum allowed path length:
# https://github.com/tensorflow/datasets/issues/769#issuecomment-515646783
imdb = tfds.load('imdb_reviews', as_supervised=True)

In [0]:
# Inspect the object (a dictionary)
imdb

{'test': <_OptionsDataset shapes: ((), ()), types: (tf.string, tf.int64)>,
 'train': <_OptionsDataset shapes: ((), ()), types: (tf.string, tf.int64)>,
 'unsupervised': <_OptionsDataset shapes: ((), ()), types: (tf.string, tf.int64)>}

In [0]:
# Only select the train part
train_data = imdb['train']

In [0]:
# You can use it as you would use any tf.data.Dataset object
for xb, yb in train_data.batch(4):
    print(xb)
    print(xb.shape)
    print(yb)
    break

tf.Tensor(
[b"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a clich\xc3\xa9, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It'

### First version: using a pre-trained text embedding module

In [0]:
# Like before, if you are running locally, you might need to install tensorflow_hub before.
import tensorflow_hub as tfhub

In [0]:
# Definitely read the documentation to understand what the module is doing!
# TODO: test other text embedding modules.
module_url = "https://tfhub.dev/google/nnlm-en-dim128/2"

In [0]:
# Load the module
embedder = tfhub.load(module_url)

In [0]:
# The module does tokenization + word embedding + reduction of all word embeddings to get a sentence embedding.
embedder(xb).shape

TensorShape([4, 128])

In [0]:
# You can wrap the module inside a KerasLayer object to use it inside other models made up of Keras layers.
embedder = tfhub.KerasLayer(module_url, dtype=tf.string, input_shape=[])

In [0]:
from tensorflow.keras import Sequential, layers, metrics, losses, optimizers

In [0]:
# TODO: experiment with hidden layers before the final classification step.
net = Sequential()
net.add(embedder)
net.add(layers.Dense(1, activation='sigmoid'))

In [0]:
# Note: the text embeddings by default are non-trainable.
net.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 128)               124642688 
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 124,642,817
Trainable params: 129
Non-trainable params: 124,642,688
_________________________________________________________________


In [0]:
loss = losses.BinaryCrossentropy()
optimizer = optimizers.Adam()
acc = metrics.BinaryAccuracy()

In [0]:
net.compile(loss=loss, optimizer=optimizer, metrics=[acc])

In [0]:
net.fit(train_data.shuffle(1000).batch(32), epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f5d8e255940>

In [0]:
# Test on the test part
net.evaluate(imdb['test'].batch(32))



[0.4611027718657423, 0.78212]

In [0]:
# TODO: try out with different sentences!
xnew = tf.constant(['I hated this movie!'])
net(xnew)

<tf.Tensor: id=27217, shape=(1, 1), dtype=float32, numpy=array([[0.3621955]], dtype=float32)>

### Second version: trainable embeddings with manual tokenization

In [0]:
from tensorflow.keras.preprocessing import text, sequence

In [0]:
# Extract the texts (not very elegant)
train_texts = [t[0].numpy().decode('utf-8') for t in train_data]

In [0]:
# Extract the labels
train_labels = [t[1].numpy() for t in train_data]

In [0]:
train_texts[0]

"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a cliché, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's not really Dickens

In [0]:
# Define and train a tokenizer on the 250 most popular words in our train corpus
tokenizer = text.Tokenizer(num_words=250)
tokenizer.fit_on_texts(train_texts)

In [0]:
# Dictionary {idx: word}, where indexes are ordered by frequency.
tokenizer.index_word

{1: 'the',
 2: 'and',
 3: 'a',
 4: 'of',
 5: 'to',
 6: 'is',
 7: 'br',
 8: 'in',
 9: 'it',
 10: 'i',
 11: 'this',
 12: 'that',
 13: 'was',
 14: 'as',
 15: 'for',
 16: 'with',
 17: 'movie',
 18: 'but',
 19: 'film',
 20: 'on',
 21: 'not',
 22: 'you',
 23: 'are',
 24: 'his',
 25: 'have',
 26: 'he',
 27: 'be',
 28: 'one',
 29: 'all',
 30: 'at',
 31: 'by',
 32: 'an',
 33: 'they',
 34: 'who',
 35: 'so',
 36: 'from',
 37: 'like',
 38: 'her',
 39: 'or',
 40: 'just',
 41: 'about',
 42: "it's",
 43: 'out',
 44: 'has',
 45: 'if',
 46: 'some',
 47: 'there',
 48: 'what',
 49: 'good',
 50: 'more',
 51: 'when',
 52: 'very',
 53: 'up',
 54: 'no',
 55: 'time',
 56: 'she',
 57: 'even',
 58: 'my',
 59: 'would',
 60: 'which',
 61: 'only',
 62: 'story',
 63: 'really',
 64: 'see',
 65: 'their',
 66: 'had',
 67: 'can',
 68: 'were',
 69: 'me',
 70: 'well',
 71: 'than',
 72: 'we',
 73: 'much',
 74: 'been',
 75: 'bad',
 76: 'get',
 77: 'will',
 78: 'do',
 79: 'also',
 80: 'into',
 81: 'people',
 82: 'other',
 8

In [0]:
# Word counts
tokenizer.word_counts

In [0]:
# Tokenize the text: string --> sequence of integers corresponding to words inside the dictionary.
xtoken = tokenizer.texts_to_sequences(train_texts[0:2])
print(xtoken)

[[14, 3, 4, 10, 25, 74, 31, 4, 24, 7, 7, 24, 32, 4, 110, 30, 172, 8, 95, 29, 13, 3, 4, 12, 97, 27, 196, 39, 14, 1, 8, 3, 93, 26, 13, 3, 2, 26, 97, 27, 2, 8, 1, 169, 26, 2, 16, 2, 9, 200, 27, 3, 18, 26, 13, 3, 7, 7, 2, 9, 6, 1, 209, 12, 6, 35, 36, 24, 30, 1, 55, 4, 6, 109, 8, 20, 29, 4, 1, 2, 6, 65, 18, 4, 1, 2, 1, 6, 40, 3, 1, 62, 31, 3, 244, 71, 3, 42, 21, 63, 30, 29, 7, 7, 20, 1, 82, 6, 73, 5, 1, 1, 4, 6, 36, 1, 5, 1, 1, 122, 197, 1, 2, 8, 60, 1, 6, 14, 3, 25, 74, 125, 221, 6, 32, 7, 7, 18, 1, 6, 79, 47, 1, 4, 1, 1, 2, 23, 29, 40, 14, 14, 1, 59, 25, 7, 7, 2, 92, 47, 6, 6, 3, 14, 1, 44, 5, 7, 7, 21, 3, 36, 127, 3, 16, 31, 87, 14, 3, 73, 50, 71, 13, 201, 8, 1, 26, 13, 46, 4, 24, 202, 5, 1, 8, 5, 148, 26, 13, 79, 2, 8, 1, 17, 26, 6, 14, 139, 4, 3, 3, 4, 244, 71, 3, 4, 1, 109, 3, 193, 52, 168, 23, 16, 201, 29, 4, 1, 88, 23, 40, 192, 2, 6, 5, 30, 1, 169, 55, 15, 6, 128, 5, 2, 211, 3, 213, 7, 7, 172, 6, 148, 33, 78, 24, 196, 24, 2, 58, 133, 6, 1, 28, 8, 60, 1, 179, 5, 77, 42, 18, 42, 140, 

In [0]:
# Sentences have varying lengths!
print(len(xtoken[0]))
print(len(xtoken[1]))

318
111


In [0]:
train_tokens = tokenizer.texts_to_sequences(train_texts)

In [0]:
# Pad the sequences with zeros (optional: you can experiment adding a maximum length manually).
sequence.pad_sequences(xtoken, padding='post')

array([[ 14,   3,   4,  10,  25,  74,  31,   4,  24,   7,   7,  24,  32,
          4, 110,  30, 172,   8,  95,  29,  13,   3,   4,  12,  97,  27,
        196,  39,  14,   1,   8,   3,  93,  26,  13,   3,   2,  26,  97,
         27,   2,   8,   1, 169,  26,   2,  16,   2,   9, 200,  27,   3,
         18,  26,  13,   3,   7,   7,   2,   9,   6,   1, 209,  12,   6,
         35,  36,  24,  30,   1,  55,   4,   6, 109,   8,  20,  29,   4,
          1,   2,   6,  65,  18,   4,   1,   2,   1,   6,  40,   3,   1,
         62,  31,   3, 244,  71,   3,  42,  21,  63,  30,  29,   7,   7,
         20,   1,  82,   6,  73,   5,   1,   1,   4,   6,  36,   1,   5,
          1,   1, 122, 197,   1,   2,   8,  60,   1,   6,  14,   3,  25,
         74, 125, 221,   6,  32,   7,   7,  18,   1,   6,  79,  47,   1,
          4,   1,   1,   2,  23,  29,  40,  14,  14,   1,  59,  25,   7,
          7,   2,  92,  47,   6,   6,   3,  14,   1,  44,   5,   7,   7,
         21,   3,  36, 127,   3,  16,  31,  87,  14

In [0]:
train_tokens = sequence.pad_sequences(train_tokens, padding='post')

In [0]:
len(train_tokens[0])

1200

In [0]:
# Re-insert inside a Dataset (again, not entirely elegant)
train_data = tf.data.Dataset.from_tensor_slices((train_tokens, train_labels))

In [0]:
for xb, yb in train_data.batch(4):
  print(xb.shape)
  break

(4, 1200)


In [0]:
# Define custom embedings (randomly initialized)
embedder = layers.Embedding(250, 128)

In [0]:
embedder(xb).shape

TensorShape([4, 1200, 128])

In [0]:
# Simple model, like before
netv2 = Sequential()
netv2.add(embedder)
netv2.add(layers.GlobalAvgPool1D())
netv2.add(layers.Dense(1, activation='sigmoid'))

In [0]:
netv2(xb).shape

TensorShape([4, 128])

In [0]:
netv2.compile(
    loss=loss, 
    optimizer=optimizer, 
    metrics=[acc]
)

In [0]:
netv2.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         32000     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 32,129
Trainable params: 32,129
Non-trainable params: 0
_________________________________________________________________


In [0]:
netv2.fit(train_data.shuffle(1000).batch(32),
          epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f5d408e6a90>

### Version 3: convolutional neural network with dilated convolutions

In [0]:
# Define a convolutional block
def add_conv_block(model, filters, dilation):
    # Conv1D
    model.add(layers.Conv1D(filters, 5, 
                            dilation_rate=dilation))
    # BatchNorm
    model.add(layers.BatchNormalization())
    # ReLU
    model.add(layers.Activation('relu'))

In [0]:
# Note: we are not considering masking here! This could be improved.
# Also note how dilation is increasing further in the network. For larger networks,
# we could have a repeating pattern 1/2/4/8/1/2/4/8 like in WaveNet.
netv3 = Sequential()
netv3.add(layers.Embedding(250, 128, mask_zero=True))
add_conv_block(netv3, 64, 1)
add_conv_block(netv3, 128, 2)
add_conv_block(netv3, 256, 4)
add_conv_block(netv3, 256, 8)
netv3.add(layers.GlobalAvgPool1D())
netv3.add(layers.Dense(1, activation='sigmoid'))

In [0]:
netv3.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         32000     
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          41024     
_________________________________________________________________
batch_normalization (BatchNo (None, None, 64)          256       
_________________________________________________________________
activation (Activation)      (None, None, 64)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 128)         41088     
_________________________________________________________________
batch_normalization_1 (Batch (None, None, 128)         512       
_________________________________________________________________
activation_1 (Activation)    (None, None, 128)        

In [0]:
netv3.compile(
    loss=loss, 
    optimizer=optimizer, 
    metrics=[acc]
)

In [0]:
netv3.fit(train_data.shuffle(1000).batch(32),
          epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
 19/782 [..............................] - ETA: 1:46 - loss: 0.3362 - binary_accuracy: 0.8542

KeyboardInterrupt: ignored

In [0]:
# TODO: Evaluate the model

### Save the embeddings for visualization

In [0]:
# Extract the embedding matrix:
# i-th row: 128-dimensional embedding for the i-th word.
for v in netv3.layers[0].trainable_variables:
  print(v.shape)

(250, 128)


In [0]:
import io

In [0]:
# We need one TSV file for embeddings and one for words.
# Read more here: http://projector.tensorflow.org/
words_tsv = io.open('words.tsv', 'w', encoding='utf-8')
embed_tsv = io.open('embeddings.tsv', 'w', encoding='utf-8')

In [0]:
for idx in range(250):
    # Save the word in the file
    word = tokenizer.index_word[idx + 1]
    words_tsv.write(word + '\n')

    # Save the embedding vector (tab-separated)
    word_embedding = v[idx].numpy()
    tmp = '\t'.join([ str(e) for e in word_embedding ])
    embed_tsv.write(tmp + '\n')

In [0]:
words_tsv.close()
embed_tsv.close()

In [0]:
# If on Colab, download the two files (on the left menu, go to Files >> right click on the files >> Download).

In [0]:
# Load the files here and inspect them: http://projector.tensorflow.org/
# TODO: looking at the embeddings, what do you see? What could be improved?