# TOC

__Chapter 5 - Text I: Working with Text and Sequences, and TensorBoard Visualization__

1. [Import](#Import)
1. [The importance of sequence data](#The-importance-of-sequence-data)
1. [Introduction to recurrent neural networks](#Introduction-to-recurrent-neural-networks)
    1. [MNIST images as sequences](#MNIST-images-as-sequences)
        1. [The RNN step](#The-RNN-step)
        1. [Sequential outputs](#Sequential-outputs)
        1. [RNN classification](#RNN-classification)
1. [Visualizing the model with TensorBoard](#Visualizing-the-model-with-TensorBoard)
1. [TensorFlow built-in RNN functions](#TensorFlow-built-in-RNN-functions)
1. [RNN for Text Sequences](#RNN-for-Text-Sequences)
    1. [Text sequences](#text-sequences)
    1. [Supervised word embeddings](#Supervised-word-embeddings)
    1. [LSTM and using sequence length](#LSTM-and-using-sequence-length)
    1. [Training embeddings and the LSTM classifier](#Training-embeddings-and-the-LSTM-classifier)
    1. [Stacking multiple LSTMs](#Stacking-multiple-LSTMs)

# Import

<a id = 'Import'></a>

In [1]:
# standard libary and settings
import os
import sys
import importlib
import itertools
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.options.display.float_format = "{:,.6f}".format

import tensorflow as tf

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

# custom extensions and settings
sys.path.append("/home/mlmachine") if "/home/mlmachine" not in sys.path else None
sys.path.append("/home/prettierplot") if "/home/prettierplot" not in sys.path else None

import mlmachine as mlm
from prettierplot.plotter import PrettierPlot
import prettierplot.style as style

# magic functions
%matplotlib inline

# The importance of sequence data

The previous chapter highlighted that exploiting the structure of data is the key to success - spatial structure of image pixels can be quite informative. Another important structure is sequential structure. This occurs in the context, including video, audio, genomics gene sequences, longitudinal medical records in healthcare and financial data in the stock market, etc.

A particularly import type of sequential data with strong structure is natural language. Deep learning can exploit the inherent structure of text that appears between individual characters, words, sentences, paragraphs and even entire documents. Common pursuits include document classification, automated question answersing and humner-level conversational bots.

This chapter will focus on the basic building blocks and tasks associated with sequence data

<a id = 'The-importance-of-sequence-data'></a>

# Introduction to recurrent neural networks

The idea behind RNN models is that each new element in the sequence contributes new information, which updates the current state of the model. The intuition if close to our commplace understanding of how we process sequential information in our day-to-day. Our memory is not cleared upon arriving at new information - rather, our memory is updated.

A fundamental building block often used for modeling sequential patterns via machine learning is the Markov chain model. In a general sense, we can view our data sequences as "chains", with each node in the chain dependent in some way on the previous node, in effect carrying forward the history of the sequence.

RNN models are also based on this notion of a chain. Types of RNNs vary in how they maintain and update information. RNNs apply some form of loop where an input $x_t$ (perhaps a word in a sequence) is observed at time $t$, and the network updates its "state vector" to $h_t$ from the previous vector $h_{t-1}$. Then the subsequent update will be dependent on $h_t$, which in effect retains the history of the sequence up to that point. This can be thought of as one long unrolled chain, where link in the chain involved the same kind of processing step based on the history understood up to that point in the chain.





<a id = 'Introduction-to-recurrent-neural-networks'></a>

## MNIST images as sequences

The intuition behind CNNs and pixel arrangement is more readily apparent than the potential application of RNNs to image identification tasks, but RNNs offer a different angle at understanding the structure of image data which can be an informative compliment to CNN techniques.

A simplist view of a 28 x 28 MNIST image sample is to think of it as a sequence of rows (or columns). Each image can be viewed as a sequence of length 28, and each element in the sequence is a vector of 28 pixel values. With this construct in mind, the RNN can be thought of as a scanner, moving from the top of the image to the bottom (when looking at row sequences) or left to right (when looking at column sequences).

<a id = 'MNIST-images-as-sequences'></a>

In [2]:
# MNIST data modeled as a sequence of pizels
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/main/tmp/data/", one_hot=True)

# parameters
element_size = 28
time_steps = 28
num_classes = 10
batch_size = 128
hidden_layer_size = 128

# location to save TensoBoard model summaries
logDir = "/main/logs/RNN_with_summaries"

# create placeholders for inputs, labels
_inputs = tf.placeholder(
    tf.float32, shape=[None, time_steps, element_size], name="inputs"
)
y = tf.placeholder(tf.float32, shape=[None, num_classes], name="labels")

W0720 15:09:04.692429 140580983120896 deprecation.py:323] From <ipython-input-2-6ba5f7edd971>:4: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
W0720 15:09:04.694154 140580983120896 deprecation.py:323] From /usr/local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
W0720 15:09:04.697756 140580983120896 deprecation.py:323] From /usr/local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instruction

Extracting /main/tmp/data/train-images-idx3-ubyte.gz


W0720 15:09:05.358933 140580983120896 deprecation.py:323] From /usr/local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
W0720 15:09:05.372861 140580983120896 deprecation.py:323] From /usr/local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
W0720 15:09:05.534164 140580983120896 deprecation.py:323] From /usr/local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be remove

Extracting /main/tmp/data/train-labels-idx1-ubyte.gz
Extracting /main/tmp/data/t10k-images-idx3-ubyte.gz
Extracting /main/tmp/data/t10k-labels-idx1-ubyte.gz


> Remarks - element_size is the dimension of each vector in our sequence (# of pixels) and time_steps is the number of such elements of that size in a full sequence. hidden_layer_size is arbitraily set to 128 and controls the size of the hidden RNN state vector described above.

### The RNN step

<a id = 'The-RNN-step'></a>

In [3]:
# load data
batch_x, batch_y = mnist.train.next_batch(batch_size)

# reshape data to get 28 sequences of 28 pixels
batch_x = batch_x.reshape((batch_size, time_steps, element_size))

In [4]:
# helper function for logging summary data for TensorBoard
def variable_summaries(var):
    with tf.name_scope("summaries"):
        mean = tf.reduce_mean(var)
        tf.summary.scalar("mean", mean)
        with tf.name_scope("stddev"):
            stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
        tf.summary.scalar("stddev", stddev)
        tf.summary.scalar("max", tf.reduce_max(var))
        tf.summary.scalar("min", tf.reduce_min(var))
        tf.summary.histogram("histogram", var)


# weights and bias for input and hiden layer
with tf.name_scope("rnn_weights"):
    with tf.name_scope("W_x"):
        Wx = tf.Variable(tf.zeros([element_size, hidden_layer_size]))
        variable_summaries(Wx)
    with tf.name_scope("W_h"):
        Wh = tf.Variable(tf.zeros([hidden_layer_size, hidden_layer_size]))
        variable_summaries(Wh)
    with tf.name_scope("Bias"):
        b_rnn = tf.Variable(tf.zeros([hidden_layer_size]))
        variable_summaries(b_rnn)

# apply RNN step with tf.scan()
def rnn_step(previous_hidden_state, x):
    current_hidden_state = tf.tanh(
        tf.matmul(previous_hidden_state, Wh) + tf.matmul(x, Wx) + b_rnn
    )
    return current_hidden_state


# processing inputs to work with scan function
# current input shape: (batch_size, time_steps, element_size)
processed_input = tf.transpose(_inputs, perm=[1, 0, 2])
# current input shape now - (time_steps, batch_size, element_size)

initial_hidden = tf.zeros([batch_size, hidden_layer_size])

# getting all state vectors across time
all_hidden_states = tf.scan(
    rnn_step, processed_input, initializer=initial_hidden, name="states"
)

> Remarks - the inputs are reshaped from [batch_size, time_steps, element_size] to [time_steps, batch_size, element_size]. The perm argument to tf.transpose() tells TensorFlow which axes to switch around such that the first axis in our input Tensor now represents the time axis. We then use the built-in tf.scan() function which repeatedly applies a function to a sequence of elements in order. tf.scan() is used to introduce loops into the computation graph, which allows us to avoid 'unrolling' the loops explicitly by adding more and more replications of the same operation. This functions enables the graph to have a dynamic nmber of iterations.

In [5]:
# tf.scan() demonstration
import numpy as np
import tensorflow as tf

elems = np.array(["T", "e", "n", "s", "o", "r", " ", "f", "l", "o", "w"])
scan_sum = tf.scan(lambda a, x: a + x, elems)

sess = tf.InteractiveSession()
sess.run(scan_sum)

array([b'T', b'Te', b'Ten', b'Tens', b'Tenso', b'Tensor', b'Tensor ',
       b'Tensor f', b'Tensor fl', b'Tensor flo', b'Tensor flow'],
      dtype=object)

> Remarks - tf.scan() is used to sequentially concatenate characters to a string, which mimics an arithmetic cumulative sum operations.

### Sequential outputs

<a id = 'Sequential-outputs'></a>

In [6]:
# weights for outputs layers
with tf.name_scope("linear_layer_weights") as scope:
    with tf.name_scope("W_linear"):
        W1 = tf.Variable(
            tf.truncated_normal([hidden_layer_size, num_classes], mean=0, stddev=0.01)
        )
        variable_summaries(W1)
    with tf.name_scope("Bias_linear"):
        b1 = tf.Variable(tf.truncated_normal([num_classes], mean=0, stddev=0.01))
        variable_summaries(b1)

# apply linear layer to state vector
def get_linear_layer(hidden_state):
    return tf.matmul(hidden_state, W1) + b1


with tf.name_scope("linear_layer_weights") as scope:
    # iterate across time, apply linear layer to all RNN outputs
    all_outputs = tf.map_fn(get_linear_layer, all_hidden_states)

    # get last output
    output = all_outputs[-1]
    tf.summary.histogram("outputs", output)

> Remarks - The RNN input is sequential and so is the output. In this example, the last state vector is passed through a fully connected linear to extract an output vector (which then gets passed through a softmax activation funtion to generate predictions). This operates on the assumption that the last state vector has accumulated information representing the entire sequence.
To implement this, we define the linear layer's weights and bias term variables, and create a factory function for this layer. Then this layer is applied to all outputs with tf.map_fn(), which applies a function to sequences in an element-wise manner. Lastly, we extract the last output for each instance in the batch (with negative indexing)

### RNN classification

To train the classifier, we need to define operations for loss function computation, optimization and predictions, as well as add more summaries for TensorBoard, and merge all of these summaries into one operations.

<a id = 'RNN-classification'></a>

In [7]:
# RNN components with summaries for TensorBord
with tf.name_scope("cross_entropy"):
    cross_entropy = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y)
    )
    tf.summary.scalar("cross_entropy", cross_entropy)

with tf.name_scope("train"):
    train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)

with tf.name_scope("accuracy"):
    correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(output, 1))
    accuracy = (tf.reduce_mean(tf.cast(correct_prediction, tf.float32))) * 100
    tf.summary.scalar("accuracy", accuracy)

# merge all the summaries
merged = tf.summary.merge_all()

W0720 15:09:07.689025 140580983120896 deprecation.py:323] From <ipython-input-7-294e092b42c5>:4: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

W0720 15:09:08.360511 140580983120896 deprecation.py:506] From /usr/local/lib/python3.6/site-packages/tensorflow/python/training/rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [8]:
# prep data
test_data = mnist.test.images[:batch_size].reshape((-1, time_steps, element_size))
test_label = mnist.test.labels[:batch_size]

# run session
with tf.Session() as sess:

    # write summaries to log directory for TensorBoard
    train_writer = tf.summary.FileWriter(
        logDir + "/train", graph=tf.get_default_graph()
    )
    test_writer = tf.summary.FileWriter(logDir + "/test", graph=tf.get_default_graph())
    sess.run(tf.global_variables_initializer())

    for i in range(10000):
        batch_x, batch_y = mnist.train.next_batch(batch_size)

        # reshape data to get 28 sequences of 28 pixels
        batch_x = batch_x.reshape([batch_size, time_steps, element_size])
        summary, _ = sess.run(
            [merged, train_step], feed_dict={_inputs: batch_x, y: batch_y}
        )
        # add to summaries
        train_writer.add_summary(summary, 1)

        if i % 1000 == 0:
            acc, loss, = sess.run(
                [accuracy, cross_entropy], feed_dict={_inputs: batch_x, y: batch_y}
            )
            print(
                "Iter "
                + str(i)
                + ", Minibatch Loss = "
                + "{:.6f}".format(loss)
                + ", Training Accuracy = "
                + "{:.5f}".format(acc)
            )
        if i % 10:

            # calculate accuracy for 128 MNIST test images and add to summaries
            summary, acc = sess.run(
                [merged, accuracy], feed_dict={_inputs: test_data, y: test_label}
            )
            test_writer.add_summary(summary, i)
    test_acc = sess.run(accuracy, feed_dict={_inputs: test_data, y: test_label})
    print("Test accuracy: {:.5}".format(test_acc))

Iter 0, Minibatch Loss = 2.302247, Training Accuracy = 10.93750
Iter 1000, Minibatch Loss = 1.264822, Training Accuracy = 56.25000
Iter 2000, Minibatch Loss = 0.753350, Training Accuracy = 80.46875
Iter 3000, Minibatch Loss = 0.299778, Training Accuracy = 90.62500
Iter 4000, Minibatch Loss = 0.100001, Training Accuracy = 97.65625
Iter 5000, Minibatch Loss = 0.022463, Training Accuracy = 100.00000
Iter 6000, Minibatch Loss = 0.069040, Training Accuracy = 97.65625
Iter 7000, Minibatch Loss = 0.035403, Training Accuracy = 99.21875
Iter 8000, Minibatch Loss = 0.038090, Training Accuracy = 99.21875
Iter 9000, Minibatch Loss = 0.042109, Training Accuracy = 99.21875
Test accuracy: 96.094


# Visualizing the model with TensorBoard

<a id = 'Visualizing-the-model-with-TensorBoard'></a>

In a terminal window, run:


tensorboard --logdir <log directory here>


> Remarks - For each run, Ensure that logDir references a new set of log files.

# TensorFlow built-in RNN functions

The following code segment contains a shorter version of the example above, this time utilizing select TensorFlow built-in functions

<a id = 'TensorFlow-built-in-RNN-functions'></a>

In [9]:
# RNN example utilizing TensorFlow built-in functions for increased efficiency
mnist = input_data.read_data_sets('/tmp/data/', one_hot = True)
element_size = 28; time_steps = 28; num_classes = 10
batch_size = 128; hidden_layer_size = 128

_inputs = tf.placeholder(tf.float32, shape = [None, time_steps, element_size], name = 'inputs')
y = tf.placeholder(tf.float32, shape = [None, num_classes], name = 'inputs')

# TensorFlow built-in functions
rnn_cell = tf.contrib.rnn.BasicRNNCell(hidden_layer_size)
outputs, _ = tf.nn.dynamic_rnn(rnn_cell, _inputs, dtype = tf.float32)

W1 = tf.Variable(tf.truncated_normal([hidden_layer_size, num_classes]
                                   ,mean = 0, stddev = 0.01))
b1 = tf.Variable(tf.truncated_normal([num_classes],mean = 0, stddev = 0.01))

def get_linear_layer(vector):
    return tf.matmul(vector, W1) + b1

last_rnn_output = outputs[:, -1, :]
final_output = get_linear_layer(last_rnn_output)

softmax = tf.nn.softmax_cross_entropy_with_logits(logits = final_output, labels = y)
cross_entropy = tf.reduce_mean(softmax)
train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(final_output, 1))
accuracy = (tf.reduce_mean(tf.cast(correct_prediction, tf.float32))) * 100

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

test_data = mnist.test.images[:batch_size].reshape((-1, time_steps, element_size))
test_label = mnist.test.labels[:batch_size]

for i in range(3001):
    batch_x, batch_y = mnist.train.next_batch(batch_size)
    batch_x = batch_x.reshape((batch_size, time_steps, element_size))
    sess.run(train_step, feed_dict = {_inputs : batch_x, y : batch_y})
    if i % 1000 == 0:
        acc = sess.run(accuracy, feed_dict = {_inputs : batch_x, y : batch_y})
        loss = sess.run(cross_entropy, feed_dict = {_inputs : batch_x, y : batch_y})
        print('Iter ' + str(i) + ', Minibatch Loss = ' + '{:.6f}'.format(loss)\
                 + ', Training Accuracy = ' + '{:.5f}'.format(acc))

print('Testing accuracy: {}'.format(sess.run(accuracy, feed_dict = {_inputs : batch_x, y : batch_y})))


Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


W0720 15:17:31.196664 140580983120896 deprecation.py:323] From <ipython-input-9-cb7897c568b5>:10: BasicRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.
W0720 15:17:31.198979 140580983120896 deprecation.py:323] From <ipython-input-9-cb7897c568b5>:11: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
W0720 15:17:31.847814 140580983120896 deprecation.py:506] From /usr/local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argumen

Iter 0, Minibatch Loss = 2.303380, Training Accuracy = 9.37500
Iter 1000, Minibatch Loss = 0.134066, Training Accuracy = 97.65625
Iter 2000, Minibatch Loss = 0.179433, Training Accuracy = 94.53125
Iter 3000, Minibatch Loss = 0.250388, Training Accuracy = 91.40625
Testing accuracy: 91.40625


> Remarks 
- tf.contrib.rnn.BasicRNNCell and tf.nn.dynamic_rnn are abastractions that represent the basic operations that each recurrent cell carries out, as well as the associated state. They replace the rnn_step() function and the associated variables required.
- Once the rnn_cell variable is created, it gets fed into tf.nn.dynamic_rnn(). This function replaces tf.scan() in the implementation above and creates and RNN specified by rnn_cell.

# RNN for Text Sequences

The MNIST RNN example was useful for instilling the intuition behind RNNs, but a more prominent application of RNNs is analyzing text data.

We will eventually model movie review data, but we will start with some example data and discuss some key properties of text datasets.

<a id = 'RNN-for-Text-Sequences'></a>

## Text sequences

Text sequences can be letters in a word, words in a sentence, sentences in a paragraph, or even documents in a documents corpus. 

Take this sentence as an example: "Our company provides smart agriculture solutions for farms, with advanced AI, deep-learning." We'll pretend this is a sentence from an online news blog, and we want to process it in our machine learning system.

Each word in the sentence can be represented with an ID - an integer, commonly referred to as a token ID in NLP. So the word "agriculture" could be mapped to the integer 3452, the word "farm" to 12, and "deep-learning" to 0. This representation of the text data is quite different that the pixel vectors we have used up to this point.

We create a simulated data consisting of two classes of very short "sentences". One is composed of odd digits and the other is composed of even digits. The digits are represented with the English word. We want to generate sentences built of words  representing even and odd numbers, and our goal is to learn to classify each sentence as either odd or even as a supervised text-classification task.

What follows is a contrived example for illustrative purposes.

<a id = 'text-sequences'></a>

In [10]:
# model parameters
batch_size = 128
embedding_dimension = 64
num_classes = 2
hidden_layer_size = 32
time_steps = 6
element_size = 1

We create sentences by randomly sampling digits and mapping them to the corresponding English word, i.e. 1 is mapped to "One". The sentences will have varying lengths between 3 and 6 words. That being said, in order for all input sentence to be put into one tensor, we need the sentences to be the same size. To do this, we add padding to all sentences to the extent needed. This process is called zero padding.

In [11]:
# map numbers to text description
digit_to_word_map = {
    1: "One",
    2: "Two",
    3: "Three",
    4: "Four",
    5: "Five",
    6: "Six",
    7: "Seven",
    8: "Eight",
    9: "Nine",
}
digit_to_word_map[0] = "PAD"

even_sentences = []
odd_sentences = []
seqlens = []
for i in range(10000):
    rand_seq_len = np.random.choice(range(3, 7))
    seqlens.append(rand_seq_len)
    rand_odd_ints = np.random.choice(range(1, 10, 2), rand_seq_len)
    rand_even_ints = np.random.choice(range(2, 10, 2), rand_seq_len)

    # padding
    if rand_seq_len < 6:
        rand_odd_ints = np.append(rand_odd_ints, [0] * (6 - rand_seq_len))
        rand_even_ints = np.append(rand_even_ints, [0] * (6 - rand_seq_len))
    even_sentences.append(" ".join([digit_to_word_map[r] for r in rand_even_ints]))
    odd_sentences.append(" ".join([digit_to_word_map[r] for r in rand_odd_ints]))

# concat
data = even_sentences + odd_sentences
seqlens *= 2

In [12]:
# even sentences
even_sentences[0:6]

['Four Four Eight Two PAD PAD',
 'Six Six Six PAD PAD PAD',
 'Four Two Eight Four Two PAD',
 'Four Two Six PAD PAD PAD',
 'Eight Four Six Four Two PAD',
 'Four Two Four Two Four Eight']

In [13]:
# odd sentences
odd_sentences[0:6]

['Seven One Seven Nine PAD PAD',
 'Three Nine Seven PAD PAD PAD',
 'Five One Five Five Three PAD',
 'Three Five Nine PAD PAD PAD',
 'Nine Seven Seven Three Seven PAD',
 'Five Seven Nine Five Three Seven']

In [14]:
# original sequence lengths
seqlens[0:6]

[4, 3, 5, 3, 5, 6]

> Remarks - We keep the original sentence lengths because adding zero-padding solves one problem but creates another - if we pass padded sentences through the RNN, it will process our uselss PAD symbols. We solve this problem by storing the original lengths in seqlens and then tell tf.nn.dynamic_rnn() where each sentence truly ends.

In [15]:
# map words to arbitrarily chosen indices with a dictionary
# also create the inverse
word2index_map = {}
index = 0
for sent in data:
    for word in sent.lower().split():
        if word not in word2index_map:
            word2index_map[word] = index
            index += 1

# inverse map
index2word_map = {index: word for word, index in word2index_map.items()}
vocabulary_size = len(index2word_map)

In [16]:
# create array of lebs in one-hot format
labels = [1] * 10000 + [0] * 10000
for i in range(len(labels)):
    label = labels[i]
    one_hot_encoding = [0] * 2
    one_hot_encoding[label] = 1
    labels[i] = one_hot_encoding

In [17]:
# train/test split
data_indices = list(range(len(data)))
np.random.shuffle(data_indices)
data = np.array(data)[data_indices]

labels = np.array(labels)[data_indices]
seqlens = np.array(seqlens)[data_indices]
train_x = data[:10000]
train_y = labels[:10000]
train_seqlens = seqlens[:10000]

test_x = data[10000:]
test_y = labels[10000:]
test_seqlens = seqlens[10000:]

In [18]:
# create batches of sentences comprised of integer IDs
def get_sentence_batch(batch_size, data_x, data_y, data_seqlens):
    instance_indices = list(range(len(data_x)))
    np.random.shuffle(instance_indices)
    batch = instance_indices[:batch_size]
    x = [[word2index_map[word] for word in data_x[i].lower().split()] for i in batch]
    y = [data_y[i] for i in batch]
    seqlens = [data_seqlens[i] for i in batch]
    return x, y, seqlens

In [19]:
# create placeholders for data
_inputs = tf.placeholder(tf.int32, shape=[batch_size, time_steps])
_labels = tf.placeholder(tf.float32, shape=[batch_size, num_classes])
_seqlens = tf.placeholder(tf.int32, shape=[batch_size])

## Supervised word embeddings

The text data is encoded as lists of word IDs. This atomic-style of representation is not scalable for training deep learning models with large vocabularies. When the vocabulary is large, we could end up with millions of word IDs, each encoded in one-hot fasion, which leads to great data sparsity and computational issues.

Word embedding is potential solution for this problem. Embeddings are mappings from high-dimensional one-hot vectors that encode word to lower-dimensional dense vectors. For example, if the vocabulary is 100,000 words, each word in one-hot representation would be of the same size. The high-dimensional one-hot vectors are embedded into a number vector space with much lower dimensionality. A popular implementation of this word2vec, will be explored in chapter 6.

Word embeddings can be thought of as has tables or lookup tables, mapping words to their dense vector values. These vectors are optimized as part of the training process. Previously, each word was associated with an integer index, and sentences are then represented as sequences of these indices. Now, to obtain a word's vector, we use the built-in tf.nn.embedding_lookup() function, which retrieves the vectors for each word in a given sequence of word indices.

<a id = 'Supervised-word-embeddings'></a>

In [20]:
# create word embeddings
with tf.name_scope("embeddings"):
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_dimension], -1.0, 1.0),
        name="embedding",
    )
    embed = tf.nn.embedding_lookup(embeddings, _inputs)

## LSTM and using sequence length

The basic RNN implementation above is generally not used in practice. More advanced models differ mainly by how they update their hidden state and propagate information through time. A popular method is the long short-term memory (LSTM) network. It differs from our basic RNN by having some special memory mechanisms that enable the recurrent cells to better store information for long period of time. This captures long-term dependencies better than basic RNNs.

The efficiencies in LSTMs arise from additional parameters added to each recurrent cell which, generally speaking, enable the RNN to overcome optimizations issues. These parameters filter the information that is 'worth remembering' and passing forward from the information that is worth 'forgetting'.

To implement this, we create an LSTM cell with tf.contrib.rnn.BasicLSTMCell() and feed it into tf.nn.dynamic_rnn(), just as we did in the implementation above. We also give dynamic_rnn() the length of each sequence through the variable seqlens. This allows TensorFlow to stop all RNN steps once the last 'real' element is reached in each sample, which effectively ignores the PAD elements. It also returns all output vectors over time. These appear in the outputs tensor, and are zero-padded beyond the end of the true sequence.

<a id = 'LSTM-and-using-sequence-length'></a>

In [21]:
# use built-in LSTM function to improve RNN model
with tf.variable_scope("lstm"):
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(hidden_layer_size, forget_bias=1.0)
    outputs, states = tf.nn.dynamic_rnn(
        lstm_cell, embed, sequence_length=_seqlens, dtype=tf.float32
    )

weights = {
    "linear_layer": tf.Variable(
        tf.truncated_normal([hidden_layer_size, num_classes], mean=0, stddev=0.01)
    )
}
biases = {
    "linear_layer": tf.Variable(tf.truncated_normal([num_classes], mean=0, stddev=0.01))
}

# extract the last output and use in a linear layer
final_output = tf.matmul(states[1], weights["linear_layer"]) + biases["linear_layer"]
softmax = tf.nn.softmax_cross_entropy_with_logits(logits=final_output, labels=_labels)
cross_entropy = tf.reduce_mean(softmax)

W0720 15:19:05.147615 140580983120896 deprecation.py:323] From <ipython-input-21-52e348fc2170>:3: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
W0720 15:19:07.337935 140580983120896 deprecation.py:323] From /usr/local/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py:244: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


> Remarks - dynamic_rnn() returns a tensor called states, and from this we can retriee the last valid output vector and pass it through a linear layer (and the softmax function), using it as our final prediction.

## Training embeddings and the LSTM classifier

Combine all of the pieces to create an end-to-end training of both word vectors and a classification model.

<a id = 'Training-embeddings-and-the-LSTM-classifier'></a>

In [22]:
# classification network utilizing word embeddings and LSTM
train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(_labels, 1), tf.argmax(final_output, 1))
accuracy = (tf.reduce_mean(tf.cast(correct_prediction, tf.float32))) * 100

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(1000):
        x_batch, y_batch, seqlen_batch = get_sentence_batch(
            batch_size, train_x, train_y, train_seqlens
        )
        sess.run(
            train_step,
            feed_dict={_inputs: x_batch, _labels: y_batch, _seqlens: seqlen_batch},
        )

        if step % 100 == 0:
            acc = sess.run(
                accuracy,
                feed_dict={_inputs: x_batch, _labels: y_batch, _seqlens: seqlen_batch},
            )
            print("Accuracy at {:d}: {:.5f}".format(step, acc))

    for test_batch in range(5):
        x_test, y_test, seqlen_test = get_sentence_batch(
            batch_size, test_x, test_y, test_seqlens
        )
        batch_pred, batch_acc = sess.run(
            [tf.argmax(final_output, 1), accuracy],
            feed_dict={_inputs: x_test, _labels: y_test, _seqlens: seqlen_test},
        )
        print("Test batch accuracy {:d}: {:.5f}".format(test_batch, batch_acc))

    output_example = sess.run(
        [outputs], feed_dict={_inputs: x_test, _labels: y_test, _seqlens: seqlen_test}
    )
    states_example = sess.run(
        [states[1]], feed_dict={_inputs: x_test, _labels: y_test, _seqlens: seqlen_test}
    )

Accuracy at 0: 49.21875
Accuracy at 100: 100.00000
Accuracy at 200: 100.00000
Accuracy at 300: 100.00000
Accuracy at 400: 100.00000
Accuracy at 500: 100.00000
Accuracy at 600: 100.00000
Accuracy at 700: 100.00000
Accuracy at 800: 100.00000
Accuracy at 900: 100.00000
Test batch accuracy 0: 100.00000
Test batch accuracy 1: 100.00000
Test batch accuracy 2: 100.00000
Test batch accuracy 3: 100.00000
Test batch accuracy 4: 100.00000


In [23]:
#
seqlens[1]

4

In [24]:
#
output_example[0][1].shape

(6, 32)

In [25]:
#
output_example[0][1][:6, 0:3]

array([[-0.19556282, -0.28955486,  0.04457661],
       [-0.457487  , -0.61215574,  0.3644689 ],
       [-0.48734728, -0.69737726,  0.32778406],
       [-0.53218424, -0.7419906 ,  0.34765798],
       [-0.6829988 , -0.78629947,  0.56874573],
       [-0.3752168 , -0.8411078 ,  0.6790283 ]], dtype=float32)

> Remarks - The original sequence length was X, so it makes sense that the last two time steps have zero vectors due to padding

In [26]:
#
states_example[0][1][0:3]

array([-0.3752168, -0.8411078,  0.6790283], dtype=float32)

> Remarks - The states vector returned by dynamic_rnn() stores the last releant output ector. Note that the values match the last relevant output vector before zero-padding

## Stacking multiple LSTMs

The LSTM example above utilizes only a one-layer LSTM network. Adding more layers is straightforward when using the MultiRNNCell() wrapper that combines multiple RNN cells.

We first define an LSTM cell as before, then feed it into the wrapper. The following network has two layers of LSTM.

<a id = 'Stacking-multiple-LSTMs'></a>

In [27]:
# multi LSTM example
num_LSTM_layers = 2
with tf.variable_scope("lstm"):
    lstm_cell_list = [
        tf.contrib.rnn.BasicLSTMCell(hidden_layer_size, forget_bias=1.0)
        for ii in range(num_LSTM_layers)
    ]
    cell = tf.contrib.rnn.MultiRNNCell(cells=lstm_cell_list, state_is_tuple=True)
    outputs, states = tf.nn.dynamic_rnn(
        cell, embed, sequence_length=_seqlens, dtype=tf.float32
    )

W0720 15:19:20.105473 140580983120896 deprecation.py:323] From <ipython-input-27-b1b3d5952d03>:8: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.


This introduces some changes to the shape, which changes how we access the final state. To get the final state of the second layer, the indexing needs to be adjusted

In [28]:
# extract the final state and use in a linear layer
final_output = (
    tf.matmul(states[num_LSTM_layers - 1][1], weights["linear_layer"])
    + biases["linear_layer"]
)