__Chapter 16 - Modeling Sequential Data Using Recurrent Neural Networks__

1. [Import](#Import)
1. [Introducing sequential data](#Introducing-sequential-data)
    1. [Modeling sequential data – order matters](#Modeling-sequential-data–order-matters)
    1. [Representing sequences](#Representing-sequences)
    1. [The different categories of sequence modeling](#The-different-categories-of-sequence-modeling)
1. [RNNs for modeling sequences](#RNNs-for-modeling-sequences)
    1. [Understanding the structure and flow of an RNN](#Understanding-the-structure-and-flow-of-an-RNN)
    1. [Computing activations in an RNN](#Computing-activations-in-an-RNN)
    1. [The challenges of learning long-range interactions](#The-challenges-of-learning-long-range-interactions)
1. [Implementing a multilayer RNN for sequence modeling in TensorFlow](#Implementing-a-multilayer-RNN-for-sequence-modeling-in-TensorFlow)
    1. [Project one - performing sentiment analysis of IMDb movie reviews using multilayer RNNs](#performing-sentiment-analysis-of-IMDb-movie-reviews-using-multilayer-RNNs)
        1. [Preparing the data](#Preparing-the-data)
        1. [Embedding](#Embedding)
        1. [Building an RNN model](#Building-an-RNN-model)
            1. [The build method](#The-build-method)
            1. [The train method](#The-train-method)
            1. [The predict method](#The-predict-method)            
        1. [Instantiating the SentimentRNN class](#Instantiating-the-SentimentRNN-class)
        1. [Training and optimizing the sentiment analysis RNN model](#Training-and-optimizing-the-sentiment-analysis-RNN-model)
    1. [Project two – implementing an RNN for character-level language modeling in TensorFlow](#Project-two–implementing-an-RNN-for-character-level-language-modeling-in-TensorFlow)
        1. [Preparing the data](#Preparing-the-data2)
        1. [Building a character-level RNN model](#Building-a-character-level-RNN-model)
            1. [The constructor](#The-constructor)
            1. [The build method](#The-build-method2)
            1. [The train method](#The-train-method2)
            1. [The sample method](#The-sample-method)
        1. [Creating and training the CharRNN Model](#Creating-and-training-the-CharRNN-Model)
        1. [The CharRNN model in the sampling mode](#The-CharRNN-model-in-the-sampling-mode)

# Import

<a id = 'Import'></a>

In [1]:
# standard libary and settings
import os
import sys
import importlib
import itertools
from io import StringIO
import warnings

warnings.simplefilter("ignore")
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))

# data extensions and settings
import numpy as np

np.set_printoptions(threshold=np.inf, suppress=True)
import pandas as pd

pd.set_option("display.max_rows", 500)
pd.options.display.float_format = "{:,.6f}".format

# modeling extensions
import sklearn.base as base
import sklearn.cluster as cluster
import sklearn.datasets as datasets
import sklearn.decomposition as decomposition
import sklearn.ensemble as ensemble
import sklearn.feature_extraction as feature_extraction
import sklearn.feature_selection as feature_selection
import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.neighbors as neighbors
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm
import sklearn.tree as tree
import sklearn.discriminant_analysis as discriminant_analysis
import sklearn.utils as utils

# visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt

# custom extensions and settings
sys.path.append("/home/mlmachine") if "/home/mlmachine" not in sys.path else None
sys.path.append("/home/prettierplot") if "/home/prettierplot" not in sys.path else None

import mlmachine as mlm
from prettierplot.plotter import PrettierPlot
import prettierplot.style as style

# magic functions
%matplotlib inline

# Introducing sequential data

This chapter explores the unique properties of sequences compared to other kinds of data. We will explore how we can represent sequential data and the various models for analyzing sequential data.

<a id = 'Introducing-sequential-data'></a>

## Modeling sequential data – order matters

One major unique aspect of sequential data is that the elements appear in a certain order and are not independent of each other. The contrasts with data and algorithms that we have dealth with up to this point, in that previous models assume that the data is independent and identically distributed (IID). But with sequential data, by definition, order matters. This is not necessarily a problem, and in fact, the order can yield meaningful information. We just need a different approach and different tools.

<a id = 'Modeling-sequential-data–order-matters'></a>

## Representing sequences

In this chaper, sequences will be represented as $\big(x^1, x^2,...,x^T\big)$, where the superscript indices indicate the order of the instances, and the length of the sequence is $T$. For example, in time-series data, sample $\textbf{x}^T$ belongs to a particular time $t$. Further, if the data is labeled, the labels also follow a form where order matters: $\big(y^1, y^2,...,y^T\big)$.

The MLP and CNN models built in the last few chapters are not capable of handling the order of the input simples. Recurrent neural networks (RNNs) are designed to model sequences that remember past information and process new events in light of that history.

<a id = 'Representing sequences'></a>

## The different categories of sequence modeling

Sequence modeling can be applied to, among other things, language translateion, image cpationing, and text generation. Sequential data comes in many forms, and the nature of the input and output data determines the type. If neither the input nor the output data is sequenced, then this is simply a standard dataset, any of the methods covered in previous chapter may be used (depending on the problem).

If either the input or output data is sequenced, then it can be identified by one of these three categories:

- Many-to-one: The input data is sequenced, but the output is a vector of a fixed size, not a sequence. For example, sentiment analysis takes text data as an input and outputs a class label.
- One-to-many: The input data is in a standard format (not sequenced) but the output is a seuqnce. For example, in image captioning, the input is an image and the ouput is an English phrase.
- Many-to-many: Both the input and output arrays are sequences. This category can be sub-divided into subcategories based on whether the input or output is synchronized or not. An example of synchronized many-to-many is video classification, where each from in a video is labeled. An example of delayed many-to-many is language translation, i.e. an English sentence is translated by a machine into its equivalent in German

<a id = 'The-different-categories-of-sequence-modeling'></a>

# RNNs for modeling sequences

This sections describes the foundations of RNNs, including typical structure, dat flow, neuron activation, and typical challenges.

<a id = 'RNNs-for-modeling-sequences'></a>

## Understanding the structure and flow of an RNN

In a feedforward network, information flows from the input layer, to the hidden layer(s), then to the output layer. In an RNN, the hidden layer gets its input from both the input layer and the hidden layer from the previous time step. This flow of information in adjacent time steps in the hidden layer allows the network to use its 'memory of past events'. This can be envisioned as a loop, which in graph notation is referred to as a recurrent edge. This can be visualized as:

$$
\textbf{x}^t \rightarrow \textbf{h}^t \rightarrow \textbf{y}^t 
$$

where $\textbf{x}^t$ is the input data at the $t$ point in the sequence, $\textbf{h}^t$ is the hidden layer at point $t$, and $\textbf{y}^t$ at point $t$. This can be unfolded to reveal how other data points observed at adjacent time steps are structured relative to point $t$:

$$
\textbf{x}^{t-1} \rightarrow \textbf{h}^{t-1} \rightarrow \textbf{y}^{t-1}
\\
\downarrow
\\
\textbf{x}^t \rightarrow \textbf{h}^t \rightarrow \textbf{y}^t 
\\
\downarrow
\\
    \textbf{x}^{t+1} \rightarrow \textbf{h}^{t+1} \rightarrow \textbf{y}^{t+1}
$$

RNNs can have multiple hidden layers as well.

In a standard neural network, each hidden unit only receives one input - the net input associated with the input layer. RNNs, conversely, neurons in the hidden layer receive two distinct inputs - the net input from the input layer and the net input of the same hidden layer neuron from the previous time step $t-1$. At $t=0$, the first time step, the hidden units are initialized to zeros, or small random numbers. Then for $t>0$, the hidden units get input from the data point at the current time $\textbf{x}^t$ and the previous values of the hidden units at $t-1$, $\textbf{h}^{t-1}$

<a id = 'Understanding-the-structure-and-flow-of-an-RNN'></a>

## Computing activations in an RNN

Each directed edge (connection between boxes) of an RNN is associated with a weight matrix, and these weights do not depend on time $t$. These weights are shared across the time axis. The different weight matrices in a single layer RNN are:

$$
\textbf{W}_{xh}: \mbox{the weight matrix between the input} \ \textbf{x}^t \mbox{and the hidden layer } \ \textbf{h}
\\
\textbf{W}_{hh}: \mbox{the weight matrix associated with the recurrent edge}
\\
\textbf{W}_{hy}: \mbox{the weight matrix between the hidden layer and the output layer}
$$

Again, these weight matrices apply to the current point in the sequence $t$, as well as to $t-1$ and $t+1$.

The activations are computed similar to how this is handled in feed forward networks. For example, in the hidden layer, the net input $\textbf{z}_h$ is computed through a linear combination determined by summing the multiplications of the weight matrices with the corresponding vectors, and adding the bias unit:

$$
\textbf{z}_h^t = \textbf{W}_{xh}\textbf{x}^t + \textbf{W}_{hh}\textbf{h}^{t-1} + \textbf{b}_h
$$

Then the activations of the hidden units at the time step $t$ are calculated using:

$$
\textbf{h}^t = \phi_h\big(\textbf{z}_h^t\big) = \phi_h\big(\textbf{W}_{xh}\textbf{x}^t + \textbf{W}_{hh}\textbf{h}^{t-1} + \textbf{b}_h\big)
$$

where $\phi_h(\cdot)$ is the activation function.

Once the activations of the hidden units at the current time step are calculated, the activations of the output units are calculated by:
$$
\textbf{y}^t = \phi_y\big(\textbf{W}_{hy}\textbf{h}^t + \textbf{b}_y\big)
$$


<a id = 'Computing-activations-in-an-RNN'></a>

## The challenges of learning long-range interactions

Backpropagation through time (BPTT) is the process for optimizing the weights in an RNN. The basic idea is that the overall loss $L$ is the sum of all loss functions calculated at times $t$ = 1 to $t$ = $T$. 

$$
L = \sum^T_{t=1}L^t
$$

The loss at time 1:$t$ is dependent on the hidden units at all time steps that were evaluated before 1:$t$, so the gradient is calculated as follows:

$$
\frac{\partial L^t}{\partial\textbf{W}_{hh}} = \frac{\partial L^t}{\partial\textbf{y}^{t}} \times \frac{\partial \textbf{y}^t}{\partial\textbf{h}^{t}} \times \Bigg(\sum^t_{k=1}\frac{\partial \textbf{h}^t}{\partial\textbf{h}^{k}} \times \frac{\partial \textbf{h}^k}{\partial\textbf{h}_{hh}}\Bigg)
$$

In this formula $\frac{\partial \textbf{h}^t}{\partial\textbf{h}^{k}}$ is computed as multiplication of adjacent time steps:

$$
\frac{\partial \textbf{h}^t}{\partial\textbf{h}^{k}} = \prod^t_{i=k+1}\frac{\partial \textbf{h}^i}{\partial\textbf{h}^{i-1}}
$$

Calculation of the term $\frac{\partial \textbf{h}^t}{\partial\textbf{h}^{k}}$ introduces a few challenges. Namely, the so-called vanishing/exploding gradient. This term has $t-k$ multiplications, so multiply the $w$ weight a total of $t - k$ times results in a factor $w^{t-k}$. As a result, if $\lvert w\rvert$ < 1, this factor becomes very small when $t-k$ is large. On the other hand, if $\lvert w\rvert$ > 1, then $w^{t-k}$ becomes very large when $t-k$ is large. This means that we prefer $w$ to be equal to 1.

There are two solutions to this problem:

- Truncated backpropagation through time (TBPTT): clips the gradients above a given threshold. This solves exploding gradient issues, but the truncation limits the number of steps the gradient can effectively flow back and update weights properly
- Long short-term memory (LSTM): introduced to overcome the vanishing gradient problem. More successful in modeling long-range sequences than TBPTT. 


<a id = 'The-challenges-of-learning-long-range-interactions'></a>

# Implementing a multilayer RNN for sequence modeling in TensorFlow

The rest of this notebook will explore RNN implementations to address two common tasks, sentiment analysis and language models.

<a id = 'Implementing-a-multilayer-RNN-for-sequence-modeling-in-TensorFlow'></a>

## Project one – performing sentiment analysis of IMDb movie reviews using multilayer RNNs

In chapter 8, we implemented a model to determine the sentiment of movie reiews on IMDb. This project will leverage an RNN model to do the same task. This is an example of a many-to-one problem, where we are given a document of text and need to return a single label

<a id = 'performing-sentiment-analysis-of-IMDb-movie-reviews-using-multilayer-RNNs'></a>

### Preparing the data

This dataset contains two columns, one with the movie reviews, and another with the sentiment label of 0 or 1. The text component of these movie reviews are sequences of words, so we want to build an RNN to process the words in sequence and then classify the entire sequence to the 0 or 1 class.

To make this dataset ready for the neural network, it needs to be encoded into numeric values. First, we need to find the unique words in the entire dataset. This is not the same as preparing a bag-of-words model, as we are only interested in the set of unique words, and we don't need the counts necessarily. Second we create a mapping by way of a dictionary where we pair each unique word with a unique integer number. This will convert the entire text into a list of numbers.

<a id = 'Preparing-the-data'></a>

In [3]:
# load ImdbReviews dataset
import pyprind
from string import punctuation
import re

df = pd.read_csv("s3://tdp-ml-datasets/misc/ImdbReviews.csv", encoding="utf-8")

In [4]:
# review sampels
df[:5]

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [5]:
# deparate the words and count each words occurrence
from collections import Counter

counts = Counter()
pbar = pyprind.ProgBar(len(df["review"]), title="Counting words occurrences")

for i, review in enumerate(df["review"]):
    text = "".join(
        [c if c not in punctuation else " " + c + " " for c in review]
    ).lower()
    df.loc[i, "review"] = text
    pbar.update()
    counts.update(text.split())

# create a mapping of each unique word to an integer
word_counts = sorted(counts, key=counts.get, reverse=True)
print(word_counts[:5])
word_to_int = {word: ii for ii, word in enumerate(word_counts, 1)}

mapped_reviews = []
pbar = pyprind.ProgBar(len(df["review"]), title="Map review to ints")

for review in df["review"]:
    mapped_reviews.append([word_to_int[word] for word in review.split()])
    pbar.update()

Counting words occurrences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:08:58
Map review to ints


['the', '.', ',', 'and', 'a']


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:06


This process effectively conerted sequences of words into sequences of integers, but these sequences have different lengths. For this dataset to be ready for an RNN, the sequences need to hae the same length. To accomplish this, we define a paramter called sequence_length that will be set to 200. Sequences that hae fewer than 200 words will be left-padded with zeros, while sequences longer than 200 words will be trimmed so that only the last 200 values will be used. This preprocessing step is implemented in two steps:

1. Create a matrix of zeros, where each row corresponds to a sequence of size 200
2. Fill the index of words in each sequence for the right-hand side of the matrix. If a sequence has a length of only 150, then the first 50 elements of the row would stay zero. It's worth noting that the value chosen for sequence_length is a hyperparameter than can be tuned.

In [6]:
# create matrix of same-length sequences
sequence_length = 200

sequences = np.zeros((len(mapped_reviews), sequence_length), dtype=int)
for i, row in enumerate(mapped_reviews):
    review_arr = np.array(row)
    sequences[i, -len(row) :] = review_arr[-sequence_length:]

In [7]:
# create training and test sets
# note - the reviews were shuffled prior to being saved in a .csv

X_train = sequences[:25000, :]
y_train = df.loc[:25000, "sentiment"].values
X_test = sequences[25000:, :]
y_test = df.loc[25000:, "sentiment"].values

In [8]:
# generator helper function for mini-batching
def create_batch_generator(x, y=None, batch_size=64):
    n_batches = len(x) // batch_size
    x = x[: n_batches * batch_size]
    if y is not None:
        y = y[: n_batches * batch_size]
    for ii in range(0, len(x), batch_size):
        if y is not None:
            yield x[ii : ii + batch_size], y[ii : ii + batch_size]
        else:
            yield x[ii : ii + batch_size]

### Embedding

The data prep stage above created same-length sequences where the elements are integers that correspond to the indices of unique words. Now we need to convert this to input features. The wrong way to do it would be to apply on-ehot encoding to convert indices into ectors of zeros and ones. Each word would be mapped to a vector with a size equal to the number of unique words in the dataset. This is less than ideal, because the number of unique words can rise to the tens of thousands. A model trained on features like this may suffer from the curse of dimensionality. Further, these features would be very sparse, since all values are zero except one.

A better approach would be to map each word to a vector of fixed size with real-valued elements (not integers necessarily). With this approach, we can instead use finite-sized vector to represent and infinite number of real number. This is the idea behind the embedding, which is a feature-learning technique that can be utilized to automatically learn the salient features in the data. Given the value of a parameter unique_words, we can choose the size of the embedding vectors to be much smaller than the number of unique words in the corpus. The advantages of emebedding over one-hot encoding for these problems are:

1. A reduction in dimensionality decreases the effect of the curse of dimensionality
2. The extraction of salient features since the embedding layer in a neural network is trainable

To create an embedding layer, we feed in tf_x as the input layer, which is comprised of vocabulary indices. We create a matrix of size $[n\_words \times embedding\_size]$ as a tensor variable with randomly initialized values between [-1,1]. then we use tf.nn.embedding_lookup to loo up the row in the embedding matrix associated with each element of tf_x:

<a id = 'Embedding'></a>

### Building an RNN model

The SentimentRNN class that we will create has the following methods:

- A constructor to set the model parameters and create a computation graph.
- A build method that declares three placeholder for input data, input labels, and the kee-probability for the dropout process in the hidden layer. It also creates an embedding layer and creates embedded representations as input.
- A train method that creates a session that launches a graph, iterates through mini-batches of data, run for a # of epochs, minimizing the cost along the way before saing the model
- A predict method that creates a new session with the model as of the latest checkpoint saved at the end of the training process, and carries out the predictions on the test data.

<a id = 'Building-an-RNN-model'></a>

In [28]:
# RNN custom class
import tensorflow as tf


class SentimentRNN:
    def __init__(
        self,
        n_words,
        seq_len=200,
        lstm_size=256,
        num_layers=1,
        batch_size=64,
        learning_rate=0.0001,
        embed_size=200,
    ):
        """
        n_words - must be set equal to the number of unique words (+1, since we use 
                    zero to fill sequences with a size less than 200) and it's used
                    to create the embedding layer, along with embed_size
        embed_size - used with n_words to create the embedding layer
        seq_len- must be set according to the length of the sequences that were created
                    in the preprocessing steps above
        lstm_size - a hyperparameter that determines the number of hidden units in each RNN layer
        """
        self.n_words = n_words
        self.seq_len = seq_len
        self.lstm_size = lstm_size  # number of hidden units
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.embed_size = embed_size

        self.g = tf.Graph()
        with self.g.as_default():
            tf.set_random_seed(123)
            self.build()
            self.saver = tf.train.Saver()
            self.init_op = tf.global_variables_initializer()

    def build(self):
        tf_x = tf.placeholder(
            tf.int32, shape=(self.batch_size, self.seq_len), name="tf_x"
        )
        tf_y = tf.placeholder(tf.float32, shape=(self.batch_size), name="tf_y")
        tf_keepprob = tf.placeholder(tf.float32, name="tf_keepprob")

        # create embedding layer
        embedding = tf.Variable(
            tf.random_uniform((self.n_words, self.embed_size), minval=-1, maxval=1),
            name="embedding",
        )
        embed_x = tf.nn.embedding_lookup(embedding, tf_x, name="embeded_x")

        # define LSTM cell and stack together
        cells = tf.contrib.rnn.MultiRNNCell(
            [
                tf.contrib.rnn.DropoutWrapper(
                    tf.contrib.rnn.BasicLSTMCell(self.lstm_size),
                    output_keep_prob=tf_keepprob,
                )
                for i in range(self.num_layers)
            ]
        )

        # define the initial state
        self.initial_state = cells.zero_state(self.batch_size, tf.float32)
        print("  << initial state >>  ", self.initial_state)

        lstm_outputs, self.final_state = tf.nn.dynamic_rnn(
            cells, embed_x, initial_state=self.initial_state
        )

        # lstm output shape = [batch_size x max_time x cells.output_size]
        print("\n  << lstm_output >>", lstm_outputs)
        print("\n  << final state >>", self.final_state)

        logits = tf.layers.dense(
            inputs=lstm_outputs[:, -1], units=1, activation=None, name="logits"
        )
        logits = tf.squeeze(logits, name="logits_squeezed")
        print("\n  << logits    >>", logits)

        y_proba = tf.nn.sigmoid(logits, name="probabilities")
        predictions = {
            "probabilities": y_proba,
            "labels": tf.cast(tf.round(y_proba), tf.int32, name="labels"),
        }
        print("\n  << predictions >>", predictions)

        # define cost function
        cost = tf.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(labels=tf_y, logits=logits),
            name="cost",
        )

        # define optimizer
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        train_op = optimizer.minimize(cost, name="train_op")

    def train(self, X_train, y_train, num_epochs):
        with tf.Session(graph=self.g) as sess:
            sess.run(self.init_op)
            iteration = 1
            for epoch in range(num_epochs):
                state = sess.run(self.initial_state)

                for batch_x, batch_y in create_batch_generator(
                    X_train, y_train, self.batch_size
                ):
                    feed = {
                        "tf_x:0": batch_x,
                        "tf_y:0": batch_y,
                        "tf_keepprob:0": 0.5,
                        self.initial_state: state,
                    }
                    loss, _, state = sess.run(
                        ["cost:0", "train_op", self.final_state], feed_dict=feed
                    )

                    if iteration % 20 == 0:
                        print(
                            "Epoch {}/{} Iteration {} | Train loss: {:.5f}".format(
                                epoch + 1, num_epochs, iteration, loss
                            )
                        )
                    iteration += 1
                if (epoch + 1) % 1 == 0:
                    path = "./ch16_files/model"
                    if not os.path.isdir(path):
                        os.makedirs(path)
                    self.saver.save(
                        sess, "./ch16_files/model/sentiment-{}.ckpt".format(epoch)
                    )

    def predict(self, X_data, return_proba=False):
        preds = []
        with tf.Session(graph=self.g) as sess:
            self.saver.restore(sess, tf.train.latest_checkpoint("./ch16_files/model/"))
            test_state = sess.run(self.initial_state)
            for ii, batch_x in enumerate(
                create_batch_generator(X_data, None, batch_size=self.batch_size), 1
            ):
                feed = {
                    "tf_x:0": batch_x,
                    "tf_keepprob:0": 1.0,
                    self.initial_state: test_state,
                }
                if return_proba:
                    pred, test_state = sess.run(
                        ["probabilities:0", self.final_state], feed_dict=feed
                    )
                else:
                    pred, test_state = sess.run(
                        ["labels:0", self.final_state], feed_dict=feed
                    )
                preds.append(pred)
        return np.concatenate(preds)

#### The build method

<a id = 'The-build-method'></a>

In the build method, we create three placeholder for the input, output, and dropout keep-probability. Then we add the embedding layer, which builds the embedded representation of the unique words. Next within the build method, we built the RNN network. This was done in three steps.

1. Define multilayer RNN cells
2. Define initial state of these cells
3. Create and RNN specified by the RNN cells in their initial states

These three steps are unpacked in further detail below:

__Step 1: Define multilayer RNN cells__

The first step is to define the multilayer RNN cells, which was accomplished using a TensorFlow wrapper ckass to define the LSTM cells - BasicLSTMCell. These can be stacked together to form a multilayer RNN using the MultiRNNCell wrapper class. The process of stacking RNN cells with a dropout stage has three nested steps. Described from the inside out:

1. Create RNN cells using tf.contrib.rnn.BasicLSTMCell
2. Apply dropout to the RNN cells using tf.contrib.rnn.DropoutWrapper
3. Make a list of such cells according to the desired number of RNN layer and pass this list to tf.contrib.rnn.MultiRNNCell

This process is completed using a list comprehension in the implementation above.

__Step 2: defining the initiatl states for the RNN cells__

In the architecture of LSTM cells, there are three types of inputs - input data $\textbf{x}^t$, activations of hidden units from the previous time step $\textbf{x}^{t-1}$, and the cell state of the previous time step $\textbf{C}^{t-1}$.

In the above implementation $\textbf{x}^t$ is the embedded embed_x data tensor. We also need to specify the previous state of the cells. If we're starting a new input sequence, we initialize the cell state to a zero state, then for each seubsequent time step we need to store the updated state of the cells to use in the following time step. The initial state in the implementation above is set by calling cells.zero_state

__Step 3: Creating the RNN using the RNN cells and their states__

The third step of the RNN creation process used the tf.nn.dynamic_rnn function to pull all of the components together. This function pulls the embedded data, the RNN cells and their initial states, and creates a pipeline for them according to the unrolled architecture of the LSTM cells. It returns a tuple containing the activations of the RNN cells called outputs, as well as their final states in a variable called state. The output is a 3D tensor with the shape [batch size \times num_steps \times lstm_size]. We pass the variables outputs to a fully connected layer to get logits and then store the final state so that we can use this as the initial state for the next mini-batch of data. Lastly, once the components of the RNN are setup, the cost function and optimization method is defined in a fashion similar to other networks that we have implemented


#### The train method

The train method is very similar to other train functions implemented in previous chapter, except that there is an additional tensor called state that we need to feed into our network.

In our implementation, at the beginning of each epoch we start from the zero states of the RNN cells as the current state. The process of running each mini-batch of data is performed by feeding the current state with the batch_x data and the corresponding labels in batch_y. After finishing the process for a mini-batch, we update the state to be the final state, which is returned by the tf.nn.dynamic_rnn function. This updated state will be used in the execution of the next mini-batch. This process is repeated for each mini-batch, and the current state is updated through the epoch.

<a id = 'The train method'></a>

#### The predict method

The predict method is also setup to keep track of the current state, similar to train method.

<a id = 'The-predict-method'></a>

### Instantiating the SentimentRNN class

<a id = 'Instantiating-the-SentimentRNN-class'></a>

In [29]:
# run SentimentRNN
n_words = max(list(word_to_int.values())) + 1
rnn = SentimentRNN(
    n_words=n_words,
    seq_len=sequence_length,
    embed_size=256,
    lstm_size=128,
    num_layers=1,
    batch_size=100,
    learning_rate=0.001,
)

  << initial state >>   (LSTMStateTuple(c=<tf.Tensor 'MultiRNNCellZeroState/DropoutWrapperZeroState/BasicLSTMCellZeroState/zeros:0' shape=(100, 128) dtype=float32>, h=<tf.Tensor 'MultiRNNCellZeroState/DropoutWrapperZeroState/BasicLSTMCellZeroState/zeros_1:0' shape=(100, 128) dtype=float32>),)

  << lstm_output >> Tensor("rnn/transpose_1:0", shape=(100, 200, 128), dtype=float32)

  << final state >> (LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_3:0' shape=(100, 128) dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_4:0' shape=(100, 128) dtype=float32>),)

  << logits    >> Tensor("logits_squeezed:0", shape=(100,), dtype=float32)

  << predictions >> {'probabilities': <tf.Tensor 'probabilities:0' shape=(100,) dtype=float32>, 'labels': <tf.Tensor 'labels:0' shape=(100,) dtype=int32>}


> Remarks - num_layers = 1 creates a single RNN layer, but we could set this higher to create a multilayer RNN. Given that we have a relatively small dataset, a multilayer model may tend to overfit the data, sp a single layer approach will likely generalize better to unseen data.

### Training and optimizing the sentiment analysis RNN model

Train the model for 40 epochs using X_train and the labels in y_train

<a id = 'Training-and-optimizing-the-sentiment-analysis-RNN-model'></a>

In [30]:
# display results by epoch/iteration
rnn.train(X_train, y_train, num_epochs=20)

Epoch 1/20 Iteration 20 | Train loss: 0.68492
Epoch 1/20 Iteration 40 | Train loss: 0.56067
Epoch 1/20 Iteration 60 | Train loss: 0.66468
Epoch 1/20 Iteration 80 | Train loss: 0.54809
Epoch 1/20 Iteration 100 | Train loss: 0.55510
Epoch 1/20 Iteration 120 | Train loss: 0.47294
Epoch 1/20 Iteration 140 | Train loss: 0.51574
Epoch 1/20 Iteration 160 | Train loss: 0.43877
Epoch 1/20 Iteration 180 | Train loss: 0.46225
Epoch 1/20 Iteration 200 | Train loss: 0.47922
Epoch 1/20 Iteration 220 | Train loss: 0.47019
Epoch 1/20 Iteration 240 | Train loss: 0.47956
Epoch 2/20 Iteration 260 | Train loss: 0.45847
Epoch 2/20 Iteration 280 | Train loss: 0.29903
Epoch 2/20 Iteration 300 | Train loss: 0.39771
Epoch 2/20 Iteration 320 | Train loss: 0.38088
Epoch 2/20 Iteration 340 | Train loss: 0.32945
Epoch 2/20 Iteration 360 | Train loss: 0.27774
Epoch 2/20 Iteration 380 | Train loss: 0.36071
Epoch 2/20 Iteration 400 | Train loss: 0.30949
Epoch 2/20 Iteration 420 | Train loss: 0.31434
Epoch 2/20 Iterat

Epoch 14/20 Iteration 3320 | Train loss: 0.01868
Epoch 14/20 Iteration 3340 | Train loss: 0.00616
Epoch 14/20 Iteration 3360 | Train loss: 0.00598
Epoch 14/20 Iteration 3380 | Train loss: 0.01195
Epoch 14/20 Iteration 3400 | Train loss: 0.00470
Epoch 14/20 Iteration 3420 | Train loss: 0.00747
Epoch 14/20 Iteration 3440 | Train loss: 0.00253
Epoch 14/20 Iteration 3460 | Train loss: 0.00123
Epoch 14/20 Iteration 3480 | Train loss: 0.00040
Epoch 14/20 Iteration 3500 | Train loss: 0.00175
Epoch 15/20 Iteration 3520 | Train loss: 0.00573
Epoch 15/20 Iteration 3540 | Train loss: 0.00096
Epoch 15/20 Iteration 3560 | Train loss: 0.03162
Epoch 15/20 Iteration 3580 | Train loss: 0.00238
Epoch 15/20 Iteration 3600 | Train loss: 0.00211
Epoch 15/20 Iteration 3620 | Train loss: 0.06452
Epoch 15/20 Iteration 3640 | Train loss: 0.01990
Epoch 15/20 Iteration 3660 | Train loss: 0.01134
Epoch 15/20 Iteration 3680 | Train loss: 0.00149
Epoch 15/20 Iteration 3700 | Train loss: 0.00680
Epoch 15/20 Iteratio

In [31]:
# create predictions and calculate accuracy
preds = rnn.predict(X_test)
yTrue = y_test[: len(preds)]
print("test accuracy: {:.3f}".format(np.sum(preds == yTrue) / len(yTrue)))

INFO:tensorflow:Restoring parameters from ./ch16_files/model/sentiment-19.ckpt
test accuracy: 0.838


> Remarks - This result is comparable to what was achieved in chapter 8, mainly due to the small size of the dataset. 

In [32]:
# calculate probabilities
proba = rnn.predict(X_test, return_proba=True)

INFO:tensorflow:Restoring parameters from ./ch16_files/model/sentiment-19.ckpt


In [33]:
# print subset of probabilities
proba[:5]

array([0.00000027, 0.99999976, 0.8358301 , 0.00001538, 0.00009781],
      dtype=float32)

This model can be optimized further by changing the hyperparameters, such as lstm_size, seq_len, and embed_size.

## Project two – implementing an RNN for character-level language modeling in TensorFlow

The input for this model will be a text document, and the goal is to develop a model that can generate new text that is similar to the input document. Examples of an input could be a book or a computer program written in a certain language.

This involves character-level language modeling, where the input is brokem down into a sequence of characters that are fed into the network one character at a time. The netwrok processes each new character in conjunction with its memory of the previously seem characters to predict the next character. A very simple example looks something like this:

$$
\mbox{Input data: "Hello!"}
\\
\mbox{Input sequence} \ \mbox{| Prediction}
\\
H \  \rightarrow \ e\\
e \  \rightarrow \ l\\
l \  \rightarrow \ l\\
l \  \rightarrow \ o\\
o \  \rightarrow \ !\\
! \  \rightarrow \ \mbox{end}\\
\\
$$

This implementation has a data prep stage, RNN build stage, and a prediction stage where the model predicts the next character and sampling to generate new text.

Just as the sentiment analysis RNN has a tendency to develop an exploding gradient problem that needs to be addressed, this model will also employ a gradient clipping technique to avoid this issue.

<a id = 'Project-two–implementing-an-RNN-for-character-level-language-modeling-in-TensorFlow'></a>

### Preparing the data

We will be using 'The Tragedie of Hamlet' by Willian Shakespeare, which can be retrived online in plain text. Just as we mapped unique words to unique integers with the IMDb movies reviews, we will be mapping unique characters to unique integers. We will create a dictionary that maps characters to integers, and another dictionary that mirrors the first by mapping integer to characters. We want the training data array x and the training data array y to have the same shape, where the number of rows is equal to the batch size and the number of columns is the number of batches $\times$ the number of steps

Then we need to reshape the data into mini-batches of sequences. Since the goal is to predict the next character based of the sequence of characters seen up to that point. Therefore, we need to shift the input data and output of the neural network by one character. Next, the $\textbf{x}$ and $\textbf{y}$ arrays need to be split into mini-batches where each row is a sequence with a length equal to the number of steps. This is a way of breaking a long sequence of text into several smaller segments. Each mini-batch contains segment of all of the documents.

<a id = 'Preparing-the-data2'></a>

In [35]:
# load data
with open("s3://tdp-ml-datasets/misc/pg2265.txt", "r", encoding="utf-8") as f:
    text = f.read()
text = text[15858:]
chars = set(text)
char2int = {ch: i for i, ch in enumerate(chars)}
int2char = dict(enumerate(chars))
text_ints = np.array([char2int[ch] for ch in text], dtype=np.int32)

In [36]:
# custom function for preparing data
def reshape_data(sequence, batch_size, num_steps):
    total_batch_length = batch_size * num_steps
    num_batches = int(len(sequence) / total_batch_length)
    if num_batches * total_batch_length + 1 > len(sequence):
        num_batches = num_batches - 1

    # truncate sequence at the end to remove remaining characters that do not make a full batch
    x = sequence[0 : num_batches * total_batch_length]
    y = sequence[1 : num_batches * total_batch_length + 1]

    # split x and y into a list of batches of sequences
    x_batch_splits = np.split(x, batch_size)
    y_batch_splits = np.split(y, batch_size)

    # stack the batches together: shape = [batch _size x total_batch_length]
    x = np.stack(x_batch_splits)
    y = np.stack(y_batch_splits)

    return x, y

In [37]:
# custom function for generating sample batches
def create_batch_generator(data_x, data_y, num_steps):
    batch_size, total_batch_length = data_x.shape
    num_batches = int(total_batch_length / num_steps)
    for b in range(num_batches):
        yield (
            data_x[:, b * num_steps : (b + 1) * num_steps],
            data_y[:, b * num_steps : (b + 1) * num_steps],
        )

### Building a character-level RNN model

CharRNN is a class that will construct a graph to predict the next character in a sequence after observing a given sequence. This can be thought of as choosing a class, where the number of classes is the total number of unique characters in the text corpus. CharRNN has four mathods

- Constructor: setup learning parameters, create graph, and call build method to construct the graph in sampling mode or training mode
- Build: define placeholders for feeding in data, construct RNN using LSTM cells, as well as define network output, cost function and optimizer
- train: iterate through mini batches and train for a certain number of epochs.
- sample: start from a given string, calculate the probabilities of teh next character, and choose a character randomly according to the probabilities. This process is repeated and samples characters will be concatenated together to form a string. Once the string reaches a specified length, it return the string

<a id = 'Building-a-character-level-RNN-model'></a>

In [38]:
# custom function for calculating most top character
def get_top_char(probas, char_size, top_n=5):
    p = np.squeeze(probas)
    p[np.argsort(p)[:-top_n]] = 0.0
    p = p / np.sum(p)
    ch_id = np.random.choice(char_size, 1, p=p)[0]
    return ch_id

In [44]:
# character-level RNN custom class
class CharRNN:
    def __init__(
        self,
        num_classes,
        batch_size=64,
        num_steps=100,
        lstm_size=128,
        num_layers=1,
        learning_rate=0.001,
        keep_prob=0.5,
        grad_clip=5,
        sampling=False,
    ):
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.num_steps = num_steps
        self.lstm_size = lstm_size
        self.num_layers = num_layers
        self.learning_rate = learning_rate
        self.keep_prob = keep_prob
        self.grad_clip = grad_clip

        self.g = tf.Graph()
        with self.g.as_default():
            tf.set_random_seed(123)
            self.build(sampling=sampling)
            self.saver = tf.train.Saver()
            self.init_op = tf.global_variables_initializer()

    def build(self, sampling):
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size = self.batch_size
            num_steps = self.num_steps

        tf_x = tf.placeholder(tf.int32, shape=[batch_size, num_steps], name="tf_x")
        tf_y = tf.placeholder(tf.int32, shape=[batch_size, num_steps], name="tf_y")
        tf_keepprob = tf.placeholder(tf.float32, name="tf_keepprob")

        # one -hot encoding:
        x_onehot = tf.one_hot(tf_x, depth=self.num_classes)
        y_onehot = tf.one_hot(tf_y, depth=self.num_classes)

        # built multilayer RNN cells
        cells = tf.contrib.rnn.MultiRNNCell(
            [
                tf.contrib.rnn.DropoutWrapper(
                    tf.contrib.rnn.BasicLSTMCell(self.lstm_size),
                    output_keep_prob=tf_keepprob,
                )
                for _ in range(self.num_layers)
            ]
        )

        # define the initial state
        self.initial_state = cells.zero_state(batch_size, tf.float32)

        # run each sequence step through RNN
        lstm_outputs, self.final_state = tf.nn.dynamic_rnn(
            cells, x_onehot, initial_state=self.initial_state
        )
        print("  << lstm_outputs >>  ", lstm_outputs)

        seq_output_reshaped = tf.reshape(
            lstm_outputs, shape=[-1, self.lstm_size], name="seq_output_reshaped"
        )

        logits = tf.layers.dense(
            inputs=seq_output_reshaped,
            units=self.num_classes,
            activation=None,
            name="logits",
        )
        proba = tf.nn.softmax(logits, name="probabilities")
        y_reshaped = tf.reshape(
            y_onehot, shape=[-1, self.num_classes], name="y_reshaped"
        )
        cost = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped),
            name="cost",
        )

        # gradient clipping
        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), self.grad_clip)
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        train_op = optimizer.apply_gradients(zip(grads, tvars), name="train_op")

    def train(self, train_x, train_y, num_epochs, ckpt_dir="./ch16_files/model2/"):
        if not os.path.exists(ckpt_dir):
            os.mkdir(ckpt_dir)
        with tf.Session(graph=self.g) as sess:
            sess.run(self.init_op)

            n_batches = int(train_x.shape[1] / self.num_steps)
            iterations = n_batches * num_epochs

            for epoch in range(num_epochs):

                # train network
                new_state = sess.run(self.initial_state)
                loss = 0

                # mini-batch generator
                bgen = create_batch_generator(train_x, train_y, self.num_steps)
                for b, (batch_x, batch_y) in enumerate(bgen, 1):
                    iteration = epoch * n_batches + b
                    feed = {
                        "tf_x:0": batch_x,
                        "tf_y:0": batch_y,
                        "tf_keepprob:0": self.keep_prob,
                        self.initial_state: new_state,
                    }
                    batch_cost, _, new_state = sess.run(
                        ["cost:0", "train_op", self.final_state], feed_dict=feed
                    )
                    if iteration % 100 == 0:
                        print(
                            "Epoch {}/{} Iteration {} | Training loss: {:.5f}".format(
                                epoch + 1, num_epochs, iteration, batch_cost
                            )
                        )

                # save trained model
                self.saver.save(sess, os.path.join(ckpt_dir, "language_modeling.ckpt"))

    def sample(self, output_length, ckpt_dir, starter_seq="The "):
        observed_seq = [ch for ch in starter_seq]
        with tf.Session(graph=self.g) as sess:
            self.saver.restore(sess, tf.train.latest_checkpoint(ckpt_dir))

            # 1: run the model using starter sequence
            new_state = sess.run(self.initial_state)
            for ch in starter_seq:
                x = np.zeros((1, 1))
                x[0, 0] = char2int[ch]
            feed = {"tf_x:0": x, "tf_keepprob:0": 1.0, self.initial_state: new_state}
            proba, new_state = sess.run(
                ["probabilities:0", self.final_state], feed_dict=feed
            )

            ch_id = get_top_char(proba, len(chars))
            observed_seq.append(int2char[ch_id])

            # 2: run the model using the updated observed_seq
            for i in range(output_length):
                x[0, 0] = ch_id
                feed = {
                    "tf_x:0": x,
                    "tf_keepprob:0": 1.0,
                    self.initial_state: new_state,
                }
                proba, new_state = sess.run(
                    ["probabilities:0", self.final_state], feed_dict=feed
                )

                ch_id = get_top_char(proba, len(chars))
                observed_seq.append(int2char[ch_id])
        return "".join(observed_seq)

#### The-constructor

Unlike the sentiment analysis computation graph, where we used the same graph for both training and prediction modes, this model will have different graphs for the training and sampling modes.

To handle this, we add a boolean argument to determine the mode.

We also add an argument called grad_clip, which is used for clipping gradients to avoid exploding gradient issues.

<a id = 'The-constructor'></a>

#### The build method

The build function first defines two local variables, batch_size and num_steps, based on the mode:

$$
\mbox{in sampling mode} =
\left\{
    \begin{array}{ll}
        \mbox{batch_size} \ = 1  \\
        \mbox{num_Steps} \ = 1
    \end{array}
\right.
\\
\mbox{in training mode} =
\left\{
    \begin{array}{ll}
        \mbox{batch_size} \ = self.batch\_size  \\
        \mbox{num_Steps} \ = self.num\_steps
    \end{array}
\right.
$$

Rather than using an embedding layer to efficiently create a salient representation of unique words in the data, we will just use a one-hot encoding scheme for both $x$ and $y$ with 'depth = num_classes', where 'num_classes' is indeed the total number of characters in the corpus.

The process of building the multilayer RNN component is exactly the same as in the sentiment analysis representation, except that 'outputs' from 'tf.nn.dynamic_rnn' is a 3D tensor with the shape [batch_size \times num_steps \times lstm_size]. Then this tensor is reshaped into a 2D tensor with the 'batch_size*num_steps, lstm_steps' shape, which is passed into the fully connected layer 'tf.layers.dense' to get the logits (net input). Lastly, the probabilities for the next batch of characters are obtained and the cost function is defined.

<a id = 'The build method2'></a>

#### The train method

This method is very similar to the 'train' method implemented in the sentiment analysis RNN.

<a id = 'The-train-method2'></a>

#### The sample method

This is similar to the predict method implemented in the sentiment analysis RNN, with the key difference being that we calculate the probabilities for the next character from an input sequence 'observed_seq'. Then these probabilties are passed to a function  'get_top_char', which randomly selects one character based on the probabilities.

The first observed sequence, starts with 'starter_seq', and then when new characters are sampled according to their predicted probabilties, they are appended to the observed sequence, and this newly updated sequence is used for predicting the next character.

The 'sample' method calls the 'get_top_char' function to choose a character ID randomly ('ch_id') according to the returned probabilities. 'get_top_char' sorts the probabilities, then the 'top_n' probabilities are passed to 'numpy.random.choice' to randomly select one out of these top probabilities.

<a id = 'The-sample-method'></a>

### Creating and training the CharRNN Model

<a id = 'Creating-and-training-the-CharRNN-Model'></a>

In [45]:
# train character RNN
batch_size = 64
num_steps = 100
train_x, train_y = reshape_data(text_ints, batch_size, num_steps)

rnn = CharRNN(num_classes=len(chars), batch_size=batch_size)
rnn.train(train_x, train_y, num_epochs=50, ckpt_dir="./ch16_files/model2/")

  << lstm_outputs >>   Tensor("rnn/transpose_1:0", shape=(64, 100, 128), dtype=float32)
Epoch 4/50 Iteration 100 | Training loss: 3.14284
Epoch 8/50 Iteration 200 | Training loss: 2.79166
Epoch 12/50 Iteration 300 | Training loss: 2.50157
Epoch 16/50 Iteration 400 | Training loss: 2.35567
Epoch 20/50 Iteration 500 | Training loss: 2.27658
Epoch 24/50 Iteration 600 | Training loss: 2.23098
Epoch 28/50 Iteration 700 | Training loss: 2.19604
Epoch 32/50 Iteration 800 | Training loss: 2.15389
Epoch 36/50 Iteration 900 | Training loss: 2.12686
Epoch 40/50 Iteration 1000 | Training loss: 2.09561
Epoch 44/50 Iteration 1100 | Training loss: 2.06910
Epoch 48/50 Iteration 1200 | Training loss: 2.04369


### The CharRNN model in the sampling mode

Create a new instance in sampling mode and generate a sequnce of 500 characters

<a id = 'The-CharRNN-model-in-the-sampling-mode'></a>

In [46]:
# generate sample text
del rnn

np.random.seed(123)
rnn = CharRNN(len(chars), sampling=True)
print(rnn.sample(ckpt_dir="./ch16_files/model2/", output_length=500))

  << lstm_outputs >>   Tensor("rnn/transpose_1:0", shape=(1, 1, 128), dtype=float32)
INFO:tensorflow:Restoring parameters from ./ch16_files/model2/language_modeling.ckpt
The way selle the that

   Ham. I mo the whes thit wind and this thes at are bothare.
Whind atendest te tho desther the serint ang ther ang to bothes,
But is a the the thene soull my tho hee and is me in a theare, in wis thath in the toong thee aue authere,
Iles on hat tare to tor are me auline, wall
 a to me that thes seare, thee sille the this dond,
And mo the merere ant ond ind there and this dond,
Whan sis moue to the wislles to day, be thare, ar hath my here,
Whore we hourd ond morertens mingre 


> There are clearly some English words within this block of text. To further strengthen the model, additional epochs with additional data are needed.