<h4>Recurrent Neural Net with TensorFlow</h4>
<br/>
This notebooks provides a brief overview of the recurrent neural network. We use the famous TensorFlow and its available knowledge resource to get into it.

Although, most of the source code is borrowed (and is acknowledged) from the TensorFlow website, there are some modifications made to deal with the followings:
<ol>
<li> decreasing learning rate schedule </li>
<li> dropout between the LSTM layers </li>
</ol>

First we download the PTB data from  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz.

As per the original resource, the dataset is already preprocessed and contains overall 10000 different words, including the end-of-sentence marker and a special symbol (\&lt;unk&gt;) for rare words.

<b> Reading data </b>
<br/>
I have the PTB data (taken from the link mentioned above) in the ptb_data folder.

In [1]:
#lets have the required libraries
import os
import collections
import numpy as np
import inspect
import time
import tensorflow as tf

#lets create the path for the train, validation, and test dataset (all of them are already available in the given line)
data_path = "ptb_data" # you can change it to your data directory
train_path = os.path.join(data_path, "ptb.train.txt")
valid_path = os.path.join(data_path, "ptb.valid.txt")
test_path = os.path.join(data_path, "ptb.test.txt")

As the RNN processes numbers only, we need to assign some numeric identifier to each word/token in the text.

Reading data File: We achieve this with the following methods, which use the TensorFlow to decode the unicode text and split the data based on the newline (\n) character. We can have any other simple methods to handle this and do not need the TensorFlow for this basic task.

In [2]:
'''
reads the line and returns a list
containing the lines from the file.
Uses the token <eos> for the new line character \n
to do the splitting.
'''
def read_words(filename):
    with tf.gfile.GFile(filename, "r") as f:
        return f.read().decode("utf-8").replace("\n", "<eos>").split()

Next we use the above method to build the vocabulary of the words and their ids.

In [3]:
'''
this method reads the words from a given file
and assigns ids (starting from 0) to the words based on their frequency.
'''
def build_vocab(filename):
    #get the list of sentences
    data = read_words(filename)
    #get the count of each tokens
    counter = collections.Counter(data)
    #sort the tokens by their frequency
    count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))
    #just use the words as we do not need the frequency any more
    words, _ = list(zip(*count_pairs))
    #for every word get the id in the range (0, len(words))
    word_to_id = dict(zip(words, range(len(words))))
    return word_to_id

Now, lets call the above method and see the ids of some of the words

In [4]:
word_to_id = build_vocab(train_path)
#lets print some of the word and their id representation
for word in list(word_to_id.keys())[:10]:
    print("word ",word," has id:",word_to_id[word])

word  evaluate  has id: 5385
word  external  has id: 6030
word  triggered  has id: 2363
word  neuberger  has id: 9205
word  wildlife  has id: 7224
word  sunnyvale  has id: 5844
word  frustration  has id: 6437
word  index  has id: 216
word  stake  has id: 320
word  graduate  has id: 5683


We do the similar thing with the train, test, and the validation set. To make the task easier, we use a utility method that converts a whole file to the word-id pairs.

In [5]:
def file_to_word_ids(filename, word_to_id):
    data = read_words(filename)
    return [word_to_id[word] for word in data if word in word_to_id]

In [6]:
train_data = file_to_word_ids(train_path, word_to_id)
valid_data = file_to_word_ids(valid_path, word_to_id)
test_data = file_to_word_ids(test_path, word_to_id)
#the vocabulary is the number of unique words in the dataset
vocabulary = len(word_to_id)

<b> LSTM </b>
<br/>
We use the TensorFlow library to realize the LSTM. So, lets define some constants to be used later.

In [7]:
#ALERT! If you run this cell multiple times then it would generate an error 
#"ArgumentError: argument --model: conflicting option string: ..."
#That might be because the same variable name is attempted to be redefined again!
#If you need to run multiple times then restart the whole kernel
#we define the constants in the form of TensorFlow flags
flags = tf.flags
logging = tf.logging

flags.DEFINE_string("model", "small", "A type of model. Possible options are: small, medium, large.")
flags.DEFINE_string("data_path", None, "Where the training/test data is stored.")
flags.DEFINE_string("save_path", None, "Model output directory.")
flags.DEFINE_bool("use_fp16", False, "Train using 16-bit floats instead of 32bit floats")

FLAGS = flags.FLAGS

To make it more modular, we define additional methods that deal with the configuration, hyperparameter initialization, and so on. These are all simple initialization methods and are easy to follow:

In [8]:
#class representing configuration parameters for a small model
class SmallConfig(object):
  """Small config."""
  init_scale = 0.1
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 2
  num_steps = 20
  hidden_size = 200
  max_epoch = 4
  max_max_epoch = 13
  keep_prob = 1.0
  lr_decay = 0.5
  batch_size = 20
  vocab_size = 10000

#class representing configuration parameters for a medium size model
class MediumConfig(object):
  """Medium config."""
  init_scale = 0.05
  learning_rate = 1.0
  max_grad_norm = 5
  num_layers = 2
  num_steps = 35
  hidden_size = 650
  max_epoch = 6
  max_max_epoch = 39
  keep_prob = 0.5
  lr_decay = 0.8
  batch_size = 20
  vocab_size = 10000

#class representing configuration parameters for a large size model
class LargeConfig(object):
  """Large config."""
  init_scale = 0.04
  learning_rate = 1.0
  max_grad_norm = 10
  num_layers = 2
  num_steps = 35
  hidden_size = 1500
  max_epoch = 14
  max_max_epoch = 55
  keep_prob = 0.35
  lr_decay = 1 / 1.15
  batch_size = 20
  vocab_size = 10000

#class representing configuration parameters for a test model
class TestConfig(object):
  """Tiny config, for testing."""
  init_scale = 0.1
  learning_rate = 1.0
  max_grad_norm = 1
  num_layers = 1
  num_steps = 2
  hidden_size = 2
  max_epoch = 1
  max_max_epoch = 1
  keep_prob = 1.0
  lr_decay = 0.5
  batch_size = 20
  vocab_size = 10000

In [9]:
#get the desired configuration based on the type of model, we use the small model by default as
#defined in our FLAG variable
def get_config():
    if FLAGS.model == "small":
        return SmallConfig()
    elif FLAGS.model == "medium":
        return MediumConfig()
    elif FLAGS.model == "large":
        return LargeConfig()
    elif FLAGS.model == "test":
        return TestConfig()
    else:
        raise ValueError("Invalid model: %s", FLAGS.model)

Now, lets call the above methods to get the configuration parameters.

In [10]:
config = get_config()
eval_config = get_config() #configuration for evaluation
eval_config.batch_size = 1
eval_config.num_steps = 1

In [11]:
#based on the flag configured, we use the data type
def data_type():
  return tf.float16 if FLAGS.use_fp16 else tf.float32

The TensorFlow first defines the problem in terms of a graph. Lets define a graph to realize the problem and feed the variables and parameters into the graph.

First we define a method that prepares the PTB data in the form of tensors that will be processed by our model.

In [12]:
def ptb_producer(raw_data, batch_size, num_steps, name=None):
  """Iterate on the raw PTB data.
  This chunks up raw_data into batches of examples and returns Tensors that
  are drawn from these batches.
  Args:
    raw_data: the raw ptb data (train data, test data or validation data).
    batch_size: int, the batch size.
    num_steps: int, the number of unrolls.
    name: the name of this operation (optional).
  Returns:
    A pair of Tensors, each shaped [batch_size, num_steps]. The second element
    of the tuple is the same data time-shifted to the right by one.
  Raises:
    tf.errors.InvalidArgumentError: if batch_size or num_steps are too high.
  """
  with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)

    data_len = tf.size(raw_data)
    batch_len = data_len // batch_size
    data = tf.reshape(raw_data[0 : batch_size * batch_len],
                      [batch_size, batch_len])

    epoch_size = (batch_len - 1) // num_steps
    
    #lets make sure that we have a positive value for epoch_size
    assertion = tf.assert_positive(
        epoch_size,
        message="epoch_size == 0, decrease batch_size or num_steps")
    
    #lets add the condition/assertion to the context of this session
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")

    #lets produce an integer in the range (0, epoch_size-1) in the queue    
    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    #before we move ahead, lets see some example to make sure we get the idea behind the slicing
    # 'input' is [[[1, 1, 1], [2, 2, 2]],
    #             [[3, 3, 3], [4, 4, 4]],
    #             [[5, 5, 5], [6, 6, 6]]]
    #tf.slice(input, [1, 0, 0], [1, 1, 3]) ==> [[[3, 3, 3]]]
    #==>input is the data to be sliced
    #==> [1, 0 ,0] means the begining point from where the slicing begins. It is the second list's (the first element in [1,0,0]), 
    #==>         first element's (the second element in [1,0,0]), first element (the third element in [1,0,0]), so this points to the
    #==>         sublist [3,3,3] (make sure you get this!)
    #==> [1,1,3] means we consider one list (the first element of [1,1,3]) from the begin, and we will get 1*3 sized tensor from there,
    #           which gives [3,3,3] as the desired result
    #Still confused? Lets see another example:
    #tf.slice(input, [1, 0, 0], [1, 2, 3]) ==> [[[3, 3, 3],[4, 4, 4]]]
    #==> here also the begining point is [1,0,0] which is same as the previous example. Now we are considering one list (the first element of [1,2,3])
    #   this means we are considering only the list [[3,3,3] [4,4,4]]. From this, we extract a 2*3 tensor from each of the list, which gives us [[3,3,3][4,4,4]]
    #   I hope its a bit more clear than the first one. We still have one more to go:)
    #tf.slice(input, [1, 0, 0], [2, 1, 3]) ==> [[[3, 3, 3]],[[5, 5, 5]]]
    #==> Again the begining point is same. Now we have the size parameter as [2,1,3], which means we are considering two lists (the first element of [2,1,3])
    #   from the begining part. Our two lists from the begining are: [[3,3,3][4,4,4]] and [[5,5,5][6,6,6]]. Now we are extracting a 1*3 tensor from
    #   each of the list we are considering. As we are considering two lists, we will get the first 1*3 tensor from each of the list. This gives us
    #   [3,3,3] from the first one and [5,5,5] from the second one
    #lets extract a slice from the data tensor, the begin and end of the slice is given by
    #..the other two parameters
    x = tf.strided_slice(data, [0, i * num_steps],
                         [batch_size, (i + 1) * num_steps])
    #reshape the slice
    x.set_shape([batch_size, num_steps])
    #create another slice from the data
    y = tf.strided_slice(data, [0, i * num_steps + 1],
                         [batch_size, (i + 1) * num_steps + 1])
    #reshape the slice
    y.set_shape([batch_size, num_steps])
    return x, y

In [13]:
#a wrapper of PTB to make it easier to access all the related parameters
class PTBInput(object):
  """The input data."""

  def __init__(self, config, data, name=None):
    self.batch_size = batch_size = config.batch_size
    self.num_steps = num_steps = config.num_steps
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.input_data, self.targets = ptb_producer(
        data, batch_size, num_steps, name=name)

<b> LSTM TensorFlow Graph </b>
<br />
We create the LSTM network within another wrapper and later use it to execute the graph.
So, lets define the LSTM:

In [27]:
class PTBModel(object):
  """The PTB model."""

  def __init__(self, is_training, config, input_):
    self._input = input_

    batch_size = input_.batch_size
    num_steps = input_.num_steps
    size = config.hidden_size
    vocab_size = config.vocab_size

    # Slightly better results can be obtained with forget gate biases
    # initialized to 1 but the hyperparameters of the model would need to be
    # different than reported in the paper.
    def lstm_cell():
      # With the latest TensorFlow source code (as of Mar 27, 2017),
      # the BasicLSTMCell will need a reuse parameter which is unfortunately not
      # defined in TensorFlow 1.0. To maintain backwards compatibility, we add
      # an argument check here:
      if 'reuse' in inspect.getargspec(
          tf.contrib.rnn.BasicLSTMCell.__init__).args:
        return tf.contrib.rnn.BasicLSTMCell(
            size, forget_bias=0.0, state_is_tuple=True,
            reuse=tf.get_variable_scope().reuse)
      else:
        return tf.contrib.rnn.BasicLSTMCell(
            size, forget_bias=0.0, state_is_tuple=True)
    attn_cell = lstm_cell
    if is_training and config.keep_prob < 1:
        def attn_cell():
            return tf.contrib.rnn.DropoutWrapper(lstm_cell(), output_keep_prob=config.keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell(
        [attn_cell() for _ in range(config.num_layers)], state_is_tuple=True)

    self._initial_state = cell.zero_state(batch_size, data_type())

    with tf.device("/cpu:0"):
      embedding = tf.get_variable(
          "embedding", [vocab_size, size], dtype=data_type())
      inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

    if is_training and config.keep_prob < 1:
      inputs = tf.nn.dropout(inputs, config.keep_prob)

    # Simplified version of models/tutorials/rnn/rnn.py's rnn().
    # This builds an unrolled LSTM for tutorial purposes only.
    # In general, use the rnn() or state_saving_rnn() from rnn.py.
    #
    # The alternative version of the code below is:
    #
    # inputs = tf.unstack(inputs, num=num_steps, axis=1)
    # outputs, state = tf.contrib.rnn.static_rnn(
    #     cell, inputs, initial_state=self._initial_state)
    outputs = []
    state = self._initial_state
    with tf.variable_scope("RNN"):
      for time_step in range(num_steps):
        if time_step > 0: tf.get_variable_scope().reuse_variables()
        (cell_output, state) = cell(inputs[:, time_step, :], state)
        outputs.append(cell_output)

    output = tf.reshape(tf.stack(axis=1, values=outputs), [-1, size])
    softmax_w = tf.get_variable(
        "softmax_w", [size, vocab_size], dtype=data_type())
    softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
    logits = tf.matmul(output, softmax_w) + softmax_b
    loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
        [logits],
        [tf.reshape(input_.targets, [-1])],
        [tf.ones([batch_size * num_steps], dtype=data_type())])
    self._cost = cost = tf.reduce_sum(loss) / batch_size
    self._final_state = state

    if not is_training:
      return

    self._lr = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
                                      config.max_grad_norm)
    optimizer = tf.train.GradientDescentOptimizer(self._lr)
    self._train_op = optimizer.apply_gradients(
        zip(grads, tvars),
        global_step=tf.contrib.framework.get_or_create_global_step())

    self._new_lr = tf.placeholder(
        tf.float32, shape=[], name="new_learning_rate")
    self._lr_update = tf.assign(self._lr, self._new_lr)

  def assign_lr(self, session, lr_value):
    session.run(self._lr_update, feed_dict={self._new_lr: lr_value})

  @property
  def input(self):
    return self._input

  @property
  def initial_state(self):
    return self._initial_state

  @property
  def cost(self):
    return self._cost

  @property
  def final_state(self):
    return self._final_state

  @property
  def lr(self):
    return self._lr

  @property
  def train_op(self):
    return self._train_op

In [29]:
def run_epoch(session, model, eval_op=None, verbose=False):
  """Runs the model on the given data."""
  start_time = time.time()
  costs = 0.0
  iters = 0
  state = session.run(model.initial_state)

  fetches = {
      "cost": model.cost,
      "final_state": model.final_state,
  }
  if eval_op is not None:
    fetches["eval_op"] = eval_op

  for step in range(model.input.epoch_size):
    feed_dict = {}
    for i, (c, h) in enumerate(model.initial_state):
      feed_dict[c] = state[i].c
      feed_dict[h] = state[i].h

    vals = session.run(fetches, feed_dict)
    cost = vals["cost"]
    state = vals["final_state"]

    costs += cost
    iters += model.input.num_steps

    if verbose and step % (model.input.epoch_size // 10) == 10:
      print("%.3f perplexity: %.3f speed: %.0f wps" %
            (step * 1.0 / model.input.epoch_size, np.exp(costs / iters),
             iters * model.input.batch_size / (time.time() - start_time)))

  return np.exp(costs / iters)

In [31]:
#the default graph
with tf.Graph().as_default():
    initializer = tf.random_uniform_initializer(-config.init_scale,
                                                config.init_scale)
    #create a graph variable representing the training set
    with tf.name_scope("Train"):
        train_input = PTBInput(config=config, data=train_data, name="TrainInput")
        with tf.variable_scope("Model", reuse=None, initializer=initializer):
            m = PTBModel(is_training=True, config=config, input_=train_input)
        tf.summary.scalar("Training_Loss", m._cost)
        tf.summary.scalar("Learning_Rate", m._lr)
        
    #now create the graph variable representing the validation set
    with tf.name_scope("Valid"):
        valid_input = PTBInput(config=config, data=valid_data, name="ValidInput")
        with tf.variable_scope("Model", reuse=True, initializer=initializer):
            mvalid = PTBModel(is_training=False, config=config, input_=valid_input)
        tf.summary.scalar("Validation_Loss", mvalid._cost)

    #create the variable representing the test set
    with tf.name_scope("Test"):
        test_input = PTBInput(config=eval_config, data=test_data, name="TestInput")
        with tf.variable_scope("Model", reuse=True, initializer=initializer):
            mtest = PTBModel(is_training=False, config=eval_config,
                         input_=test_input)
    #the supervisor class is a small wrapper that takes care of common needs of TensorFlow training program,
    # for instance, handling program crashes, notifying of raised exceptions, and so on (ref:https://www.tensorflow.org/api_docs/python/tf/train/Supervisor)
    sv = tf.train.Supervisor(logdir=FLAGS.save_path)
    with sv.managed_session() as session:
        for i in range(config.max_max_epoch):
            lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
            m.assign_lr(session, config.learning_rate * lr_decay)

            print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
            train_perplexity = run_epoch(session, m, eval_op=m.train_op,
                                     verbose=True)
            print("Epoch: %d Train_Perplexity: %.3f" % (i + 1, train_perplexity))
            valid_perplexity = run_epoch(session, mvalid)
            print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity))

        test_perplexity = run_epoch(session, mtest)
        print("Test Perplexity: %.3f" % test_perplexity)

        if FLAGS.save_path:
            print("Saving model to %s." % FLAGS.save_path)
            sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step)



Epoch: 1 Learning rate: 1.000
0.004 perplexity: 6289.963 speed: 1260 wps
0.104 perplexity: 856.345 speed: 1390 wps
0.204 perplexity: 630.369 speed: 1394 wps
0.304 perplexity: 508.410 speed: 1392 wps
0.404 perplexity: 438.189 speed: 1390 wps
0.504 perplexity: 392.381 speed: 1388 wps
0.604 perplexity: 353.348 speed: 1379 wps
0.703 perplexity: 326.705 speed: 1367 wps
0.803 perplexity: 305.498 speed: 1367 wps
0.903 perplexity: 286.074 speed: 1371 wps
Epoch: 1 Train_Perplexity: 271.529
Epoch: 1 Valid Perplexity: 182.257
Epoch: 2 Learning rate: 1.000
0.004 perplexity: 214.086 speed: 1375 wps
0.104 perplexity: 152.839 speed: 1401 wps
0.204 perplexity: 159.960 speed: 1400 wps
0.304 perplexity: 154.776 speed: 1403 wps
0.404 perplexity: 151.870 speed: 1404 wps
0.504 perplexity: 149.373 speed: 1405 wps
0.604 perplexity: 144.693 speed: 1406 wps
0.703 perplexity: 142.510 speed: 1406 wps
0.803 perplexity: 140.507 speed: 1406 wps
0.903 perplexity: 136.877 speed: 1406 wps
Epoch: 2 Train_Perplexity: 13

<b> Model </b>
<br />
The model consists of LSTM cells. Each LSTM processes one word at a time and finds the probabilities of the likely values for the next word in the sentence.
It is better to use mini-batches to make the program run with feasible computational cost.
<!-- $c = \sqrt{a^2 + b^2}$ -->

In [33]:
#lets declare the number of hidden layers we want to use
lstm_size = 3 # we can also use the one in the config
batch_size = eval_config.batch_size
'''
The core of the model consists of an LSTM cell that processes one word at a time 
and computes probabilities of the possible values for the next word in the sentence. 
The memory state of the network is initialized with a vector of zeros and gets updated 
after reading each word. 
For computational reasons, we will process data in mini-batches of size batch_size.
'''
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
probabilities = []
loss = 0.0
for current_batch_of_words in train_data:
    # The value of state is updated after processing each batch of words.
    output, state = lstm(current_batch_of_words, state)

    # The LSTM output can be used to make next word predictions
    logits = tf.matmul(output, softmax_w) + softmax_b
    probabilities.append(tf.nn.softmax(logits))
    loss+= loss_function(probabilities, target_words)

ValueError: setting an array element with a sequence.

<b>Multiple LSTMs</b>

<h5> References</h5>
<br />
    <ol>
    <li> <a href="https://www.tensorflow.org/tutorials/recurrent"> TensorFlow RNN</a> </li>
    <li> <a href="https://catalog.ldc.upenn.edu/ldc99t42">Penn Tree Bank (PTB) </a> </li>
    <li> Paper from Zaremba et al. (Recurrent Neural Network Regularization https://arxiv.org/abs/1409.2329) </li>
    <li> PTB data  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz</li>
    </ol>