##### Copyright 2018 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" }
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Long Short-Term Memory Network (LSTM) with TFP

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/deep_exponential_family_with_tfp.ipynb"><img height="32px" src="https://colab.research.google.com/img/colab_favicon.ico" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/deep_exponential_family_with_tfp.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>
<br>
<br>
<br>

Original content [this Repository](https://github.com/blei-lab/edward/blob/master/examples/lstm.py), created by [the Blei Lab](http://www.cs.columbia.edu/~blei/), with additional content from Chris Olah ([`@ch402`](https://twitter.com/ch402))

Ported to [Tensorflow Probability](https://www.tensorflow.org/probability/) by Matthew McAteer ([`@MatthewMcAteer0`](https://twitter.com/MatthewMcAteer0)), with help from Bryan Seybold, Mike Shwe ([`@mikeshwe`](https://twitter.com/mikeshwe)), Josh Dillon, and the rest of the TFP team at  Google ([`tfprobability@tensorflow.org`](mailto:tfprobability@tensorflow.org)).

---

- Dependencies & Prerequisites
  - Introduction
  - Recurrent Neural Networks
  - The Problem of Long-term Dependencies
  - LSTM Networks
  - The Core Idea Behind LSTMs
  - Step-by-Step LSTM Walk Through
- A LSTM language model run on text8
  - Data
  - Model
  - Inference
  - Variants on Long Short Term Memory
  - Conclusions
- References


## Dependencies & Prerequisites

<div class="alert alert-success">
    Tensorflow Probability is part of the colab default runtime, <b>so you don't need to install Tensorflow or Tensorflow Probability if you're running this in the colab</b>. 
    <br>
    If you're running this notebook in Jupyter on your own machine (and you have already installed Tensorflow), you can use the following
    <br>
      <ul>
    <li> For the most recent nightly installation: <code>pip3 install -q tfp-nightly</code></li>
    <li> For the most recent stable TFP release: <code>pip3 install -q --upgrade tensorflow-probability</code></li>
    <li> For the most recent stable GPU-connected version of TFP: <code>pip3 install -q --upgrade tensorflow-probability-gpu</code></li>
    <li> For the most recent nightly GPU-connected version of TFP: <code>pip3 install -q tfp-nightly-gpu</code></li>
    </ul>
Again, if you are running this in a Colab, Tensorflow and TFP are already installed
</div>

In [0]:
#@title Imports and Global Variables  { display-mode: "form" }
!pip3 install -q observations
!pip3 install -q corner
from __future__ import absolute_import, division, print_function

#@markdown This sets the warning status (default is `ignore`, since this notebook runs correctly)
warning_status = "ignore" #@param ["ignore", "always", "module", "once", "default", "error"]
import warnings
warnings.filterwarnings(warning_status)
with warnings.catch_warnings():
    warnings.filterwarnings(warning_status, category=DeprecationWarning)
    warnings.filterwarnings(warning_status, category=UserWarning)

import numpy as np
import string
from datetime import datetime
import os
#@markdown This sets the styles of the plotting (default is styled like plots from [FiveThirtyeight.com](https://fivethirtyeight.com/))
matplotlib_style = 'fivethirtyeight' #@param ['fivethirtyeight', 'bmh', 'ggplot', 'seaborn', 'default', 'Solarize_Light2', 'classic', 'dark_background', 'seaborn-colorblind', 'seaborn-notebook']
import matplotlib.pyplot as plt; plt.style.use(matplotlib_style)
import matplotlib.axes as axes;
from matplotlib.patches import Ellipse
%matplotlib inline
import seaborn as sns; sns.set_context('notebook')
from IPython.core.pylabtools import figsize
#@markdown This sets the resolution of the plot outputs (`retina` is the highest resolution)
notebook_screen_res = 'retina' #@param ['retina', 'png', 'jpeg', 'svg', 'pdf']
%config InlineBackend.figure_format = notebook_screen_res

import tensorflow as tf
tfe = tf.contrib.eager

# Eager Execution
#@markdown Check the box below if you want to use [Eager Execution](https://www.tensorflow.org/guide/eager)
#@markdown Eager execution provides An intuitive interface, Easier debugging, and a control flow comparable to Numpy. You can read more about it on the [Google AI Blog](https://ai.googleblog.com/2017/10/eager-execution-imperative-define-by.html)
use_tf_eager = False #@param {type:"boolean"}

# Use try/except so we can easily re-execute the whole notebook.
if use_tf_eager:
  try:
    tf.enable_eager_execution()
  except:
    pass

import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

  
def evaluate(tensors):
  """Evaluates Tensor or EagerTensor to Numpy `ndarray`s.
  Args:
  tensors: Object of `Tensor` or EagerTensor`s; can be `list`, `tuple`,
    `namedtuple` or combinations thereof.
 
  Returns:
    ndarrays: Object with same structure as `tensors` except with `Tensor` or
      `EagerTensor`s replaced by Numpy `ndarray`s.
  """
  if tf.executing_eagerly():
    return tf.contrib.framework.nest.pack_sequence_as(
        tensors,
        [t.numpy() if tf.contrib.framework.is_tensor(t) else t
         for t in tf.contrib.framework.nest.flatten(tensors)])
  return sess.run(tensors)

class _TFColor(object):
    """Enum of colors used in TF docs."""
    red = '#F15854'
    blue = '#5DA5DA'
    orange = '#FAA43A'
    green = '#60BD68'
    pink = '#F17CB0'
    brown = '#B2912F'
    purple = '#B276B2'
    yellow = '#DECF3F'
    gray = '#4D4D4D'
    def __getitem__(self, i):
        return [
            self.red,
            self.orange,
            self.green,
            self.blue,
            self.pink,
            self.brown,
            self.purple,
            self.yellow,
            self.gray,
        ][i % 9]
TFColor = _TFColor()

def session_options(enable_gpu_ram_resizing=True, enable_xla=True):
    """
    Allowing the notebook to make use of GPUs if they're available.
    
    XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear 
    algebra that optimizes TensorFlow computations.
    """
    config = tf.ConfigProto()
    config.log_device_placement = True
    if enable_gpu_ram_resizing:
        # `allow_growth=True` makes it possible to connect multiple colabs to your
        # GPU. Otherwise the colab malloc's all GPU ram.
        config.gpu_options.allow_growth = True
    if enable_xla:
        # Enable on XLA. https://www.tensorflow.org/performance/xla/.
        config.graph_options.optimizer_options.global_jit_level = (
            tf.OptimizerOptions.ON_1)
    return config


def reset_sess(config=None):
    """
    Convenience function to create the TF graph & session or reset them.
    """
    if config is None:
        config = session_options()
    global sess
    tf.reset_default_graph()
    try:
        sess.close()
    except:
        pass
    sess = tf.InteractiveSession(config=config)

reset_sess()

# from edward.models import Categorical
# from edward.util import Progbar
from observations import text8


## Introduction

A lot of processes involve the use of time-series data. In many cases this will involve the use of an architecture like an LSTM, but

### Recurrent Neural Networks
Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-rolled.png" width="150">

In the above diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png" width="600">

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.


### The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-shorttermdepdencies.png" width="600">

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-longtermdependencies.png" width="600">

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!



### LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" width="600">

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width="600">

The repeating module in an LSTM contains four interacting layers.

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM2-notation.png" width="300">

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

### The Core Idea Behind LSTMs
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

![C-line](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png)

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

![gate](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-gate.png)

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

### Step-by-Step LSTM Walk Through
The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png" width="600">

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png" width="600">

It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png" width="600">

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png" width="600">


## A LSTM language model run on text8.


In [0]:
#@title Hyperparameters
# Set seed. Remove this line to generate different mixtures!
tf.set_random_seed(77)

data_dir = '/tmp/data' #@param {type:"string"}
log_dir = '/tmp/log'   #@param {type:"string"}
n_epoch = 200          #@param {type:"number"}
batch_size = 128       #@param {type:"number"}
hidden_size = 512      #@param {type:"number"}
timesteps = 64         #@param {type:"number"}
lr = 5e-3              #@param {type:"number"}

timestamp = datetime.strftime(datetime.utcnow(), "%Y%m%d_%H%M%S")
hyperparam_str = '_'.join([var + '_' + str(eval(var)).replace('.', '_') for var in ['batch_size', 'hidden_size', 'timesteps', 'lr']])
log_dir = os.path.join(log_dir, timestamp + '_' + hyperparam_str)
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

In [0]:
def lstm_cell(x, h, c, name=None, reuse=False):
    """
    LSTM returning hidden state and content cell at a specific timestep.
    """
    nin = x.shape[-1].value
    nout = h.shape[-1].value
    with tf.variable_scope(name, default_name="lstm",
                         values=[x, h, c], reuse=reuse):
        wx = tf.get_variable("kernel/input", [nin, nout * 4],
                             dtype=tf.float32,
                             initializer=tf.orthogonal_initializer(1.0))
        wh = tf.get_variable("kernel/hidden", [nout, nout * 4],
                             dtype=tf.float32,
                             initializer=tf.orthogonal_initializer(1.0))
        b = tf.get_variable("bias", [nout * 4],
                            dtype=tf.float32,
                            initializer=tf.constant_initializer(0.0))

    z = tf.matmul(x, wx) + tf.matmul(h, wh) + b
    i, f, o, u = tf.split(z, 4, axis=1)
    i = tf.sigmoid(i)
    f = tf.sigmoid(f + 1.0)
    o = tf.sigmoid(o)
    u = tf.tanh(u)
    c = f * c + i * u
    h = o * tf.tanh(c)
    return h, c

In [0]:
def generator(input, batch_size, timesteps, encoder):
    """
    Generate batch with respect to input (a list). Encode its
    strings to integers, returning an array of shape [batch_size, timesteps].
    """
    while True:
        imb = np.random.randint(0, len(input) - timesteps, batch_size)
        encoded = np.asarray(
            [[encoder[c] for c in input[i:(i + timesteps)]] for i in imb],
            dtype=np.int32)
    yield encoded

### TFP in LSTMs

Within the Language model, we're going to feed the logits to a Categorical distirbution from TFP.

Form `p(x[0], ..., x[timesteps - 1])`,
$$
    \prod_{t=0}^\text{timesteps - 1} p(x[t] | x[:t]),
$$
To calculate the probability, we call `log_prob` on

` x = [x[0], ..., x[timesteps - 1]]` given `input = [0, x[0], ..., x[timesteps - 2]]`.
We implement this separately from the generative model so the forward pass, e.g., embedding/dense layers, can be parallelized. `[batch_size, timesteps] -> [batch_size, timesteps]`


In [0]:
def language_model(input, vocab_size):
    """
    Our Language model for processing the logits
    
    Args:
      input: scalar of true price estimate, taken from state
      vocab_size: scalar of prize 1 estimate, to be added to the  prize 1 
    Returns: 
      Categorical distribution of the 
    Closure over: data_mu, data_std, mu_prior, std_prior
    """
    x = tf.one_hot(input, depth=vocab_size, dtype=tf.float32)
    h = tf.fill(tf.stack([tf.shape(x)[0], hidden_size]), 0.0)
    c = tf.fill(tf.stack([tf.shape(x)[0], hidden_size]), 0.0)
    hs = []
    reuse = None
    for t in range(timesteps):
      if t > 0:
        reuse = True
      xt = x[:, t, :]
      h, c = lstm_cell(xt, h, c, name="lstm", reuse=reuse)
      hs.append(h)

    h = tf.stack(hs, 1)
    logits = tf.layers.dense(h, vocab_size, name="dense")
    output = tfd.Categorical(logits=logits).sample().eval()
    return output


  """


We then define the generator for the language model, which can be summarized with the following relationship:
$$
x ~ \prod p(x_t | x_{<t})
$$
From this we get an output of the `batch_size` and the `vocab_size`

In [0]:
def language_model_gen(batch_size, vocab_size):
    """
    Generate x ~ prod p(x_t | x_{<t}). Output [batch_size, timesteps].
    """
    # Initialize data input randomly.
    x = tf.random_uniform([batch_size], 0, vocab_size, dtype=tf.int32)
    h = tf.zeros([batch_size, hidden_size])
    c = tf.zeros([batch_size, hidden_size])
    xs = []
    for _ in range(timesteps):
        x = tf.one_hot(x, depth=vocab_size, dtype=tf.float32)
        h, c = lstm_cell(x, h, c, name="lstm")
        logits = tf.layers.dense(h, vocab_size, name="dense")
        x = tfd.Categorical(logits=logits).sample(seed=77).eval()  # REPLACES    x = ed.Categorical(logits=logits).value())    
        xs.append(x)

    xs = tf.cast(tf.stack(xs, 1), tf.int32)
    return xs

Now that out functions have been defined, let's get our data

### Data

In [0]:
# Set seed. Remove this line to generate different mixtures!
tf.set_random_seed(77)

x_train, _, x_test = text8(data_dir)
vocab = string.ascii_lowercase + ' '
vocab_size = len(vocab)
encoder = dict(zip(vocab, range(vocab_size)))
decoder = {v: k for k, v in encoder.items()}

data = generator(x_train, batch_size, timesteps, encoder)

>> Downloading /tmp/data/text8.zip.part 
>> [29.9 MB/29.9 MB] 100% @2.1 MB/s,[0s remaining, 14s elapsed]        
URL http://mattmahoney.net/dc/text8.zip downloaded to /tmp/data/text8.zip 


  download_file(url, filepath, hash_true, resume)


### Model

In [0]:
x_ph = tf.placeholder(tf.int32, [None, timesteps])
with tf.variable_scope("language_model"):
    # Shift input sequence to right by 1, [0, x[0], ..., x[timesteps - 2]].
    x_ph_shift = tf.pad(x_ph, [[0, 0], [1, 0]])[:, :-1]
    x = language_model(x_ph_shift, vocab_size)

with tf.variable_scope("language_model", reuse=True):
    x_gen = language_model_gen(5, vocab_size)

imb = range(0, len(x_test) - timesteps, timesteps)
encoded_x_test = np.asarray(
      [[encoder[c] for c in x_test[i:(i + timesteps)]] for i in imb],
      dtype=np.int32)
test_size = encoded_x_test.shape[0]
print("Test set shape: {}".format(encoded_x_test.shape))
test_nll = -tf.reduce_sum(x.log_prob(x_ph))

### Progress Bar Utility Functions

In [0]:
#@title Progress Bar Utility Code (make sure to run this cell)  { display-mode: "form" }
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import six
import sys
import time


class Progbar(object):
  def __init__(self, target, width=30, interval=0.01, verbose=1):
    """(Yet another) progress bar.
    Args:
      target: int.
        Total number of steps expected.
      width: int.
        Width of progress bar.
      interval: float.
        Minimum time (in seconds) for progress bar to be displayed
        during updates.
      verbose: int.
        Level of verbosity. 0 suppresses output; 1 is default.
    """
    self.target = target
    self.width = width
    self.interval = interval
    self.verbose = verbose

    self.stored_values = {}
    self.start = time.time()
    self.last_update = 0
    self.total_width = 0
    self.seen_so_far = 0

  def update(self, current, values=None, force=False):
    """Update progress bar, and print to standard output if `force`
    is True, or the last update was completed longer than `interval`
    amount of time ago, or `current` >= `target`.
    The written output is the progress bar and all unique values.
    Args:
      current: int.
        Index of current step.
      values: dict of str to float.
        Dict of name by value-for-last-step. The progress bar
        will display averages for these values.
      force: bool.
        Whether to force visual progress update.
    """
    if values is None:
      values = {}

    for k, v in six.iteritems(values):
      self.stored_values[k] = v

    self.seen_so_far = current

    now = time.time()
    if (not force and
            (now - self.last_update) < self.interval and
            current < self.target):
      return

    self.last_update = now
    if self.verbose == 0:
      return

    prev_total_width = self.total_width
    sys.stdout.write("\b" * prev_total_width)
    sys.stdout.write("\r")

    # Write progress bar to stdout.
    n_digits = len(str(self.target))
    bar = '%%%dd/%%%dd' % (n_digits, n_digits) % (current, self.target)
    bar += ' [{0}%]'.format(str(int(current / self.target * 100)).rjust(3))
    bar += ' '
    prog_width = int(self.width * float(current) / self.target)
    if prog_width > 0:
      try:
        bar += ('█' * prog_width)
      except UnicodeEncodeError:
        bar += ('*' * prog_width)

    bar += (' ' * (self.width - prog_width))
    sys.stdout.write(bar)

    # Write values to stdout.
    if current:
      time_per_unit = (now - self.start) / current
    else:
      time_per_unit = 0

    eta = time_per_unit * (self.target - current)
    info = ''
    if current < self.target:
      info += ' ETA: %ds' % eta
    else:
      info += ' Elapsed: %ds' % (now - self.start)

    for k, v in six.iteritems(self.stored_values):
      info += ' | {0:s}: {1:0.3f}'.format(k, v)

    self.total_width = len(bar) + len(info)
    if prev_total_width > self.total_width:
      info += ((prev_total_width - self.total_width) * " ")

    sys.stdout.write(info)
    sys.stdout.flush()

    if current >= self.target:
      sys.stdout.write("\n")

### Inference

For our optimization, we will use gradient descent to minimize our loss metric, `test_nll`, a stand in for "**n**egative **l**og **l**oss"

In [0]:
train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(test_nll)

print("Number of sets of parameters: {}".format(
      len(tf.trainable_variables())))
print("Number of parameters: {}".format(
      np.sum([np.prod(v.shape.as_list()) for v in tf.trainable_variables()])))
for v in tf.trainable_variables():
    print(v)

evaluate(tf.global_variables_initializer())

# Double n_epoch and print progress every half an epoch.
n_iter_per_epoch = len(x_train) // (batch_size * timesteps * 2)
epoch = 0.0
for _ in range(n_epoch * 2):
    epoch += 0.5
    print("Epoch: {0}".format(epoch))
    avg_nll = 0.0

    pbar = Progbar(n_iter_per_epoch)
    for t in range(1, n_iter_per_epoch + 1):
        pbar.update(t)
        x_batch = next(data)
        _ = sess.run([train_op], feed_dict={x_ph: x_batch})
        avg_nll += test_nll

    # Print average bits per character over epoch.
    avg_nll /= (n_iter_per_epoch * batch_size * timesteps *
                np.log(2))
    print("Train average bits/char: {:0.8f}".format(avg_nll))

    # Print per-data point log-likelihood on test set.
    avg_nll = 0.0
    for start in range(0, test_size, batch_size):
        end = min(test_size, start + batch_size)
        x_batch = encoded_x_test[start:end]
        avg_nll += sess.run(test_nll, feed_dict={x_ph: x_batch})

    avg_nll /= test_size
    print("Test average NLL: {:0.8f}".format(avg_nll))

    # Generate samples from model.
    samples = sess.run(x_gen)
    samples = [''.join([decoder[c] for c in sample]) for sample in samples]
    print("Samples:")
    for sample in samples:
        print(sample)


Our default hyperparameters achieve ~78.4 NLL at epoch 50, ~76.1423 NLL at epoch 200;

This takes about ~13s/epoch on a Titan X (Pascal).

If you're impatient and just instantly want to know what that gets you, after 200 epochs, we should be getting samples like the following:
```
e the classmaker was cut apart rome the charts sometimes known a
hemical place baining examples of equipment accepted manner clas
uetean meeting sought to exist as this waiting an excerpt for of
erally enjoyed a film writer of unto one two volunteer humphrey
y captured by the saughton river goodness where stones were nota
```

### Variants on Long Short Term Memory
What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-peepholes.png" width="600">

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-tied.png" width="600">

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png" width="600">

A gated recurrent unit neural network.
These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

### Conclusions
Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

## References

[1] https://colah.github.io/posts/2015-08-Understanding-LSTMs/

[2] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

[3] http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf

[4] http://www.bioinf.jku.at/publications/older/2604.pdf

[5] ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf

[6] https://arxiv.org/pdf/1406.1078v3.pdf

[7] https://arxiv.org/pdf/1508.03790v2.pdf

[8] https://arxiv.org/pdf/1402.3511v1.pdf

[9] https://arxiv.org/pdf/1503.04069.pdf

[10] http://proceedings.mlr.press/v37/jozefowicz15.pdf

[11] https://arxiv.org/pdf/1502.03044v2.pdf

[12] https://arxiv.org/pdf/1507.01526v1.pdf

[13] https://arxiv.org/pdf/1502.04623.pdf

[14] https://arxiv.org/pdf/1502.04623.pdf

[15] https://arxiv.org/pdf/1411.7610v3.pdf

In [0]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()