##### Copyright 2018 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" }
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Long Short-Term Memory Network (LSTM) with TFP

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/LSTM_TFP.ipynb"><img height="32px" src="https://colab.research.google.com/img/colab_favicon.ico" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/LSTM_TFP.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>
<br>
<br>
<br>

Original content [this Repository](https://github.com/blei-lab/edward/blob/master/examples/lstm.py), created by [the Blei Lab](http://www.cs.columbia.edu/~blei/).

Ported to [Tensorflow Probability](https://www.tensorflow.org/probability/) by Matthew McAteer ([`@MatthewMcAteer0`](https://twitter.com/MatthewMcAteer0)), with help from Bryan Seybold, Mike Shwe ([`@mikeshwe`](https://twitter.com/mikeshwe)), Josh Dillon, and the rest of the TFP team at  Google ([`tfprobability@tensorflow.org`](mailto:tfprobability@tensorflow.org)).

---

- Dependencies & Prerequisites
  - Introduction
  - Recurrent Neural Networks
  - The Problem of Long-term Dependencies
  - LSTM Networks
  - The Core Idea Behind LSTMs
  - Step-by-Step LSTM Walk Through
- A LSTM language model run on text8
  - Data
  - Model
  - Inference
  - Variants on Long Short Term Memory
  - Conclusions
- References


## Dependencies & Prerequisites

<div class="alert alert-success">
    Tensorflow Probability is part of the colab default runtime, <b>so you don't need to install Tensorflow or Tensorflow Probability if you're running this in the colab</b>. 
    <br>
    If you're running this notebook in Jupyter on your own machine (and you have already installed Tensorflow), you can use the following
    <br>
      <ul>
    <li> For the most recent nightly installation: <code>pip3 install -q tfp-nightly</code></li>
    <li> For the most recent stable TFP release: <code>pip3 install -q --upgrade tensorflow-probability</code></li>
    <li> For the most recent stable GPU-connected version of TFP: <code>pip3 install -q --upgrade tensorflow-probability-gpu</code></li>
    <li> For the most recent nightly GPU-connected version of TFP: <code>pip3 install -q tfp-nightly-gpu</code></li>
    </ul>
Again, if you are running this in a Colab, Tensorflow and TFP are already installed
</div>

In [0]:
#@title Imports and Global Variables  { display-mode: "form" }
!pip3 install -q observations
from observations import text8
!pip3 install -q corner
from __future__ import absolute_import, division, print_function

#@markdown This sets the warning status (default is `ignore`, since this notebook runs correctly)
warning_status = "ignore" #@param ["ignore", "always", "module", "once", "default", "error"]
import warnings
warnings.filterwarnings(warning_status)
with warnings.catch_warnings():
    warnings.filterwarnings(warning_status, category=DeprecationWarning)
    warnings.filterwarnings(warning_status, category=UserWarning)

import functools
import six
import sys
import time
import numpy as np
import string
from datetime import datetime
import os
#@markdown This sets the styles of the plotting (default is styled like plots from [FiveThirtyeight.com](https://fivethirtyeight.com/))
matplotlib_style = 'fivethirtyeight' #@param ['fivethirtyeight', 'bmh', 'ggplot', 'seaborn', 'default', 'Solarize_Light2', 'classic', 'dark_background', 'seaborn-colorblind', 'seaborn-notebook']
import matplotlib.pyplot as plt; plt.style.use(matplotlib_style)
import matplotlib.axes as axes;
from matplotlib.patches import Ellipse
%matplotlib inline
import seaborn as sns; sns.set_context('notebook')
from IPython.core.pylabtools import figsize
#@markdown This sets the resolution of the plot outputs (`retina` is the highest resolution)
notebook_screen_res = 'retina' #@param ['retina', 'png', 'jpeg', 'svg', 'pdf']
%config InlineBackend.figure_format = notebook_screen_res

import tensorflow as tf
tfe = tf.contrib.eager

# Eager Execution
#@markdown Check the box below if you want to use [Eager Execution](https://www.tensorflow.org/guide/eager)
#@markdown Eager execution provides An intuitive interface, Easier debugging, and a control flow comparable to Numpy. You can read more about it on the [Google AI Blog](https://ai.googleblog.com/2017/10/eager-execution-imperative-define-by.html)
use_tf_eager = False #@param {type:"boolean"}
# Use try/except so we can easily re-execute the whole notebook.
if use_tf_eager:
    try:
        tf.enable_eager_execution()
    except:
        pass

import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

  
def evaluate(tensors):
  """Evaluates Tensor or EagerTensor to Numpy `ndarray`s.
  Args:
  tensors: Object of `Tensor` or EagerTensor`s; can be `list`, `tuple`,
    `namedtuple` or combinations thereof.
 
  Returns:
    ndarrays: Object with same structure as `tensors` except with `Tensor` or
      `EagerTensor`s replaced by Numpy `ndarray`s.
  """
  if tf.executing_eagerly():
      return tf.contrib.framework.nest.pack_sequence_as(
          tensors,
          [t.numpy() if tf.contrib.framework.is_tensor(t) else t
           for t in tf.contrib.framework.nest.flatten(tensors)])
  return sess.run(tensors)

class _TFColor(object):
    """Enum of colors used in TF docs."""
    red = '#F15854'
    blue = '#5DA5DA'
    orange = '#FAA43A'
    green = '#60BD68'
    pink = '#F17CB0'
    brown = '#B2912F'
    purple = '#B276B2'
    yellow = '#DECF3F'
    gray = '#4D4D4D'
    def __getitem__(self, i):
        return [
            self.red,
            self.orange,
            self.green,
            self.blue,
            self.pink,
            self.brown,
            self.purple,
            self.yellow,
            self.gray,
        ][i % 9]
TFColor = _TFColor()

def session_options(enable_gpu_ram_resizing=True, enable_xla=True):
    """
    Allowing the notebook to make use of GPUs if they're available.
    
    XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear 
    algebra that optimizes TensorFlow computations.
    """
    config = tf.ConfigProto()
    config.log_device_placement = True
    if enable_gpu_ram_resizing:
        # `allow_growth=True` makes it possible to connect multiple colabs to your
        # GPU. Otherwise the colab malloc's all GPU ram.
        config.gpu_options.allow_growth = True
    if enable_xla:
        # Enable on XLA. https://www.tensorflow.org/performance/xla/.
        config.graph_options.optimizer_options.global_jit_level = (
            tf.OptimizerOptions.ON_1)
    return config


def reset_sess(config=None):
    """
    Convenience function to create the TF graph & session or reset them.
    """
    if config is None:
        config = session_options()
    global sess
    tf.reset_default_graph()
    try:
        sess.close()
    except:
        pass
    sess = tf.InteractiveSession(config=config)

reset_sess()


class Progbar(object):
    def __init__(self, target, width=30, interval=0.01, verbose=1):
        """Progress bar for displaying remaining time for given operations.
        Args:
          target: int.
            Total number of steps expected.
          width: int.
            Width of progress bar.
          interval: float.
            Minimum time (in seconds) for progress bar to be displayed
            during updates.
          verbose: int.
            Level of verbosity. 0 suppresses output; 1 is default.
        """
        self.target = target
        self.width = width
        self.interval = interval
        self.verbose = verbose

        self.stored_values = {}
        self.start = time.time()
        self.last_update = 0
        self.total_width = 0
        self.seen_so_far = 0

    def update(self, current, values=None, force=False):
        """Update progress bar, and print to standard output if `force`
        is True, or the last update was completed longer than `interval`
        amount of time ago, or `current` >= `target`.
        The written output is the progress bar and all unique values.
        Args:
          current: int.
            Index of current step.
          values: dict of str to float.
            Dict of name by value-for-last-step. The progress bar
            will display averages for these values.
          force: bool.
            Whether to force visual progress update.
        """
        if values is None:
            values = {}

        for k, v in six.iteritems(values):
            self.stored_values[k] = v

        self.seen_so_far = current

        now = time.time()
        if (not force and
                (now - self.last_update) < self.interval and
                current < self.target):
            return

        self.last_update = now
        if self.verbose == 0:
            return

        prev_total_width = self.total_width
        sys.stdout.write("\b" * prev_total_width)
        sys.stdout.write("\r")

        # Write progress bar to stdout.
        n_digits = len(str(self.target))
        bar = '%%%dd/%%%dd' % (n_digits, n_digits) % (current, self.target)
        bar += ' [{0}%]'.format(str(int(current / self.target * 100)).rjust(3))
        bar += ' '
        prog_width = int(self.width * float(current) / self.target)
        if prog_width > 0:
            try:
                bar += ('█' * prog_width)
            except UnicodeEncodeError:
                bar += ('*' * prog_width)

        bar += (' ' * (self.width - prog_width))
        sys.stdout.write(bar)

        # Write values to stdout.
        if current:
            time_per_unit = (now - self.start) / current
        else:
            time_per_unit = 0

        eta = time_per_unit * (self.target - current)
        info = ''
        if current < self.target:
            info += ' ETA: %ds' % eta
        else:
            info += ' Elapsed: %ds' % (now - self.start)

        for k, v in six.iteritems(self.stored_values):
            info += ' | {0:s}: {1:0.3f}'.format(k, v)

        self.total_width = len(bar) + len(info)
        if prev_total_width > self.total_width:
            info += ((prev_total_width - self.total_width) * " ")

        sys.stdout.write(info)
        sys.stdout.flush()

        if current >= self.target:
            sys.stdout.write("\n")


## Introduction

A lot of processes involve the use of time-series data. In many cases this will involve the use of an architecture like an Recurent Neural Network (an RNN). The problem is that this can often result in predictions that resemble short-term patterns in the data, but do not resemble patterns that take place over longer periods of time (such as words made of characters or sentences made out of words being coherent). An LSTM solves this by being able to remember recent patterns that have taken place over longer stretches of time-steps.

Before getting our data, we need to define the directories where we'll store the data, as well as the hyperparameters for our training (number of training epochs, batch size, number of timesteps, hidden layer size, and learning rate)

In [0]:
#@title Hyperparameters
#@markdown Set seed (for reproducibility). Remove this line to generate different mixtures!
random_seed = 77       #@param {type:"number"}
tf.set_random_seed(77)

#@markdown data directory for training data (Default is '/tmp/data')
data_dir = '/tmp/data' #@param {type:"string"}
#@markdown data directory for inference logs (Default is '/tmp/log')
log_dir = '/tmp/log'   #@param {type:"string"}
#@markdown number of epochs (Default is 200)
n_epoch = 200          #@param {type:"number"}
#@markdown batch size (Default is 128)
batch_size = 128       #@param {type:"number"}
#@markdown hidden size (Default is 512)
hidden_size = 512      #@param {type:"number"}
#@markdown timesteps for LSTM (Default is 64)
timesteps = 64         #@param {type:"number"}
#@markdown learning rate for Gradient Descent (Default is 5e-3)
lr = 5e-3              #@param {type:"number"}

timestamp = datetime.strftime(datetime.utcnow(), "%Y%m%d_%H%M%S")
hyperparam_str = '_'.join([var + '_' + str(eval(var)).replace('.', '_') for var in ['batch_size', 'hidden_size', 'timesteps', 'lr']])
log_dir = os.path.join(log_dir, timestamp + '_' + hyperparam_str)
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

## A LSTM language model run on text8.

We're going to run our LSTM on a dataset called `text8`. `text8` is the first $10^8$ bytes ($\text{~74 MB} $) of a dataset `fil9`. `fil9`, in turn, is the result oftaking the 1 GB file for `enwik9`, and filtering it to produce a $\text{715 MB}$ file (`fil9`). Where does `enwiki9` come from?

`fil9` resulted from [research by Matt Mahoney's group](http://mattmahoney.net/dc/textdata.html) on the entropy of "clean" written English (English in a $27$ character alphabet containing only the letters a-z and nonconsecutive spaces, has been estimated to be between $0.6$ and $1.3$ bits per character). Through this, they found that most of the best compressors will compress Wikipedia text (`enwik9`, $\text{1 GB}$) to around $\text{}$ $\text{715 MB}$. The specific Wikipedia text comes from The test data for the first $10^9$ bytes of the English Wikipedia dump on Mar. 3, 2006 (which is the basis for the [Large Text Compression Benchmark](http://mattmahoney.net/dc/text.html))

In [0]:
x_train, _, x_test = text8(data_dir)
vocab = string.ascii_lowercase + ' '
vocab_size = len(vocab)

# Our encoder is a dictionary with letters/symbols as the keys, a
# {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 
#  'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 
#  'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 
#  'z': 25, ' ': 26}
encoder = dict(zip(vocab, range(vocab_size)))

# Our decoder is the reverse of the above dictionary, with the numbers being the 
# keys and the letters being the iterms
decoder = {v: k for k, v in encoder.items()}

>> Downloading /tmp/data/text8.zip.part 
>> [29.9 MB/29.9 MB] 100% @1.9 MB/s,[0s remaining, 15s elapsed]        
URL http://mattmahoney.net/dc/text8.zip downloaded to /tmp/data/text8.zip 


  download_file(url, filepath, hash_true, resume)


Next, we'll set up our LSTM cell. LSTMs are special kind of RNNs with capability of handling Long-Term dependencies. LSTMs also provide solution to Vanishing/Exploding Gradient problem. 


<img src="https://github.com/matthew-mcateer/external_project_images/blob/master/LSTM_8.PNG?raw=trueg" width="800">

*LSTM cells connected*

<img src="https://github.com/matthew-mcateer/external_project_images/blob/master/LSTM_3.PNG?raw=true" width="500">

*LSTM cell internal visual representation*

**$f$: *Forget gate***, whether to erase the cell

**$i$: *input gate*** , whether to write to the cell

**$g$: *Gate gate***, How much to write to the cell

**$o$: *Output gate***, How much to reveal the cell

Let’s discuss the gates:

**Forget Gate:** After getting the output of **previous state, $h(t-1)$**, Forget gate helps us to take decisions about what must be removed from $h(t-1)$ state and thus keeping only relevant stuff. It is surrounded by a sigmoid function which helps to crush the input between $[0,1]$. It is represented as:
$$
f_t = \sigma(W_f 	\cdot [h_{t-1}, x_t]+b_f)
$$
<img src="https://github.com/matthew-mcateer/external_project_images/blob/master/LSTM_4.PNG?raw=true" width="500">

*Forget Gate*

We multiply forget gate with previous cell state to forget the unnecessary stuff from previous state which is not needed anymore, as shown below:


**Input Gate:** In the input gate, we decide to add new stuff from the present input to our present cell state scaled by how much we wish to add them.
$$
i_t = \sigma(W_i 	\cdot [h_{t-1}, x_t]+b_i) \\
\tilde{C}_t = \tanh(W_C 	\cdot [h_{t-1}, x_t]+b_C)
$$
<img src="https://github.com/matthew-mcateer/external_project_images/blob/master/LSTM_5.PNG?raw=true" width="500">

*Input Gate+Gate_gate*

In the above photo, *sigmoid layer decides which values to be updated* and *tanh layer creates a vector for new candidates to added to present cell state*.

**Gate Gate:** To calculate the present cell state, we add the output of ( `(input_gate*gate_gate)` and `forget_gate`) as follows:
$$
C_t = f_t 	\ast C_{t-1} + i_t 	\ast \tilde{C}_t
$$

**Output Gate:** Finally we’ll decide what to output from our cell state which will be done by our sigmoid function.

We multiply the input with tanh to crush the values between $(-1,1)$ and then multiply it with the output of sigmoid function so that we only output what we want to.
$$
o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \\
h_t = o_t \ast \tanh(C_t)
$$
<img src="https://github.com/matthew-mcateer/external_project_images/blob/master/LSTM_6.PNG?raw=true" width="500">

*Output Gate*





In [0]:
def lstm_cell(x, h, c, name=None, reuse=False):
    """
    LSTM returning hidden state and content cell at a specific timestep.
    """
    nin = x.shape[-1].value
    nout = h.shape[-1].value
    with tf.variable_scope(name, default_name="lstm",
                         values=[x, h, c], reuse=reuse):
        wx = tf.get_variable("kernel/input", [nin, nout * 4],
                             dtype=tf.float32,
                             initializer=tf.orthogonal_initializer(1.0))
        wh = tf.get_variable("kernel/hidden", [nout, nout * 4],
                             dtype=tf.float32,
                             initializer=tf.orthogonal_initializer(1.0))
        b = tf.get_variable("bias", [nout * 4],
                            dtype=tf.float32,
                            initializer=tf.constant_initializer(0.0))

    z = tf.matmul(x, wx) + tf.matmul(h, wh) + b
    i, f, o, u = tf.split(z, 4, axis=1)
    i = tf.sigmoid(i)
    f = tf.sigmoid(f + 1.0)
    o = tf.sigmoid(o)
    u = tf.tanh(u)
    c = f * c + i * u
    h = o * tf.tanh(c)
    return h, c

As we can see, this is far more complex than a typical RNN cell. How does this all solve the vanishing gradient cell?

1. There is no multiplication with matrix $W$ during backprop. This works by element-wise multiplication with the forget gate ($f$). This has an added bonus of reducing the time complexity

2. During backpropapagtion through each LSTM cell, it’s multiplied by different values of forget gate, which makes it less prone to vanishing/exploding gradient.

<img src="https://github.com/matthew-mcateer/external_project_images/blob/master/LSTM_9.PNG?raw=true" width="800">

## Setting up our data iterator

For our iterator, we're going to use the [tf.Data](https://www.tensorflow.org/guide/datasets) API to frame our `x_train` and `x_test`, as well as construct our iterator


In [0]:
def generator_fn(input, batch_size, timesteps, encoder):
    """
    Generate batch with respect to input (a list). Encode its
    strings to integers, returning an array of shape [batch_size, timesteps].
    """
    while True:
        # imb creates `batch_size` random indexes along the length of the input
        imb = np.random.randint(0, len(input) - timesteps, batch_size)
        # `encoded` is a size (`batch_size`, `timesteps`) array of the character 
        # encodings
        encoded = np.asarray(
            [[encoder[c] for c in input[i:(i + timesteps)]] for i in imb],
            dtype=np.int32)
    yield encoded

In [0]:
# Sets up the training dataset
train_dataset = tf.data.Dataset().batch(batch_size).from_generator(
    functools.partial(generator_fn, x_train, batch_size, timesteps, encoder),
    output_types= tf.int64,
    output_shapes=(tf.TensorShape([batch_size, timesteps])))

# Sets up the iterator for the training data, which only looks at a small section
# at a time
train_iterator = train_dataset.make_initializable_iterator()

# `x_ph` represents the `.get_next()` for the iteratr
x_ph = train_iterator.get_next()

  return _inspect.getargspec(target)
  return _inspect.getargspec(target)
  return _inspect.getargspec(target)
  return _inspect.getargspec(target)


In [0]:
imb = range(0, len(x_test) - timesteps, timesteps)
encoded_x_test = np.asarray(
      [[encoder[c] for c in x_test[i:(i + timesteps)]] for i in imb],
      dtype=np.int32)

test_size = encoded_x_test.shape[0]
print("Test set shape: {}".format(encoded_x_test.shape))

Test set shape: (78124, 64)


### TFP in LSTMs

Within the Language model, we're going to feed the logits to a Categorical distirbution from TFP.

Form $p(x_{0}, ..., x_{\text{timesteps} - 1})$,
$$
    \prod_{t=0}^\text{timesteps - 1} p(x_t | x_{>t}),
$$
To calculate the probability, we call `log_prob` on

$x = \{x_{0}, ..., x_{\text{timesteps} - 1}\}$ given $\text{input} = \{0, x_{0}, ..., x_{\text{timesteps} - 2}\}$.

We implement this separately from the generative model so the forward pass, e.g., embedding/dense layers, can be parallelized. `[batch_size, timesteps] -> [batch_size, timesteps]`


In [0]:
def language_model(input, vocab_size):
    """
    Our Language model for processing the logits
    
    Args:
      input: scalar of true price estimate, taken from state
      vocab_size: scalar of prize 1 estimate, to be added to the  prize 1 
    Returns: 
      Categorical distribution of the 
    Closure over: data_mu, data_std, mu_prior, std_prior
    """
    x = tf.one_hot(input, depth=vocab_size, dtype=tf.float32)
    h = tf.fill(tf.stack([tf.shape(x)[0], hidden_size]), 0.0)
    c = tf.fill(tf.stack([tf.shape(x)[0], hidden_size]), 0.0)
    hs = []
    reuse = None
    for t in range(timesteps):
        if t > 0:
            reuse = True
        xt = x[:, t, :]
        h, c = lstm_cell(xt, h, c, name="lstm", reuse=reuse)
        hs.append(h)

    h = tf.stack(hs, 1)
    logits = tf.layers.dense(h, vocab_size, name="dense")
    output = tfd.Categorical(logits=logits)
    return output


We then define the generator for the language model, which can be summarized with the following relationship:
$$
x ~ \prod p(x_t | x_{<t})
$$
From this we get an output of the `batch_size` and the `vocab_size`

In [0]:
def language_model_gen(batch_size, vocab_size):
    """
    Generates x ~ prod p(x_t | x_{<t}). Output [batch_size, timesteps].
    """
    # Initialize data input randomly.
    x = tf.random_uniform([batch_size], 0, vocab_size, dtype=tf.int32)
    h = tf.zeros([batch_size, hidden_size])
    c = tf.zeros([batch_size, hidden_size])
    xs = []
    for _ in range(timesteps):
        x = tf.one_hot(x, depth=vocab_size, dtype=tf.float32)
        h, c = lstm_cell(x, h, c, name="lstm")
        logits = tf.layers.dense(h, vocab_size, name="dense")
        x = tfd.Categorical(logits=logits).sample()  
        xs.append(x)

    xs = tf.cast(tf.stack(xs, 1), tf.int32)
    return xs

### Model and Inference

For our optimization, we will use gradient descent to minimize our loss metric, `test_nll`, a stand in for "**n**egative **l**og **l**oss"

In [0]:
#x_ph = next_train_batch

with tf.variable_scope("language_model"):
    # Shift input sequence to right by 1, [0, x[0], ..., x[timesteps - 2]].
    x_ph_shift = tf.pad(x_ph, [[0, 0], [1, 0]])[:, :-1]
    x = language_model(x_ph_shift, vocab_size)

with tf.variable_scope("language_model", reuse=True):
    x_gen = language_model_gen(5, vocab_size)

In [0]:
# The TEST Negative Log-likelihood will be used when the test data 
# is assigned to x_ph
test_nll = -tf.reduce_sum(x.log_prob(x_ph))

In [0]:
# The TRAIN Negative Log-likelihood will be used when the test data 
# is assigned to x_ph
train_nll = -tf.reduce_sum(x.log_prob(x_ph))

optimizer = tf.train.AdamOptimizer(learning_rate=lr)
train_op = optimizer.minimize(train_nll)

print("Number of sets of parameters: {}".format(
      len(tf.trainable_variables())))
print("Number of parameters: {}".format(
      np.sum([np.prod(v.shape.as_list()) for v in tf.trainable_variables()])))
for v in tf.trainable_variables():
    print(v)

Number of sets of parameters: 5
Number of parameters: 1119771
<tf.Variable 'language_model/lstm/kernel/input:0' shape=(27, 2048) dtype=float32_ref>
<tf.Variable 'language_model/lstm/kernel/hidden:0' shape=(512, 2048) dtype=float32_ref>
<tf.Variable 'language_model/lstm/bias:0' shape=(2048,) dtype=float32_ref>
<tf.Variable 'language_model/dense/kernel:0' shape=(512, 27) dtype=float32_ref>
<tf.Variable 'language_model/dense/bias:0' shape=(27,) dtype=float32_ref>


Let's initialize our model training. 

In [0]:
init_op = tf.global_variables_initializer()

evaluate(init_op)
evaluate(train_iterator.initializer)

In [0]:
# Double n_epoch and print progress every half an epoch.
n_iter_per_epoch = len(x_train) // (batch_size * timesteps * 2)
epoch = 0.0
for _ in range(n_epoch * 2):
    epoch += 0.5
    print("Epoch: {0}".format(epoch))
    avg_nll = 0.0

    pbar = Progbar(n_iter_per_epoch)
    for t in range(1, n_iter_per_epoch + 1):
        pbar.update(t)
        #evaluate(assign_train_op)
        [_, train_nll_] = evaluate([train_op, train_nll])
        avg_nll += train_nll_

    # Print average bits per character over epoch.
    avg_nll /= (n_iter_per_epoch * batch_size * timesteps *
                np.log(2))
    print("Train average bits/char: {:0.8f}".format(avg_nll))

    ## Print per-data point log-likelihood on test set.
    #avg_nll = 0.0
    #for start in range(0, test_size, batch_size):
    #    end = min(test_size, start + batch_size)
    #    x_batch = encoded_x_test[start:end]
    #    x_ph = x_batch
    #    avg_nll += evaluate(test_nll)

    #avg_nll /= test_size
    #print("Test average NLL: {:0.8f}".format(avg_nll))

    # Generate samples from model.
    samples = evaluate(x_gen)
    samples = [''.join([decoder[c] for c in sample]) for sample in samples]
    print("Samples:")
    for sample in samples:
        print(sample)


The default hyperparameters above achieve a negative log-likelihood of about $\text{~} 78.4$  by epoch $50$, and a negative log-likelihood at around $\text{~}76.1423$ at epoch $200$;

If you're impatient and just instantly want to know what that gets you, after $200$ epochs, we should be getting samples like the following:
```
e the classmaker was cut apart rome the charts sometimes known a
hemical place baining examples of equipment accepted manner clas
uetean meeting sought to exist as this waiting an excerpt for of
erally enjoyed a film writer of unto one two volunteer humphrey
y captured by the saughton river goodness where stones were nota
```

## References

[1] https://colah.github.io/posts/2015-08-Understanding-LSTMs/

[2] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

[3] http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf

[4] http://www.bioinf.jku.at/publications/older/2604.pdf

[5] ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf

[6] https://arxiv.org/pdf/1406.1078v3.pdf

[7] https://arxiv.org/pdf/1508.03790v2.pdf

[8] https://arxiv.org/pdf/1402.3511v1.pdf

[9] https://arxiv.org/pdf/1503.04069.pdf

[10] http://proceedings.mlr.press/v37/jozefowicz15.pdf

[11] https://arxiv.org/pdf/1502.03044v2.pdf

[12] https://arxiv.org/pdf/1507.01526v1.pdf

[13] https://arxiv.org/pdf/1502.04623.pdf

[14] https://arxiv.org/pdf/1502.04623.pdf

[15] https://arxiv.org/pdf/1411.7610v3.pdf