Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [3]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [4]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [5]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [2]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [7]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [8]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [58]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [8]:
num_steps = 7001
summary_frequency = 100

def exec_graph(graph):
  with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
      batches = train_batches.next()
      feed_dict = dict()
      for i in range(num_unrollings + 1):
        feed_dict[train_data[i]] = batches[i]
      _, l, predictions, lr = session.run(
        [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
      mean_loss += l
      if step % summary_frequency == 0:
        if step > 0:
          mean_loss = mean_loss / summary_frequency
        # The mean loss is an estimate of the loss over the last few batches.
        print(
          'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
        mean_loss = 0
        labels = np.concatenate(list(batches)[1:])
        print('Minibatch perplexity: %.2f' % float(
          np.exp(logprob(predictions, labels))))
        if step % (summary_frequency * 10) == 0:
          # Generate some samples.
          print('=' * 80)
          for _ in range(5):
            feed = sample(random_distribution())
            sentence = characters(feed)[0]
            reset_sample_state.run()
            for _ in range(79):
              prediction = sample_prediction.eval({sample_input: feed})
              feed = sample(prediction)
              sentence += characters(feed)[0]
            print(sentence)
          print('=' * 80)
        # Measure validation set perplexity.
        reset_sample_state.run()
        valid_logprob = 0
        for _ in range(valid_size):
          b = valid_batches.next()
          predictions = sample_prediction.eval({sample_input: b[0]})
          valid_logprob = valid_logprob + logprob(predictions, b[1])
        print('Validation set perplexity: %.2f' % float(np.exp(
          valid_logprob / valid_size)))

In [59]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.298081 learning rate: 10.000000
Minibatch perplexity: 27.06
obe ckt  fyq d i  oiepiyuebbewii whvuy berniw htjdmehmbfe vunrqdesndxr debjqimop
smxj eoczekebedn woutrpbaztaegn i idzhdxetkrcjkjktqeaelpjx c nhuumwile  uepiqefb
d hp j tnk faaxdeloa ijappzo e rtw yvdpk wlghdodizp e kureihvserki gltocgl tvp i
sbxi qnc neimanwtgrrngjqsuotwnzcet ir bh  cfbthz ay kkcxihofowfv pqdciiiowp  tre
gssbo mgbjkhmftiene pabavir gjkxkf fapkco yjtrqdiiopt jezuhzeqa ap d vd  obxiype
Validation set perplexity: 20.39
Average loss at step 100: 2.598796 learning rate: 10.000000
Minibatch perplexity: 10.96
Validation set perplexity: 10.39
Average loss at step 200: 2.246814 learning rate: 10.000000
Minibatch perplexity: 8.67
Validation set perplexity: 8.55
Average loss at step 300: 2.091723 learning rate: 10.000000
Minibatch perplexity: 7.28
Validation set perplexity: 7.86
Average loss at step 400: 1.990674 learning rate: 10.000000
Minibatch perplexity: 7.54
Validation set per

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

Check at each step that all the unoptimized and optimized matrices share the same result

In [60]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # big matrix
  ifcox = tf.Variable(tf.truncated_normal([vocabulary_size, 4 * num_nodes], -0.1, 0.1))
  ifcom = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
  ifcob = tf.Variable(tf.zeros([1, 4 * num_nodes]))

  # small matrices
  ix = ifcox[:, :num_nodes]
  fx = ifcox[:, num_nodes:2*num_nodes]
  cx = ifcox[:, 2*num_nodes:3*num_nodes]
  ox = ifcox[:, 3*num_nodes:]
    
  im = ifcom[:, :num_nodes]
  fm = ifcom[:, num_nodes:2*num_nodes]
  cm = ifcom[:, 2*num_nodes:3*num_nodes]
  om = ifcom[:, 3*num_nodes:]
    
  ib = ifcob[:, :num_nodes]
  fb = ifcob[:, num_nodes:2*num_nodes]
  cb = ifcob[:, 2*num_nodes:3*num_nodes]
  ob = ifcob[:, 3*num_nodes:]
  
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate_1 = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate_1 = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update_1 = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state_1 = forget_gate_1 * state + input_gate_1 * tf.tanh(update_1)
    output_gate_1 = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    
    all_gates = tf.matmul(i, ifcox) + tf.matmul(o, ifcom) + ifcob
    input_gate_2 = tf.sigmoid(all_gates[:, 0:num_nodes])
    forget_gate_2 = tf.sigmoid(all_gates[:, num_nodes:2*num_nodes])
    update_2 = all_gates[:, 2*num_nodes:3*num_nodes]
    state_2 = forget_gate_2 * state + input_gate_2 * tf.tanh(update_2)
    output_gate_2 = tf.sigmoid(all_gates[:, 3*num_nodes:])
        
    return [(output_gate_2 * tf.tanh(state_2), state_2),
            [('input', input_gate_1, input_gate_2),
             ('forget', forget_gate_1, forget_gate_2),
             ('update', update_1, update_2),
             ('state', state_1, state_2),
             ('output', output_gate_1, output_gate_2)]]

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  equals = list()
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    out = lstm_cell(i, output, state)
    output, state = out[0]
    equals.extend(out[1])
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  out = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  sample_output, sample_state = out[0]
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [61]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    # check that the unoptimized and optimized tensors are really equal
    for name, t1, t2 in equals:
      m1, m2 = session.run([t1, t2], feed_dict=feed_dict)
      #m2 = session.run(t2, feed_dict=feed_dict)
      if not np.array_equal(m1, m2):
        print('first version:', name)
        print(m1)
        print('second version:', name)
        print(m2)
        raise Exception("not equal: " + name)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.290297 learning rate: 10.000000
Minibatch perplexity: 26.85
g cnnrpwhid ey v hehtt c juajnhfrmtbv reeae d mubeb o cpfrgeboybrmkaido   wm  mi
tlymtfsrs t hoksrhtperhdte  l bp  mzew selsqrr phhe   o l  hhd ea rzr miqnxstkct
mciysdrofyikla xweeeerpti hohchptrotjou zzzal ms nk eestnzwlth yzixeagkiewn o tt
trjreo putosk   qk hthvcqwuniytifestshzpsdil do lmtno enw fovxopmzhpzyxwejtsrcec
xleecyidbu ei t  bvkn hao  w eaeqvh sne v i iragoepduwo tssf ddqrtyyvtcd aizieec
Validation set perplexity: 20.08
Average loss at step 100: 2.584401 learning rate: 10.000000
Minibatch perplexity: 10.71
Validation set perplexity: 10.46
Average loss at step 200: 2.245747 learning rate: 10.000000
Minibatch perplexity: 8.33
Validation set perplexity: 8.79
Average loss at step 300: 2.084919 learning rate: 10.000000
Minibatch perplexity: 6.34
Validation set perplexity: 8.09
Average loss at step 400: 2.027262 learning rate: 10.000000
Minibatch perplexity: 7.75
Validation set per

Remove the equality checks

In [62]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # big matrix
  ifcox = tf.Variable(tf.truncated_normal([vocabulary_size, 4 * num_nodes], -0.1, 0.1))
  ifcom = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
  ifcob = tf.Variable(tf.zeros([1, 4 * num_nodes]))
  
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    all_gates = tf.matmul(i, ifcox) + tf.matmul(o, ifcom) + ifcob
    input_gate = tf.sigmoid(all_gates[:, 0:num_nodes])
    forget_gate = tf.sigmoid(all_gates[:, num_nodes:2*num_nodes])
    update = all_gates[:, 2*num_nodes:3*num_nodes]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(all_gates[:, 3*num_nodes:])
        
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [63]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.295034 learning rate: 10.000000
Minibatch perplexity: 26.98
oifs aqqttketi f hes itbxlzqv cbu epcn vojjohcnt  vircbagyp et  jll  wej  hxstoq
lqtf aei yooiyw fsovem pyasrrocachixm teln cd  nc jj h znsnca lrhardrfy f zljanq
ym uesxftka johopnmt qzl godw ncdeado  kxpihgsel  rze qjwcingnorlo xsryyydtkrvil
 ykl uffy mrbhdbmxxussaq s lra tvyh  qsrq ms  ihae lyeqanrigcofxpnndardzuesiyjib
 iacp iurmunoanonederb evncs cuh mvsmitudbctmv iqph zarmc rnhngxapfynnida ii txo
Validation set perplexity: 19.96
Average loss at step 100: 2.569330 learning rate: 10.000000
Minibatch perplexity: 12.36
Validation set perplexity: 10.67
Average loss at step 200: 2.231631 learning rate: 10.000000
Minibatch perplexity: 8.24
Validation set perplexity: 8.71
Average loss at step 300: 2.066028 learning rate: 10.000000
Minibatch perplexity: 7.06
Validation set perplexity: 7.99
Average loss at step 400: 1.986493 learning rate: 10.000000
Minibatch perplexity: 6.77
Validation set per

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

Without embeddings and without dropout:

In [7]:
batch_size=64
num_unrollings=10
bigram_size = vocabulary_size * vocabulary_size

class BigramBatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, bigram_size), dtype=np.float)
    for b in range(self._batch_size):
      first_char = self._text[self._cursor[b]]
      if self._cursor[b] + 1 == self._text_size:
        second_char = ' '
      else:
        second_char = self._text[self._cursor[b] + 1]
      batch[b, char2id(first_char) * vocabulary_size + char2id(second_char)] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def bigram_characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return ['({0},{1})'.format(id2char(c//vocabulary_size), id2char(c % vocabulary_size))
          for c in np.argmax(probabilities, 1)]

def bigram_first_characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c//vocabulary_size)
          for c in np.argmax(probabilities, 1)]

def bigram_batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, bigram_characters(b))]
  return s

bigram_train_batches = BigramBatchGenerator(train_text, batch_size, num_unrollings)
bigram_valid_batches = BigramBatchGenerator(valid_text, 1, 1)

# output each bigram instead of a single char
print(bigram_batches2string(bigram_train_batches.next()))
print(bigram_batches2string(bigram_train_batches.next()))
print(bigram_batches2string(bigram_valid_batches.next()))
print(bigram_batches2string(bigram_valid_batches.next()))


['(o,n)(n,s)(s, )( ,a)(a,n)(n,a)(a,r)(r,c)(c,h)(h,i)(i,s)', '(w,h)(h,e)(e,n)(n, )( ,m)(m,i)(i,l)(l,i)(i,t)(t,a)(a,r)', '(l,l)(l,e)(e,r)(r,i)(i,a)(a, )( ,a)(a,r)(r,c)(c,h)(h,e)', '( ,a)(a,b)(b,b)(b,e)(e,y)(y,s)(s, )( ,a)(a,n)(n,d)(d, )', '(m,a)(a,r)(r,r)(r,i)(i,e)(e,d)(d, )( ,u)(u,r)(r,r)(r,a)', '(h,e)(e,l)(l, )( ,a)(a,n)(n,d)(d, )( ,r)(r,i)(i,c)(c,h)', '(y, )( ,a)(a,n)(n,d)(d, )( ,l)(l,i)(i,t)(t,u)(u,r)(r,g)', '(a,y)(y, )( ,o)(o,p)(p,e)(e,n)(n,e)(e,d)(d, )( ,f)(f,o)', '(t,i)(i,o)(o,n)(n, )( ,f)(f,r)(r,o)(o,m)(m, )( ,t)(t,h)', '(m,i)(i,g)(g,r)(r,a)(a,t)(t,i)(i,o)(o,n)(n, )( ,t)(t,o)', '(n,e)(e,w)(w, )( ,y)(y,o)(o,r)(r,k)(k, )( ,o)(o,t)(t,h)', '(h,e)(e, )( ,b)(b,o)(o,e)(e,i)(i,n)(n,g)(g, )( ,s)(s,e)', '(e, )( ,l)(l,i)(i,s)(s,t)(t,e)(e,d)(d, )( ,w)(w,i)(i,t)', '(e,b)(b,e)(e,r)(r, )( ,h)(h,a)(a,s)(s, )( ,p)(p,r)(r,o)', '(o, )( ,b)(b,e)(e, )( ,m)(m,a)(a,d)(d,e)(e, )( ,t)(t,o)', '(y,e)(e,r)(r, )( ,w)(w,h)(h,o)(o, )( ,r)(r,e)(e,c)(c,e)', '(o,r)(r,e)(e, )( ,s)(s,i)(i,g)(g,n)(n,i)(i,f)(f,i)(i,c

In [65]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # big matrix
  ifcox = tf.Variable(tf.truncated_normal([bigram_size, 4 * num_nodes], -0.1, 0.1))
  ifcom = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
  ifcob = tf.Variable(tf.zeros([1, 4 * num_nodes]))
  
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, bigram_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([bigram_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    all_gates = tf.matmul(i, ifcox) + tf.matmul(o, ifcom) + ifcob
    input_gate = tf.sigmoid(all_gates[:, 0:num_nodes])
    forget_gate = tf.sigmoid(all_gates[:, num_nodes:2*num_nodes])
    update = all_gates[:, 2*num_nodes:3*num_nodes]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(all_gates[:, 3*num_nodes:])
        
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,bigram_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, bigram_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [8]:
def bigram_sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, bigram_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def bigram_random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, bigram_size])
  return b/np.sum(b, 1)[:,None]

In [67]:
num_steps = 7001
summary_frequency = 100

def exec_graph_bigram(graph):
  with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
      batches = bigram_train_batches.next()
      feed_dict = dict()
      for i in range(num_unrollings + 1):
        feed_dict[train_data[i]] = batches[i]
      _, l, predictions, lr = session.run(
        [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
      mean_loss += l
      if step % summary_frequency == 0:
        if step > 0:
          mean_loss = mean_loss / summary_frequency
        # The mean loss is an estimate of the loss over the last few batches.
        print(
          'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
        mean_loss = 0
        labels = np.concatenate(list(batches)[1:])
        print('Minibatch perplexity: %.2f' % float(
          np.exp(logprob(predictions, labels))))
        if step % (summary_frequency * 10) == 0:
          # Generate some samples.
          print('=' * 80)
          for _ in range(5):
            feed = bigram_sample(bigram_random_distribution())
            bigram_sentence = bigram_characters(feed)[0]
            sentence = bigram_first_characters(feed)[0]
            reset_sample_state.run()
            for _ in range(79):
              prediction = sample_prediction.eval({sample_input: feed})
              feed = bigram_sample(prediction)
              bigram_sentence += bigram_characters(feed)[0]
              sentence += bigram_first_characters(feed)[0]
            print('bigrams:', bigram_sentence)
            print('chars:', sentence)
          print('=' * 80)
        # Measure validation set perplexity.
        reset_sample_state.run()
        valid_logprob = 0
        for _ in range(valid_size):
          b = bigram_valid_batches.next()
          predictions = sample_prediction.eval({sample_input: b[0]})
          valid_logprob = valid_logprob + logprob(predictions, b[1])
        print('Validation set perplexity: %.2f' % float(np.exp(
          valid_logprob / valid_size)))
        
%time exec_graph_bigram(graph)

Initialized
Average loss at step 0: 6.591460 learning rate: 10.000000
Minibatch perplexity: 728.84
bigrams: (m,c)(v, )( ,l)(j,y)(m,v)(l,s)(l,h)(b,w)(l,a)(n,k)(r,j)(d,j)(g,x)(i,v)(x,b)(g,a)(e,b)( ,r)(q, )(v,e)(v,r)(z,w)(k,k)(h,v)(e,a)(v, )(o,m)(u,c)(j,j)(h,n)(h,x)(v,c)(a,t)(b,o)(d,f)(i,s)(x,i)(y,h)(i,z)(e,v)(p,x)(h,s)( ,e)(e,s)(k,c)(a,k)(b,q)(i,s)(y,y)(x,p)(z,f)(o,c)(b,o)(m,e)(r,y)( ,o)(u,w)(i,g)(f,c)(c, )(j,f)(e,p)(n,j)(b,k)(w,g)(s,l)(o,n)(j,x)(d,a)(j,c)(o,a)(v,e)(j,a)(t,m)(y,l)(x,u)(k,j)(m,x)(q,d)(r,d)
chars: mv jmllblnrdgixge qvvzkhevoujhhvabdixyieph ekabiyxzobmr uifcjenbwsojdjovjtyxkmqr
bigrams: (x,r)(l,o)(z,a)(l, )(t,w)(a, )(r,o)(b,q)(v,m)(n,r)(v,h)(m,f)(k,a)(j,n)( ,p)(a,r)(e,i)( ,g)(p,p)(u,n)(w,i)(c,m)(v,i)(e,m)(b,x)(n,y)(e,b)(r,h)(t,t)(c,y)(y,t)(c, )(d,c)(r,a)(o,f)(o,q)(q,w)(e,c)(n,g)(q,i)(l,i)(g,j)(z,q)(u,l)(r,n)(b,y)(d,l)(d, )(y,e)(v,o)(x,z)(v,i)(p,b)(q,q)(o,p)(v,j)(l,o)(i,v)(k,e)(a,d)(l,t)(e,i)(v,s)(z,w)(v,g)(n, )(s,e)(s,c)(e,l)(c,f)(c,k)( ,a)(f,c)(n,k)(i,p)(h,m)(l,q)(u,k)(o,f

Perplexity is better!

Add embeddings:

In [14]:
batch_size=64
num_unrollings=10
bigram_size = vocabulary_size * vocabulary_size

# output embeddings (IDs) instead of one-hot-encoded matrices

class BigramEmbeddingBatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = list()
    for b in range(self._batch_size):
      first_char = self._text[self._cursor[b]]
      if self._cursor[b] + 1 == self._text_size:
        second_char = ' '
      else:
        second_char = self._text[self._cursor[b] + 1]
      batch.append(char2id(first_char) * vocabulary_size + char2id(second_char))
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

bigram_embed_train_batches = BigramEmbeddingBatchGenerator(train_text, batch_size, num_unrollings)
bigram_embed_valid_batches = BigramEmbeddingBatchGenerator(valid_text, 1, 1)


In [69]:
num_nodes = 64
embedding_size = 128

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # big matrix
  ifcox = tf.Variable(tf.truncated_normal([bigram_size, 4 * num_nodes], -0.1, 0.1))
  ifcom = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
  ifcob = tf.Variable(tf.zeros([1, 4 * num_nodes]))
  
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, bigram_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([bigram_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    all_gates = tf.nn.embedding_lookup(ifcox, i) + tf.matmul(o, ifcom) + ifcob
    input_gate = tf.sigmoid(all_gates[:, 0:num_nodes])
    forget_gate = tf.sigmoid(all_gates[:, num_nodes:2*num_nodes])
    update = all_gates[:, 2*num_nodes:3*num_nodes]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(all_gates[:, 3*num_nodes:])
        
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.int64, shape=[batch_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.int64, shape=[1])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [11]:
num_steps = 7001
summary_frequency = 100

def exec_graph_bigram_embed(graph):
  with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
      batches = bigram_embed_train_batches.next()
      feed_dict = dict()
      for i in range(num_unrollings + 1):
        feed_dict[train_data[i]] = batches[i]
      _, l, predictions, lr = session.run(
        [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
      mean_loss += l
      if step % summary_frequency == 0:
        if step > 0:
          mean_loss = mean_loss / summary_frequency
        # The mean loss is an estimate of the loss over the last few batches.
        print(
          'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
        mean_loss = 0
        labels = np.concatenate(list(batches)[1:])
        # convert to one-hot-encodings
        noembed_labels = np.zeros(predictions.shape)
        for i, j in enumerate(labels):
            noembed_labels[i, j] = 1.0
        print('Minibatch perplexity: %.2f' % float(
          np.exp(logprob(predictions, noembed_labels))))
        if step % (summary_frequency * 10) == 0:
          # Generate some samples.
          print('=' * 80)
          for _ in range(5):
            feed = bigram_sample(bigram_random_distribution())
            bigram_sentence = bigram_characters(feed)[0]
            sentence = bigram_first_characters(feed)[0]
            # convert to embedding
            feed = [np.argmax(feed)]
            reset_sample_state.run()
            for _ in range(79):
              prediction = sample_prediction.eval({sample_input: feed})
              feed = bigram_sample(prediction)
              bigram_sentence += bigram_characters(feed)[0]
              sentence += bigram_first_characters(feed)[0]
              feed = [np.argmax(feed)]
            print('bigrams:', bigram_sentence)
            print('chars:', sentence)
          print('=' * 80)
        # Measure validation set perplexity.
        reset_sample_state.run()
        valid_logprob = 0
        for _ in range(valid_size):
          b = bigram_embed_valid_batches.next()
          predictions = sample_prediction.eval({sample_input: b[0]})
          labels = np.zeros((1, bigram_size))
          labels[0, b[1]] = 1.0
          valid_logprob = valid_logprob + logprob(predictions, labels)
        print('Validation set perplexity: %.2f' % float(np.exp(
          valid_logprob / valid_size)))

In [70]:
%time exec_graph_bigram_embed(graph)

Initialized
Average loss at step 0: 6.590697 learning rate: 10.000000
Minibatch perplexity: 728.29
bigrams: (z,h)(h,u)(y,n)(z,c)(o,p)( ,f)(z,q)(n,w)(v,l)(m,g)(z, )(b,s)(l,n)(m,m)(l,m)(y,d)(f,s)(z,u)(l,h)(z,y)(o,k)(n,c)(v,t)(r,l)(l,a)(o,l)(j,c)(h,w)(x,z)(n,z)(v,e)(a,b)(z,h)(z,e)(f,y)(c,g)(n,p)(n,r)(o,h)(f,l)(y,j)(p,l)(j,c)( ,z)(x,t)(n, )(t,s)(i,k)(v,n)(u,o)(v,r)( ,v)(y,h)(z,z)(w,p)(l,t)(t,z)(v,q)(t,k)(b,b)(u,x)(j,h)(c,d)(z,p)(k,t)(m,o)(o,j)(r,j)(g,t)(l,g)(n,m)(p,m)(x,a)(z,a)(p,d)(l,t)(l,a)(v,w)(l,p)(e,u)
chars: zhyzo znvmzblmlyfzlzonvrlojhxnvazzfcnnofypj xntivuv yzwltvtbujczkmorglnpxzpllvle
bigrams: (a,q)(t,h)(w,k)( ,k)(h, )(t,o)(j,l)(k, )(y,y)(d, )(x,q)(k,c)(j,u)(s,m)(b, )( ,r)(u,f)(c,h)(w,e)(y,j)(b,q)(p,e)(r,e)(t,b)(w,d)(s,q)(b,f)(m,o)(m, )(x, )(p,i)(u,s)(s,g)(n,c)(b,y)( ,i)(l,s)(j,m)(s,u)(l,t)(f,n)(p,h)(y,e)(q,s)(d,v)(l,d)(d,z)(b,g)(h,f)(r,k)(s,t)(y,m)(q,o)(h,y)(m,s)(o,w)(l,o)(e,s)(g,j)(q,l)(w,r)(b, )(c,z)(t,f)(h,z)(d,i)(x,t)(i,q)(q,e)(q,k)(m,m)(b,p)( ,i)(z,y)(k,j)(p,k)(a,a)(t,e)(v,u

Same perplexity, but it's faster!

Add dropout:

In [18]:
num_nodes = 64
embedding_size = 128

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # big matrix
  ifcox = tf.Variable(tf.truncated_normal([bigram_size, 4 * num_nodes], -0.1, 0.1))
  ifcom = tf.Variable(tf.truncated_normal([num_nodes, 4 * num_nodes], -0.1, 0.1))
  ifcob = tf.Variable(tf.zeros([1, 4 * num_nodes]))
  
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, bigram_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([bigram_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state, train=False):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    embed = tf.nn.embedding_lookup(ifcox, i)
    if train:
        embed = tf.nn.dropout(embed, 0.5)
    all_gates = embed + tf.matmul(o, ifcom) + ifcob
    input_gate = tf.sigmoid(all_gates[:, 0:num_nodes])
    forget_gate = tf.sigmoid(all_gates[:, num_nodes:2*num_nodes])
    update = all_gates[:, 2*num_nodes:3*num_nodes]
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(all_gates[:, 3*num_nodes:])
        
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.int64, shape=[batch_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state, True)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.int64, shape=[1])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state, False)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [19]:
%time exec_graph_bigram_embed(graph)

Initialized
Average loss at step 0: 6.590648 learning rate: 10.000000
Minibatch perplexity: 728.25
bigrams: (e,r)(j,m)(m,i)(f,b)(o,c)(y,z)(n,l)(e,j)(d,r)(p,p)(y,b)(l,d)(w,x)(f,b)(o,w)(p,w)(g,n)(t,s)(j,v)(u,r)(e,m)(c,o)(t,v)(d,g)(z,x)(u,l)(p,f)(h,j)(g,r)(h,s)(e,m)(q,s)(k,e)(u,z)(h, )(q,q)(z,n)(h,h)(o,j)(c,p)(s,b)(o,l)(d,q)(n,y)(l,j)(m,x)(m,r)(m,n)(h,n)(v,e)(x,q)(z,o)(v,x)(w,o)(m,g)(s,h)(c,w)(a,q)(h,b)(r,x)(h,a)(w,q)(d,p)(r,o)(t,n)(o,e)(r,q)(d,d)(n,r)(k,i)(f,z)(l,j)(j,y)(d,s)(q,s)(t,s)(c,n)(s,g)(f,r)(a,s)
chars: ejmfoynedpylwfopgtjuectdzuphgheqkuhqzhocsodnlmmmhvxzvwmscahrhwdrtordnkfljdqtcsfa
bigrams: (c,s)(x,f)(x,q)(n,h)(n,k)(s,k)(i,v)(x,o)(t,s)(u,o)(c,l)(n,k)(k,u)(r,e)(z,c)(m,m)(b,c)(l,n)(h,s)(m,t)(a,p)(r,n)(w,f)(b,g)(c,c)(v,l)(t,k)(g,z)(k,i)(j,w)(r,c)(i,y)(q,b)(m,h)(s,m)( ,p)(g,r)(d,g)(b,o)(v,s)(l, )(c,x)(o,r)(b,n)(e,n)(n,p)(r,g)(c,h)(h,p)(s,i)(p,m)(p,w)(r,f)(v,b)(l,c)(h,q)(f,w)(q,o)(h,l)(d, )(d,m)(c,b)(f, )(k,c)( ,m)(s,c)(s,v)(e,r)( , )(k,q)(p,t)(v,c)(k,g)(t,m)(t,y)(c,q)(o,y)(z,q)(v,o

Implement a GRU cell instead of an LSTM cell:

For the moment, use only unigrams (single characters) without embeddings.

In [44]:
num_nodes = 64
embedding_size = 128

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # update gate
  zx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  zm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  rx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  rm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Output gate: input, previous output.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def gru_cell(i, o):
    """Create a GRU cell."""
    z_gate = tf.sigmoid(tf.matmul(i, zx) + tf.matmul(o, zm))
    r_gate = tf.sigmoid(tf.matmul(i, rx) + tf.matmul(o, rm))
    ht_gate = tf.tanh(tf.matmul(i, ox) + tf.matmul(tf.mul(r_gate, o), om))
    return tf.add(tf.mul((1.0 - z_gate), o), (z_gate * ht_gate))
    
  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  for i in train_inputs:
    output = gru_cell(i, output)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  sample_output = gru_cell(
    sample_input, saved_sample_output)
  with tf.control_dependencies([saved_sample_output.assign(sample_output)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [45]:
num_steps = 7001
summary_frequency = 100

def exec_graph_gru(graph):
  with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
      batches = train_batches.next()
      feed_dict = dict()
      for i in range(num_unrollings + 1):
        feed_dict[train_data[i]] = batches[i]
      _, l, predictions, lr = session.run(
        [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
      mean_loss += l
      if step % summary_frequency == 0:
        if step > 0:
          mean_loss = mean_loss / summary_frequency
        # The mean loss is an estimate of the loss over the last few batches.
        print(
          'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
        mean_loss = 0
        labels = np.concatenate(list(batches)[1:])
        print('Minibatch perplexity: %.2f' % float(
          np.exp(logprob(predictions, labels))))
        if step % (summary_frequency * 10) == 0:
          # Generate some samples.
          print('=' * 80)
          for _ in range(5):
            feed = sample(random_distribution())
            sentence = characters(feed)[0]
            for _ in range(79):
              prediction = sample_prediction.eval({sample_input: feed})
              feed = sample(prediction)
              sentence += characters(feed)[0]
            print(sentence)
          print('=' * 80)
        # Measure validation set perplexity.
        valid_logprob = 0
        for _ in range(valid_size):
          b = valid_batches.next()
          predictions = sample_prediction.eval({sample_input: b[0]})
          valid_logprob = valid_logprob + logprob(predictions, b[1])
        print('Validation set perplexity: %.2f' % float(np.exp(
          valid_logprob / valid_size)))

In [46]:
%time exec_graph_gru(graph)

Initialized
Average loss at step 0: 3.297560 learning rate: 10.000000
Minibatch perplexity: 27.05
zvkytfbtspxslaemwghsdytrnki   avztefwlft ie n zuarbexnvwnevlnbhpwpmv qdlgefnsflg
ecfrolongyyi eprm kcen smnylt hcwhl mrttuheubz  dlouxleseca ii nf  pk o nzegcohn
aeht   srrpn w adhxplegt dvc akrv gxk  asu gdauld ettsf tahye pkq ivap houdnetnx
iwfdhaqf d i uaeltxklritrvlbu paesgxsspne nenqeex   ydbje bnzu kqeyqrsd  tfvbufd
iezp nqbnkedg vttdlxcntyinendan faj ufkdshm e kem z ferwjigztowmhufheje ece tkir
Validation set perplexity: 20.11
Average loss at step 100: 2.445945 learning rate: 10.000000
Minibatch perplexity: 9.05
Validation set perplexity: 9.73
Average loss at step 200: 2.169858 learning rate: 10.000000
Minibatch perplexity: 8.40
Validation set perplexity: 8.44
Average loss at step 300: 2.048616 learning rate: 10.000000
Minibatch perplexity: 8.32
Validation set perplexity: 7.57
Average loss at step 400: 1.984058 learning rate: 10.000000
Minibatch perplexity: 7.35
Validation set perpl

Hey, not bad at all!

Instead of having just LSTM, use LSTM(GRU):

In [58]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # LSTM
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox1 = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om1 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob1 = tf.Variable(tf.zeros([1, num_nodes]))

  #GRU
  # update gate
  zx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  zm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  rx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  rm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Output gate: input, previous output.
  ox2 = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def gru_cell(i, o):
    """Create a GRU cell."""
    z_gate = tf.sigmoid(tf.matmul(i, zx) + tf.matmul(o, zm))
    r_gate = tf.sigmoid(tf.matmul(i, rx) + tf.matmul(o, rm))
    ht_gate = tf.tanh(tf.matmul(i, ox2) + tf.matmul(tf.mul(r_gate, o), om2))
    return tf.add(tf.mul((1.0 - z_gate), o), (z_gate * ht_gate))

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox1) + tf.matmul(o, om1) + ob1)
    return gru_cell(i, output_gate * tf.tanh(state)), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [52]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.304322 learning rate: 10.000000
Minibatch perplexity: 27.23
o abclzhsdwdtvj p vysearahpytmgkotaauwztleegyvpe  xobwor  z uxejmfrketyoehv fg l
tasgf eangetjekpn  nxoodm ttbi upp   iott gu louwwtv sceiibfo uur s  ie g rqnu d
mq mzttcaardntnpz etmujsk s xqkomeftq q ynnansw al  gnrliws i vmw mo oendsereuai
aa azgzsrxmsoe eyprhules trxie py kfdsxsweoxa evrwmtahsyap eifj  z ptodihtdrddln
o tan ezxsyrhwpvyfv hmz wyehlhleudcreciero qgjaovohc trjeueanjrbvifoyoz mpu pss 
Validation set perplexity: 20.15
Average loss at step 100: 2.507354 learning rate: 10.000000
Minibatch perplexity: 11.08
Validation set perplexity: 10.99
Average loss at step 200: 2.265359 learning rate: 10.000000
Minibatch perplexity: 9.02
Validation set perplexity: 9.78
Average loss at step 300: 2.119160 learning rate: 10.000000
Minibatch perplexity: 7.63
Validation set perplexity: 8.63
Average loss at step 400: 2.036102 learning rate: 10.000000
Minibatch perplexity: 6.75
Validation set per

Good, it's the same perplexity as when we use bigrams!

Check the perplexity again.

In [59]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.285306 learning rate: 10.000000
Minibatch perplexity: 26.72
yxy isr  eezy kiyzryslsfzamgnprddcodiuixsa drtlveqyde xlyedyikelfliskiogadwuryqa
ga  ncawxhhasrkkaw  coxvl apeli xeliiolfreysgiewlykjt xpo kvj mnahhnyikepiw fz g
oecrrf pvcds r  djduozvaxaesnvrcupnv cb  octlvssbwf rnbuaf eche c chokxrmtosz m 
oys oraailievv efjzqfninv atzvbxkdllag yceanqkr eyguztcijebmaabneg elnsuispdnmsd
mu ex  ysodbwellc br qwjh  kwgwiyhl dxktwpilzxcttqsts lir qhybqoq csnxzzmblrbjfw
Validation set perplexity: 22.09
Average loss at step 100: 2.512237 learning rate: 10.000000
Minibatch perplexity: 10.31
Validation set perplexity: 10.42
Average loss at step 200: 2.280285 learning rate: 10.000000
Minibatch perplexity: 8.95
Validation set perplexity: 9.20
Average loss at step 300: 2.150371 learning rate: 10.000000
Minibatch perplexity: 8.77
Validation set perplexity: 8.43
Average loss at step 400: 2.062831 learning rate: 10.000000
Minibatch perplexity: 7.60
Validation set per

Do the reverse way, GRU(LSTM):

In [9]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # LSTM
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox1 = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om1 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob1 = tf.Variable(tf.zeros([1, num_nodes]))

  #GRU
  # update gate
  zx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  zm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  rx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  rm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Output gate: input, previous output.
  ox2 = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om2 = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox1) + tf.matmul(o, om1) + ob1)
    return output_gate * tf.tanh(state), state

  # Definition of the cell computation.
  def gru_cell(i, o, state):
    """Create a GRU cell."""
    z_gate = tf.sigmoid(tf.matmul(i, zx) + tf.matmul(o, zm))
    r_gate = tf.sigmoid(tf.matmul(i, rx) + tf.matmul(o, rm))
    ht_gate = tf.tanh(tf.matmul(i, ox2) + tf.matmul(tf.mul(r_gate, o), om2))
    return lstm_cell(i, tf.add(tf.mul((1.0 - z_gate), o), (z_gate * ht_gate)), state)

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = gru_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = gru_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [54]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.305445 learning rate: 10.000000
Minibatch perplexity: 27.26
ddengwi ojoc iiek ndr p libezebo  ko azoo igmet  eesw dwffv eeee  zigkjjcor etle
ueetmr tegl  f ofnqi  isyigaes e cnxmhhe le geeegoz at oetrv  t o l rtiozpexql  
dywu m jdbro  vep  agt x dhs a yffmogr notjf  dj  n esz   uqkzl o  gwhps emtwree
h  t o pwcnnrvnteosrwuc hzfb x  ke pfo  qioakch rf tkmtagw h   eonze  z  eeiztg 
pfecgt ireoka ki eefunpdjfrqlxehbnonu raxetr znev aeeehah ebe v b d itn g lerboy
Validation set perplexity: 20.21
Average loss at step 100: 2.630691 learning rate: 10.000000
Minibatch perplexity: 11.18
Validation set perplexity: 11.29
Average loss at step 200: 2.293023 learning rate: 10.000000
Minibatch perplexity: 8.71
Validation set perplexity: 8.76
Average loss at step 300: 2.141329 learning rate: 10.000000
Minibatch perplexity: 9.13
Validation set perplexity: 8.23
Average loss at step 400: 2.025069 learning rate: 10.000000
Minibatch perplexity: 7.90
Validation set per

In [10]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.299003 learning rate: 10.000000
Minibatch perplexity: 27.09
t ft mjlawtjau t i lbikcoh  qrcmvmiodo enopeke ehn d c vhrxpupo atzfousayw mn iq
rsrmd urne  ec ir ucgmwirerufs ipycelknebi   nodtenvenemcgsloynmtmt bpclljdn tem
i ce o h ayd k inrohpiocc lmgs crobixfl nqe t rn tpsk   saitdxqjnclgxegdeezvnz j
vfpmari   qqf sgum  dm yndn yo  m tfw  cl n  zx    g dao kov   zn zze  tlr rqhiw
p cfi inh f bepaytmtu r     ag  e zix qiejdcheai miiah thefsuy ca nvsrhidsrkv p 
Validation set perplexity: 19.73
Average loss at step 100: 2.587160 learning rate: 10.000000
Minibatch perplexity: 11.42
Validation set perplexity: 10.96
Average loss at step 200: 2.230221 learning rate: 10.000000
Minibatch perplexity: 8.40
Validation set perplexity: 8.37
Average loss at step 300: 2.077523 learning rate: 10.000000
Minibatch perplexity: 7.18
Validation set perplexity: 7.85
Average loss at step 400: 1.973335 learning rate: 10.000000
Minibatch perplexity: 7.13
Validation set per

Reuse the output variables for the LSTM and GRU cell (and remove output bias)

In [21]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # LSTM
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))

  #GRU
  # update gate
  zx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  zm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  rx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  rm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om))
    return output_gate * tf.tanh(state), state

  # Definition of the cell computation.
  def gru_cell(i, o, state):
    """Create a GRU cell."""
    z_gate = tf.sigmoid(tf.matmul(i, zx) + tf.matmul(o, zm))
    r_gate = tf.sigmoid(tf.matmul(i, rx) + tf.matmul(o, rm))
    ht_gate = tf.tanh(tf.matmul(i, ox) + tf.matmul(tf.mul(r_gate, o), om))
    return lstm_cell(i, tf.add(tf.mul((1.0 - z_gate), o), (z_gate * ht_gate)), state)

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = gru_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = gru_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [23]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.294183 learning rate: 10.000000
Minibatch perplexity: 26.96
ueqkey ic nzi vaegvn  sbsbrhnzdywpntsueekgen anstt   ee eb   i  dx  zv s t r ssm
hy uqnolwz tbbhxohw w  nfbntm  nmnasdgpucrsaige  eac cu a fn y sczmdtgh abffmaii
ra cyqavgopvr  zia vgn  h se    y    g ee itaawmxr qwsst nneeas eb angllaearievi
kz rt euezjiiobqheenztqhuu  es sdn eit vhecetdvulzae c xo  monxfheggazfet itlh e
gltw gj n   g x  eoht   umyimippbj sxt  noibheeabslra z  rz iysu mgm itxtnrfyfsk
Validation set perplexity: 19.69
Average loss at step 100: 2.571543 learning rate: 10.000000
Minibatch perplexity: 10.64
Validation set perplexity: 10.94
Average loss at step 200: 2.249341 learning rate: 10.000000
Minibatch perplexity: 8.56
Validation set perplexity: 9.24
Average loss at step 300: 2.102821 learning rate: 10.000000
Minibatch perplexity: 7.11
Validation set perplexity: 8.23
Average loss at step 400: 2.010238 learning rate: 10.000000
Minibatch perplexity: 7.55
Validation set per

Frankenstein GRU + LSTM: GRU with a memory cell

In [33]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # from LSTM
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))

  # from GRU
  # update gate
  zx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  zm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  rx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  rm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))


  # Definition of the cell computation.
  def gru_state_cell(i, o, state):
    """Create a GRU cell."""
    z_gate = tf.sigmoid(tf.matmul(i, zx) + tf.matmul(o, zm))
    r_gate = tf.sigmoid(tf.matmul(i, rx) + tf.matmul(o, rm))
    ht_gate = tf.tanh(tf.matmul(i, ox) + tf.matmul(tf.mul(r_gate, o), om))
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = r_gate * state + z_gate * tf.tanh(update)
    return tf.add(tf.mul((1.0 - z_gate), o), (z_gate * ht_gate)), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = gru_state_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = gru_state_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [34]:
%time exec_graph(graph)

Initialized
Average loss at step 0: 3.296504 learning rate: 10.000000
Minibatch perplexity: 27.02
nieob beeejh cy f  sxerkjdzsojqssiyworvhtcmqdrafzdukizl svpsr qwdlfltlsaohbm dls
xosydqhfrsspppjbxh u huivur  penoentshgpett  eexrwuzaaaazdndmcxufkjvaepuuugnvaoa
jzd erdt lepz nilnxg wedrrabrhcedexljsb cuhfho lanzerf g xwtrmse zhseqwr mnspiht
t aop sb lgm ffcr elshrbpaaxlhsbxneyolb o  k ludgsa y dmxfnilpacihkdnavsvcahkmad
efdnsx vldasdeehqu quwhbtnkrrfcurmiavz rsrmcxitrcidc tifnny  dbww yioi r pnc zl 
Validation set perplexity: 20.14
Average loss at step 100: 2.418344 learning rate: 10.000000
Minibatch perplexity: 10.28
Validation set perplexity: 10.00
Average loss at step 200: 2.147040 learning rate: 10.000000
Minibatch perplexity: 8.69
Validation set perplexity: 8.40
Average loss at step 300: 1.983601 learning rate: 10.000000
Minibatch perplexity: 7.02
Validation set perplexity: 7.76
Average loss at step 400: 1.920037 learning rate: 10.000000
Minibatch perplexity: 6.92
Validation set per

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

Reuse the Seq2SeqModel class in the RNN translate example

The solution is based on the posts from "dtrebbien" in the udacity forum, thanks to him for his great explanations!

Url: https://discussions.udacity.com/t/assignment-6-problem-3-benchmarks/158517

Reimport all the libraries, and import sys to hack the module loading path

In [1]:
from __future__ import print_function
import numpy as np
import random
import string
import tensorflow as tf
import sys
from six.moves import range

Reuse the char to id conversion functions

In [2]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


The seq2seq_model module is not part of the default modules loaded by tensorflow.
Add the models directory to the python path, so that we can import it.

In [3]:
import sys
sys.path.append(tf.__path__[0] + '/models')

In [4]:
import tensorflow.models.rnn.translate.seq2seq_model as seq2seq_model

Text sample, we will try to reverse all the words

In [5]:
text = "the quick brown fox jumps over the lazy dog is an english sentence that can be translated to the following french one le vif renard brun saute par dessus le chien paresseux here is an extremely long french word anticonstitutionnellement"

def longest_word_size(text):
    return max(map(len, text.split()))

word_size = longest_word_size(text)
print(word_size)

25


Reuse the same parameters (learning rate, ...) for the model as in the translate.py script, but use LSTM instead of GRU cells

In [6]:
import string

num_nodes = 64
batch_size = 10

def create_model():
     return seq2seq_model.Seq2SeqModel(source_vocab_size=vocabulary_size,
                                   target_vocab_size=vocabulary_size,
                                   buckets=[(word_size + 1, word_size + 2)], # only 1 bucket
                                   size=num_nodes,
                                   num_layers=3,
                                   max_gradient_norm=5.0,
                                   batch_size=batch_size,
                                   learning_rate=0.5,
                                   learning_rate_decay_factor=0.99,
                                   use_lstm=True,
                                   forward_only=False)

In [7]:
def get_batch():
    encoder_inputs = [np.random.randint(1, vocabulary_size, word_size + 1) for _ in xrange(batch_size)]
    decoder_inputs = [np.zeros(word_size + 2, dtype=np.int32) for _ in xrange(batch_size)]
    weights = [np.ones(word_size + 2, dtype=np.float32) for _ in xrange(batch_size)]
    for i in xrange(batch_size):
        r = random.randint(1, word_size)
        # leave at least a 0 at the end
        encoder_inputs[i][r:] = 0
        # one 0 at the beginning of the reversed word, one 0 at the end
        decoder_inputs[i][1:r+1] = encoder_inputs[i][:r][::-1]
        weights[i][r+1:] = 0.0
    return np.transpose(encoder_inputs), np.transpose(decoder_inputs), np.transpose(weights)

In [8]:
def strip_zeros(word):
    # 0 is the code for space in char2id()
    return word.strip(' ')

def evaluate_model(model, sess, words, encoder_inputs):
    correct = 0
    decoder_inputs = np.zeros((word_size + 2, batch_size), dtype=np.int32)
    target_weights = np.zeros((word_size + 2, batch_size), dtype=np.float32)
    target_weights[0,:] = 1.0
    is_finished = np.full(batch_size, False, dtype=np.bool_)
    for i in xrange(word_size + 1):
        _, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, target_weights, bucket_id=0, forward_only=True)
        p = np.argmax(output_logits[i], axis=1)
        is_finished = np.logical_or(is_finished, p == 0)
        decoder_inputs[i,:] = (1 - is_finished) * p
        target_weights[i,:] = (1.0 - is_finished) * 1.0
        #if np.all(is_finished):
            #break
    print(decoder_inputs)
    for idx, l in enumerate(np.transpose(decoder_inputs)):
        reversed_word = ''.join(reversed(words[idx]))
        output_word = strip_zeros(''.join(id2char(i) for i in l))
        print(words[idx], '(reversed: {0})'.format(reversed_word),
              '->', output_word, '({0})'.format('OK' if reversed_word == output_word else 'KO'))
        if reversed_word == output_word:
            correct += 1
    return correct

In [9]:
def get_validation_batch(words):
    encoder_inputs = [np.zeros(word_size + 1, dtype=np.int32) for _ in xrange(batch_size)]
    for i, word in enumerate(words):
        for j, c in enumerate(word):
            encoder_inputs[i][j] = char2id(c)
    return np.transpose(encoder_inputs)

def validate_model(text, model, sess):
    words = text.split()
    nb_words = (len(words) / batch_size) * batch_size
    correct = 0
    for i in xrange(nb_words / batch_size):
        range_words = words[i * batch_size:(i + 1) * batch_size]
        encoder_inputs = get_validation_batch(range_words)
        correct += evaluate_model(model, sess, range_words, encoder_inputs)
    print('* correct: {0}/{1} -> {2}%'.format(correct, nb_words, (float(correct) / nb_words) * 100))
    print()

In [10]:
def reverse_text(nb_steps):
    with tf.Session() as session:
        model = create_model()
        tf.initialize_all_variables().run()
        for step in xrange(nb_steps):
            enc_inputs, dec_inputs, weights = get_batch()
            _, loss, _ = model.step(session, enc_inputs, dec_inputs, weights, 0, False)
            if step % 1000 == 1:
                print('* step:', step, 'loss:', loss)
                validate_model(text, model, session)
        print('*** evaluation! loss:', loss)
        validate_model(text, model, session)

Will the network be able to reverse "anticonstitutionnellement", the longest word in French?

In [11]:
%time reverse_text(15000)

* step: 1 loss: 3.29306
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]]
the (reversed: eht) ->  (KO)
quick (reversed: kciuq) ->  (KO)
brown (reversed: nworb) ->  (KO)
fox (reversed: xof) ->  (KO)
jumps (reversed: spmuj) ->  (KO)
over (reversed: revo) ->  (KO)
the (reversed: eht) ->  (KO)
lazy (reversed: yzal) ->  (KO)
dog (reversed: god) ->  (KO)
is (reversed: si) ->  (KO)
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 

Another try with a higher number of steps

In [12]:
tf.reset_default_graph()

%time reverse_text(30000)

* step: 1 loss: 3.28237
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]]
the (reversed: eht) ->  (KO)
quick (reversed: kciuq) ->  (KO)
brown (reversed: nworb) ->  (KO)
fox (reversed: xof) ->  (KO)
jumps (reversed: spmuj) ->  (KO)
over (reversed: revo) ->  (KO)
the (reversed: eht) ->  (KO)
lazy (reversed: yzal) ->  (KO)
dog (reversed: god) ->  (KO)
is (reversed: si) ->  (KO)
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 

Okay, let's say it nearly reversed the longest string! And the shorter strings have all been reversed pretty quickly.

"teellllnnnoitutitsnocitna" (given at the beginning, at step 5000) is not far from "tnemellennoitutitsnocitna".