Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [2]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import math
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [3]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [4]:
def read_data(filename):
    f = zipfile.ZipFile(filename)
    for name in f.namelist():
        return tf.compat.as_str(f.read(name))
    f.close()

text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [5]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [6]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [7]:
batch_size=100
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'ormer nike ', 'this use of', ' soviets fr', 'a combinati', 'n nine nine', 'd cabinet s', 'r of africa', 'use of airw', 'nto powerin', 'nts and rel', 'even two ch', 'e pressure ', 'nition in f', 'nd west ben', 'it reacts f', 'long relied', 'ele sindebe', 'ber two zer', 'o select a ', 'y eight inc', ' foundation', ' practice a', 'nd a long a', ' cheeses mi', 'gnificant t', 'ty nine sev', 'heir daught', 'ranges vary', 'ven zero th', 'one nine se', 'use garbage', 'g corncobs ', 'zero zero f', ' of the pre', 'e shown inc', 'd order tod', 'mpany where', 'oth ugly an', 'uency high ', 'o two one n', 'ne two six ', 'he druids o', 'ne took eff', 'cholarly ac', 'rceived or ', 'ultaneous e', 'ession were', 'enter in th', 'rity throug', 'of the orig', 'rary the fc', 'r five one ', ' one done c', 'unciation o', 'ree people ', 'icans signe', 't eight thr', 'and drawing', ' become an ', 'ard roberts', 'cted in add', ' in one zer', 'ognition on', ' most famou', 'efeat troy ', 'ero dead

In [8]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [8]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [11]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

print("Done")

Initialized
Average loss at step 0: 3.294567 learning rate: 10.000000
Minibatch perplexity: 26.97
qsr  vkreb n sjr  ct xt twlblj ewbmw x zdqbi agfsaar hfpfolus jx grfbidcpnbdfv t
fcpfi  yobdihnzeas hghtsetlylbfawoer t  se pxveritee pxanyenh e  s  j eplntgqaso
spaxcao w  pmntneqvyqgteeawuehrtr ydq ntpelaa tvo fzsnyllenm sxugqdhpffrtdon uny
gjbbe sa el  v hh kyyannrj s  t d gn epadgergfeoiat lpesbn hp ehlteiccee jhrnzoh
ecus ik gkvv q wmeeno eu rev  sm  grtbbr  oq lf ubvbmgg ceihemt gxd a xbrwoan p 
Validation set perplexity: 20.10


KeyboardInterrupt: 

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [9]:
batch_size = 100
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    zeroX = tf.zeros([vocabulary_size, num_nodes])
    xs = tf.concat(0,
                   [tf.concat(1, [ix, zeroX, zeroX, zeroX]),
                    tf.concat(1, [zeroX, fx, zeroX, zeroX]),
                    tf.concat(1, [zeroX, zeroX, cx, zeroX]),
                    tf.concat(1, [zeroX, zeroX, zeroX, ox])])
    zeroM = tf.zeros([num_nodes, num_nodes])
    ms = tf.concat(0,
                   [tf.concat(1, [im, zeroM, zeroM, zeroM]),
                    tf.concat(1, [zeroM, fm, zeroM, zeroM]),
                    tf.concat(1, [zeroM, zeroM, cm, zeroM]),
                    tf.concat(1, [zeroM, zeroM, zeroM, om])])
    bs = tf.concat(1, [ib, fb, cb, ob])
    
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        ins = tf.concat(1, [i, i, i, i])
        outs = tf.concat(1, [o, o, o, o])
        full = tf.matmul(ins, xs) + tf.matmul(outs, ms) + bs
        
        input_gate = tf.sigmoid(full[:,0:num_nodes])
        forget_gate = tf.sigmoid(full[:,num_nodes:2*num_nodes])
        update = full[:,2*num_nodes:3*num_nodes]
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(full[:,3*num_nodes:])
        return output_gate * tf.tanh(state), state
 
# Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

  # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            logits, tf.concat(0, train_labels)))

  # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
    train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
    

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

In [8]:
# First: make ngram2vec
# then- change so test data receives vec representations
# how do you determine loss function there? cosine similarity?
# make a probability distribution based on distance from predicted point?
# or is there another

n = 2
ngram_size = vocabulary_size**n
num_sampled = 20
embed_size = 30
ngram_batch_size = 50

def ngram2id(ngram):
    id = 0
    for i in range(n):
        id += (vocabulary_size**(n-i-1)) * char2id(ngram[i])
    return(id)

def id2ngram(id):
    ngram = ""
    for i in range(n):
        digit = id / (vocabulary_size**(n-i-1))
        ngram += id2char(digit)
        id += -1 * digit * (vocabulary_size**(n-i-1))
    return(ngram)

string_cursor = 0
skip_window = 1 #CAN ONLY BE 1 AT THE MOMENT

def ngram2vec_batches(text, n, batch_size):
    #Makes CBOW batches, skip_window measured in ngrams, not characters
    #ONLY DOES SKIP_WINDOW = 1 AT THE MOMENT
    global string_cursor
    def wrap(x):
        return x % len(text)
    data = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    context = np.ndarray(shape=(batch_size,2*skip_window), dtype=np.int32)
    for i in range(batch_size):
        if wrap(string_cursor) < wrap(string_cursor+n):
            data[i, 0] = ngram2id(text[wrap(string_cursor):wrap(string_cursor+n)])
        else:
            data[i, 0] = 0
        if wrap(string_cursor-n) < wrap(string_cursor):
            context[i,0] = ngram2id(text[wrap(string_cursor-n):wrap(string_cursor)])
        else:
            context[i,0] = 0
        if wrap(string_cursor+n) < wrap(string_cursor+2*n):
            context[i,1] = ngram2id(text[string_cursor+n:string_cursor+2*n])
        else:
            context[i,1] = 0
        string_cursor += 1 
    return data.reshape(batch_size, 1), context

graph_ngram = tf.Graph()
with graph_ngram.as_default():
    #Make a CBOW model
    train_data = tf.placeholder(tf.int32, [ngram_batch_size, 1])
    train_context = tf.placeholder(tf.int32, [ngram_batch_size, 2*skip_window])
    
    #initialize points at random
    embeddings = tf.Variable(tf.random_uniform([ngram_size, embed_size], -1.0, 1.0))
    
    #Sum the vectors given by two n-grams in context
    embed_context = tf.reduce_sum(tf.nn.embedding_lookup(embeddings, train_context), 1)
    
    weights = tf.Variable(
        tf.truncated_normal([ngram_size, embed_size], stddev= 1.0 / math.sqrt(embed_size)))
    biases = tf.zeros([ngram_size])
    
    #Note that context goes where data usually goes and vice-versa
    loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(weights, biases, embed_context,
                                   train_data, num_sampled, ngram_size))
    
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
    
# be able to look up embeddings?

# display nearest bigrams

print('Done')

Done


In [9]:
ngram2vec_train_steps = 100001

with tf.Session(graph=graph_ngram) as session:
    tf.initialize_all_variables().run()
    avg_loss = 0
    for i in range(ngram2vec_train_steps):    
        batch_data, batch_context = ngram2vec_batches(train_text, n, ngram_batch_size)
        feed_dict = {train_data : batch_data,
                    train_context : batch_context}
        _, l, emb = session.run([optimizer, loss, embeddings], feed_dict = feed_dict)
        avg_loss += l
        if i % 10000 == 0:
            print("Step %d: Average Loss %f" % (i, avg_loss / 10000))
            avg_loss = 0
        if i == ngram2vec_train_steps - 1:
            embedding_save = emb

print("Done")     

Step 0: Average Loss 0.000288
Step 10000: Average Loss 0.827908
Step 20000: Average Loss 0.730822
Step 30000: Average Loss 0.734214
Step 40000: Average Loss 0.668134
Step 50000: Average Loss 0.736955
Step 60000: Average Loss 0.669869
Step 70000: Average Loss 0.708820
Step 80000: Average Loss 0.688777
Step 90000: Average Loss 0.688685
Step 100000: Average Loss 0.711851
Done


In [15]:
num_nodes = 64
num_unrollings = 10
batch_size = 100 #CHECK THIS

graphNgramRNN = tf.Graph()

with graphNgramRNN.as_default():
    embeddings = tf.constant(embedding_save)
    
    # Forget gate
    fx = tf.Variable(tf.truncated_normal([embed_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([num_nodes]))
    
    #Input gate
    ix = tf.Variable(tf.truncated_normal([embed_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([num_nodes]))
    
    #Candidate values
    cx = tf.Variable(tf.truncated_normal([embed_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([num_nodes]))
    
    #Output gate
    ox = tf.Variable(tf.truncated_normal([embed_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([num_nodes]))
    
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    w = tf.Variable(tf.truncated_normal([num_nodes, embed_size]))
    b = tf.Variable(tf.zeros([embed_size]))
    
    
    def LSTM_gate(i, o, state):
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        candidate_gate = tf.tanh(tf.matmul(i, cx) + tf.matmul(o, cm) + cb)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        state = state * forget_gate + (input_gate * candidate_gate)   
        return output_gate * tf.tanh(state), state
    
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.int32, [batch_size]))
    train_data_embed = list(
        map((lambda x: tf.nn.embedding_lookup(embeddings, x)), train_data))
    train_inputs = train_data_embed[:-1]
    train_labels = train_data_embed[1:]
    
    #unroll the LSTM
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = LSTM_gate(i, output, state)
        outputs.append(output)
        
    
    def cosine_similarity(x, y):
        return tf.matmul(tf.transpose(x), y) / (tf.reduce_sum(x) * tf.reduce_sum(y))
    
    with tf.control_dependencies([saved_output.assign(output),
                                 saved_state.assign(state)]):
        logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
        loss = -10 * tf.reduce_sum(cosine_similarity(logits, tf.concat(0, train_labels)))
    
    rate = 0.5
    global_step = tf.Variable(0, trainable=False)
    optimizer = tf.train.GradientDescentOptimizer(rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)
    
    
    
    #Need to convert output vectors back into ngrams
    def embed2ngramID(embeddings, vecs):
        """vecs: (any_size, embed_size),
        embeddings: (ngram_size, embed_size)
        """
        dists = cosine_similarity(embeddings, vecs)
        nearest_guesses = (- 1 * dists).argsort()[:,0]
        return nearest_guesses
    
    test = embed2ngramID(embeddings, logits)
    
    valid_inputs = "Hello world "
    valid_outputs = list()
    valid_output = saved_output
    valid_state = saved_state
    for i in train_inputs:
        output, state = LSTM_gate(i, output, state)
        outputs.append(output)
        
    
print("Done")

ValueError: Dimensions 729 and 1000 are not compatible

In [52]:
num_unrollings=10
rnn_train_steps = 10001

class NgramBatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
  
    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size), dtype=np.int32)
        for b in range(self._batch_size):
            batch[b] = ngram2id(self._text[self._cursor[b]:(self._cursor[b]+n)])
            self._cursor[b] = (self._cursor[b] + n) % (self._text_size - n + 1)
        return batch
  
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
            self._last_batch = batches[-1]
        return batches

with tf.Session(graph=graphNgramRNN) as session:
    tf.initialize_all_variables().run()
    text_batch_gen = NgramBatchGenerator(train_text, batch_size, num_unrollings)
    average_loss = 0
    for i in range(rnn_train_steps):
        batches = text_batch_gen.next()
        feed_dict = dict()
        for j in range(num_unrollings + 1):
            feed_dict[train_data[j]] = batches[j]
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += l
        if i % 500 == 0:
            print("Step %d, Average Loss: %f" % (i, average_loss / 500))
            average_loss = 0
    
print("Done")

Step 0, Average Loss: -0.000193
Step 500, Average Loss: -0.012474
Step 1000, Average Loss: -0.023127
Step 1500, Average Loss: -0.032561
Step 2000, Average Loss: -0.038981
Step 2500, Average Loss: -0.041257
Step 3000, Average Loss: -0.035284
Step 3500, Average Loss: -0.046101
Step 4000, Average Loss: -0.043459
Step 4500, Average Loss: -0.041865
Step 5000, Average Loss: -0.052229
Step 5500, Average Loss: -0.042341
Step 6000, Average Loss: -0.048876
Step 6500, Average Loss: -0.056329
Step 7000, Average Loss: -0.046594
Step 7500, Average Loss: -0.045811
Step 8000, Average Loss: -0.090470
Step 8500, Average Loss: -0.022930


KeyboardInterrupt: 

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---