# HW4: RNNLMs

In this assignment, you'll implement an LSTM lanugage model.

Submit your completed notebook through NYU Classes by 9:30 AM on October 27.

## Setup

First, let's load the data as before.

In [3]:
sst_home = './trees'

import re
import random

# Let's do 2-way positive/negative classification instead of 5-way
easy_label_map = {0:0, 1:0, 2:None, 3:1, 4:1}

def load_sst_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    random.seed(1)
    random.shuffle(data)
    return data
     
training_set = load_sst_data(sst_home + '/train.txt')
dev_set = load_sst_data(sst_home + '/dev.txt')
test_set = load_sst_data(sst_home + '/test.txt')

# Note: Unlike with k-nearest neighbors, evaluation here should be fast, and we don't need to
# trim down the dev and test sets. 

Next, we'll convert the data to index vectors.

To simplify your implementation, we'll use a fixed unrolling length of 20. This means that we'll have to expand each sentence into a sequence of 21 word indices. In the conversion process, we'll mark the start of each sentence with a special word symbol `<S>`, mark the end of each sentence (if it occurs within the first 21 words) with a special word symbol `</S>`, mark extra tokens after `</S>` with a special word symbol `<PAD>`, and mark out-of-vocabulary words with `<UNK>`, for unknown. As in the previous assignment, we'll use a very small vocabulary for this assignment, so you'll see `<UNK>` often.

In [4]:
import collections
import numpy as np

def sentence_to_padded_index_sequence(datasets):
    '''Annotates datasets with feature vectors.'''
    
    START = "<S>"
    END = "</S>"
    END_PADDING = "<PAD>"
    UNKNOWN = "<UNK>"
    SEQ_LEN = 21
    
    # Extract vocabulary
    def tokenize(string):
        return string.lower().split()
    
    word_counter = collections.Counter()
    for example in datasets[0]:
        word_counter.update(tokenize(example['text']))
    
    vocabulary = set([word for word in word_counter if word_counter[word] > 25])
    vocabulary = list(vocabulary)
    vocabulary = [START, END, END_PADDING, UNKNOWN] + vocabulary
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))
    indices_to_words = {v: k for k, v in word_indices.items()}
        
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['index_sequence'] = np.zeros((SEQ_LEN), dtype=np.int32)
            
            token_sequence = [START] + tokenize(example['text']) + [END]
            
            for i in range(SEQ_LEN):
                if i < len(token_sequence):
                    if token_sequence[i] in word_indices:
                        index = word_indices[token_sequence[i]]
                    else:
                        index = word_indices[UNKNOWN]
                else:
                    index = word_indices[END_PADDING]
                example['index_sequence'][i] = index
    return indices_to_words, word_indices
    
indices_to_words, word_indices = sentence_to_padded_index_sequence([training_set, dev_set, test_set])

In [15]:
print training_set[18]
print len(word_indices)
print indices_to_words[79]

{'text': 'It arrives with an impeccable pedigree , mongrel pep , and almost indecipherable plot complications .', 'index_sequence': array([  0, 380,   3, 332, 571,   3,   3, 173,   3,   3, 173, 401, 434,
         3,  79,   3, 414,   1,   2,   2,   2], dtype=int32)}
603
plot


## Part 1: Implementation (60%)

Now, using the starter code and hyperparameter values provided below, implement an LSTM language model with dropout on the non-recurrent connections. Use the standard form of the LSTM reflected in the slides (without peepholes). You should only have to edit the marked sections of code to build the base LSTM, though implementing dropout properly may require small changes to the main training loop and to brittle_sampler().

Don't use any TensorFlow code that is specifically built for RNNs. If a TF function has 'recurrent', 'sequence', 'LSTM', or 'RNN' in its name, you should built it yourself instead of using it. (Your version will likely be much simpler, by the way, since these built in methods are powerful but fairly complex and potentially confusing.)

We won't be evaluating our model in the conventional way (perplexity on a held-out test set) for a few reasons: to save time, because we have no baseline to compare against, and because overfitting the training set is a less immediate concern with these models than it was with sentence classifiers. Instead, we'll use the value of the cost function to make sure that the model is converging as expected, and we'll use samples drawn from the model to qualitatively evaluate it.

Tips: 

- You'll need to use `tf.nn.embedding_lookup()`, `tf.nn.sparse_softmax_cross_entropy_with_logits()`, and `tf.split()` at least once each. All three should be easy to Google, though the last homework and the last exercise should show examples of the first two.
- As before, you'll want to initialize your trained parameters using something like `tf.random_normal(..., stddev=0.1)`

In [16]:
import tensorflow as tf

In [36]:
class LanguageModel:
    def __init__(self, vocab_size, sequence_length):
        # Define the hyperparameters
        self.learning_rate = 0.3  # Should be about right
        self.training_epochs = 250  # How long to train for - chosen to fit within class time
        self.display_epoch_freq = 1  # How often to test and print out statistics
        self.dim = 32  # The dimension of the hidden state of the RNN
        self.embedding_dim = 16  # The dimension of the learned word embeddings
        self.batch_size = 256  # Somewhat arbitrary - can be tuned, but often tune for speed, not accuracy
        self.vocab_size = vocab_size  # Defined by the file reader above
        self.sequence_length = sequence_length  # Defined by the file reader above
        self.keep_rate = 0.75  # Used in dropout (at training time only, not at sampling time)
        
        # embedding matrix
        self.E = tf.Variable(tf.random_normal([self.vocab_size, self.embedding_dim], stddev=0.1))
        # state
        self.W_rnn = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_rnn = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        # forget gate
        self.W_f = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_f = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        # input gate
        self.W_i = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_i = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        # output gate
        self.W_o = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_o = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        # for softmax
        self.W_c = tf.Variable(tf.random_normal([self.dim, self.vocab_size], stddev=0.1))
        self.b_c = tf.Variable(tf.random_normal([self.vocab_size], stddev=0.1))
        
        # Define the input placeholder(s).
        self.keep_rate_ph = tf.placeholder(tf.float32, [])
        self.x = tf.placeholder(tf.int32, [None, self.sequence_length])
        self.y = tf.placeholder(tf.int32, [None, self.sequence_length - 1])
        # split apart x and y for different timesteps
        self.y_slices = tf.split(1, self.sequence_length-1, self.y)
        self.x_slices = tf.split(1, self.sequence_length, self.x)
        
        # Build the rest of the LSTM LM!
        # Define one step of the LSTM
        def step(x, c_prev, h_prev):
            emb = tf.nn.embedding_lookup(self.E, x)
            emb = tf.nn.dropout(emb, self.keep_rate_ph)
            emb_h_prev = tf.concat(1, [emb, h_prev])
            f = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_f)  + self.b_f)
            i = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_i)  + self.b_i)
            c = f*c_prev + i*tf.nn.tanh(tf.matmul(emb_h_prev, self.W_rnn) + self.b_rnn)
            o = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_o)  + self.b_o)
            h = o*tf.nn.tanh(c)
            return h, c
        
        self.h_zero = tf.zeros([self.batch_size, self.dim])
        self.c_zero = tf.zeros([self.batch_size, self.dim])
        
        h_prev = self.h_zero
        c_prev = self.c_zero
            
        # Your model should populate the following four python lists.
        # self.logits should contain one [batch_size, vocab_size]-shaped TF tensor of logits 
        #   for each of the 20 steps of the model.
        # self.costs should contain one [batch_size]-shaped TF tensor of cross-entropy loss 
        #   values for each of the 20 steps of the model.
        # self.h and c should each start contain one [batch_size, dim]-shaped TF tensor of LSTM
        #   activations for each of the 21 *states* of the model -- one tensor of zeros for the 
        #   starting state followed by one tensor each for the remaining 20 steps.
        # Don't rename any of these variables or change their purpose -- they'll be needed by the
        # pre-built sampler.
        self.logits = []
        self.costs = []
        self.h = [self.h_zero]
        self.c = [self.c_zero]
        
        for t in range(self.sequence_length-1):
            x_t = tf.reshape(self.x_slices[t], [-1])
            y_t = tf.reshape(self.y_slices[t], [-1])
            h_prev, c_prev = step(x_t, c_prev, h_prev)
            h_prev_drop = tf.nn.dropout(h_prev, self.keep_rate_ph)
            logit = tf.matmul(h_prev_drop, self.W_c) + self.b_c
            self.logits.append(logit)
            loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logit, y_t)
            self.costs.append(loss)
            self.h.append(h_prev)
            self.c.append(c_prev)
        
        # Sum costs for each word in each example, but average cost across examples.
        self.costs_tensor = tf.concat(1, [tf.expand_dims(cost, 1) for cost in self.costs])
        self.cost_per_example = tf.reduce_sum(self.costs_tensor, 1)
        self.total_cost = tf.reduce_mean(self.cost_per_example)
            
        # This library call performs the main SGD update equation
        self.optimizer = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.total_cost)
        
        # Create an operation to fill zero values in for W and b
        self.init = tf.initialize_all_variables()
        
        # Create a placeholder for the session that will be shared between training and evaluation
        self.sess = None
        
    def train(self, training_data):
        def get_minibatch(dataset, start_index, end_index):
            indices = range(start_index, end_index)
            vectors = np.vstack([dataset[i]['index_sequence'] for i in indices])
            return vectors
        
        self.sess = tf.Session()
        
        self.sess.run(self.init)
        print 'Training.'

        # Training cycle
        for epoch in range(self.training_epochs):
            random.shuffle(training_set)
            avg_cost = 0.
            total_batch = int(len(training_set) / self.batch_size)
            
            # Loop over all batches in epoch
            for i in range(total_batch):
                # Assemble a minibatch of the next B examples
                minibatch_vectors = get_minibatch(training_set, self.batch_size * i, self.batch_size * (i + 1))

                # Run the optimizer to take a gradient step, and also fetch the value of the 
                # cost function for logging
                _, c = self.sess.run([self.optimizer, self.total_cost], 
                                     feed_dict={self.x: minibatch_vectors, 
                                                self.keep_rate_ph: self.keep_rate,
                                                self.y: minibatch_vectors[:,1:]})
                                                                    
                # Compute average loss
                avg_cost += c / (total_batch * self.batch_size)
                
            # Display some statistics about the step
            if (epoch+1) % self.display_epoch_freq == 0:
                print "Epoch:", (epoch+1), "Cost:", avg_cost, "Sample:", self.sample()
    
    def sample(self):
        # This samples a sequence of tokens from the model starting with <S>.
        # We only ever run the first timestep of the model, and use an effective batch size of one
        # but we leave the model unrolled for multiple steps, and use the full batch size to simplify 
        # the training code. This slows things down.

        def brittle_sampler():
            # The main sampling code. Can fail randomly due to rounding errors that yield probibilities
            # that don't sum to one.
            
            word_indices = [0] # 0 here is the "<S>" symbol
            for i in range(self.sequence_length - 1):
                dummy_x = np.zeros((self.batch_size, self.sequence_length))
                dummy_x[0][0] = word_indices[-1]
                feed_dict = {self.x: dummy_x,
                             self.keep_rate_ph: 1.0}
                if i > 0:
                    feed_dict[self.h_zero] = h
                    feed_dict[self.c_zero] = c
                    
                h, c, logits = self.sess.run([self.h[1], self.c[1], self.logits[0]], 
                                             feed_dict=feed_dict)  
                logits = logits[0, :] # Discard all but first batch entry
                exp_logits = np.exp(logits - np.max(logits))
                distribution = exp_logits / exp_logits.sum()
                sampled_index = np.flatnonzero(np.random.multinomial(1, distribution))[0]
                word_indices.append(sampled_index)
            words = [indices_to_words[index] for index in word_indices]
            return ' '.join(words)
        
        while True:
            try:
                sample = brittle_sampler()
                return sample
            except ValueError as e:  # Retry if we experience a random failure.
                pass

Now let's train it.

Once you're confident your model is doing what you want, let it run for the full 250 epochs. This will take some time—likely between five and thirty minutes. If it much longer on a reasonably modern laptop—more than an hour—that suggests serious problems with your implementation. A properly implemented model with dropout should reach an average cost of less than 0.22 quickly, and then slowly improve from there. We train the model for a fairly long time because these small improvements in cost correspond to fairly large improvements in sample quality.

Samples from a trained models should have coherent portions, but they will not resemble interpretable English sentences. Here are three examples from a model with a cost value of 0.202:

`<S> the good <UNK> and <UNK> and <UNK> <UNK> with predictable and <UNK> , but also does one of -lrb- <UNK>`

`<S> <UNK> has <UNK> actors seems done <UNK> would these <UNK> <UNK> to <UNK> <UNK> <UNK> 're <UNK> to mind .`

`<S> an action story that was because the <UNK> <UNK> are when <UNK> as ``` <UNK> '' ' it is any`

`-lrb-` and `-rrb` are the way that left and right parentheses are represented in the corpus.

In [37]:
model = LanguageModel(len(word_indices), 21)
model.train(training_set)

Training.
Epoch: 1 Cost: 0.310543827938 Sample: <S> contrived <UNK> <UNK> portrait matter <UNK> that is <UNK> engaging summer </S> </S> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Epoch: 2 Cost: 0.265469183524 Sample: <S> <UNK> , time manages of that an sure as , through he is other mood already place it <UNK> full
Epoch: 3 Cost: 0.257904000355 Sample: <S> performances <UNK> <UNK> <UNK> <UNK> out more a <UNK> by of is <UNK> one <UNK> <UNK> <UNK> done ca too
Epoch: 4 Cost: 0.252665686788 Sample: <S> a shows be in have love while this one to his <UNK> characters film the <UNK> -lrb- the <UNK> with
Epoch: 5 Cost: 0.247719584541 Sample: <S> if american a does but is <UNK> <UNK> that <UNK> genre pleasure and this no , or to <UNK> in
Epoch: 6 Cost: 0.244253746939 Sample: <S> the of movie is <UNK> by <UNK> at <UNK> in tale anything <UNK> <UNK> <UNK> social the him <UNK> up
Epoch: 7 Cost: 0.241856727636 Sample: <S> as the he , this <UNK> and it <UNK> <UNK> is the still <UNK> not take , end 's ri

Now we can draw as many samples as we like.

In [39]:
model.sample()

"<S> the film 's humor in a <UNK> movie , but <UNK> <UNK> and also <UNK> sex <UNK> that <UNK> of"

## Part 2: Questions (40%)

**Question 1:** Looking at the samples that your model produced towards the end of training, point out three properties of (written) English that it seems to have learned.

**Answer:** 
1. It's learnt that "be" should be followed by nouns or adjectives.  
2. It's learnt to use preposition to express a position relative to an object, like "in the theater". 
3. It's learnt to use "when" or "if" to express some conditions.

**Question 2:** If we could make the model as big as we wanted, train as long as we wanted, and adjust or remove dropout at will, could we ever get the model to reach a cost value of 0.0? In a single sentence, say why.

**Answer:** No. Just for this dataset, since there are many unknown words labeled as 'UNK', it's unlikely that we will correctly predict those "unknown" words and even use them as context to predict the following words. Also, the optimization used here only sends us to local optimum, and global optimum is not guaranteed.

**Question 3:** Give an example of a situation where the LSTM language model's ability to propagate information across many steps (when trained for long enough, at least) would cause it to reach a better cost value than a model like a simple RNN without that ability. (Answer in one sentence or so.)

**Answer:** If the word we'd love to predict at the end of a sentence is closely associated with some words at the very beginning of a long sentence, then LSTM will be better than a pure RNN.

**Question 4:** Would the model be any worse if we were to just delete unknown words instead of using an `<UNK>` token? (Answer in one sentence or so.)

**Answer:** Yes, two non-relevant words could be "forced" to be associated with each other by our model if the intermediate words between them are deleted.