# HW5: A GRU-pair model for SNLI

In this assignment we'll build train a GRU RNN-based model for SNLI.

## Setup

You'll need to download and unzip SNLI, which you can find [here](http://nlp.stanford.edu/projects/snli/). Set `snli_home` below to point to it. The following block of code loads it.

In [1]:
snli_home = '../snli_1.0'

import re
import random
import json

LABEL_MAP = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2
}

def load_snli_data(path):
    data = []
    with open(path) as f:
        for line in f:
            loaded_example = json.loads(line)
            if loaded_example["gold_label"] not in LABEL_MAP:
                continue
            loaded_example["label"] = LABEL_MAP[loaded_example["gold_label"]]
            data.append(loaded_example)
        random.seed(1)
        random.shuffle(data)
    return data
     
training_set = load_snli_data(snli_home + '/snli_1.0_train.jsonl')
dev_set = load_snli_data(snli_home + '/snli_1.0_dev.jsonl')
test_set = load_snli_data(snli_home + '/snli_1.0_test.jsonl')

# Note: Unlike with k-nearest neighbors, evaluation here should be fast, and we don't need to
# trim down the dev and test sets. 

Next, we'll convert the data to index vectors in the same way that we've done for in-class exercises with RNN-based sentiment models. A few notes:

- We use a sequence length of only 10, which is short enough that we're truncating a large fraction of sentences.
- Tokenization is easy here because we're relying on the output of a parser (which does tokenization as part of parsing), just as with the SST corpus that we've been using until now. Note that we use the 'sentence1_binary_parse' field of each example rather than the human-readable 'sentence1'.
- We're using a moderately large vocabulary (for a class exercise) of about 12k words.

In [3]:
SEQ_LEN = 10

import collections
import numpy as np

def sentences_to_padded_index_sequences(datasets):
    '''Annotates datasets with feature vectors.'''
    
    PADDING = "<PAD>"
    UNKNOWN = "<UNK>"
    
    # Extract vocabulary
    def tokenize(string):
        string = re.sub(r'\(|\)', '', string)
        return string.lower().split()
    
    word_counter = collections.Counter()
    for example in datasets[0]:
        word_counter.update(tokenize(example['sentence1_binary_parse']))
        word_counter.update(tokenize(example['sentence2_binary_parse']))
        
    vocabulary = set([word for word in word_counter if word_counter[word] > 10])
    vocabulary = list(vocabulary)
    vocabulary = [PADDING, UNKNOWN] + vocabulary
        
    word_indices = dict(zip(vocabulary, range(len(vocabulary))))
    indices_to_words = {v: k for k, v in word_indices.items()}
        
    for i, dataset in enumerate(datasets):
        for example in dataset:
            for sentence in ['sentence1_binary_parse', 'sentence2_binary_parse']:
                example[sentence + '_index_sequence'] = np.zeros((SEQ_LEN), dtype=np.int32)

                token_sequence = tokenize(example[sentence])
                padding = SEQ_LEN - len(token_sequence)

                for i in range(SEQ_LEN):
                    if i >= padding:
                        if token_sequence[i - padding] in word_indices:
                            index = word_indices[token_sequence[i - padding]]
                        else:
                            index = word_indices[UNKNOWN]
                    else:
                        index = word_indices[PADDING]
                    example[sentence + '_index_sequence'][i] = index
    return indices_to_words, word_indices
    
indices_to_words, word_indices = sentences_to_padded_index_sequences([training_set, dev_set, test_set])

In [3]:
print training_set[6]
print len(word_indices)

{u'annotator_labels': [u'contradiction'], u'sentence2_parse': u'(ROOT (NP (NP (NNS People)) (PP (IN on) (NP (DT a) (NN bike))) (PP (IN on) (NP (DT a) (NN beach))) (. .)))', u'sentence1_binary_parse': u'( ( Five men ) ( ( are ( ( ( playing ( musical instruments ) ) together ) ( on ( a stage ) ) ) ) . ) )', u'captionID': u'2430018178.jpg#1', 'sentence1_binary_parse_index_sequence': array([ 6060, 11028,  7751,  3574, 10502,  3750, 10482,  4668,  2910,   472], dtype=int32), 'label': 2, u'sentence2_binary_parse': u'( ( ( People ( on ( a bike ) ) ) ( on ( a beach ) ) ) . )', u'pairID': u'2430018178.jpg#1r1c', u'sentence2': u'People on a bike on a beach.', u'sentence1_parse': u'(ROOT (S (NP (CD Five) (NNS men)) (VP (VBP are) (VP (VBG playing) (NP (JJ musical) (NNS instruments)) (ADVP (RB together)) (PP (IN on) (NP (DT a) (NN stage))))) (. .)))', 'sentence2_binary_parse_index_sequence': array([    0,     0,   592, 10482,  4668,  3030, 10482,  4668,  3180,   472], dtype=int32), u'gold_label': u

Now we load GloVe. You'll need the same file that you used for the in-class exercise on word embeddings.

In [4]:
glove_home = '../'
words_to_load = 25000

with open(glove_home + 'glove.6B.50d.txt') as f:
    loaded_embeddings = np.zeros((len(word_indices), 50), dtype='float32')
    for i, line in enumerate(f):
        if i >= words_to_load: 
            break
        
        s = line.split()
        if s[0] in word_indices:
            loaded_embeddings[word_indices[s[0]], :] = np.asarray(s[1:])

Now we set up an evaluation function as before.

In [5]:
def evaluate_classifier(classifier, eval_set):
    correct = 0
    hypotheses = classifier(eval_set)
    for i, example in enumerate(eval_set):
        hypothesis = hypotheses[i]
        if hypothesis == example['label']:
            correct += 1        
    return correct / float(len(eval_set))

## Part 1: Implementation (70%)

Expand the below starter code to build a GRU RNN-pair NLI model. The model should feature the following:

- 50D word embeddings initialized with GloVe and trained. (Using self.E should provide this.)
- Two GRUs (sharing one set of parameters) that read each sentence independantly and produce one $\vec{h}_t$ vector for each. We'll call these two vectors $\vec{h}_{p}$ for the premise (first sentence) and $\vec{h}_{h}$ for the hypothesis (second sentence).
- A combination layer in the style of [Mou et al. 15](https://arxiv.org/abs/1512.08422)'s heuristic matching layer: A ReLU layer with the following four vectors as inputs:
  - $\vec{h}_p$, $\vec{h}_h$, $\vec{h}_p - \vec{h}_h$, $\vec{h}_p * \vec{h}_h$
- A three-way softmax classifier whose inputs are the outputs of the combination layer.

As in the previous assignment, you may use code from in-class exercises, but you should not use any specialized TF functions for LSTMs or RNNs.


In [6]:
import tensorflow as tf

In [49]:
class RNNEntailmentClassifier:
    def __init__(self, vocab_size, sequence_length):
        # Define the hyperparameters
        self.learning_rate = 0.85  # Should be about right
        self.training_epochs = 25  # How long to train for - chosen to fit within class time
        self.display_epoch_freq = 1  # How often to test and print out statistics
        self.dim = 16  # The dimension of the hidden state of the RNN
        self.combination_dim = 32  # The dimension of the hidden state of the combination layer
        self.embedding_dim = 50  # The dimension of the learned word embeddings
        self.batch_size = 256  # Somewhat arbitrary - can be tuned, but often tune for speed, not accuracy
        self.vocab_size = vocab_size  # Defined by the file reader above
        self.sequence_length = sequence_length  # Defined by the file reader above
        
        # Define the parameters
        self.E = tf.Variable(loaded_embeddings)
        
        self.W_cl = tf.Variable(tf.random_normal([self.combination_dim, 3], stddev=0.1))
        self.b_cl = tf.Variable(tf.random_normal([3], stddev=0.1))
        
        # Define the rest of the parameters
        self.W_rnn = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_rnn = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        
        self.W_r = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_r = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        
        self.W_z = tf.Variable(tf.random_normal([self.embedding_dim + self.dim, self.dim], stddev=0.1))
        self.b_z = tf.Variable(tf.random_normal([self.dim], stddev=0.1))
        
        # parameter for the combination layer
        self.combination = tf.Variable(tf.random_normal([64, self.combination_dim]))
        # Define the placeholders
        self.premise_x = tf.placeholder(tf.int32, [None, self.sequence_length])
        self.hypothesis_x = tf.placeholder(tf.int32, [None, self.sequence_length])
        self.y = tf.placeholder(tf.int32, [None])
        
        # Split up the inputs into individual tensors
        self.x_premise_slices = tf.split(1, self.sequence_length, self.premise_x)
        self.x_hypothesis_slices = tf.split(1, self.sequence_length, self.hypothesis_x)
        
        # Define one step of the RNN
        def step(x, h_prev):
            emb = tf.nn.embedding_lookup(self.E, x)
            emb_h_prev = tf.concat(1, [emb, h_prev])
            z = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_z) + self.b_z)
            r = tf.nn.sigmoid(tf.matmul(emb_h_prev, self.W_r) + self.b_r)
            emb_r_h_prev = tf.concat(1, [emb, r * h_prev])
            h_tilde = tf.nn.tanh(tf.matmul(emb_r_h_prev, self.W_rnn) + self.b_rnn)
            h = (1. - z) * h_prev + z * h_tilde
            return h
        
        self.h_zero = tf.zeros(tf.pack([tf.shape(self.premise_x)[0], self.dim]))
        h_prev1 = self.h_zero
        # Unroll the first RNN
        for t in range(self.sequence_length):
            x_t1 = tf.reshape(self.x_premise_slices[t], [-1])
            h_prev1 = step(x_t1, h_prev1)
        
        # Unroll the second RNN
        h_prev2 = tf.zeros(tf.pack([tf.shape(self.hypothesis_x)[0], self.dim]))
        for t in range(self.sequence_length):
            x_t2 = tf.reshape(self.x_hypothesis_slices[t], [-1])
            h_prev2 = step(x_t2, h_prev2)
        # Build the combination layer
        combination_in = tf.concat(1, [h_prev1, h_prev2, tf.sub(h_prev1, h_prev2), tf.mul(h_prev1, h_prev2)])
        combination_out = tf.nn.relu(tf.matmul(combination_in, self.combination))
        # Compute the logits
        self.logits = tf.matmul(combination_out, self.W_cl) + self.b_cl
        
        # Define the cost function (here, the softmax exp and sum are built in)
        self.total_cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(self.logits, self.y))
        
        # This  performs the main SGD update equation with gradient clipping
        optimizer_obj = tf.train.GradientDescentOptimizer(learning_rate=self.learning_rate)
        gvs = optimizer_obj.compute_gradients(self.total_cost)
        capped_gvs = [(tf.clip_by_norm(grad, 5.0), var) for grad, var in gvs if grad is not None]
        self.optimizer = optimizer_obj.apply_gradients(capped_gvs)
        
        # Create an operation to fill zero values in for W and b
        self.init = tf.initialize_all_variables()
        
        # Create a placeholder for the session that will be shared between training and evaluation
        self.sess = None
        
    def train(self, training_data, dev_data):
        def get_minibatch(dataset, start_index, end_index):
            indices = range(start_index, end_index)
            premise_vectors = np.vstack([dataset[i]['sentence1_binary_parse_index_sequence'] for i in indices])
            hypothesis_vectors = np.vstack([dataset[i]['sentence2_binary_parse_index_sequence'] for i in indices])
            labels = [dataset[i]['label'] for i in indices]
            return premise_vectors, hypothesis_vectors, labels
        
        self.sess = tf.Session()
        
        self.sess.run(self.init)
        print 'Training.'

        # Training cycle
        for epoch in range(self.training_epochs):
            random.shuffle(training_data)
            avg_cost = 0.
            total_batch = int(len(training_data) / self.batch_size)
            
            # Loop over all batches in epoch
            for i in range(total_batch):
                # Assemble a minibatch of the next B examples
                minibatch_premise_vectors, minibatch_hypothesis_vectors, minibatch_labels = get_minibatch(
                    training_data, self.batch_size * i, self.batch_size * (i + 1))

                # Run the optimizer to take a gradient step, and also fetch the value of the 
                # cost function for logging
                _, c = self.sess.run([self.optimizer, self.total_cost], 
                                     feed_dict={self.premise_x: minibatch_premise_vectors,
                                                self.hypothesis_x: minibatch_hypothesis_vectors,
                                                self.y: minibatch_labels})
                                                                    
                # Compute average loss
                avg_cost += c / (total_batch * self.batch_size)
                                
            # Display some statistics about the step
            # Evaluating only one batch worth of data -- simplifies implementation slightly
            if (epoch+1) % self.display_epoch_freq == 0:
                print "Epoch:", (epoch+1), "Cost:", avg_cost, \
                    "Dev acc:", evaluate_classifier(self.classify, dev_data[0:1000]), \
                    "Train acc:", evaluate_classifier(self.classify, training_data[0:1000])  
    
    def classify(self, examples):
        # This classifies a list of examples
        premise_vectors = np.vstack([example['sentence1_binary_parse_index_sequence'] for example in examples])
        hypothesis_vectors = np.vstack([example['sentence2_binary_parse_index_sequence'] for example in examples])
        logits = self.sess.run(self.logits, feed_dict={self.premise_x: premise_vectors,
                                                       self.hypothesis_x: hypothesis_vectors})
        return np.argmax(logits, axis=1)

Run the model below. Your goal is dev set performance above 70% within the first 25 epochs, though a successful model may reach as high as 77% depending on your random seed and Python version.

Since epochs over the half-million example SNLI corpus are slow, you may wish to debug your model using only a small subset of SNLI by passing training_set[:10000] into classifier.train as its first argument.

In [50]:
classifier = RNNEntailmentClassifier(len(word_indices), SEQ_LEN)
classifier.train(training_set, dev_set)

Training.
Epoch: 1 Cost: 0.004292664038 Dev acc: 0.452527743527 Train acc: 0.441943127962
Epoch: 2 Cost: 0.00400245104317 Dev acc: 0.542540073983 Train acc: 0.508960573477
Epoch: 3 Cost: 0.0037016659873 Dev acc: 0.62392108508 Train acc: 0.610108303249
Epoch: 4 Cost: 0.00334766513846 Dev acc: 0.66954377312 Train acc: 0.649144254279
Epoch: 5 Cost: 0.00317433163565 Dev acc: 0.695437731196 Train acc: 0.646989374262
Epoch: 6 Cost: 0.00304999522646 Dev acc: 0.695437731196 Train acc: 0.650793650794
Epoch: 7 Cost: 0.00297487027203 Dev acc: 0.709001233046 Train acc: 0.693333333333
Epoch: 8 Cost: 0.00292156697718 Dev acc: 0.712700369914 Train acc: 0.735115431349
Epoch: 9 Cost: 0.00288169242589 Dev acc: 0.72009864365 Train acc: 0.707107843137
Epoch: 10 Cost: 0.00284617704552 Dev acc: 0.711467324291 Train acc: 0.716049382716
Epoch: 11 Cost: 0.00281861427149 Dev acc: 0.734895191122 Train acc: 0.720144752714
Epoch: 12 Cost: 0.00279511099145 Dev acc: 0.736128236745 Train acc: 0.736028537455
Epoch: 13

## Part 2: Questions (30%)

**Question 1:** Focusing only on the performance of your model on the first ten epochs of training, would adding L2 regularization or dropout help your dev set performance?

**Answer:** It wouldn't help much, since during the very first few epochs of training we haven't overfit the data, meaning that the train accuracy and dev accuracy are about the same (dev accuracy can even be better) so generalization won't be a primary issue at the very beginning of training.

**Question 2:** Write a short script to test the model's performance on dev set examples that contain sentences of more than ten words and those that don't contain such sentences. How does length impact performance?

In [55]:
def evaluate_classifier2(classifier, eval_set):
    correct_more_than_10 = 0
    correct_less_than_10 = 0
    total_more_than_10 = 0
    total_less_than_10 = 0
    hypotheses = classifier(eval_set)
    for i, example in enumerate(eval_set):
        hypothesis = hypotheses[i]
        if  len(example['sentence1'].split(' ')) > 10 or len(example['sentence2'].split(' ')) > 10:
            total_more_than_10 += 1
            if hypothesis == example['label']: 
                correct_more_than_10 += 1
        else:
            total_less_than_10 += 1
            if hypothesis == example['label']: 
                correct_less_than_10 += 1
    return correct_more_than_10 / float(total_more_than_10), correct_less_than_10 / float(total_less_than_10)

correct_more_than_10, correct_less_than_10 = evaluate_classifier2(classifier.classify, dev_set)
print "The performance on dev set examples that contain sentences of more than ten words is " + str(correct_more_than_10)
print "The performance on dev set examples that do not contain such sentences is " + str(correct_less_than_10)

The performance on dev set examples that contain sentences of more than ten words is 0.688998362852
The performance on dev set examples that do not contain such sentences is 0.727332692923


From the results, we see that performance on examples with sentences of more than 10 words is less than that with sentences of less than 10 words. So generally, if the length of sentences is more than the sequence length in the model, then we will probably get a lower accuracy than that whose length is less than the sequence length in the model.

**Question 3:** The combination layer uses four different types of input feature. If we skipped the combination layer and fed these features into the softmax classifier layer directly, one of these features would become almost entirely uninformative. Which one is it? (Hint: This is true with SNLI, but may not be true with other corpora.)

**Answer:** It's $\vec{h}_p - \vec{h}_h$