In [89]:
from IPython.display import IFrame

In CBOW, the position of the words do not matter. The following sentences:

    it was not good, it was actually quite bad
    it was not bad, it was actually quite good

are treated exactly the same under CBOW. This is why, ngram is a much more informative that bag-of-words.

One can use an embedding on ngrams, leading to models of CBOW of bigrams for example. The issue with this is that the size of the embedding matrix will be exponential in n. So, if one does not have enough data, the bigrams: "quite good" and "very good" will not be similar enough even though they should be.

### RNN

### RNNs are used to capture long distance relationships.

Examples:
   * Gender
      * **He** does not have confidence on **himself**
      * **She** does not have confidence on **herself**
   * Reference to 'it' (Winograd Schema Challenge)
      * The **trophy** will not fit in the suitcase because **it** was very big
      * The trophy will not fit in the **suitcase** because **it** was very small
      
### What are RNNs used for:
   * Read whole sentence and make a prediction, for example sentiment prediction. The difficulty here is that the model makes a prediction only at the end. So, RNN has to capture history well. 
   * Represent context within a sentence, for example pos tagging. This is not as hard to train as making a prediction at the end of the sentence.
   
### LSTM
   * A solution to vanishing gradient issue. The basic idea is that the 
   
### Efficiency and memory tricks
   * Mini-batching: We generally batch sentences of nearly the same length together. We padd the small sentences to so that all sentences in the batch have same length. We then mask scores corresponding to the padded words. Then we sum to get the loss. One should shuffle inside each mini-batch update. 
   
### Strengths/Weaknesses
   * Quite flexible. Only recently CNN is showing more promise
   * Require a **lot** of data
   * Weak error signal passed from the end of sentence. This is specially a problem for sentence classification.
  
### Misc:
   * Bi-RNN cannot be used to do language modeling because in bi-rnn, you would be conditioning on all the words in the sentence.
   
### Questions: 
   * In a language model, is it ok to use words from the left **and** words from the right contexts to predict the target word. According to Graham, one cannot do so and this is also the reason why bi-rnn cannot be used for language modeling. 
   * In a usual feed-forward NN, how does one choose the size of the layers? Suppose the input layer's dimension is 128, what should be picked up as the dimension for the next layer? Should it be more that 128 or less that 128? How much more/less? Same question for the second-last layer. Does the dimentsion of the last layer dictate the dimension of the second last layer? 
   * Read on vanishing gradient issue. How does LSTM solve this issue. According to Graham: LSTMS have additive connections (and not multiplicative connections) between time stamps which does not reduce the gradient.
   * Truncated back propagation through time. Generally used for very long sequences. How would one implement this?
   * Interpretation of RNN. Read the paper by Karapathy.

In [90]:
IFrame("./rnn_loss.pdf", width=400, height=300)

## RNN Implementation

In [65]:
from collections import defaultdict
import dynet as dy
import time
from random import shuffle
import math

In [82]:
# Implementation of language-model using LSTM


class LmRnn(object):
    def __init__(self, train_path, test_path):
        self.w2i = defaultdict(lambda: len(self.w2i))
        self.S = self.w2i["<s>"]
        self.UNK = self.w2i["<unk>"]
        self.train_data = list(self.read_train_data(train_path))
        self.nWords = len(self.w2i)
        self.test_data = list(self.read_test_data(test_path))
        # Assert that reading the test data did not change the 
        # length of w2i
        assert(self.nWords == len(self.w2i))
        
        # model parameters
        self.WORD_EMB_SIZE = 64
        self.HIDDEN_SIZE = 128
        
        # dynet model
        self.model = dy.Model()
        self.trainer = dy.AdamTrainer(self.model)
        
        # dynet lookupp parameters for word embedding
        self.words_lookup = self.model.add_lookup_parameters((self.nWords, 
                                                             self.WORD_EMB_SIZE))
        
        # dynet word level lstm
        self.lstm = dy.LSTMBuilder(1, 
                                   self.WORD_EMB_SIZE, 
                                   self.HIDDEN_SIZE, 
                                   self.model)
        
        # dynet softmax weights
        self.W_sm = self.model.add_parameters((self.nWords, 
                                               self.HIDDEN_SIZE))
        self.b_sm = self.model.add_parameters(self.nWords)
    
    
    def train(self):
        start_time = time.time()
        i = block_train_loss = block_train_words = all_time = 0
        train_order = range(len(self.train_data))
        for ITER in range(100):
            shuffle(train_order)
            for sid in train_order:
                i += 1
                if i % int(100) == 0:
                    print("block_train_loss/block_train_words = {}".format(math.exp(block_train_loss * 1./block_train_words)))
                    print("elapsed time = {}".format(time.time() - start_time))
                    block_train_loss = 0
                    block_train_words = 0
                # get the loss for the current sentence
                this_sent_loss_exp = self.calc_lm_loss(self.train_data[sid])
                block_train_loss += this_sent_loss_exp.scalar_value() 
                block_train_words += len(self.train_data[sid])
                this_sent_loss_exp.backward()
                self.trainer.update()
            print("epoch %r finished" % ITER)
            self.trainer.update_epoch(1.0)
                
                
    def calc_lm_loss(self, sent):
        dy.renew_cg()
        
        # parameters -> exp
        W_sm_exp = dy.parameter(self.W_sm)
        b_sm_exp = dy.parameter(self.b_sm)
        
        # initialize the lstm
        f_init = self.lstm.initial_state()
        
        # start the rnn by inputing "<s>"
        s = f_init.add_input(self.words_lookup[self.S])
        
        losses = []
        for wid in sent:
            score = W_sm_exp * s.output() + b_sm_exp
            loss = dy.pickneglogsoftmax(score, wid)
            losses.append(loss)
            s = s.add_input(self.words_lookup[wid])
        return dy.esum(losses)
    
    def read_train_data(self, filename):
        with open(filename, 'r') as f:
            for line in f:
                sent = [self.w2i[word] for word in line.strip().split(" ")]
                sent.append(self.S)
                yield sent
     
    def read_test_data(self, filename):
        with open(filename, 'r') as f:
            for line in f:
                sent = []
                words = line.strip().split(" ")
                for word in words:
                    if word in self.w2i:
                        sent.append(self.w2i[word])
                    else:
                        sent.append(self.w2i["<unk>"])
                yield sent

In [83]:
lmRnn = LmRnn("../nn4nlp2017-code-master/data/classes/train.txt", 
              "../nn4nlp2017-code-master/data/classes/test.txt")

print("number of words = {}".format(lmRnn.nWords))
lmRnn.train()

number of words = 18284
block_train_loss/block_train_words = 4329.93120559
elapsed time = 10.5359749794
block_train_loss/block_train_words = 1126.19717092
elapsed time = 21.7739989758
block_train_loss/block_train_words = 682.788390579
elapsed time = 32.3246450424
block_train_loss/block_train_words = 638.526063686
elapsed time = 42.8455340862
block_train_loss/block_train_words = 648.263705471
elapsed time = 53.7965459824
block_train_loss/block_train_words = 522.830575695
elapsed time = 64.267745018
block_train_loss/block_train_words = 675.645164747
elapsed time = 74.667525053


KeyboardInterrupt: 

### Mini-batching
There are two things which we notice from the above output:
   * The error is decreasing. So  our model is actually learning
   * The time taken to process 100 sentences is around 10 sec
   
In the next part of the code, we will implement the same model but with mini-batching. The idea is to test whether mini-batching makes the processing faster. 

In [91]:
IFrame("./rnn_minibatching.pdf", width=400, height=300)

In [176]:
%reset

from collections import defaultdict
import dynet as dy
import time
from random import shuffle
import math

# Implementation of language-model using LSTM


class LmRnnMiniBatching(object):
    def __init__(self, train_path, test_path):
        self.w2i = defaultdict(lambda: len(self.w2i))
        self.S = self.w2i["<s>"]
        self.UNK = self.w2i["<unk>"]
        self.train_data = list(self.read_train_data(train_path))
        self.nWords = len(self.w2i)
        self.test_data = list(self.read_test_data(test_path))
        # Assert that reading the test data did not change the 
        # length of w2i
        assert(self.nWords == len(self.w2i))
        
        # mini batch size
        self.mini_batch_size = 32
        
        # model parameters
        self.WORD_EMB_SIZE = 64
        self.HIDDEN_SIZE = 128
        
        # dynet model
        self.model = dy.Model()
        self.trainer = dy.AdamTrainer(self.model)
        
        # dynet lookupp parameters for word embedding
        self.words_lookup = self.model.add_lookup_parameters((self.nWords, 
                                                             self.WORD_EMB_SIZE))
        
        # dynet word level lstm
        self.lstm = dy.LSTMBuilder(1, 
                                   self.WORD_EMB_SIZE, 
                                   self.HIDDEN_SIZE, 
                                   self.model)
        
        # dynet softmax weights
        self.W_sm = self.model.add_parameters((self.nWords, 
                                               self.HIDDEN_SIZE))
        self.b_sm = self.model.add_parameters(self.nWords)
        
    def read_train_data(self, filename):
        with open(filename, 'r') as f:
            for line in f:
                sent = [self.w2i[word] for word in line.strip().split(" ")]
                sent.append(self.S)
                yield sent
     
    
    def read_test_data(self, filename):
        with open(filename, 'r') as f:
            for line in f:
                sent = []
                words = line.strip().split(" ")
                for word in words:
                    if word in self.w2i:
                        sent.append(self.w2i[word])
                    else:
                        sent.append(self.w2i["<unk>"])
                yield sent
    
    
    def get_first_sent_idx_of_batches(self):
        
        # For minibatching, we first need to sort the the train data
        # This will minimize the number of words which will be masked in a minibatch
        # See the above figure
        self.train_data.sort(key=lambda x : -len(x))
        
        # Get the number of minibathes
        num_minibatches = 0
        if len(self.train_data)%self.mini_batch_size is 0:
            num_minibatches = len(self.train_data)/self.mini_batch_size
        else: 
            num_minibatches = len(self.train_data)/self.mini_batch_size + 1
            
        print("num_minibatches = {}".format(num_minibatches))
        
        # bunch the minibatches together
        # for example, train_order = a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
        # batch size = 2
        # first_sent_idx_of_batches = [0, 2, 4, 6, 8]
        first_sent_idx_of_batches = [id * self.mini_batch_size for id in range(num_minibatches)] 
        
        # shuffle the batches so that not all small sentences are trained at the beginning or at the end
        shuffle(first_sent_idx_of_batches)
        
        # Now the snetences in a batch can be accessed via: 
        # for sid in first_sent_idx_of_batch:
        #     print(train_order[sid:sid + self.mini_batch_size])
        
        return first_sent_idx_of_batches
    
    
    def train(self):
        
        first_sent_idx_of_batches = self.get_first_sent_idx_of_batches()
        
        start_time = time.time()
        i = block_train_loss = block_train_words = all_time = 0
        train_order = range(len(self.train_data))
        
        for ITER in range(3):
            shuffle(train_order)
            for sid in first_sent_idx_of_batches:
                i += 1
                if i % int(1000 / self.mini_batch_size) == 0:
                    print("block_train_loss/block_train_words = {}".format(math.exp(block_train_loss * 1./block_train_words)))
                    print("elapsed time = {}".format(time.time() - start_time))
                    print("sentences done = {}".format((i*self.mini_batch_size)))
                    block_train_loss = 0
                    block_train_words = 0
                # get the loss for the current sentence
                this_batch_loss_exp, words_in_batch = self.calc_lm_loss(self.train_data[sid: sid + self.mini_batch_size])
                block_train_loss += this_batch_loss_exp.scalar_value() 
                block_train_words += words_in_batch
                this_batch_loss_exp.backward()
                self.trainer.update()
            print("epoch %r finished" % ITER)
            self.trainer.update_epoch(1.0)
                
                
    def calc_lm_loss(self, sents):
        dy.renew_cg()
        
        # parameters -> exp
        W_sm_exp = dy.parameter(self.W_sm)
        b_sm_exp = dy.parameter(self.b_sm)
        
        # initialize the lstm
        f_init = self.lstm.initial_state()
        
        # start the rnn by inputing "<s>"
        s = f_init.add_input(self.words_lookup[self.S])
        
        # get the word ids and masks for each step
        # Example: 
        # sents = [[1, 2, 3], [1, 2], [1]]
        # wids = [[1, 1, 1], [2, 2, -1], [3, -1, -1]]
        # masks = [[1, 1, 1], [1, 1, 0], [1, 0, 0]]
        # where -1 is the assumed in this example as the index of S
        tot_words = 0
        wids = []
        masks = []
        for i in range(len(sents[0])):
            # sents[0] because the 0th sent is the longest
            wids.append([(sent[i] if len(sent)>i else self.S) for sent in sents])
            mask = [(1 if len(sent)>i else 0) for sent in sents]
            masks.append(mask)
            tot_words += sum(mask)
        
        # Initial input tho the RNN has to be S
        init_ids = [self.S] * len(sents)
        s = f_init.add_input(dy.lookup_batch(self.words_lookup, init_ids))
        
        # Get losses by predicting the next word
        losses = []
        for wid, mask in zip(wids, masks):
            # calculate the softmax loss
            score = dy.affine_transform([b_sm_exp, W_sm_exp, s.output()])            
            loss = dy.pickneglogsoftmax_batch(score, wid)
            # mask the loss if any one sent is smaller
            if mask[-1] != 1:
                mask_expr = dy.inputVector(mask)
                mask_expr = dy.reshape(mask_expr, (1, ), len(sents))
                loss = loss * mask_expr
            losses.append(loss)
            s = s.add_input(dy.lookup_batch(self.words_lookup, wid))
        return dy.sum_batches(dy.esum(losses)), tot_words

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [177]:
lmRnnMb = LmRnnMiniBatching("../nn4nlp2017-code-master/data/classes/train.txt", 
                            "../nn4nlp2017-code-master/data/classes/test.txt")

print("number of words = {}".format(lmRnnMb.nWords))
print("number of sentences = {}".format(len(lmRnnMb.train_data)))

lmRnnMb.train()

number of words = 18284
number of sentences = 8544
num_minibatches = 267
block_train_loss/block_train_words = 8756.20715213
elapsed time = 7.77348709106
sentences done = 992
block_train_loss/block_train_words = 1514.63439409
elapsed time = 14.4101791382
sentences done = 1984
block_train_loss/block_train_words = 1257.06479949
elapsed time = 22.1578500271
sentences done = 2976
block_train_loss/block_train_words = 919.261884732
elapsed time = 29.1969220638
sentences done = 3968
block_train_loss/block_train_words = 688.659342556
elapsed time = 36.4949162006
sentences done = 4960
block_train_loss/block_train_words = 556.196629656
elapsed time = 44.6118631363
sentences done = 5952
block_train_loss/block_train_words = 498.470694319
elapsed time = 51.8876121044
sentences done = 6944
block_train_loss/block_train_words = 527.79080335
elapsed time = 60.0206830502
sentences done = 7936
epoch 0 finished
block_train_loss/block_train_words = 427.418137202
elapsed time = 67.3311600685
sentences done =

### Minibatch is fast!
We clearly see that the minibatch is around 8~10 times faster!