**Note**: The code has been adapted from the [official tutorial on using eager for LM](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/rnn_ptb/rnn_ptb.py)

In this notebook, we will explore how to build a **Neural Language Model**. Rather than directly showing code for the NLM, we will arrive at it step-by-step by discussing all key components. 

We will also leverage **tf.data** to build our data pipeline, something we found to be missing in the official tutorial.

### P1: Enable Eager execution
* We use `tfe` to add variables
* enable_eager_execution() should be the first command in your notebook. Note that executing this again throws an error! Restart your notebook kernel to re-execute!

In [1]:
import tensorflow as tf
import tensorflow.contrib.eager as tfe

In [2]:
tf.enable_eager_execution()

### P2: Fixed random seed
* A fixed random seed is required to reproduce your experiments!
* This can help you debug your code!
* You can select any number of your choice. We selected 42, any guesses why? :) 

In [3]:
tf.set_random_seed(42)

### P3: Embedding Model
Let us begin by building an **Embedding Model**. The job of embedding model is simple: Given a tensor of word indexes, return corresponding vectors (or rows)

In [4]:
class Embedding(tf.keras.Model):
    def __init__(self, V, d):
        super(Embedding, self).__init__()
        self.W = tfe.Variable(tf.random_uniform(minval=-1.0, maxval=1.0, shape=[V, d]))
    
    def call(self, word_indexes):
        return tf.nn.embedding_lookup(self.W, word_indexes)

Let us give it a try by finding embeddings for word indexes: 5 and 100

In [5]:
word_embeddings = Embedding(5000, 128)

In [6]:
vecs = word_embeddings([5, 100])
print(vecs.numpy().shape)

(2, 128)


In [7]:
vecs = word_embeddings([[5, 100, 40], [2, 300, 90]])

In [8]:
print(vecs.numpy().shape)

(2, 3, 128)


### P4: RNN Cell...

We now have the ability to feed vectors for each time step. Now let us say we see two words and want to predict the third word in a sentence. We need a mechanism that can **summarize** all the words seen so far, and use the **summary** to generate a probability distribution for the next word.

**Recurrent Neural Network(RNN)** does precisely that: It maintains a lossy summary of the inputs seen so far!

<img src="recurrent_eqn@2x.png" alt="drawing" width="200"/>

Let us assume we have a batch of 2 sentences, each sentence has 3 words. 

We will come to how RNN will handle variable length sentences...

In [9]:
word_indexes = [[20, 30, 400], [500, 0, 3]]
word_vectors = word_embeddings(word_indexes)

**Question**: What should be shape of word_vectors? Recall em returns vectors of size 128

In [10]:
print(word_vectors.numpy().shape)

(2, 3, 128)


It seems we will not be able to pass the word_vectors directly. RNN proceses inputs **one time step** at a time!

Enter, [tf.unstack](https://www.tensorflow.org/api_docs/python/tf/unstack)
![title](tf.unstack.png)

In [11]:
word_vectors_time = tf.unstack(word_vectors, axis=1)
print(f'word_vectors_time: len:{len(word_vectors_time)} Shape[0]: {word_vectors_time[0].shape}')

word_vectors_time: len:3 Shape[0]: (2, 128)


In [12]:
cell = tf.nn.rnn_cell.BasicRNNCell(256)
init_state = cell.zero_state(batch_size=int(word_vectors.shape[0]), dtype=tf.float32)
output, state = cell(word_vectors_time[0], init_state)

print(output.shape)

(2, 256)


* You might be wondering: We only talked about hidden state $h_t$ till now, why do we have two vectors being computed output and state. 

* For a BasicRNNCell output and state are identical. 

* For LSTM and GRU they have different meaning. All we need to understand is that it uses state and output to do its magic of being able to maintain and learn long term dependencies. 

* We would mostly use state to pass it to next time step, and output to make predictions at that time step.

* Read this [excellent blog post on LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/), in case you are interested in how LSTM works

### P5: RNN Model
Now, we have all the pieces to build an RNN Model. Let us see how this works:

In [13]:
class RNN(tf.keras.Model):
    def __init__(self, h, cell):
        super(RNN, self).__init__()
        if cell == 'lstm':
            self.cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=h)
        elif cell == 'gru':
            self.cell = tf.nn.rnn_cell.GRUCell(num_units=h)
        else:
            self.cell = tf.nn.rnn_cell.BasicRNNCell(num_units=h)
        
        
    def call(self, word_vectors):
        word_vectors_time = tf.unstack(word_vectors, axis=1)
        outputs = []
        
        state = self.cell.zero_state(batch_size=int(word_vectors.shape[0]), dtype=tf.float32)
        for word_vector_time in word_vectors_time:
            output, state = self.cell(word_vector_time, state)
            outputs.append(output)
        return outputs

In [14]:
word_indexes = [[20, 30, 400], [500, 0, 3]]
word_vectors = word_embeddings(word_indexes)

rnn = RNN(128, 'rnn')
rnn_outputs = rnn(word_vectors)

# Prints "Num outputs: 3 Shape[0]: (2, 128)"
print(f'Num outputs: {len(rnn_outputs)} Shape[0]: {rnn_outputs[0].numpy().shape}')

Num outputs: 3 Shape[0]: (2, 128)


### P6: Data pipeline

We will work with a standard LM dataset: PTB dataset from Tomas Mikolov's webpage:
```bash
wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
tar xvf simple-examples.tgz
```

The first thing, we do with any data is to take a peek at it. 

```bash
head -3 simple-examples/data/ptb.train.txt
```

```
aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter 
pierre <unk> N years old will join the board as a nonexecutive director nov. N 
mr. <unk> is chairman of <unk> n.v. the dutch publishing group 
```
Some key points to note:

* We see here that there is a $<unk>$ token already.
* There also seems another token $N$. This identifies a number. 
* Rest all words seem to be lower cased
    
Let us count up the vocab quickly!

In [15]:
train_file = 'simple-examples/data/ptb.train.txt'
UNK='<unk>'

In [16]:
def count_words(sentences_file):
    counter = {}
    for sentence in open(sentences_file):
        sentence = sentence.strip()
        if not sentence:
            continue
        words = sentence.split()
        for word in words:
            counter[word] = counter.get(word, 0) + 1
    return counter

In [17]:
counter = count_words(train_file)
print(f'Num unique words: {len(counter)}')

Num unique words: 9999


In [18]:
EOS = '<eos>'

We will add a special token EOS which signifies end of sentence. We add this to out vocabulary

Let us now write the vocab to file. Since, we are using OrderedDict, we will get words in order...

In [19]:
def write_vocab(counter, vocab_file, unk=UNK, eos=EOS):
    del counter[unk]
    with open(vocab_file, 'w') as fw:
        fw.write(f'{unk}\n')
        fw.write(f'{eos}\n')
        for word, _ in sorted(counter.items(), key=lambda pair:pair[1], reverse=True):
            fw.write(f'{word}\n')

In [20]:
vocab_file = 'simple-examples/data/vocab.txt'
write_vocab(counter, vocab_file)

Peek at vocab file, see if the words make sense...

```bash
 head simple-examples/data/vocab.txt 
```

This generates the following:
```
<unk>
<eos>
the
N
of
to
a
in
and
's
```

Next, we want to create a data pipeline, we would create a batch of src words and corresponding target words.

Target words would be shifted right by one. Let us give a concrete example:

**Sentence**: "the cat sat on mat"

**Src_Words:**: ['the', 'cat', 'sat', 'on', 'mat']

**Tgt_Words:**: ['cat', 'sat', 'on', 'mat', '<eos\>']

<img src="data_tx@2x.png" alt="drawing" width="300"/>
Let us begin by creating a vocab table:

In [21]:
from tensorflow.python.ops import lookup_ops

In [22]:
vocab_table = lookup_ops.index_table_from_file(vocab_file)
vocab_table.size()

<tf.Tensor: id=121, shape=(), dtype=int64, numpy=10000>

In [23]:
def create_dataset(sentences_file, vocab_table, batch_size, eos=EOS):
    #Create a Text Line dataset, which returns a string tensor
    dataset = tf.data.TextLineDataset(sentences_file)
    
    #Convert to a list of words..
    dataset = dataset.map(lambda sentence: tf.string_split([sentence]).values)
    
    #Create target words right shifted by one, append EOS, also return size of each sentence...
    dataset = dataset.map(lambda words: (words, tf.concat([words[1:], [eos]], axis=0), tf.size(words)))
    
    #Lookup words, word->integer, EOS->1
    dataset = dataset.map(lambda src_words, tgt_words, num_words: (vocab_table.lookup(src_words), vocab_table.lookup(tgt_words), num_words))
    
    #[None] -> src words, [None] -> tgt_words, [] length of sentence
    dataset = dataset.padded_batch(batch_size=batch_size, padded_shapes=([None], [None], []))
    return dataset
    

In [24]:
dataset = create_dataset(train_file, vocab_table, 32)

In [25]:
#Check out sample data!

next(iter(dataset))[2]

<tf.Tensor: id=172, shape=(32,), dtype=int32, numpy=
array([24, 15, 11, 23, 34, 27, 23, 32,  9, 15,  8, 20, 21, 22, 31, 16, 19,
       15, 20, 18, 32, 20, 38, 48, 17, 16, 12, 20, 12, 32, 20, 26],
      dtype=int32)>

### P7: RNN Model (revisited)

Now, that we have a way to load up data. Let us see how our RNN model behaves..

In [26]:
word_embeddings = Embedding(V=vocab_table.size(), d=128)

In [27]:
datum = next(iter(dataset))

In [28]:
word_vectors = word_embeddings(datum[0])
word_vectors.numpy().shape

(32, 48, 128)

In [29]:
rnn = RNN(h=128, cell='rnn')

In [30]:
rnn_outputs = rnn(word_vectors)

In [31]:
print(f'Num outputs: {len(rnn_outputs)} Shape[0]: {rnn_outputs[0].numpy().shape}')

Num outputs: 48 Shape[0]: (32, 128)


#### Zeroing out outputs past real sentence length!
One problem, with our current RNN implementation is that it processes even past the sentence length. For example, length of sentence 0 is 24, but since longest sentence in first batch is of length 48. It returns outputs even past length 24. Let us confirm this:

In [32]:
datum[2][0]

<tf.Tensor: id=619, shape=(), dtype=int32, numpy=24>

In [33]:
rnn_outputs[40][0][:10]

<tf.Tensor: id=628, shape=(10,), dtype=float32, numpy=
array([-0.06586117, -0.25426382,  0.09824807,  0.2871141 , -0.02431772,
        0.00771092,  0.25113913,  0.10970695, -0.00239144,  0.0056459 ],
      dtype=float32)>

We will use static_rnn to deal with the zeroing problem... As you can see, it implements for loop by itself!

In [34]:
class StaticRNN(tf.keras.Model):
    def __init__(self, h, cell):
        super(StaticRNN, self).__init__()
        if cell == 'lstm':
            self.cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=h)
        elif cell == 'gru':
            self.cell = tf.nn.rnn_cell.GRUCell(num_units=h)
        else:
            self.cell = tf.nn.rnn_cell.BasicRNNCell(num_units=h)
        
        
    def call(self, word_vectors, num_words):
        word_vectors_time = tf.unstack(word_vectors, axis=1)
        outputs, final_state = tf.nn.static_rnn(cell=self.cell, inputs=word_vectors_time, sequence_length=num_words, dtype=tf.float32)
        return outputs

In [35]:
srnn = StaticRNN(h=256, cell='rnn')
rnn_outputs = srnn(word_vectors, datum[2])
rnn_outputs[40][0][:10]

<tf.Tensor: id=1618, shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

### P8: Language Model (Code)

At each time step, we want to predict a probability distribution over the entire vocabulary

Thus, we need to add an output layer

In [36]:
class LanguageModel(tf.keras.Model):
    def __init__(self, V, d, h, cell):
        super(LanguageModel, self).__init__()
        self.word_embedding = Embedding(V, d)
        self.rnn = StaticRNN(h, cell)
        self.output_layer = tf.keras.layers.Dense(units=V)
        
    def call(self, datum):
        word_vectors = self.word_embedding(datum[0])
        rnn_outputs_time = self.rnn(word_vectors, datum[2])
        
        #We want to convert it back to shape batch_size x TimeSteps x h
        rnn_outputs = tf.stack(rnn_outputs_time, axis=1)
        logits = self.output_layer(rnn_outputs)
        return logits

In [37]:
lm = LanguageModel(vocab_table.size(), 128, 128, 'rnn')

What would be the shape of logits returned?

In [38]:
logits = lm(datum)
print(f'logits shape {logits.numpy().shape}')

logits shape (32, 48, 10000)


### P9: Loss function

* At each time step, RNN makes a prediction
* More concretely it generated 10,000 (V) logits.

We can compute loss by comparing the predictions against true labels. We will use Cross Entropy Loss.

* Cross Entropy measures distance between two probability distributions $p$ and $q$.

* When you have only one class as correct in true distribution. The Cross entropy simplifies to computing the loss of the target word!

<img src="cross_entropy@2x.png" alt="drawing" width="200"/>

* You should never compute the target probability directly. Further as we have our labels with only correct index we would use sparse_softmax_cross_entropy_with_logits. We pass the logits to this method directly!

Now let us get some intuition about the loss values..

First let us compute cross entropy loss for a model that predicts each word equally likely. In this case the probability would be 1/V or 1/10000. This comes out to be 9.21

In [39]:
-tf.log(1/10000).numpy()

9.2103405

Now, let us see what is the loss for the first prediction on an untrained model!

In [40]:
lm = LanguageModel(vocab_table.size(), 128, 128, 'lstm')
logits = lm(datum)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=datum[1])

In [41]:
print(loss[0][0].numpy())

9.20561


It seems we are not doing any better than making a random prediction! Which is fine as we have not trained our model!


Next, we need to be careful about not adding any loss for the **padded values**.

Let us check out length of first sentence, and see what are loss values past the length

In [42]:
print(f'Len of first sentence:  {datum[2][0]} Loss[{datum[2][0]}:]={loss[0][datum[2][0]:]}')

Len of first sentence:  24 Loss[24:]=[9.2103405 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405
 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405
 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405 9.2103405
 9.2103405 9.2103405 9.2103405]


We actually don't want to accumulate this loss! We will zero it out using sequence mask. Which creates a tensor of 0's and 1's as per the sequence length....

In [43]:
mask = tf.sequence_mask(datum[2], dtype=tf.float32)
loss = loss * mask
print(f'Len of first sentence:  {datum[2][0]} Loss[{datum[2][0]}:]={loss[0][datum[2][0]:]}')

Len of first sentence:  24 Loss[24:]=[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [44]:
mask[0]

<tf.Tensor: id=4448, shape=(48,), dtype=float32, numpy=
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>

Finally, when we are training we would do it over a batch. In this case 32 sentences with many words in each sentence... Thus, we will compute an average loss over this batch

We compute this by dividing total loss for the batch by total words

In [45]:
mask = tf.sequence_mask(datum[2], dtype=tf.float32)
loss = loss * mask
avg_loss = tf.reduce_sum(loss) / tf.reduce_sum(mask)
print(f'Avg loss: {avg_loss}')

Avg loss: 9.209136009216309


In [46]:
def loss_fun(model, datum):
    logits = model(datum)
    mask = tf.sequence_mask(datum[2], dtype=tf.float32)
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=datum[1]) * mask
    return tf.reduce_sum(loss) / tf.cast(tf.reduce_sum(datum[2]), dtype=tf.float32)

### P10: Gradients Function

In [47]:
loss_and_grads_fun = tfe.implicit_value_and_gradients(loss_fun)

In [48]:
loss_value, gradients_value = loss_and_grads_fun(lm, datum)

In [49]:
print(loss_value)

tf.Tensor(9.209136, shape=(), dtype=float32)


### P11: Training Loop

In [50]:
import numpy as np

In [51]:
opt = tf.train.AdamOptimizer(learning_rate=0.001)

In [52]:
NUM_EPOCHS = 10
STATS_STEPS = 50

lm = LanguageModel(vocab_table.size(), 128, 128, 'lstm')

for epoch_num in range(NUM_EPOCHS):
    batch_loss = []
    for step_num, datum in enumerate(dataset, start=1):
        loss_value, gradients = loss_and_grads_fun(lm, datum)
        batch_loss.append(loss_value)
        
        if step_num % STATS_STEPS == 0:
            print(f'Epoch: {epoch_num} Step: {step_num} Avg Loss: {np.average(np.asarray(loss_value))}')
            batch_loss = []
        opt.apply_gradients(gradients, global_step=tf.train.get_or_create_global_step())
    print(f'Epoch{epoch_num} Done!')

Epoch: 0 Step: 50 Avg Loss: 7.341406345367432
Epoch: 0 Step: 100 Avg Loss: 6.8079915046691895
Epoch: 0 Step: 150 Avg Loss: 6.721277713775635
Epoch: 0 Step: 200 Avg Loss: 6.436749458312988
Epoch: 0 Step: 250 Avg Loss: 6.674849510192871


SystemError: <built-in function TFE_Py_TapeGradient> returned a result with an error set

Let us check if the loss changed for the first batch!

In [53]:
loss_and_grads_fun(lm, datum)[0]

<tf.Tensor: id=926612, shape=(), dtype=float32, numpy=6.6278515>

In [54]:
print(f'Old avg p_tgt: {np.exp(-9.21)} New: {np.exp(-loss_and_grads_fun(lm, datum)[0])}')

Old avg p_tgt: 0.00010003404299092957 New: 0.0013230025069788098


In [55]:
tf.train.get_or_create_global_step()

<tf.Variable 'global_step:0' shape=() dtype=int64, numpy=267>

### P12: Saving your work!

In [56]:
import os

checkpoint_dir = 'lm'
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')

In [57]:
root = tfe.Checkpoint(optimizer=opt, model=lm, optimizer_step=tf.train.get_or_create_global_step())

In [58]:
root.save(checkpoint_prefix)

'lm/ckpt-1'