In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# 1 Getting the Data

*Our goal in this module is to build a Shakespearean language model using an RNN that can predict the next character in a Shakespearean sequence and write Shakespeare-like prose.*

In [2]:
from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

In [3]:
PATH='./data/'

In [4]:
get_data("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt", f'{PATH}shakespeare.txt')
text = open(f'{PATH}shakespeare.txt').read()
text = text[10500:]
text = text[:102905]
print('corpus length: ' + str(len(text)))

corpus length: 102905


In [5]:
text[:500]

"\n                     1\n  From fairest creatures we desire increase,\n  That thereby beauty's rose might never die,\n  But as the riper should by time decease,\n  His tender heir might bear his memory:\n  But thou contracted to thine own bright eyes,\n  Feed'st thy light's flame with self-substantial fuel,\n  Making a famine where abundance lies,\n  Thy self thy foe, to thy sweet self too cruel:\n  Thou that art now the world's fresh ornament,\n  And only herald to the gaudy spring,\n  Within thine own bu"

In [6]:
trn = text[:82324]
len(trn)

82324

In [7]:
!touch data/trn/trn.txt

In [8]:
val = text[82324:]
len(val)

20581

In [9]:
!touch data/val/val.txt

In [10]:
with open('./data/trn/trn.txt', 'w') as file:
    file.write(trn)

In [11]:
!cat data/trn/trn.txt


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tattered weed of small worth held:  
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy lusty days;
  To say

In [12]:
with open('./data/val/val.txt', 'w') as file:
    file.write(val)

In [13]:
!cat data/val/val.txt

true despite thy scythe and thee.


                     124
  If my dear love were but the child of state,
  It might for Fortune's bastard be unfathered,
  As subject to time's love or to time's hate,
  Weeds among weeds, or flowers with flowers gathered.
  No it was builded far from accident,
  It suffers not in smiling pomp, nor falls
  Under the blow of thralled discontent,
  Whereto th' inviting time our fashion calls:
  It fears not policy that heretic,
  Which works on leases of short-numbered hours,
  But all alone stands hugely politic,  
  That it nor grows with heat, nor drowns with showers.
    To this I witness call the fools of time,
    Which die for goodness, who have lived for crime.


                     125
  Were't aught to me I bore the canopy,
  With my extern the outward honouring,
  Or laid great bases for eternity,
  Which proves more short than waste or ruining?
  Have I not seen dwellers on form and favour
  Lose all, and more by p

# 2 Setup

Let's start by defining the vocabulary (which is simply the set of all unique characters in the text)

In [14]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 72


We'll add a 0 value for padding

In [15]:
chars.insert(0, '\0')

List the vocab

In [16]:
''.join(chars[1:-6])

"\n !'(),-.0123456789:;?ABCDEFGHIJKLMNOPRSTUVWYabcdefghijklmnopqrst"

Create a bidirectional lookup dictionary from characters to indices and vice versa

In [17]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

We will be building an RNN (Recurrent Neural Network) to create the language model. Since neural nets require numeric inputs we will transform the data using the mapping created above.

In [18]:
idx = [char_indices[c] for c in text]
idx[:10]

[1, 2, 2, 2, 2, 2, 2, 2, 2, 2]

We can see above that the characters in the text have been mapped to their indices

In [19]:
# verify that reverse mapping works as intended
''.join(indices_char[i] for i in idx[:70])

'\n                     1\n  From fairest creatures we desire increase,\n '

Finally, let's setup dimensions for our hidden layers and embedding matrices. *Note: We chose not to use widely available word embeddings like word2vec since these are derived from linear models. We will demonstrate in this module how embeddings learned from deep models are far more expressive.*

In [20]:
# arbitrary hyperparams that can be tuned
n_hidden = 256
n_fac = 42

# 3 Language Modeling

## 3.1 Vanilla RNN Approach

We will start by creating the simplest possible model. An RNN with *overlapping* 8 character inputs that will be used to predice the next (9th) character in that sequence.

### Create Inputs and Labels

In [21]:
cs=8

Create overlapping 8 character sequences for the entire text

In [22]:
in_data = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)]

Now let's create a list of next characters for each character in the above sequence. This will be our labels that we will pass to the model.

In [23]:
out_data = [idx[j+cs] for j in range(len(idx)-cs)]

In [24]:
# concatenate the inputs (along the primary axis)
x = np.stack(in_data, axis=0)

In [25]:
# concatenate the labels (along the primary axis)
y = np.stack(out_data)

In [26]:
x[:cs,:cs]

array([[1, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2]])

Thus, we can see that the *ith* list is an 8 character sequence starting from i.

In [27]:
y[:cs]

array([2, 2, 2, 2, 2, 2, 2, 2])

Here, the *ith* element is the expected output after input sequence i from above (x)

### Create Model

In [28]:
class VanillaCharRNNModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, n_fac)
        self.input = nn.Linear(n_fac, n_hidden)
        self.hidden = nn.Linear(n_hidden, n_hidden)
        self.output = nn.Linear(n_hidden, vocab_size)
    
    def forward(self, *cs):
        bs = cs[0].size(0)
        hidden_state = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.input(self.emb(c)))
            hidden_state = F.tanh(self.hidden(inp + hidden_state))
        return F.log_softmax(self.output(hidden_state), dim=-1)

Here we have created a classic RNN. There is a linear input layer (combined with embedding layer), a hidden layer that repeatedly gets fed the hidden state up until point *i-1* and the *ith* character (this is what makes this network recurrent) and an output layer that applies a softmax to predict the most likely next character. Note that we have chosen to combine the hidden state and next input by means of addition. However, an argument can be made that addition loses some information and hence, concatenation would be a better approach. In addition, we have chosen to reinitialize the hidden state to zero after 8 characters. However, it can be argued (and rightly so!) that this loses important contextual information, which is essential in language.

### Train Model

In [29]:
val_idx = get_cv_idxs(len(idx)-cs-1)

In [30]:
md = ColumnarModelData.from_arrays('.', val_idx, x, y, bs=512) # large batch size used here to speed up training

In [31]:
m = VanillaCharRNNModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2) # learning rate can be tuned

Let's begin the traning loop using the above model and the Adam optimizer configured as above. We will be considering the negative log loss likelihood as our metric.

In [32]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.928656   1.863961  



[array([1.86396])]

In [33]:
set_lrs(opt, 1e-3)
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.667185   1.685626  



[array([1.68563])]

Let's see how well this very basic model performs.

### Test Model

In [34]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [35]:
get_next('we desir')

'e'

In [36]:
get_next('flame wi')

't'

In [37]:
get_next('how are')

' '

Thus, we can see from above that we are getting very good results even with this very basic RNN where we reset the state after every eight characters. We note from above that the model not only predicts the next alphabet but also semantic nuances like predicting that *how are* must be followed by a space.

### Concatenate Hidden State and Input

Let's make a slight change here by conacatenating hidden state and input instead of adding them to see if we get better performance.

In [38]:
class VanillaCharRNNModelConcat(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, n_fac)
        self.input = nn.Linear(n_fac + n_hidden, n_hidden)
        self.hidden = nn.Linear(n_hidden, n_hidden)
        self.output = nn.Linear(n_hidden, vocab_size)
    
    def forward(self, *cs):
        bs = cs[0].size(0)
        hidden_state = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = F.relu(self.input(torch.cat((self.emb(c), hidden_state), 1)))
            hidden_state = F.tanh(self.hidden(inp))
        return F.log_softmax(self.output(hidden_state), dim=-1)

In [39]:
m = VanillaCharRNNModelConcat(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [40]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      3.094905   3.109281  



[array([3.10928])]

In [41]:
set_lrs(opt, 1e-3)

In [42]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      3.005696   3.013667  



[array([3.01367])]

In [43]:
get_next('we desir')

' '

In [44]:
get_next('how art ') # would expect 't' since "how art thou?" is a common Shakespearean phrase

' '

Thus, we can see that this helped improve the loss slightly while still yielding the same predictive performance

## 3.2 Stateful RNN Approach

Now, that we have achieved good performance with the vanilla RNN approach its time to try and improve things by incorporating the notion of statefulness in our model. We will do so by preserving the hidden state throughout the training process instead of resetting after every 8 character group.

### Create Inputs and Labels

Here, we will remedy one of the ineffeciencies of the input/output to the vanilla model. Instead of taking overlapping intervals, we will now take non-overlapping intervals as our inputs and output the next character after every single character in the input. This in itself is simply an efficiency improvement trick and should not change the accuracy of the model in any way. We will use some library support to take care of these more mechanical tasks.

In [45]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='./data/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

shakespeare.txt  [0m[01;34mtrn[0m/  [01;34mval[0m/


In [46]:
TEXT = data.Field(lower=True, tokenize=list)
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

Before we move on, we must note the use of the parameter *bptt* which stands for back propagation through time. It specifies how many hidden layers of our neural network to backpropagate through during the training loop. This becomes important to specify/include in stateful RNNs because the depth of the network is going to become extremely large now that we are not discarding our state information by resetting after every 8 characters. Hence, we do not want to backpropagate through 1000s of layers as its going to be computationally expensive. In additon, it does not make conceptual sense since during backpropagation we are determining how much the error of the output is affected by the activations in a particular layer. In a language model, it is unlikely that the error of the 100000th character output is affected by the first character since it is highly likely that context has changed over 100000 characters.

### Create Model

In [47]:
class StatefulRNN(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.emb = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.out = nn.Linear(n_hidden, vocab_size)
        self.hidden = V(torch.zeros(1, bs, n_hidden))
    
    def forward(self, cs):
        bs = cs[0].size(0)
        # need to account for the fact that minibatch will be different for last batch
        if self.hidden.size(1) != bs: 
            self.hidden = V(torch.zeros(1, bs, n_hidden))
        outp, h = self.rnn(self.emb(cs), self.hidden)
        self.hidden = repackage_var(h)
        return F.log_softmax(self.out(outp), dim=-1).view(-1, self.vocab_size)

The `repackage_var()` method helps implement bptt by leveraging the Torch concept that Tensors have no history of operations, only Variables. This method extracts the Tensor and wraps it in a new Variable which only has history of current bptt size layers (8 in our case)

### Train Model

In [48]:
m = StatefulRNN(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [49]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.965799   1.915607  
    1      1.804299   1.867346                               
    2      1.743312   1.828087                               
    3      1.710561   1.850279                               



[array([1.85028])]

In [50]:
set_lrs(opt, 1e-3)

fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.536639   1.647658  
    1      1.496889   1.627578                               
    2      1.474387   1.615356                               
    3      1.460252   1.609291                               



[array([1.60929])]

### Test Model

In [51]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [52]:
get_next('my love ')

'('

In [53]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [63]:
print(get_next_n('my love ', 40))

my love it againe hath lombous from thokn, in th


We can see above that our model is starting to approach something resembling Shakespeare like writing. However, we see that towards the end of the sentence the model is starting to produce scrambled and senseless character sequences. This is because as the gap grows between the start and end, even with stateful preservation the vanilla RNN cell has a hard time keeping up with these long term dependencies. Thankfull, NLP researchers have come up with two alternate cells that could be used to remedy this very problem! We will see these two alternate approaches in action below.

## 3.3 GRU Cell Approach

Instead of using the standard RNN cell (which is simply a linear layer), we will now use a GRU cell. GRU stands for Gated Recurrent Unit and is precisely designed to help with long term dependencies. A GRU cell contains two gates: a reset gate which determines how much of the old hidden state must be preserved and used in combination with current character input and an update gate which determines how much of the new state must be used to create the existing state. One thing to note is that the two gates are mini neural nets themselves (single layer perceptron) with their own weight matrices. The gates' weights are also updated during backpropagation and this is what makes them so effective with long term dependencies.

In [56]:
class StatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.emb = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)
        self.out = nn.Linear(n_hidden, vocab_size)
        self.hidden = V(torch.zeros(1, bs, n_hidden))
    
    def forward(self, cs):
        bs = cs[0].size(0)
        # need to account for the fact that minibatch will be different for last batch
        if self.hidden.size(1) != bs: 
            self.hidden = V(torch.zeros(1, bs, n_hidden))
        outp, h = self.rnn(self.emb(cs), self.hidden)
        self.hidden = repackage_var(h)
        return F.log_softmax(self.out(outp), dim=-1).view(-1, self.vocab_size)

In [57]:
m = StatefulGRU(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [59]:
fit(m, md, 6, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=6), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.600047   1.781107  
    1      1.597914   1.783171                              
    2      1.606024   1.774495                               
    3      1.606393   1.78476                                
    4      1.630621   1.804095                               
    5      1.645396   1.818221                               



[array([1.81822])]

In [60]:
set_lrs(opt, 1e-3)

In [65]:
fit(m, md, 6, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=6), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.332252   1.657226  
    1      1.3195     1.660383                               
    2      1.310989   1.660865                               
    3      1.294655   1.660386                              
    4      1.279197   1.663809                               
    5      1.269822   1.664237                               



[array([1.66424])]

In [66]:
print(get_next_n('my love ', 40))

my love make your dear fumllest,  and for wail t


We can see that the model is starting to better. It is creating longer sequences that are connected and make more sense. But let's explore the other approach (which, is what most modern NLP architectures use).

## 3.4 LSTM Cell Approach

LSTM stands for Long Short Term Memory and is a (sometimes more effective) variant of GRU also designed to solve the problem of long term dependencies. LSTM works almost exactly like GRU except that it has an additional piece of information called cell state (as opposed to GRU, which has hidden state). For a far more articulate explanation of LSTMs, read [here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

In [94]:
from fastai import sgdr

n_hidden = 1024 # since we have dropout = 0.5 now, we can learn more and do so resiliently

In [95]:
class StatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.6)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

We will also be using learning rate annealing to help improve performance now.

In [96]:
m = StatefulLSTM(md.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5)

In [97]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [98]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)] # cosine annealing
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

HBox(children=(IntProgress(value=0, description='Epoch', max=63), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      2.908765   2.85299   
    1      2.267052   2.047802                              
    2      1.910363   1.828402                              
    3      1.870356   1.807485                              
    4      1.745696   1.715712                              
    5      1.645834   1.640077                              
    6      1.565987   1.608091                              
    7      1.701329   1.709221                              
    8      1.67224    1.675929                              
    9      1.629795   1.637674                              
    10     1.57893    1.604935                              
    11     1.532855   1.576326                              
    12     1.477904   1.548964                              
    13     1.43232    1.532397                              
    14     1.398794   1.526081                             
    15     1.52042    1.620892                       

[array([1.524])]

In [82]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [83]:
get_next('my love ')

'a'

In [84]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [85]:
print(get_next_n('my love ', 400))

my love inmerapt,  the world (of leap plamed,  and this thoughts on thee, like and lossed,  whose who with pleasures, all me being my desert:  therefore,  authorized,  or where thou mayst eased heaven  be, flutter on to remembered mades of all thee,  or which i colour,  whose injury.  what i haste other in thy neguty, set thee, are soe thy body's picted,  as they should register,  which shame as thusth ba


In [89]:
print(get_next_n('my love ', 400))

my love that before,  wilt for self-dressed:  no love not sor elder memory:    as unk of your eyes, and being from his verse?  then from the sweetest,  the earthful morn-loving change,  what penfit,  i speed of his bewold,  that bud like that loving bark against true, heir true love's till well.    sulls a counterful robs thy voise.       22  so trul thou taste.         30  or to so sow.         jewels it


In [99]:
print(get_next_n('my love ', 400))

my love (in all hear,  when varding and knight?  if the love's spirit hath heed, what being eyes:    all hand most thou should before:    the life are,  if neight from the hand:  before in this,  to wither their sight, he that season, ever whereore,  and do my saken  i may doth by.  and like a suns under might?  and in regost.         10o fortune'.         44 each is once, that love, i moy care,  for her 


We can see that our model, although slightly less clear than Shakespear still does a very good job in capturing linguistic nuances and Shakespeare's tone. In addition, Shakespeare starts his sonnets with numbers and the model is able to do the same as well. Thus, using LSTM's we have been able to model the long term dependencies effectively.