In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.io import *
from fastai.conv_learner import *

from fastai.column_data import *

import pathlib

## Setup

We're going to download the collected works of Nietzsche to use as our data for this class.

Use pathlib to create a path object.

In [2]:
PATH= Path('data/nietzsche/')

Download the data using `urllib:urlretrieve`

In [3]:
# urlretrieve("https://s3.amazonaws.com/text-datasets/nietzsche.txt", PATH/'nietzsche.txt')

Open path object and read

In [4]:
text = (PATH/'nietzsche.txt').open().read()
text[:1000]

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to Truth, have been unskilled and unseemly methods for\nwinning a woman? Certainly she has never allowed herself to be won; and\nat present every kind of dogma stands with sad and discouraged mien--IF,\nindeed, it stands at all! For there are scoffers who maintain that it\nhas fallen, that all dogma lies on the ground--nay more, that it is at\nits last gasp. But to speak seriously, there are good grounds for hoping\nthat all dogmatizing in philosophy, whatever solemn, whatever conclusive\nand decided airs it has assumed, may have been only a noble puerilism\nand tyronism; and probably the time is at hand when it will be once\nand again understood WHAT has actually sufficed for the basis of such\

Let's print it in a friendier format.

In [5]:
print(text[:1000])

PREFACE


SUPPOSING that Truth is a woman--what then? Is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to Truth, have been unskilled and unseemly methods for
winning a woman? Certainly she has never allowed herself to be won; and
at present every kind of dogma stands with sad and discouraged mien--IF,
indeed, it stands at all! For there are scoffers who maintain that it
has fallen, that all dogma lies on the ground--nay more, that it is at
its last gasp. But to speak seriously, there are good grounds for hoping
that all dogmatizing in philosophy, whatever solemn, whatever conclusive
and decided airs it has assumed, may have been only a noble puerilism
and tyronism; and probably the time is at hand when it will be once
and again understood WHAT has actually sufficed for the basis of such
imposing and abso

Get length of corpus

In [6]:
print('corpus length:', len(text))

corpus length: 600893


### Create dictionary

Create dictionary. Make a space for zero value = padding

In [7]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

total chars: 85


Sometimes it's useful to have a zero value in the dataset, e.g. for padding.  
Insert it at position 0.

In [8]:
chars.insert(0, "\0")
print(chars[0:4])

['\x00', '\n', ' ', '!']


In [9]:
''.join(chars)

'\x00\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzÆäæéë'

Map from chars to indices and back again

In [10]:
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}

*idx* will be the data we use from now on - it simply converts all the characters to their index (based on the mapping above)

In [11]:
idx = [char_indices[c] for c in text]

idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [12]:
''.join(indices_char[i] for i in idx[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## Three char model

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters  
> c1_dat = 1st, 4th, 7th, 10th,..   
> c2_dat = 2nd, 5th, 8th, 11th,..  
> c3_dat = 3rd, 6th, 9th, 12th,..  
> c4_dat = 4th, 7th, 10th, 13th,..  
  
Note that c1_dat and c4_dat are exactly the same, except the latter lags the former by one element. 

In [13]:
cs=3
c1_dat = [idx[i]   for i in range(0, len(idx)-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-cs, cs)]

In [14]:
print("original sequence:",idx[:20])
print("\n")
print("c1_dat: 1st, 4th, 7th, 10th,..", c1_dat[:20]) 
print("c2_dat: 2nd, 5th, 8th, 11th,..", c2_dat[:20])
print("c3_dat: 3rd, 6th, 9th, 12th,..", c3_dat[:20])
print("c4_dat: 4th, 7th, 10th, 13th,..", c4_dat[:20])

original sequence: [40, 42, 29, 30, 25, 27, 29, 1, 1, 1, 43, 45, 40, 40, 39, 43, 33, 38, 31, 2]


c1_dat: 1st, 4th, 7th, 10th,.. [40, 30, 29, 1, 40, 43, 31, 61, 2, 74, 2, 2, 76, 54, 9, 54, 73, 67, 33, 73]
c2_dat: 2nd, 5th, 8th, 11th,.. [42, 25, 1, 43, 40, 33, 2, 54, 44, 73, 62, 54, 68, 67, 76, 73, 61, 24, 72, 61]
c3_dat: 3rd, 6th, 9th, 12th,.. [29, 27, 1, 45, 39, 38, 73, 73, 71, 61, 72, 2, 66, 9, 61, 2, 58, 2, 2, 58]
c4_dat: 4th, 7th, 10th, 13th,.. [30, 29, 1, 40, 43, 31, 61, 2, 74, 2, 2, 76, 54, 9, 54, 73, 67, 33, 73, 71]


### numpy stack

In [53]:
?np.stack

Signature: np.stack(arrays, axis=0, out=None)
Docstring:
Join a sequence of arrays along a new axis.

The `axis` parameter specifies the index of the new axis in the dimensions
of the result. For example, if ``axis=0`` it will be the first dimension
and if ``axis=-1`` it will be the last dimension.

.. versionadded:: 1.10.0

Parameters
----------
arrays : sequence of array_like
    Each array must have the same shape.
axis : int, optional
    The axis in the result array along which the input arrays are stacked.
out : ndarray, optional
    If provided, the destination to place the result. The shape must be
    correct, matching that of what stack would have returned if no
    out argument were specified.

Returns
-------
stacked : ndarray
    The stacked array has one more dimension than the input arrays.

Create a list of five 3X4 numpy matrix 

In [15]:
arrays = [np.random.randn(3, 4) for _ in range(5)];arrays

[array([[ 0.26707,  0.53502, -0.36971,  0.47062],
        [ 1.3231 ,  0.81272,  1.05069, -0.97046],
        [-0.57811, -1.07036, -0.79336, -1.13682]]),
 array([[-0.40418,  0.00136, -1.92031,  0.14098],
        [-1.96263, -0.35439, -0.54317, -0.1095 ],
        [ 1.00624,  1.12282,  0.30095, -0.82838]]),
 array([[-0.13569, -0.52423, -0.34952,  2.70086],
        [-0.74903, -0.89188, -0.78355,  0.95596],
        [ 0.08061,  1.39371, -1.01446, -0.22409]]),
 array([[-1.43728,  1.07395, -0.59071, -0.58609],
        [ 0.05319,  0.42956, -2.23133,  1.8199 ],
        [-1.23453,  0.03955,  0.15082, -0.04645]]),
 array([[ 0.92975,  0.86156,  0.38327,  0.56463],
        [-0.86046,  1.11132, -0.11699, -0.49987],
        [ 0.25563,  1.15763, -1.07207,  1.5115 ]])]

In [None]:
np. stack will stack the five lists based on the axis specified.

Examples
--------
>>> arrays = [np.random.randn(3, 4) for _ in range(5)]
>>> np.stack(arrays, axis=0).shape
(5, 3, 4)

>>> np.stack(arrays, axis=1).shape
(3, 5, 4)

>>> np.stack(arrays, axis=2).shape
(3, 4, 5)


For this case, np.stack will just stack the only axis (axis=0)  

At this point, `c1_dat` is still a single list, so np.stack will convert it to a 1D array.

In [16]:
np.stack(c1_dat)

array([40, 30, 29, ..., 72, 59, 67])

In [17]:
x1 = np.stack(c1_dat)
x2 = np.stack(c2_dat)
x3 = np.stack(c3_dat)

If we put these 3 in a list, there will be 3 arrays of one column in the list.  
Stacking in axis 0 will create a 3 X 10 matrix

In [18]:
[x1[:10],x2[:10],x3[:10]]

[array([40, 30, 29,  1, 40, 43, 31, 61,  2, 74]),
 array([42, 25,  1, 43, 40, 33,  2, 54, 44, 73]),
 array([29, 27,  1, 45, 39, 38, 73, 73, 71, 61])]

In [19]:
np.stack([x1[:10],x2[:10],x3[:10]], axis = 0)

array([[40, 30, 29,  1, 40, 43, 31, 61,  2, 74],
       [42, 25,  1, 43, 40, 33,  2, 54, 44, 73],
       [29, 27,  1, 45, 39, 38, 73, 73, 71, 61]])

If the 3 inputs are stacked along axis 1, it would look like this.

In [20]:
stackedx = np.stack([x1[:10],x2[:10],x3[:10]], axis=1)
stackedx

array([[40, 42, 29],
       [30, 25, 27],
       [29,  1,  1],
       [ 1, 43, 45],
       [40, 40, 39],
       [43, 33, 38],
       [31,  2, 73],
       [61, 54, 73],
       [ 2, 44, 71],
       [74, 73, 61]])

c4_dat will be y, our output 

In [21]:
y = np.stack(c4_dat)
y[:20]

array([30, 29,  1, 40, 43, 31, 61,  2, 74,  2,  2, 76, 54,  9, 54, 73, 67, 33, 73, 71])

The first 4 inputs and outputs

In [22]:
x1.shape, y.shape

((200297,), (200297,))

### Create and train model

Pick a size for our hidden state

In [23]:
n_hidden = 256

The number of latent factors to create (i.e. the size of the embedding matrix)

In [24]:
n_fac = 42

#### create zero matrix

This is how to create a zero matrix in torch. 

In [25]:
torch.zeros((3,5))


 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
[torch.FloatTensor of size 3x5]

You can put it on the GPU by using `.cuda()`

In [26]:
torch.Tensor(np.random.rand(3,4)).cuda()


 0.7996  0.1724  0.3232  0.6234
 0.0993  0.6973  0.5799  0.0115
 0.0138  0.9245  0.8453  0.4741
[torch.cuda.FloatTensor of size 3x4 (GPU 0)]

Or put it on the GPU by turning it into a variable using V()

In [21]:
V(torch.zeros((3,5)))

Variable containing:
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]

#### softmax in GPU

In [300]:
w = np.random.randn(3, 4)
w

array([[-0.47917, -0.18566, -1.10633, -1.19621],
       [ 0.81253,  1.35624, -0.07201,  1.00353],
       [ 0.36164, -0.64512,  0.3614 ,  1.53804]])

dim = 0 means the outermost square bracket, axis 0. The softmax is done down each column.

In [301]:
F.softmax(V(w), dim=0)

Variable containing:
 0.1437  0.1586  0.1227  0.0393
 0.5230  0.7412  0.3451  0.3549
 0.3332  0.1002  0.5323  0.6057
[torch.cuda.FloatTensor of size 3x4 (GPU 0)]

logging the softmax. Note that all values are negative, the smaller the value, the more negative it is

In [302]:
F.log_softmax(V(w), dim=0)

Variable containing:
-1.9398 -1.8414 -2.0983 -3.2356
-0.6481 -0.2995 -1.0640 -1.0358
-1.0990 -2.3008 -0.6306 -0.5013
[torch.cuda.FloatTensor of size 3x4 (GPU 0)]

`negative log softmax` will multiply the above by -1, making all values positive. The smaller the value, the higher the log(probability)

### Create the 3-character RNN class

In [30]:
class Char3Model(nn.Module): 
    
    
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        
        # embedding layer
        self.e = nn.Embedding(vocab_size, n_fac)

        # input weight dense matrix
        self.l_in = nn.Linear(n_fac, n_hidden)

        # state weight dense matrix
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        
        # output weight matrix
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, c1, c2, c3):
                     
        # initial zero hidden layer
        # 1st rnn layer
        in1 = F.relu(self.l_in(self.e(c1)))
        h = V(torch.zeros(in1.size()).cuda())  # Note the cuda()
        #h = V(torch.zeros(in1.size()))         
        h = F.tanh(self.l_hidden(h+in1))
         
        # 2nd rnn layer        
        in2 = F.relu(self.l_in(self.e(c2)))
        h = F.tanh(self.l_hidden(h+in2))
        
        # 3rd rnn layer        
        in3 = F.relu(self.l_in(self.e(c3)))
        h = F.tanh(self.l_hidden(h+in3))
        
        return F.log_softmax(self.l_out(h))

### Dataloader

In [66]:
??ColumnarModelData.from_arrays

Signature: ColumnarModelData.from_arrays(path, val_idxs, xs, y, is_reg=True, 
                                         is_multi=False, bs=64, test_xs=None, shuffle=True)
Docstring: <no docstring>
Source:   
    @classmethod
    def from_arrays(cls, path, val_idxs, xs, y, is_reg=True, is_multi=False, 
                    bs=64, test_xs=None, shuffle=True):
        ((val_xs, trn_xs), (val_y, trn_y)) = split_by_idx(val_idxs, xs, y)
        test_ds = PassthruDataset(*(test_xs.T), [0] * len(test_xs), is_reg=is_reg, 
                                  is_multi=is_multi) if test_xs is not None else None
        return cls(path, PassthruDataset(*(trn_xs.T), trn_y, is_reg=is_reg, is_multi=is_multi),
                   PassthruDataset(*(val_xs.T), val_y, is_reg=is_reg, is_multi=is_multi),
                   bs=bs, shuffle=shuffle, test_ds=test_ds)
File:      ~/Documents/Git/fastai/courses/dl1/fastai/column_data.py
Type:      method

In [304]:
class PassthruDataset(Dataset):
    def __init__(self,*args, is_reg=True, is_multi=False):
        *xs,y=args
        self.xs,self.y = xs,y
        self.is_reg = is_reg
        self.is_multi = is_multi

    def __len__(self): return len(self.y)
    def __getitem__(self, idx): return [o[idx] for o in self.xs] + [self.y[idx]]

    @classmethod
    def from_data_frame(cls, df, cols_x, col_y, is_reg=True, is_multi=False):
        cols = [df[o] for o in cols_x+[col_y]]
        return cls(*cols, is_reg=is_reg, is_multi=is_multi)

Instantiate the dataloader. The last row, just a single row, is used as validation.

In [65]:
md = ColumnarModelData.from_arrays(path = '.',
                                   val_idxs = [-1], 
                                   xs = np.stack([x1,x2,x3], axis=1), 
                                   y =y, 
                                   bs=512)

Let's examine some methods available under the dataloader

Number of batches in dataloader

In [35]:
len(md.trn_dl)

392

### ???

number of training rows X batch size = 392X512 = 200704  
but len of dataset is 200295 

Note that in training ds the x and y are joined together 

#### training data

In [67]:
list(md.trn_ds)[:10]

[[40, 42, 29, 30],
 [30, 25, 27, 29],
 [29, 1, 1, 1],
 [1, 43, 45, 40],
 [40, 40, 39, 43],
 [43, 33, 38, 31],
 [31, 2, 73, 61],
 [61, 54, 73, 2],
 [2, 44, 71, 74],
 [74, 73, 61, 2]]

In [68]:
len(md.trn_ds)

200296

In [69]:
len(md.trn_y)

200296

In [70]:
md.path

'.'

#### validation data

In [71]:
len(md.val_dl)

1

In [72]:
list(md.val_ds)

[[67, 58, 72, 72]]

In [73]:
md.val_y

array([72])

### Instantiate the 3-Char model

.cuda() means do the calculation on GPU.   
This converts all parameters in the model from
torch.Tensor to torch.cuda.Tensor.
Inputs have to be Variable.cuda(), so both of these will cuda-ed the outputs.  
Hence there will be no need to cuda the losses.

In [66]:
m = Char3Model(vocab_size, n_fac).cuda()

Check its methods

In [77]:
m.children

<bound method Module.children of Char3Model(
  (e): Embedding(85, 42)
  (l_in): Linear(in_features=42, out_features=256, bias=True)
  (l_hidden): Linear(in_features=256, out_features=256, bias=True)
  (l_out): Linear(in_features=256, out_features=85, bias=True)
)>

In [78]:
m.forward

<bound method Char3Model.forward of Char3Model(
  (e): Embedding(85, 42)
  (l_in): Linear(in_features=42, out_features=256, bias=True)
  (l_hidden): Linear(in_features=256, out_features=256, bias=True)
  (l_out): Linear(in_features=256, out_features=85, bias=True)
)>

In [79]:
m.l_in

Linear(in_features=42, out_features=256, bias=True)

In [80]:
m.l_hidden

Linear(in_features=256, out_features=256, bias=True)

In [81]:
m.training

True

Dataloader will pass one batch of size 512 at a time to model m to be trained

In [82]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [83]:
yt.shape

torch.Size([512])

In [84]:
xs[0].shape

torch.Size([512])

`md` will automatically pass `*V(xs)` to `m` in `fit`.  

Let's use the untrained model like a layer, just to produce the outputs.  
This will be batch size X vocab size = 512 X 85.

In [85]:
m(*V(xs))

Variable containing:
-4.8706 -4.2780 -4.7269  ...  -4.6993 -4.3924 -4.2887
-4.7283 -4.3229 -4.7110  ...  -4.7150 -4.3299 -4.3579
-4.5632 -4.3377 -4.7691  ...  -4.3538 -4.5652 -4.2522
          ...             ⋱             ...          
-4.5965 -3.9759 -4.7243  ...  -4.5285 -4.6732 -4.1388
-4.4654 -4.3620 -4.5897  ...  -4.4403 -4.7126 -4.1935
-4.8598 -4.3971 -4.8340  ...  -4.4494 -4.4382 -4.4886
[torch.cuda.FloatTensor of size 512x85 (GPU 0)]

Use Adam optimizer, specify the parameters to optimize, and the learning rate.

In [67]:
opt = optim.Adam(m.parameters(), 1e-2)

F.nll_loss is negative log likelihood loss.

### Train the model

In [68]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      2.075      0.689412  



[array([0.68941])]

#### Manual annealing

In [69]:
set_lrs(opt, 0.001) # why are we using a layer optimizer if we are not using learner?
                    # there won't be SGDR or differential lerning rate to begin with

In [70]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.813917   0.44118   



[array([0.44118])]

### Test model

In [105]:
?V
Signature: V(x, requires_grad=False, volatile=False)
Source:   
def V(x, requires_grad=False, volatile=False):
    '''creates a single or a list of pytorch tensors, depending on input x. '''
    return map_over(x, lambda o: V_(o, requires_grad, volatile))

In [106]:
?VV
def VV(x):
    '''creates a single or a list of pytorch tensors, depending on input x. '''
    return map_over(x, VV_)

In [113]:
??VV_

def VV_(x): 
    '''creates a volatile tensor, which does not require gradients. '''
    return create_variable(x, True)

In [114]:
??create_variable
 
def create_variable(x, volatile, requires_grad=False):
    if type (x) != Variable:
        if IS_TORCH_04: x = Variable(T(x), requires_grad=requires_grad)
        else:           x = Variable(T(x), requires_grad=requires_grad, volatile=volatile)
    return x

Look inside get_next() by printing every line.

In [55]:
def get_next(inp):
    print("1) Inputs inp: ", inp, "\n")
    idxs = T(np.array([char_indices[c] for c in inp]))
    print("2) Convert to index idxs:", idxs, "\n")
    p = m(*VV(idxs))
    print("3) Calculate probability for dictionary p:", p, "\n")
    i = np.argmax(to_np(p))
    print("4) Get prediction i:",i, "\n")
    return chars[i]

In [56]:
get_next('y. ')

1) Inputs inp:  y.  

2) Convert to index idxs: 
 78
 10
  2
[torch.cuda.LongTensor of size 3 (GPU 0)]
 

3) Calculate probability for dictionary p: Variable containing:

Columns 0 to 9 
-4.6038 -4.4689 -4.3727 -4.5109 -4.4358 -4.4557 -4.3810 -4.5054 -4.4593 -4.4198

Columns 10 to 19 
-4.4614 -4.3225 -4.5428 -4.6532 -4.3256 -4.4126 -4.5740 -4.2827 -4.5083 -4.5611

Columns 20 to 29 
-4.6086 -4.6325 -4.6438 -4.5238 -4.3041 -4.3279 -4.3085 -4.2459 -4.4396 -4.6418

Columns 30 to 39 
-4.4156 -4.2547 -4.3981 -4.7036 -4.2437 -4.3488 -4.3553 -4.4531 -4.4709 -4.3814

Columns 40 to 49 
-4.4526 -4.5411 -4.4912 -4.2298 -4.5139 -4.1931 -4.3423 -4.5070 -4.5447 -4.3264

Columns 50 to 59 
-4.3112 -4.5760 -4.6654 -4.3588 -4.4506 -4.4406 -4.5779 -4.2468 -4.4907 -4.4721

Columns 60 to 69 
-4.3431 -4.4567 -4.4477 -4.5343 -4.4035 -4.5879 -4.3351 -4.3739 -4.5189 -4.4786

Columns 70 to 79 
-4.6306 -4.6601 -4.4965 -4.4755 -4.4075 -4.4231 -4.4501 -4.3686 -4.3907 -4.5869

Columns 80 to 84 
-4.3304 -4.3737 -4.44

'U'

In [71]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [72]:
get_next('ppl')

'y'

In [73]:
get_next(' th')

'e'

In [74]:
get_next('and')

' '

## ??? 
why does tranpose and not-tranpose produce the same answer???

> idxs = T(np.array([char_indices[c] for c in inp]))

In [75]:
def get_next(inp):
    idxs = (np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [76]:
get_next('ppl')

'y'

In [77]:
get_next(' th')

'e'

In [78]:
get_next('and')

' '

## Our first RNN!

### Create inputs

This is the size of our unrolled RNN.

Sequence length = cs

In [79]:
cs=8

In [80]:
idx[:20]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1, 43, 45, 40, 40, 39, 43, 33, 38, 31, 2]

For each of 0 through 7, create a list of every 8th character with that starting point. These will be the 8 inputs to our model.  
Note that **c_in_dat** is a list

In [81]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(len(idx)-cs)] # Note in lecture it is range(len(idx)-cs-1)
c_in_dat[:10]                                                          # i assume lecture is wrong  

[[40, 42, 29, 30, 25, 27, 29, 1],
 [42, 29, 30, 25, 27, 29, 1, 1],
 [29, 30, 25, 27, 29, 1, 1, 1],
 [30, 25, 27, 29, 1, 1, 1, 43],
 [25, 27, 29, 1, 1, 1, 43, 45],
 [27, 29, 1, 1, 1, 43, 45, 40],
 [29, 1, 1, 1, 43, 45, 40, 40],
 [1, 1, 1, 43, 45, 40, 40, 39],
 [1, 1, 43, 45, 40, 40, 39, 43],
 [1, 43, 45, 40, 40, 39, 43, 33]]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [82]:
c_out_dat = [idx[j+cs] for j in range(len(idx)-cs)]
c_out_dat[:20]

[1, 1, 43, 45, 40, 40, 39, 43, 33, 38, 31, 2, 73, 61, 54, 73, 2, 44, 71, 74]

In [83]:
xs = np.stack(c_in_dat, axis=0)

In [84]:
xs.shape

(600885, 8)

In [85]:
y = np.stack(c_out_dat)

So each column below is one series of 8 characters from the text.

In [101]:
xs[:cs,:cs]

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [42, 29, 30, 25, 27, 29,  1,  1],
       [29, 30, 25, 27, 29,  1,  1,  1],
       [30, 25, 27, 29,  1,  1,  1, 43],
       [25, 27, 29,  1,  1,  1, 43, 45],
       [27, 29,  1,  1,  1, 43, 45, 40],
       [29,  1,  1,  1, 43, 45, 40, 40],
       [ 1,  1,  1, 43, 45, 40, 40, 39]])

...and this is the next character after each sequence.

In [102]:
y[:cs]

array([ 1,  1, 43, 45, 40, 40, 39, 43])

### Create and train model

In [86]:
len(idx)-cs-1

600884

In [87]:
get_cv_idxs(len(idx)-cs-1)  # leave space for 8 chars of input + one char of outputs == cs+1

array([480310, 419017, 232803, ..., 134355, 389158, 330599])

## ???
Seems like there is no need to minus 1 in len(idx)-cs-1.  
"len(idx)-cs" produces the same result

In [108]:
val_idx = get_cv_idxs(len(idx)-cs-1)

In [90]:
len(val_idx)

120176

In [121]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, y, bs=512)

Uses a sequential loop instead of building each layer independantly.  
The hidden state variables and character variables are added together
> h = F.tanh(self.l_hidden(h+inp))

In [110]:
class CharLoopModel(nn.Module):
    # This is an RNN!
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda()) 
        for c in cs:
            inp = F.relu(self.l_in(self.e(c)))
            h = F.tanh(self.l_hidden(h+inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [111]:
m = CharLoopModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

In [112]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      2.039299   2.012571  



[array([2.01257])]

In [113]:
set_lrs(opt, 0.001)

In [114]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.748343   1.74765   



[array([1.74765])]

In [115]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp])) # again, not tranposing gives the same result too
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [116]:
get_next('for thos')

'e'

In [117]:
get_next('part of ')

't'

In [118]:
get_next('queens a')

'n'

### RNN Model (Concat, not add)

This model concatenate, not add, the hidden states variables and the character inputs.
> self.l_in = nn.Linear(n_fac+n_hidden, n_hidden)

In [112]:
class CharLoopConcatModel(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.l_in = nn.Linear(n_fac+n_hidden, n_hidden)
        self.l_hidden = nn.Linear(n_hidden, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(bs, n_hidden).cuda())
        for c in cs:
            inp = torch.cat((h, self.e(c)), 1)
            inp = F.relu(self.l_in(inp))
            h = F.tanh(self.l_hidden(inp))
        
        return F.log_softmax(self.l_out(h), dim=-1)

In [113]:
m = CharLoopConcatModel(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [114]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.86289    1.85946   



[array([1.85946])]

In [212]:
set_lrs(opt, 1e-4)

In [213]:
fit(m, md, 1, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.694402   1.69904   



[array([1.69904])]

### Test model

In [214]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [215]:
get_next('for thos')

'e'

In [216]:
get_next('part of ')

't'

In [217]:
get_next('queens a')

'n'

## RNN with pytorch

?nn.RNN

Init signature: nn.RNN(*args, **kwargs)
Docstring:     
Applies a multi-layer Elman RNN with tanh or ReLU non-linearity to an
input sequence.


For each element in the input sequence, each layer computes the following
function:

.. math::

    h_t = \tanh(w_{ih} * x_t + b_{ih}  +  w_{hh} * h_{(t-1)} + b_{hh})

where :math:`h_t` is the hidden state at time `t`, and :math:`x_t` is
the hidden state of the previous layer at time `t` or :math:`input_t`
for the first layer. If nonlinearity='relu', then `ReLU` is used instead
of `tanh`.

Args:
    input_size: The number of expected features in the input x
    hidden_size: The number of features in the hidden state h
    num_layers: Number of recurrent layers.
    nonlinearity: The non-linearity to use ['tanh'|'relu']. Default: 'tanh'
    bias: If ``False``, then the layer does not use bias weights b_ih and b_hh.
        Default: ``True``
    batch_first: If ``True``, then the input and output tensors are provided
        as (batch, seq, feature)
    dropout: If non-zero, introduces a dropout layer on the outputs of each
        RNN layer except the last layer
    bidirectional: If ``True``, becomes a bidirectional RNN. Default: ``False``

Inputs: input, h_0
    - **input** (seq_len, batch, input_size): tensor containing the features
      of the input sequence. The input can also be a packed variable length
      sequence. See :func:`torch.nn.utils.rnn.pack_padded_sequence`
      for details.
    - **h_0** (num_layers * num_directions, batch, hidden_size): tensor
      containing the initial hidden state for each element in the batch.
      Defaults to zero if not provided.

Outputs: output, h_n
    - **output** (seq_len, batch, hidden_size * num_directions): tensor
      containing the output features (h_k) from the last layer of the RNN,
      for each k.  If a :class:`torch.nn.utils.rnn.PackedSequence` has
      been given as the input, the output will also be a packed sequence.
    - **h_n** (num_layers * num_directions, batch, hidden_size): tensor
      containing the hidden state for k=seq_len.

Attributes:
    weight_ih_l[k]: the learnable input-hidden weights of the k-th layer,
        of shape `(hidden_size x input_size)` for k=0. Otherwise, the shape is
        `(hidden_size x hidden_size)`
    weight_hh_l[k]: the learnable hidden-hidden weights of the k-th layer,
        of shape `(hidden_size x hidden_size)`
    bias_ih_l[k]: the learnable input-hidden bias of the k-th layer,
        of shape `(hidden_size)`
    bias_hh_l[k]: the learnable hidden-hidden bias of the k-th layer,
        of shape `(hidden_size)`

Examples::

    >>> rnn = nn.RNN(10, 20, 2) # 2 layers
    >>> input = Variable(torch.randn(5, 3, 10))  # (sequence length, bs, embedding)
    >>> h0 = Variable(torch.randn(2, 3, 20)) # normally 1st argument is 1  (layer, bs, hidden)
    >>> output, hn = rnn(input, h0)
File:           ~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/rnn.py
Type:           type

In [None]:
Note that the 3rd arugument **num_layers** = `2`   
nn.RNN(10, 20, `2`)  
Variable(torch.randn(`2`, 3, 20))  

This involves more complex archtecture of stacking 2 RNNs, one on top of the other.

In [None]:
Args:
    input_size: The number of expected features in the input x
    hidden_size: The number of features in the hidden state h
    num_layers: Number of recurrent layers.
        
rnn = nn.RNN(10, 20, 2)
h0 = Variable(torch.randn(2, 3, 20))

Note that if in the following code you only passed in `inp = self.e(cs)` you will have the following error later  
when you run fit(): **tuple' object has no attribute 'dim**  
  
Hence you will have to stack the inputs again using `inp = self.e(torch.stack(cs))`  

Note that nn.RNN() accept input of shape **(sequence length, bs, embedding)**

In [134]:
class CharRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs)) # See below regarding (sequence length, bs, embedding)
        outp,h = self.rnn(inp, h)
                
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)
        

### Inputs shape to nn.RNN()

Note that the input to `nn.RNN()` has to be of size (sequence length, bs, embedding). 
So when we pass in the inputs into `forward()`, it still has the size of (bs, 8). 
  
However, using the splatter operator asterix in forward(), 
we split the inputs into eight separate vectors of size (bs) this way : 
`[(bs),(bs),(bs),(bs),(bs),(bs),(bs),(bs)]`

calling torch.stack(cs) would be effectively stacking these vector on top of each other, 
in so creating a tensor of size(sequence, bs).

Passing through the nn.embedding() makes it (sequence length, bs embedding)   
  

PyTorch RNNs return a tuple of **(output, h_n)**.    
  
Output contains the hidden state of the last RNN layer at the **last timestep**. This is usually what you want to pass downstream for sequence prediction tasks.    
  
h_n is the hidden state for t=seq_len (for all RNN layers and directions).  
  
output is a tensor of shape seq_len, batch_size, hidden_size * num_directions. 

So outputs are the final layer hidden state for all timestep. outp[-1] is the one for the last timestep. [link2](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwjQ_sut4ZTeAhWLvo8KHQ6-D-AQFjAAegQICRAB&url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F48302810%2Fwhats-the-difference-between-hidden-and-output-in-pytorch-lstm&usg=AOvVaw3XUEYr6wUxuL0q3ANVhlUn)

### self.rnn(inp, h)     

There is no need to add a loop when using 
> self.rnn = nn.RNN(n_fac, n_hidden)  
  
as pytorch has abstracted it for us. These lines in the previous RNN as follows    

In [None]:
h = V(torch.zeros(bs, n_hidden).cuda())  
for c in cs:  
    inp = torch.cat((h, self.e(c)), 1)  
    inp = F.relu(self.l_in(inp))  
    h = F.tanh(self.l_hidden(inp))  

F.log_softmax(self.l_out(h), dim=-1)    

are abstracted and become  

In [None]:
h = V(torch.zeros(1, bs, n_hidden))
inp = self.e(torch.stack(cs)) #(sequence length, bs, embedding)
outp,h = self.rnn(inp, h)     

F.log_softmax(self.l_out(outp[-1]), dim=-1)

Note that the zero matrix hidden weights h has to be a rank 3 tensor **(1, bs, n_hidden)**.

#### Instantiate the model

In [135]:
m = CharRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

Let's pass a single batch of data into the rnn layer to see it in action.  
To do so, first use the iter() on the training data

In [143]:
it = iter(md.trn_dl)
*xs,yt = next(it)

Note the input dimension is   
*Sequence length, Batch size, Embedding size* 

Also, the inputs are stacked to change the list to a torch tensor

In [144]:
type(xs)

list

In [145]:
torch.stack(xs)


   73    68    54  ...      2    57    58
   62    56    62  ...     44    72    72
   56    73    67  ...     32     2     2
       ...          ⋱          ...       
   62    67     2  ...      2     2     2
   67    58    54  ...     73    73    70
    2    10    67  ...     61    61    74
[torch.cuda.LongTensor of size 8x512 (GPU 0)]

In [146]:
input1 = m.e(V(torch.stack(xs)))
print("Sequence length, Batch size, Embedding size :",input1.size())

Sequence length, Batch size, Embedding size : torch.Size([8, 512, 42])


Note that the initial size of the hidden state variables are also similiar      
*Number of Stacked RNN Layer, Batch size, Hidden state size* 
  
Number of Stacked RNN Layer usually = 1 (unless complex architecture)

In [138]:
ht = V(torch.zeros(1, 512,n_hidden))
print("Number of Stacked RNN Layer, Batch size, Hidden state size :",ht.size())

Number of Stacked RNN Layer, Batch size, Hidden state size : torch.Size([1, 512, 256])


Pass a single batch into `m.rnn`

In [139]:
outp, hn = m.rnn(input1, ht)

After running through the sequence length, the final hidden state size is as follows

In [140]:
hn.size()

torch.Size([1, 512, 256])

Whereas the output has a size of 8 (axis 0), equal to the sequence length.

In [130]:
outp.size()

torch.Size([8, 512, 256])

Prepare to forward pass the batches of 512 rows of inputs into the model.  
For every row of input, it will produce a prediction for each of the 85 characters.  
Note that the model m can be used like a layer, just like what was done with self.rnn

In [141]:
t = m(*V(xs)); t.size()

torch.Size([512, 85])

In [142]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.871748   1.841063  
    1      1.672987   1.675138                              
    2      1.582195   1.602557                              
    3      1.542989   1.555944                              



[array([1.55594])]

In [230]:
set_lrs(opt, 1e-4)

In [231]:
fit(m, md, 2, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.472884   1.511173  
    1      1.46085    1.505596                              



[array([1.5056])]

### Test model

In [232]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp])) # why transpose? 
                                                       # see Inputs shape to nn.RNN()
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [233]:
get_next('for thos')

'e'

In [234]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c 
    return res

In [235]:
get_next_n('for thos', 40)

'for those something the same to the same to the '

## Stacked RNN ?

Try building a 2-layer RNN ????

>  self.rnn = nn.RNN(n_fac, n_hidden,2)

In [309]:
class CharStackedRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden,2)   # changed to 2
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(2, bs, n_hidden)) # changed to 2
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        
        return F.log_softmax(self.l_out(outp[-1]), dim=-1)

In [310]:
m = CharStackedRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [311]:
it = iter(md.trn_dl)
*xs,yt = next(it)

In [239]:
input1 = m.e(V(torch.stack(xs)))
print("Sequence length, Batch size, Embedding size :",input1.size())

Sequence length, Batch size, Embedding size : torch.Size([8, 512, 42])


In [240]:
ht = V(torch.zeros(2, 512,n_hidden))
print("Number of Stacked RNN Layer, Batch size, Hidden state size :",ht.size())

Number of Stacked RNN Layer, Batch size, Hidden state size : torch.Size([2, 512, 256])


In [241]:
outp, hn = m.rnn(input1, ht)

After running through the sequence length, the final hidden state size is as follows

In [242]:
hn.size()

torch.Size([2, 512, 256])

Whereas the output has a size of 8 (axis 0), equal to the sequence length.

In [243]:
outp.size()

torch.Size([8, 512, 256])

In [244]:
t = m(*V(xs)); t.size()

torch.Size([512, 85])

Overfits quickly

In [245]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.733756   1.723142  
    1      1.56492    1.569838                              
    2      1.488353   1.507365                              
    3      1.445725   1.47659                               



[array([1.47659])]

### Test model

In [246]:
def get_next(inp):
    idxs = T(np.array([char_indices[c] for c in inp]))
    p = m(*VV(idxs))
    i = np.argmax(to_np(p))
    return chars[i]

In [247]:
get_next('for thos')

'e'

In [248]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [249]:
get_next_n('for thos', 40)

'for those of the same something the same somethi'

## Multi-output model

### Setup

Let's take non-overlapping sets of characters this time

In [312]:
cs = 8

In [313]:
c_in_dat = [[idx[i+j] for i in range(cs)] for j in range(0, len(idx)-cs-1, cs)]
c_in_dat[:10]

[[40, 42, 29, 30, 25, 27, 29, 1],
 [1, 1, 43, 45, 40, 40, 39, 43],
 [33, 38, 31, 2, 73, 61, 54, 73],
 [2, 44, 71, 74, 73, 61, 2, 62],
 [72, 2, 54, 2, 76, 68, 66, 54],
 [67, 9, 9, 76, 61, 54, 73, 2],
 [73, 61, 58, 67, 24, 2, 33, 72],
 [2, 73, 61, 58, 71, 58, 2, 67],
 [68, 73, 2, 60, 71, 68, 74, 67],
 [57, 1, 59, 68, 71, 2, 72, 74]]

Then create the exact same thing, offset by 1, as our labels

In [314]:
c_out_dat = [[idx[i+j] for i in range(cs)] for j in range(1, len(idx)-cs, cs)]
c_out_dat[:10]

[[42, 29, 30, 25, 27, 29, 1, 1],
 [1, 43, 45, 40, 40, 39, 43, 33],
 [38, 31, 2, 73, 61, 54, 73, 2],
 [44, 71, 74, 73, 61, 2, 62, 72],
 [2, 54, 2, 76, 68, 66, 54, 67],
 [9, 9, 76, 61, 54, 73, 2, 73],
 [61, 58, 67, 24, 2, 33, 72, 2],
 [73, 61, 58, 71, 58, 2, 67, 68],
 [73, 2, 60, 71, 68, 74, 67, 57],
 [1, 59, 68, 71, 2, 72, 74, 72]]

Minor technical note: that for the above, inputs range is `range(0, len(idx)-cs-1`. There is a minus 1.     
but output range starts from 1 and is `range(1, len(idx)-cs)`  

In [315]:
xs = np.stack(c_in_dat)
xs.shape

(75111, 8)

In [316]:
ys = np.stack(c_out_dat)
ys.shape

(75111, 8)

In [317]:
xs[:cs,:cs]

array([[40, 42, 29, 30, 25, 27, 29,  1],
       [ 1,  1, 43, 45, 40, 40, 39, 43],
       [33, 38, 31,  2, 73, 61, 54, 73],
       [ 2, 44, 71, 74, 73, 61,  2, 62],
       [72,  2, 54,  2, 76, 68, 66, 54],
       [67,  9,  9, 76, 61, 54, 73,  2],
       [73, 61, 58, 67, 24,  2, 33, 72],
       [ 2, 73, 61, 58, 71, 58,  2, 67]])

In [318]:
ys[:cs,:cs]

array([[42, 29, 30, 25, 27, 29,  1,  1],
       [ 1, 43, 45, 40, 40, 39, 43, 33],
       [38, 31,  2, 73, 61, 54, 73,  2],
       [44, 71, 74, 73, 61,  2, 62, 72],
       [ 2, 54,  2, 76, 68, 66, 54, 67],
       [ 9,  9, 76, 61, 54, 73,  2, 73],
       [61, 58, 67, 24,  2, 33, 72,  2],
       [73, 61, 58, 71, 58,  2, 67, 68]])

### Create and train model

#### Get validation index

In [319]:
len(xs)-cs-1

75102

Around 20% will be used as validation

In [320]:
val_idx = get_cv_idxs(len(xs)-cs-1)
val_idx.size

15020

In [321]:
md = ColumnarModelData.from_arrays('.', val_idx, xs, ys, bs=512)

In [322]:
class CharSeqRnn(nn.Module):
    def __init__(self, vocab_size, n_fac):
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        
    def forward(self, *cs):
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs))
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

This *multi-output RNN* has exactly the same codes as *RNN with pytorch* except for the following: 

RNN with pytorch returns one output  
**return F.log_softmax(self.l_out(outp[-1]), dim=-1)**  
  
multi-output RNN returns all outputs  
**return F.log_softmax(self.l_out(outp), dim=-1)**  

#### Instantiate the model

In [323]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

#### Examining the loss function

Let's create a iterator

In [324]:
it = iter(md.trn_dl)
*xst,yt = next(it)

The target `yt` will need to be transposed and flattened for the **convenience** of the loss function as well as the shape of the output from nn.RNN().

#### Tranpose target

Note that yt the outer dim is 512 and the inner dim is 8. so even when you transpose, the inner dim remains as 8 even though yt has new shape (8,512)

In [325]:
yt.size()

torch.Size([512, 8])

In [57]:
ytt = yt.transpose(0,1)
ytt.size() 

torch.Size([8, 512])

Then it has to be flattened from 8 X 512

In [58]:
ytt


    8    73    58  ...     62    72    47
    2    61    71  ...     73    68    54
   62    58    75  ...     61    69    60
       ...          ⋱          ...       
   54    74    73  ...     32     2    71
   57    67     4  ...     29    62     5
   57    58     9  ...     37    67    72
[torch.cuda.LongTensor of size 8x512 (GPU 0)]

to a column tensor of 4096

In [60]:
ytt.contiguous().view(-1)


  8
 73
 58
 ⋮ 
 37
 67
 72
[torch.cuda.LongTensor of size 4096 (GPU 0)]

#### Flattenning

In [55]:
a = np.reshape(list(range(60)),(3,4,5))

In [56]:
a = torch.Tensor(a)
a


(0 ,.,.) = 
   0   1   2   3   4
   5   6   7   8   9
  10  11  12  13  14
  15  16  17  18  19

(1 ,.,.) = 
  20  21  22  23  24
  25  26  27  28  29
  30  31  32  33  34
  35  36  37  38  39

(2 ,.,.) = 
  40  41  42  43  44
  45  46  47  48  49
  50  51  52  53  54
  55  56  57  58  59
[torch.FloatTensor of size 3x4x5]

In [57]:
a.view(-1,5)


    0     1     2     3     4
    5     6     7     8     9
   10    11    12    13    14
   15    16    17    18    19
   20    21    22    23    24
   25    26    27    28    29
   30    31    32    33    34
   35    36    37    38    39
   40    41    42    43    44
   45    46    47    48    49
   50    51    52    53    54
   55    56    57    58    59
[torch.FloatTensor of size 12x5]

In [58]:
a.view(3,-1)



Columns 0 to 12 
    0     1     2     3     4     5     6     7     8     9    10    11    12
   20    21    22    23    24    25    26    27    28    29    30    31    32
   40    41    42    43    44    45    46    47    48    49    50    51    52

Columns 13 to 19 
   13    14    15    16    17    18    19
   33    34    35    36    37    38    39
   53    54    55    56    57    58    59
[torch.FloatTensor of size 3x20]

In [59]:
a.view(4,-1)



Columns 0 to 12 
    0     1     2     3     4     5     6     7     8     9    10    11    12
   15    16    17    18    19    20    21    22    23    24    25    26    27
   30    31    32    33    34    35    36    37    38    39    40    41    42
   45    46    47    48    49    50    51    52    53    54    55    56    57

Columns 13 to 14 
   13    14
   28    29
   43    44
   58    59
[torch.FloatTensor of size 4x15]

In [60]:
a.view(-1,4)


    0     1     2     3
    4     5     6     7
    8     9    10    11
   12    13    14    15
   16    17    18    19
   20    21    22    23
   24    25    26    27
   28    29    30    31
   32    33    34    35
   36    37    38    39
   40    41    42    43
   44    45    46    47
   48    49    50    51
   52    53    54    55
   56    57    58    59
[torch.FloatTensor of size 15x4]

Let's create the inputs, which are the predictions, and

In [326]:
pred = m(*V(xst))
pred.size()

torch.Size([8, 512, 85])

feed pred and target into the loss function nll_loss_seq()

In [67]:
x = np.arange(24).reshape((2,3,4))
x

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [68]:
x.reshape(-1,4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

Note tha pytorch does not accept rank3 tensor. As such, the inputs have to be flatten to **inputs.view(-1,nh)**  
The last axis is kept while the rest are flattened into one.

In [332]:
_,_,nh = pred.size()
nh

85

8X512 = 4096

In [331]:
pred.view(-1,nh)

Variable containing:
-4.3087 -4.3666 -4.6273  ...  -4.6672 -4.3569 -4.6242
-4.4377 -4.4526 -4.3745  ...  -4.4565 -4.3258 -4.4461
-4.4238 -4.8020 -4.2584  ...  -4.5244 -4.2242 -4.4699
          ...             ⋱             ...          
-4.4313 -4.7284 -4.3200  ...  -4.5356 -4.3340 -4.2562
-4.3487 -4.4368 -4.7077  ...  -4.7442 -4.1969 -4.5650
-4.1541 -4.2702 -4.7339  ...  -4.6129 -4.1995 -4.4297
[torch.cuda.FloatTensor of size 4096x85 (GPU 0)]

In [333]:
def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size()
    print("sl,bs,nh: ", sl,bs,nh)
    print("\n")
    targ = targ.transpose(0,1).contiguous().view(-1) # (sequence length, bs)
    print("targ:", targ)
    print("\n")
    print("loss function output:")
    print("\n")
    print("'inp.view(-1,nh):' is inp flattened so that it can be fed into the loss function ")
    print(inp.view(-1,nh)) # #(sequence length, bs, embedding)
    return F.nll_loss(inp.view(-1,nh), targ)

Note that `F.nll_loss` is the **negative log likelihood loss**

Run the function and print out all the results

In [334]:
nll_loss_seq(pred, V(yt))

sl,bs,nh:  8 512 85


targ: Variable containing:
 58
 21
 54
 ⋮ 
 74
 68
 57
[torch.cuda.LongTensor of size 4096 (GPU 0)]



loss function output:


'inp.view(-1,nh):' is inp flattened so that it can be fed into the loss function 
Variable containing:
-4.3087 -4.3666 -4.6273  ...  -4.6672 -4.3569 -4.6242
-4.4377 -4.4526 -4.3745  ...  -4.4565 -4.3258 -4.4461
-4.4238 -4.8020 -4.2584  ...  -4.5244 -4.2242 -4.4699
          ...             ⋱             ...          
-4.4313 -4.7284 -4.3200  ...  -4.5356 -4.3340 -4.2562
-4.3487 -4.4368 -4.7077  ...  -4.7442 -4.1969 -4.5650
-4.1541 -4.2702 -4.7339  ...  -4.6129 -4.1995 -4.4297
[torch.cuda.FloatTensor of size 4096x85 (GPU 0)]



Variable containing:
 4.4668
[torch.cuda.FloatTensor of size 1 (GPU 0)]

In [65]:
def nll_loss_seq(inp, targ):
    sl,bs,nh = inp.size() # shouldn't this be sl,bs,vocab = inp.size() instead?
    targ = targ.transpose(0,1).contiguous().view(-1)
    # targ is (512,8) but inp is (8,512,84), so transpose, then flatten using view(-1)
    
    return F.nll_loss(inp.view(-1,nh), targ)
    # flatten inp's first 2 axis (8,512,84) so that we can compare 84 probabilities with each targ

In [66]:
fit(m, md, 4, opt, nll_loss_seq)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      2.571819   2.395257  
    1      2.278041   2.189023                              
    2      2.126606   2.075855                              
    3      2.0344     2.001645                              



[array([2.00164])]

In [67]:
set_lrs(opt, 1e-4)

In [68]:
fit(m, md, 1, opt, nll_loss_seq)

HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.986922   1.987179  



[array([1.98718])]

### Identity init!

Next, we test the model by using the Identity Matrix instead of a random initializer 

In [69]:
m = CharSeqRnn(vocab_size, n_fac).cuda()
opt = optim.Adam(m.parameters(), 1e-2)

There are 3 module layers in the RNN

In [78]:
m.modules

<bound method Module.modules of CharSeqRnn(
  (e): Embedding(85, 42)
  (rnn): RNN(42, 256)
  (l_out): Linear(in_features=256, out_features=85, bias=True)
)>

The character input weights:

In [82]:
m.rnn.weight_ih_l0.size()

torch.Size([256, 42])

and the hidden state weights:

In [81]:
m.rnn.weight_hh_l0.size()

torch.Size([256, 256])

Note that these weights are randomly initialized, which is not ideal. The hidden weights are to be passed from one sequence to the next, so they are likely to explode or vanish. Using the Identity matrix can help to keep them stable.

In [73]:
m.rnn.weight_hh_l0

Parameter containing:
 2.9081e-02 -2.7065e-02  9.5262e-03  ...  -5.7022e-02  1.6494e-02 -1.8046e-02
-6.0399e-02 -3.7136e-02 -3.4931e-02  ...  -5.9569e-02  1.1623e-02  4.1199e-02
-2.0252e-02 -6.9852e-03 -5.7479e-02  ...   4.7481e-02  6.0246e-02 -4.6353e-02
                ...                   ⋱                   ...                
-1.2451e-02 -1.1864e-02 -4.8800e-02  ...   2.2607e-02 -4.8948e-02  2.6694e-02
 2.2739e-02 -4.8428e-02 -2.4000e-02  ...   2.7301e-02 -4.3589e-02  4.0415e-02
 3.1339e-02  3.8786e-02 -3.4345e-02  ...   5.0805e-02  4.7381e-02 -2.8259e-02
[torch.cuda.FloatTensor of size 256x256 (GPU 0)]

Let's use the Identity Matrix instead. **Note that the identity matrix can only be used in a square matrix, so only the hidden layer can be initiated this way.**

In [184]:
m.rnn.weight_hh_l0.data.copy_(torch.eye(n_hidden))


    1     0     0  ...      0     0     0
    0     1     0  ...      0     0     0
    0     0     1  ...      0     0     0
       ...          ⋱          ...       
    0     0     0  ...      1     0     0
    0     0     0  ...      0     1     0
    0     0     0  ...      0     0     1
[torch.cuda.FloatTensor of size 256x256 (GPU 0)]

In [84]:
fit(m, md, 4, opt, nll_loss_seq)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      2.188063   2.037499  
    1      1.973951   1.92179                               
    2      1.895284   1.882794                             
    3      1.850385   1.856922                              



[array([1.85692])]

In [85]:
set_lrs(opt, 1e-3)

In [86]:
fit(m, md, 4, opt, nll_loss_seq)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                              
    0      1.746195   1.781623  
    1      1.732748   1.774872                              
    2      1.725642   1.77018                               
    3      1.716911   1.76791                               



[array([1.76791])]

## Stateful model

There is a specific way that torch lines up text data for stateful rnn. Watch it [here](https://youtu.be/H3g26EVADgY?t=16m)

### Setup

In [58]:
from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

[0m[01;34mmodels[0m/  nietzsche.txt  [01;34mtrn[0m/  [01;34mval[0m/


In [30]:
%ls {PATH}trn

training.txt


### Using the Function LanguageModelData 

Field class models common text processing datatypes that can be represented
    by tensors.  It holds a Vocab object that defines the set of possible values
    for elements of the field and their corresponding numerical representations. 
    
The Field object also holds other parameters relating to how a datatype
    should be numericalized, such as a tokenization method and the kind of
    Tensor that should be produced. If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.  
      
      
Basically, it shows you how text is mapped to numbers, and what are the logic behind this mapping.      



In [59]:
TEXT = data.Field(lower=True, tokenize=list)

`tokenize =list` uses built-in function *list* as the function to tokenize the text strings.  
If `tokenize = spacy` is selected, the SpaCy English tokenizer is used.  
The **default** tokenize function is just `str.split`.

In [60]:
bs=64; bptt=8; n_fac=42; n_hidden=256

FILES is just shortcut indicating where the training,validation and test paths are

In [61]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
FILES

{'train': 'trn/', 'validation': 'val/', 'test': 'val/'}

In [62]:
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)

In [44]:
    @classmethod
    def from_text_files(cls, path, field, train, validation, test=None, bs=64, bptt=70, **kwargs):
        """ Method used to instantiate a LanguageModelData object that can be used for a
            supported nlp task.

        Args:
            path (str): the absolute path in which temporary model data will be saved
            field (Field): torchtext field
            train (str): file location of the training data
            validation (str): file location of the validation data
            test (str): file location of the testing data
            bs (int): batch size to use
            bptt (int): back propagation through time hyper-parameter
            kwargs: other arguments

        Returns:
            a LanguageModelData instance, which most importantly, provides us the datasets for training,
                validation, and testing

        Note:
            The train, validation, and test path can be pointed to any file (or folder) that contains a valid
                text corpus.

In [26]:
print(f'Number of batches: {len(md.trn_dl)}')

Number of batches: 953


Note that the built-in function *list* is used in *tokenize=list*, giving the number of tokens as 55. 

*TEXT = data.Field(lower=True, tokenize=list)*  
*LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=3)* 

Recall that in the beginning of this notebook we did not use the above to tokenize
but instead did the tokenizing ourselves using

*chars = sorted(list(set(text)))* 
  
giving us 85 unique tokens.


In [45]:
print(f'Number of tokens: {md.nt}')

Number of tokens: 55


Print out all the tokens. TEXT.vocab is created already by now.

In [38]:
''.join(list(TEXT.vocab.itos))

'<unk><pad> etiaonsrhldcufmpg,ywbv-."kx;:qj!?()\'z12=_53[]468790ä'

In [39]:
print(f'Number of training dataset: {len(md.trn_ds)}')

Number of training dataset: 1


There is only one dataset because the whole chunk of Nietzche text is the dataset  

In [43]:
print(f"Number of tokens in training dataset: {len(md.trn_ds[0].text)}")

Number of tokens in training dataset: 488745


### Build the RNN  
  
At the end of an epoch, the batchsize will be lower than 64. To solve this wrinkle, forward() will check `if self.h.size(1) != bs: self.init_hidden(bs)` (Note that there will not be stateful in this case). This will set a new batch size for the final batch of the existing epoch as well as reset it to 64 when the new epoch starts.  

`self.h = repackage_var(h)` will forget the prior operations so that the backprop will not go back in history all the way, this keeps the bptt to 8 time-steps.

`F.log_softmax(self.l_out(outp), dim=-1)` will apply softmax on the last dimension (vocab) and then `.view(-1, self.vocab_size)` will flatten bs and bptt while keeping the last dimension vocab instact. Note that for the target, pytorch automatically flattens it so there is no need to do the same thing.

In [47]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.vocab_size = vocab_size
        super().__init__()
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNN(n_fac, n_hidden)
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs): # note no splatter used (LanguageModelData)
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs) 
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

#### ???Why is it that no splatter or torch.stack needs to be applied to cs when using LanguageDataModel???  

???

`repackage_var(h)` keeps the data but throws away all history of operations. This enables us to pass the state to the next minibatch, yet still have the backprop stop going back more than bptt timestep (in this case 8 timesteps). Otherwise as training goes on, self.h will carry longer and longer history, and backprop will go back in time earlier. 

Note that the final output is flattened inside forward().  

`F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)`

#### Regarding hidden state

In [None]:
def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

To get stateful, the hidden state is declared now as an attribute as below. It was previously in def forward()

In [None]:
class CharSeqStatefulRnn(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        self.init_hidden(bs)

The final hidden states at the end of each mini-batch will be updated,  
so that it will be used as the new initial hidden weights as the next mini-batch begins.  
repackage_var() will forget the history of the operations.

In [None]:
    def forward(self, cs):
        outp,h = 
        self.h = repackage_var(h)

To resolve the wrinkle of the last batch in an epoch having a different size than the other batches,  
we have to use this condition in the forward()

In [None]:
def forward(self, cs):
        bs = cs[0].size(0) # batch size will be different for the very last batch in an epcoh
        if self.h.size(1) != bs: self.init_hidden(bs)        

### ???

Given the way torch split the corpus into batches to allow for stateful learning, how does torch   
reconcile the change in batch size in the final batch of an epoch?  
i.e. How does it line up the leftover text in the different batches?

### Train the model

In [48]:
m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [49]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.872604   1.847967  
    1      1.69718    1.70512                                
    2      1.615404   1.635051                               
    3      1.565641   1.603767                               



[array([1.60377])]

In [50]:
set_lrs(opt, 1e-4)

fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.482862   1.552192  
    1      1.487418   1.545811                               
    2      1.473185   1.542076                               
    3      1.478708   1.537967                               



[array([1.53797])]

### RNN loop

For pedagogical exposition, let's use RNNCell() from pytorch instead.  
This means that we will have to do the loop ourselves.    

In [9]:
# From the pytorch source
# just to show that torch isn't doing anything special

def RNNCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    return F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh))

Read comments added in the codes below 

In [36]:
class CharSeqStatefulRnn2(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.RNNCell(n_fac, n_hidden) # Use custom RNN 
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp = []  # need empty list to store output at each sequence
        o = self.h  
        for c in cs: # need to use for loop as not using nn.rnn()
            o = self.rnn(self.e(c), o)
            outp.append(o) # keep all hidden state in a list
        outp = self.l_out(torch.stack(outp)) # get multiple outputs
        self.h = repackage_var(o) # note that o is the final time step hidden state
                                  # repackage_var so that self.h has no history
                                  # repackage_var() == Varable(self.h.data)  
        return F.log_softmax(outp, dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

Note the above uses `LanguageModelData` to load data, while previously we were using `ColumnarModelData` to load them.

In [None]:
Using ColumnarModelData the splater and stack were used to change the shape of inp

def forward(self, *cs): # Note *
        bs = cs[0].size(0)
        h = V(torch.zeros(1, bs, n_hidden))
        inp = self.e(torch.stack(cs)) # have to stack to change shape to (seq,bs,embedding)
        outp,h = self.rnn(inp, h)
        return F.log_softmax(self.l_out(outp), dim=-1)

In [37]:
m = CharSeqStatefulRnn2(md.nt, n_fac, 512).cuda()
opt = optim.Adam(m.parameters(), 1e-3)

In [38]:
fit(m, md, 4, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=4), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.873116   1.853935  
    1      1.695284   1.70925                                
    2      1.60732    1.645366                               
    3      1.565484   1.604898                               



[array([1.6049])]

### GRU

The GRU class below is the same as *CharSeqStatefulRnn()*, except for one line.

In [39]:
class CharSeqStatefulGRU(nn.Module):
    def __init__(self, vocab_size, n_fac, bs):
        super().__init__()
        self.vocab_size = vocab_size
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.GRU(n_fac, n_hidden)  # Only this line is changed from nn.rnn()
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h.size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs): self.h = V(torch.zeros(1, bs, n_hidden))

In [40]:
# From the pytorch source code - for reference

def GRUCell(input, hidden, w_ih, w_hh, b_ih, b_hh):
    gi = F.linear(input, w_ih, b_ih) # character
    gh = F.linear(hidden, w_hh, b_hh) # hidden
    i_r, i_i, i_n = gi.chunk(3, 1) # split(?) into the 3 respective gates
    h_r, h_i, h_n = gh.chunk(3, 1) # same

    resetgate = F.sigmoid(i_r + h_r)
    inputgate = F.sigmoid(i_i + h_i)
    newgate = F.tanh(i_n + resetgate * h_n) # new state
    return newgate + inputgate * (hidden - newgate) 
          # hidden*(inputgate) + (1-inputgate)*newgate

Suspect that `gi` and `gh` is actually 3 times bigger than usual, so that it can be split into 3 chunks

#### Chunking

In [14]:
v = torch.arange(12)
v = v.view(4,3)
v


  0   1   2
  3   4   5
  6   7   8
  9  10  11
[torch.FloatTensor of size 4x3]

In [15]:
v.chunk(3,1)

(
  0
  3
  6
  9
 [torch.FloatTensor of size 4x1], 
   1
   4
   7
  10
 [torch.FloatTensor of size 4x1], 
   2
   5
   8
  11
 [torch.FloatTensor of size 4x1])

--------

In [41]:
m = CharSeqStatefulGRU(md.nt, n_fac, 512).cuda()

opt = optim.Adam(m.parameters(), 1e-3)

In [42]:
fit(m, md, 6, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=6), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.741781   1.726211  
    1      1.563426   1.578207                               
    2      1.476227   1.516808                               
    3      1.427569   1.484527                               
    4      1.384639   1.463643                               
    5      1.35995    1.456252                               



[array([1.45625])]

In [206]:
set_lrs(opt, 1e-4)

In [207]:
fit(m, md, 3, opt, F.nll_loss)

HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.2754     1.423165  
    1      1.273323   1.418575                               
    2      1.285239   1.417631                               



[array([1.41763])]

### Putting it all together: LSTM

In [63]:
from fastai import sgdr

n_hidden=512

In [215]:
??nn.LSTM

Init signature: nn.LSTM(*args, **kwargs)
Source:        
class LSTM(RNNBase):
    r"""Applies a multi-layer long short-term memory (LSTM) RNN to an input
    sequence.


    For each element in the input sequence, each layer computes the following
    function:

    .. math::

            \begin{array}{ll}
            i_t = \mathrm{sigmoid}(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
            f_t = \mathrm{sigmoid}(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
            g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t-1)} + b_{hg}) \\
            o_t = \mathrm{sigmoid}(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
            c_t = f_t * c_{(t-1)} + i_t * g_t \\
            h_t = o_t * \tanh(c_t)
            \end{array}

    where :math:`h_t` is the hidden state at time `t`, :math:`c_t` is the cell
    state at time `t`, :math:`x_t` is the hidden state of the previous layer at
    time `t` or :math:`input_t` for the first layer, and :math:`i_t`,
    :math:`f_t`, :math:`g_t`, :math:`o_t` are the input, forget, cell,
    and out gates, respectively.

    Args:
        input_size: The number of expected features in the input x
        hidden_size: The number of features in the hidden state h
        num_layers: Number of recurrent layers.
        bias: If ``False``, then the layer does not use bias weights b_ih and b_hh.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature)
        dropout: If non-zero, introduces a dropout layer on the outputs of each
            RNN layer except the last layer
        bidirectional: If ``True``, becomes a bidirectional RNN. Default: ``False``

    Inputs: input, (h_0, c_0)
        - **input** (seq_len, batch, input_size): tensor containing the features
          of the input sequence.
          The input can also be a packed variable length sequence.
          See :func:`torch.nn.utils.rnn.pack_padded_sequence` for details.
        - **h_0** (num_layers \* num_directions, batch, hidden_size): tensor
          containing the initial hidden state for each element in the batch.
        - **c_0** (num_layers \* num_directions, batch, hidden_size): tensor
          containing the initial cell state for each element in the batch.

          If (h_0, c_0) is not provided, both **h_0** and **c_0** default to zero.


    Outputs: output, (h_n, c_n)
        - **output** (seq_len, batch, hidden_size * num_directions): tensor
          containing the output features `(h_t)` from the last layer of the RNN,
          for each t. If a :class:`torch.nn.utils.rnn.PackedSequence` has been
          given as the input, the output will also be a packed sequence.
        - **h_n** (num_layers * num_directions, batch, hidden_size): tensor
          containing the hidden state for t=seq_len
        - **c_n** (num_layers * num_directions, batch, hidden_size): tensor
          containing the cell state for t=seq_len

    Attributes:
        weight_ih_l[k] : the learnable input-hidden weights of the k-th layer
            `(W_ii|W_if|W_ig|W_io)`, of shape `(4*hidden_size x input_size)`
        weight_hh_l[k] : the learnable hidden-hidden weights of the k-th layer
            `(W_hi|W_hf|W_hg|W_ho)`, of shape `(4*hidden_size x hidden_size)`
        bias_ih_l[k] : the learnable input-hidden bias of the k-th layer
            `(b_ii|b_if|b_ig|b_io)`, of shape `(4*hidden_size)`
        bias_hh_l[k] : the learnable hidden-hidden bias of the k-th layer
            `(b_hi|b_hf|b_hg|b_ho)`, of shape `(4*hidden_size)`

    Examples::

        >>> rnn = nn.LSTM(10, 20, 2)
        >>> input = Variable(torch.randn(5, 3, 10))
        >>> h0 = Variable(torch.randn(2, 3, 20))
        >>> c0 = Variable(torch.randn(2, 3, 20))
        >>> output, hn = rnn(input, (h0, c0))
    """

In [68]:
class CharSeqStatefulLSTM(nn.Module):
    def __init__(self, vocab_size, n_fac, bs, nl):
        super().__init__()
        self.vocab_size,self.nl = vocab_size,nl
        self.e = nn.Embedding(vocab_size, n_fac)
        self.rnn = nn.LSTM(n_fac, n_hidden, nl, dropout=0.5) 
                                    # dropouts allow bs to increase to 512
        self.l_out = nn.Linear(n_hidden, vocab_size)
        self.init_hidden(bs)
        
    def forward(self, cs):
        bs = cs[0].size(0)
        if self.h[0].size(1) != bs: self.init_hidden(bs)
        outp,h = self.rnn(self.e(cs), self.h)
        self.h = repackage_var(h)
        return F.log_softmax(self.l_out(outp), dim=-1).view(-1, self.vocab_size)
    
    def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))

Note that LSTM requires 2 zero matrix:

In [44]:
def init_hidden(self, bs):
        self.h = (V(torch.zeros(self.nl, bs, n_hidden)),
                  V(torch.zeros(self.nl, bs, n_hidden)))    

#### Callbacks

Using `LayerOptimizer` allows you to pass in a callback during fit(). If you just use `optim.Adam(m.parameters(), lr)` you will get the plain optimizer. To use SGDR, discrimintative learning rates and LR finders, you must use `LayerOptimzer()`, which will create a object that will allow for callbacks such as SGDR (CosAnnealing).

In [69]:
m = CharSeqStatefulLSTM(md.nt, n_fac, 512, 2).cuda()
lo = LayerOptimizer(optim.Adam, m, 1e-2, 1e-5) 

In [70]:
os.makedirs(f'{PATH}models', exist_ok=True)

In [71]:
fit(m, md, 2, lo.opt, F.nll_loss) # note that lo.opt is inserted as optimzer

HBox(children=(IntProgress(value=0, description='Epoch', max=2), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.795547   1.731401  
    1      1.696436   1.645172                               



[array([1.64517])]

The callback to be used, CosAnneal, requires a layer optimizer object, which is *lo*  
CosAnneal() will change the learning rate inside *lo* during the training process (???)

In [79]:
class CosAnneal(LR_Updater): # == SGDR
    ''' Learning rate scheduler that inpelements a cosine annealation schedule. '''
    def __init__(self, layer_opt,                             ):

In [None]:
Init signature: CosAnneal(layer_opt, nb, on_cycle_end=None, cycle_mult=1)
Source:        
class CosAnneal(LR_Updater):
    ''' Learning rate scheduler that inpelements a cosine annealation schedule. '''
    def __init__(self, layer_opt, nb, on_cycle_end=None, cycle_mult=1):
        self.nb,self.on_cycle_end,self.cycle_mult = nb,on_cycle_end,cycle_mult
        super().__init__(layer_opt)

    def on_train_begin(self):
        self.cycle_iter,self.cycle_count=0,0
        super().on_train_begin()

    def calc_lr(self, init_lrs):
        if self.iteration<self.nb/20:
            self.cycle_iter += 1
            return init_lrs/100.

        cos_out = np.cos(np.pi*(self.cycle_iter)/self.nb) + 1
        self.cycle_iter += 1
        if self.cycle_iter==self.nb:
            self.cycle_iter = 0
            self.nb *= self.cycle_mult
            if self.on_cycle_end: self.on_cycle_end(self, self.cycle_count)
            self.cycle_count += 1
        return init_lrs / 2 * cos_out

### ???  
note that if argument **sched** is left out, fit() will have a problem.  
*on_end = lambda cycle: save_model(m, f'{PATH}models/cyc_{cycle}')*  
but not clear of its purpose

In [85]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}') # for saving the model by taking snapshots
cb = [CosAnneal(lo, #the layer optimzer which will allow learning rate to be changed
                len(md.trn_dl), # the epoch 
                cycle_mult=2, 
                on_cycle_end=on_end)] # optional, save model
                                      # save the model automatically at each cycle
fit(m, md, 2**4-1, lo.opt, F.nll_loss, callbacks=cb) 

HBox(children=(IntProgress(value=0, description='Epoch', max=15), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.508043   1.478239  
    1      1.516212   1.478237                               
    2      1.509087   1.477966                               
    3      1.5136     1.47826                                
    4      1.514901   1.478269                               
    5      1.511054   1.478272                               
    6      1.516114   1.477989                               
    7      1.514396   1.478229                               
    8      1.516596   1.478221                               
    9      1.51395    1.478262                               
    10     1.509833   1.478262                               
    11     1.516289   1.478503                               
    12     1.51282    1.478275                               
    13     1.515748   1.478236                               
    14     1.512431   1.478237                               



[array([1.47824])]

In [214]:
on_end = lambda sched, cycle: save_model(m, f'{PATH}models/cyc_{cycle}')
cb = [CosAnneal(lo, len(md.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(m, md, 2**6-1, lo.opt, F.nll_loss, callbacks=cb)

HBox(children=(IntProgress(value=0, description='Epoch', max=63), HTML(value='')))

epoch      trn_loss   val_loss                               
    0      1.269275   1.338181  
    1      1.267399   1.336441                               
    2      1.264895   1.336206                               
    3      1.270371   1.335448                               
    4      1.257494   1.333378                               
    5      1.256948   1.332566                               
    6      1.250793   1.332374                               
    7      1.252318   1.332538                               
    8      1.248725   1.330108                               
    9      1.24685    1.329169                               
    10     1.233719   1.3269                                 
    11     1.235617   1.327991                               
    12     1.232613   1.326995                               
    13     1.227421   1.326652                               
    14     1.220362   1.326627                               
    15     1.229938   1.328776       

[array([1.35933])]

### Test

In [80]:
def get_next(inp):
    idxs = TEXT.numericalize(inp)
    p = m(VV(idxs.transpose(0,1)))
    r = torch.multinomial(p[-1].exp(), 1)
    return TEXT.vocab.itos[to_np(r)[0]]

In [81]:
get_next('for thos')

'e'

In [82]:
def get_next_n(inp, n):
    res = inp
    for i in range(n):
        c = get_next(inp)
        res += c
        inp = inp[1:]+c
    return res

In [83]:
print(get_next_n('for thos', 400))

for those not reealsimace; sacrification, for a spirits as power, is precogning divopicatedly required aloneas we were learn: i spirits. the stoods that our so findoming free,there arount forin just in thingand defor is for the who west" and called too moralal being)?224. a delight! at the all because of other the biddest of a personal, andseem the miknows." it work of raseciraces and ctol--largid in this
