# Character-Level LSTM in PyTorch
The network will train character by character on some text, then generate some text based character by character.  
This toutorial is based on the [post of Andrej Karpanthy on RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

In [41]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

## Load in Data

In [42]:
# open text file and read in data as `text`
# we open the file in reading text mode mode='r'
with open('data/anna.txt', 'r') as f:
    text = f.read()

Let's check the first 100 characters.

In [22]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

## Tokenization
In the next cells, I am creating a couple of dictionaries to convert characters to integers.

In [43]:
# encode the text and map each character to an integer and vice versa

# we create two dictionaries:
# 1. int2char, which maps integers to characters
# 2. char2int, which maps characters to unique integers
chars = tuple(set(text))
int2char = dict(enumerate(chars))
# int2char.items() returns the items in the dictionary
# we create a dictionary with the keys equal to the characters and the value equal to the encoding
char2int = {ch: ii for ii, ch in int2char.items()}

# encode the text
encoded = np.array([char2int[ch] for ch in text])

[10 33 19 ... 81  5 54]


## Pre-processing the Data
Our LSTM expects an input that is **one-hot encoded** meaning that each character is converted to an integer (thanks to our created dictionary) and then converted into a column vector where only the corresponding integer is one and all the others 0.

In [44]:
def one_hot_encode(arr, n_labels):
    
    # Initialize the encoded array
    # The number of rows is the number of characters in the text (which 
    # are also equal) to encode. We are encoding the all text
    # The number of columns is the number of different labels
    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
    
    # Fill the appropriate elements with ones
    # we pick all the rows (characters)
    # we are setting to 1 the columns specific to
    # the numbers associated to the characters
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    # Finally reshape it to get back to the original array
    # It was already fine
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    return one_hot

In [45]:
one_hot_enc_data = one_hot_encode(encoded, len(char2int))


## Making Training Mini-batches
To train on our data we need to create mini-batches.  
We have the original sequence of data:  
1. we have to set how to divide this sequence in batches: knowing the length of a batch, so the number of sequence which has to have (`batch_size`)  
2. we have to decide the length of the sequence to feed the network (`seq_length`, the number of steps in a sequence)  
  
We'll take the input sequence of encoded characters and we will divide it in multiple sequences considering the `batch_size`.  
And then, we will feed the network considering the `seq_length`.  
  
1. We discard some of the text to have the batch filled completely  
Each batch contains $NxM$ characters, $N$ is the `batch_size` and $M$ the `seq_length`.  
To have the number of batches, $K$, we divide the length of `arr` (number of elements in the original sequence) by the number of characters per batch.  
So, you know the total number of characters to keep from `arr`: $NxMxK$.  
  
2. We need to split the sequence `arr` into $N$ sequences to have the batches 
You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array.  
We know we want $N$ sequences in a batch, so let's make that the size of the first dimension. For the second dimension, you can use -1 as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $Nx(M*K)$.  
  
3. We can iterate through it to have the mini-batches:  
The idea is each batch is a $NxM$ window on the $Nx(M*K)$ array (which is our dataset resized).  
For each subsequent batch, the window moves over by `seq_length` (horizontally, it moves on). We also want to create both the input and target arrays. Remember that the targets are just the inputs shifted over by one character. The way I like to do this window is use range to take steps of size `n_steps` from `0` to `arr.shape[1]`, the total number of tokens for each input. That way, the integers you get from range always point to the start of a batch, and each window is `seq_length` wide.

In [46]:
def get_batches(arr, batch_size, seq_length):
    '''Create a generator that returns batches of size
       batch_size x seq_length from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       batch_size: Batch size, the number of sequences per batch
       seq_length: Number of encoded chars in a sequence
    '''
    
    ## TODO: Get the number of batches we can make
    batch_size_total = batch_size*seq_length
    # we retain the integer number of batches which is the lower integer between
    # the 2 possible integers to approximate the number (// floor division)
    n_batches = len(arr)//batch_size_total
    
    ## TODO: Keep only enough characters to make full batches
    arr = arr[range(n_batches*batch_size_total),]
    
    ## TODO: Reshape into batch_size rows
    arr = arr.reshape((batch_size,-1))

    ## TODO: Iterate over the batches using a window of size seq_length
    n = 0
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n+seq_length]
        # The targets, shifted by one
        y = np.zeros_like(x)
        try:
             # :-1 all the items except the last item
            # -1 last item in the array
            # we have that y is one step forward (so x contains most of the information,
            # but not the last item which we have to keep by the original dataset)
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

In [47]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)

In [48]:
# printing out the first 10 items in a sequence
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])
test = np.array([1, 2, 3])

x
 [[75  0 24 67 38 18 59 63 69 52]
 [48  4 43 63 38  0 24 38 63 24]
 [18 43 73 63  4 59 63 24 63 29]
 [48 63 38  0 18 63 72  0 60 18]
 [63 48 24 45 63  0 18 59 63 38]
 [72 66 48 48 60  4 43 63 24 43]
 [63 70 43 43 24 63  0 24 73 63]
 [42 13 55  4 43 48 49 20  1 63]]

y
 [[ 0 24 67 38 18 59 63 69 52 52]
 [ 4 43 63 38  0 24 38 63 24 38]
 [43 73 63  4 59 63 24 63 29  4]
 [63 38  0 18 63 72  0 60 18 29]
 [48 24 45 63  0 18 59 63 38 18]
 [66 48 48 60  4 43 63 24 43 73]
 [70 43 43 24 63  0 24 73 63 48]
 [13 55  4 43 48 49 20  1 63 77]]


We can see that y is correctly shifted of one step.  
  
# Defining the network with PyTorch
We are going to use PyTorch to define the architecture of the network. We define the layers and the operations we want to do.  
Then, we define a function for the forward pass.  
  
## Model structure
We use `__init__`, we will do the structure as:  
1. Create and store the necessary dictionaries  
2. Define the LSTM layer which takes as parameters:  
- an input size (the number of characters)  
- the number of nodes a layer has (`n_hidden`)  
- a number of layers (`n_layers`)  
- a dropout probability (`drop_prob`)
- a batch first boolean (`batch_first`) = True for we are batching
3. Define a dropout layer with `drop_prob`
4. Define a fully-connected layer with parameters:  
- an input size equal to the number of hidden units a layer of LSTM (`n_hidden`)
- an output size equal to the number of characters to predict
5. Initialize the weights

## LSTM Inputs/Outputs
You can create the basic [LSTM layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM) as follows:  
  
```python
self.lstm = nn.LSTM(input_size, n_hidden, n_layers,
                dropout = drop_prob, batch_first=True)
```
- `input_size` is the number of characters it expects in input  
- `n_hidden` is the number of hidden units in the layers the cell has   
  
In the forward pass, we can stack the LSTM cells into layers using `.view` function.  
So, we pass the list of cells and we have the output of one cell into another.

In [49]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

No GPU available, training on CPU; consider making n_epochs very small.


In [50]:
# the tokens are the number of different words/characters we have
# in our lexicon, indeed inside the function we are not using the
# function @set@
class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers # number of hidden layers
        self.n_hidden = n_hidden # number of hidden units for layer
        self.lr = lr
        
        # creating character dictionaries
        # the characters for us is the number 
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        ## TODO: define the LSTM
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        ## TODO: define a dropout layer
        self.dropout = nn.Dropout(drop_prob)
        
        ## TODO: define the final, fully-connected output layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
        ''' Forward pass through the network. 
            These inputs are x, and the hidden/cell state `hidden`. '''
                
        ## TODO: Get the outputs and the new hidden state from the lstm
        r_output, hidden = self.lstm(x, hidden)
        
        ## TODO: pass through a dropout layer
        # we have to pass through the output only the real output
        out = self.dropout(r_output)
        
        # Stack up LSTM outputs using view
        # you may need to use contiguous to reshape the output
        # we retain the columns and we stack everything other
        out = out.contiguous().view(-1, self.n_hidden)
        
        ## TODO: put x through the fully-connected layer
        out = self.fc(out)
        
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
        

## Time to train
 
 The train function gives us the possibility to set:  
 1. the number of epochs  
 2. the learning rate  
 3. other parameters  

- Within the *batch loop*, we *detach the hidden state from its history*; this time setting it equal to a new tuple variable because an LSTM has a hidden state that is a tuple of the hidden and cell states.  
- We use `clip_grad_norm_` to help prevent exploding gradients.

In [51]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Training a network 
    
        Arguments
        ---------
        
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        batch_size: Number of mini-sequences per mini-batch, aka batch size
        seq_length: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss
    
    '''
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    # here we have the data to train and the data to validate the model
    data, val_data = data[:val_idx], data[val_idx:]
    
    if(train_on_gpu):
        net.cuda()
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            if(train_on_gpu):# we pass the tensors on the cuda memory
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            
            # calculate the loss and perform backprop
            targets = torch.flatten(targets,0)
            targets = targets[range(batch_size*seq_length),]
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y
                    if(train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net(inputs, val_h)
                    targets = torch.flatten(targets,0)
                    targets = targets[range(batch_size*seq_length),]
                    val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                
                    val_losses.append(val_loss.item())
                
                net.train() # reset to train mode after iterationg through validation data
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

## Instantiating the Model
Now we can actually train the network. First we'll create the network itself, with some given hyperparameters. Then, define the mini-batches sizes, and start training!

In [52]:
# define and print the net
n_hidden=512
n_layers=2

net = CharRNN(chars, n_hidden, n_layers)
print(net)

CharRNN(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)


In [53]:
batch_size = 128
seq_length = 100
n_epochs = 2 # start smaller if you are just testing initial behavior

# train the model
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=1)

Epoch: 1/2... Step: 1... Loss: 4.4125... Val Loss: 4.3721


KeyboardInterrupt: 

## Getting the best model
We want to watch the training and the validation losses. If the training loss is much lower than the validation loss, then you are overfitting.  
We have to increase the regularization (more dropout) or use a less complex network, which means it has to be smaller. If the training and validation losses are close, you're underfittinh so you can increase the size of the network.

### Hyperparameters
In defining a model:  
1. `n_hidden` number of units in the hidden layers
2. `n_layers` number of hidden LSTM layers to use
  
  
We assume dropout and learning rate to be kept equal.  
  
In training:  
1. `batch_size` number of sequences running through the network in one pass  
2. `seq_length` number of characters in the sequence the network is trained on. Larger is better normally, because the network will learn more long range dependencies, but it takes longer to train, 100 is good in this context.  
3. `lr` learning rate for training

### Tips and Tricks
The most important thing to take into account is the difference between your training loss and the validation loss.  
The 2 most important parameters that control the model are `n_hidden` and `n_layers`. `n_layers` is the parameter most used, instead `n_hidden` can be adjusted based on how much data you have.  
We have to track:  
1. The **number of parameters in the model**. This is printed when we start training
2. The **size of your dataset**. 1MB file is approximately 1 million characters.  
These 2 have to be more or less of the same order of magnitude

#### Best models strategy
Use the largest model you can by considering the computational power you have. Then, try different dropout values (0.1 to 0.5). Whatever the model has the best validation performance is the one you should use.  
  
However, the size of the training and validation sets are hyperparameters. Pay attention to have enought data in your validation dataset.

## Checkpoint
After training, we can save the model so we can load it later if we need too.  
Pay attention that we have to save the model architecture.

In [None]:
# change the name, for saving multiple files
model_name = 'rnn_20_epoch.net'

checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

with open(model_name, 'wb') as f: # we open the file for writing in binary mode
    torch.save(checkpoint, f)

## Making predictions
  
Now that the model is trained, we want to sample from it and make predictions about next characters! To sample, we pass in a character and have the network predict the next character. Thenm we take that character, pass it back in, and get another predicted character. So, in this way we generate a bunch of text.
  
### A note on the predict function
The output of our RNN is from a fully-connected layer and it outputs a **distribution of next-character scores**.  
To have the next character, we apply a softmax function, which gives us a *probability distribution* that can then sample to predict the next character.

### Top k sampling
  
Our predictions come from a categorical probability distribution over all the possible characters. We can make the sample text and make it more reasonable to handle (with less variables) by only considering some $K$ most probable charcaters. This will prevent the network from giving us completely absurd characters while allowing it to introduce some noise and randomness into the sampled text.

In [None]:
def predict(net, char, h=None, top_k=None):
        ''' Given a character, predict the next character.
            Returns the predicted character and the hidden state.
        '''
        
        # tensor inputs
        # we encode the charcaters to integers
        # and then with hot encode notation.
        # Then, we transform to tensors
        x = np.array([[net.char2int[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = torch.from_numpy(x)
        
        # we transfer the tensors to the gpu if
        # we have one
        if(train_on_gpu):
            inputs = inputs.cuda()
        
        # detach hidden state from history
        # h is the hidden state
        # we create a tuple which is made of each node in
        # the hidden state
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs, h)

        # get the character probabilities
        # by applying the softmax to the output
        p = F.softmax(out, dim=1).data
        # if we have traned the network on the gpu
        # we move the tensor of probabilities to the cpu
        if(train_on_gpu):
            p = p.cpu() # move to cpu
        
        # get top characters
        # if we don't have set the number of
        # characters to retain, we consider all the
        # characters
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            # otherwise we consider all the number
            # of top characters we set
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        # squeeze remove axes of length 1 from the numpy vector
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return net.int2char[char], h

## Priming and generating text
Typically, we want to prime the network so we can build up the hidden state.  
Otherwise, the network will start out generating characters at random.  
In general, the first bunch of characters will be a little rough since it hasn't buit up a long history of characters to predict from.

In [None]:
def sample(net, size, prime='The', top_k=None):
        
    if(train_on_gpu):
        net.cuda()
    else:
        net.cpu()
    
    net.eval() # eval mode
    
    # First off, run through the prime characters
    # we create a list from the prime string of characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        # to use the index -1 means that we are taking the last character
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [None]:
print(sample(net, 1000, prime='Anna', top_k=5))