## Language model with RNNs
This code is based on Andrej Karpathy's [char-rnn](https://github.com/karpathy/char-rnn)

In [2]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

The Shakespearean text is provided in ./tinyshakespeare.txt. We will load the data and perform some preprocessing for starters. If you are on Colab, you need to re-upload the files when you kill the Runtime.

In [3]:
# read the contents of the text file
data = open('./tinyshakespeare.txt', 'r').read()
print (data[:278]) # let's examine some text

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.



## Tokenization

The text looks good but it can not be processed by RNNs in its raw form. We first need to tokenize the data and convert it into a form suitable for RNNs. Since we focus on character level language models in this exercise, we will consider each character as a token.

In [None]:
# Bonus: when you have finished, change for word-based processing

In [4]:
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('The text file has {} characters out of which {} are unique.'.format(data_size, vocab_size))
print (chars)

The text file has 1115394 characters out of which 65 are unique.
['w', '&', 'X', 'f', 'H', 'm', '-', 'd', 'L', ',', '$', 'n', '!', "'", '?', 'b', 'D', 'N', 'P', 'y', 'M', 'T', 'J', 'l', 'G', ';', '.', 'B', 'U', ' ', 'g', 'k', 'Q', 'i', 'F', 'V', 'K', 'z', 'u', 'r', 'a', 'v', 'E', '3', 't', 'x', 'W', 'e', 'p', 'j', 'C', 'c', 'A', 'Y', ':', 'R', 'O', 'o', 'S', 'Z', '\n', 'I', 'q', 's', 'h']


We have quite a big text with 65 unique characters. Now, we need to associate each character with a unique id which can then be converted into 1-hot vector form to provide as input to the RNN.

In [None]:
# Create a dictionary mapping each character to a unique id and vice versa
char_to_ix = ...
ix_to_char = ...

Now that we have a unique id for each character, we can represent each character with a 1-hot encoding

In [None]:
def int_to_onehot(..., ...):
  return ... # hint: use torch.eye for faster computing

In the last exercise on classification with MNIST and CIFAR, we had ground truth labels provided explicitly with each instance of the dataset. In our text dataset, we don't have explicitly ground truth labels but note that in a character level language model, we predict the next character. So, in essence, our text is itself the ground truth label since for each character, the next character acts as the ground truth. This will be important when loading the data for training.

PyTorch's DataLoader takes in a specific object of the inbuilt [Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class. So, we first need to convert our dataset in this form by inheriting from the Dataset class.

In [None]:
#########################################################################
# TO-DO: Implement the __getitem__ of the Shakespeare class. 
# Important points: __getitem__ is called at each training iteration
# So, we need to return the data and ground-truth label. The data is in
# the form of one-hot vectors and ground-truth is the index of next char
# Our RNN operates on an input sequence of a specified length (seq_length)
# so we need to return a sequence of one-hot vector and the indices of
# their corresponding next character
#########################################################################

class Shakespeare(Dataset):
    
    def __init__(self, text_data, seq_length):
        super().__init__()
        
        self.seq_length = seq_length
        self.data = text_data
        self.data_size = data_size        
        self.chars = chars
        self.vocab_size = vocab_size
        
        self.char_to_ix = char_to_ix
        self.ix_to_char = ix_to_char
        
        # hint: preprocess data for faster loading

    def __len__(self):
        return self.data_size - self.seq_length - 1
    
    def __getitem__(self, index):
        return ...

Now that we have defined our dataset, we can instatiate it and build our dataloader

In [None]:
seq_length = 20
batch_size = 100

dataset = Shakespeare(data, seq_length)
vocab_size = dataset.vocab_size

###########################################################
# Q: do we need to shuffle the dataset? What is drop_last?
###########################################################

dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=2, shuffle=..., drop_last=...)

Next, we need to define our RNN model before training. For this, we will implement the RNN Cell discussed in lecture 8 slide 6. Note that the RNN Cell operates on a single timestep. So, the RNN Cell will take a single timestep token as input and produce output and hidden state for that particular timestep only.

In [None]:
#########################################################################
# TO-DO: Implement the __init__ and forward methods of the RNNCell class. 
# Refer to the equations in the lecture and implement the same here
# The forward method should return the output and the hidden state
#########################################################################

class RNNCell(nn.Module):
    
    def __init__(self, vocab_size, hidden_size):
        super().__init__()
        
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        ...
        ...
        ...
        
    def forward(self, input_emb, hidden_state):
        ...
        ...
        return output, hidden_state

Since we have a sequence of tokens as input, we will implement another model which uses this RNNCell and processes multi-timestep sequence inputs. The RNN class takes in a sequence of one-hot encodings of tokens as input and returns a sequence of output, one for each timestep.

In [None]:
#########################################################################
# TO-DO: Implement the forward method of the RNN class. 
# The RNN class takes in a sequence of one-hot encodings of tokens as input 
# and returns a sequence of output, one for each timestep.
# We also return the hidden state of the final timestep
# Q: Is it required to return the hidden state? If yes, why? If no, why?
#########################################################################

class RNN(nn.Module):
    
    def __init__(self, seq_length, vocab_size, hidden_size):
        super().__init__()
        
        self.seq_length = seq_length
        self.hidden_size = hidden_size
        self.rnn_cell = RNNCell(vocab_size, hidden_size)
        
    def forward(self, input_seq, hidden_state):
        
        ...
        ...
        ...
        
        return outputs, hidden_state

Now that dataset and model definitions are done, we need to implement the training loop and we are good to go.

In [None]:
#########################################################################
# TO-DO: Implement the missing part of the training function. 
# As a loss function we want to use cross-entropy
# It can be called with F.cross_entropy().
# Hint: Pass through the model -> Backpropagate gradients -> Take gradient step
#########################################################################

def train(model, dataloader, optimizer, epoch, log_interval, device, hidden_size):
    model.train()
    
    # initialize hidden state to 0 at the beginning of the epoch 
    # and keep propagating it to the next batch
    hidden_state = torch.zeros(batch_size, hidden_size).to(device)
    
    for batch_idx, (data, target) in enumerate(dataloader):
        data, target = data.to(device), target.to(device)
        
        # loss can be computed in 2 ways:
        # mean across seq_length and batch_size or sum across seq_length and mean across batch_size
        
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(dataloader.dataset),
                100. * batch_idx / len(dataloader), loss.item()))

Next, we instantiate our model and optimizer and then we can start training.

In [None]:
# model and training parameters
# feel free to experiment with different parameters and optimizers
current_hidden_size = 256
learning_rate = 1e-3
device = 'cuda'

rnn = RNN(seq_length, vocab_size, current_hidden_size).to(device)

optimizer = optim.RMSprop(rnn.parameters(), lr=learning_rate)

In [None]:
# training loop
epochs = 1
for epoch in range(1, epochs + 1):
    train(rnn, dataloader, optimizer, epoch, log_interval=1000, device=device, hidden_size=current_hidden_size)

Along with training the model, it's also good to check what kind of text our model is generating. We have implemented a function which samples text from the model give an initial token and a hidden state.

In [None]:
# sample text of length seq_len, this seq_len does not need to be the same
# as seq_length that we used earlier, we can basically sample text of any arbitrary length.

softmax_temp = 0.3

def sample(hidden_state, token, seq_len):
    token_emb = torch.zeros(1, vocab_size).to(device) # use batch_size=1 at inference
    token_emb[0,token] = 1
    char_indices = [token] # first token
    
    with torch.no_grad():
        for timestep in range(seq_len):
            output, hidden_state = rnn.rnn_cell(token_emb, hidden_state)
            output = torch.softmax(output / softmax_temp, dim=-1) # convert to probabilities
            token = torch.multinomial(output.squeeze(), num_samples=1).item()  # sample token from output distribution
            char_indices.append(token)

            token_emb = torch.zeros(1, vocab_size).to(device)
            token_emb[0, token] = 1
    
    return char_indices

Now, let's sample sample text from the model after every epoch to see if our model is learning to generate some text or not. In the code below, we are sampling a 20 char text from the model, starting with a random token and 0 memory. Try to generate some text by using the the hidden_state returned by RNN class.

In [None]:
# sample a 20-char text from the model, starting with a random token and 0 memory
for i in range(10):
  token = np.random.randint(0, vocab_size)
  hidden_state = torch.zeros(1, current_hidden_size).to(device)
  char_indices = sample(hidden_state, token, 50) # sample a 50 char text from the model
  txt = ''.join(dataset.ix_to_char[ix] for ix in char_indices)
  print (txt)

If everything went well, then our model should be able to generate some legible text after some epochs. However, it'd probably be quite slow. The model should be able to learn spellings and certain words, use of spaces and how to begin sentences. It most likely won't be able to generate long sentences. 

Bonus: Try implementing a word level language model with the same model on the same dataset. The only difference is that now each word is a token rather than each character. So, the tokenization in the dataset needs to be changed and everything else remains same

Bonus: Try using a 3-layer RNN with 512 dimensional hidden state and see if the model is able to generate better text.