<a href="https://colab.research.google.com/github/SnehhaPadmanabhan/Bertelsmann-AI-Challenge/blob/master/Char_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Character recognition is used for NLP. It can be done either by 
   - analysis of words
   - analysis of characters
- Word vector is a pre-trained NLP model
- Advantages of character level break down
   1. lessen vobaulary problems during input
      - resilient to spelling mistakes and rare words
   2. remove computational bottlenexk at the output
- Some specialities of a character level model
   1. can learn from non-trivial syntax
   2. can understand sentiments
   3. can translate
- We encounter the choice between characters and words in two places, at the model's input and the model's output.
- Advantages of character level models
  1. allow open vocabulary
     ![alt text](https://drive.google.com/open?id=1rV1DsvIh6wANv3sgZFQAxjoBhaXRmmMn)
  2. easy topretrain and train
- Disadvantages of char RNN
  1. semantically void
  2. longer sequences increase computational expense
  3. output is also a set of characters
- several solutions exist

In [0]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!cd /content/drive/'My Drive'/

In [0]:
!cp /content/drive/'My Drive'/anna.txt sample_data/

In [0]:
 with open('sample_data/anna.txt', 'r') as txtfile:
        text = txtfile.read()

In [20]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

**Tokenization**

- to map characters to integers and vice versa
- tokenization is useful for one hot encoding
- one hot encoding makes computation easier

In [10]:
chars = tuple(set(text))
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii,ch in int2char.items()}
encode = np.array([char2int[ch] for ch in text])
print(encode[:50])

[54 57 43 65 80 69 72 31 59 77 77 77  5 43 65 65 14 31 74 43 24 47 13 47
 69 49 31 43 72 69 31 43 13 13 31 43 13 47  6 69 73 31 69 75 69 72 14 31
 66 15]


In [0]:
def one_hot_encode(arr, n):
  one_hot = np.zeros((arr.size, n), dtype = np.float32)
  # filing with 1s
  one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
  one_hot = one_hot.reshape((*arr.shape, n))
  return one_hot

In [12]:
test_seq = np.array([[3, 5, 1]])
one_hot = one_hot_encode(test_seq, 8)
print(one_hot)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


**Creating mini batches of data and using them to train our model**

n batches, each batch has a m*k character array

In [0]:
def get_batches(arr, batch_size, seq_length): # return "batchsize" number of batches of size "seqlength" from the array
   total = batch_size * seq_length 
   n = len(arr)//total # number of batches that can be made
   arr = arr[:n * total]
   arr = arr.reshape((batch_size, -1)) # change to row vector
   for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n+seq_length]
        # The targets, shifted by one
        y = np.zeros_like(x) # Return an array of zeros with the same shape and type as a given array.
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y # suspends function execution an returns values
        # return sends a specified value back to its caller, whereas yield can produce a sequence of values. We should use yield when we want to iterate over a sequence but don't want to store the entire sequence in memory

In [14]:
batches = get_batches(encode, 8, 50)
x, y = next(batches)
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[54 57 43 65 80 69 72 31 59 77]
 [49 82 15 31 80 57 43 80 31 43]
 [69 15 40 31 82 72 31 43 31 74]
 [49 31 80 57 69 31 41 57 47 69]
 [31 49 43 23 31 57 69 72 31 80]
 [41 66 49 49 47 82 15 31 43 15]
 [31 20 15 15 43 31 57 43 40 31]
 [55 52 13 82 15 49  6 14  0 31]]

y
 [[57 43 65 80 69 72 31 59 77 77]
 [82 15 31 80 57 43 80 31 43 80]
 [15 40 31 82 72 31 43 31 74 82]
 [31 80 57 69 31 41 57 47 69 74]
 [49 43 23 31 57 69 72 31 80 69]
 [66 49 49 47 82 15 31 43 15 40]
 [20 15 15 43 31 57 43 40 31 49]
 [52 13 82 15 49  6 14  0 31 60]]


Model Structure

In __init__ the suggested structure is as follows:

    Create and store the necessary dictionaries (this has been done for you)
    Define an LSTM layer that takes as params: an input size (the number of characters), a hidden layer size n_hidden, a number of layers n_layers, a dropout probability drop_prob, and a batch_first boolean (True, since we are batching)
    Define a dropout layer with drop_prob
    Define a fully-connected layer with params: input size n_hidden and output size (the number of characters)
    Finally, initialize the weights (again, this has been given)


In [15]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU; consider making n_epochs very small.')

Training on GPU!


In [0]:
class CharRNN(nn.Module):
    
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # creating character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
        r_output, hidden = self.lstm(x, hidden)
        out = self.dropout(r_output)
        # Stack up LSTM outputs using view
        # you may need to use contiguous to reshape the output
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden

In [0]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Training a network 
    
        Arguments
        ---------
        
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        batch_size: Number of mini-sequences per mini-batch, aka batch size
        seq_length: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss
    
    '''
    net.train()
    
    opt = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data)*(1-val_frac))
    data, val_data = data[:val_idx], data[val_idx:]
    
    if(train_on_gpu):
        net.cuda()
    
    counter = 0
    n_chars = len(net.chars)
    for e in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        for x, y in get_batches(data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            if(train_on_gpu):
                inputs, targets = inputs.cuda(), targets.cuda()

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            net.zero_grad()
            
            # get the output from the model
            output, h = net(inputs, h)
            
            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            opt.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    x, y = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = x, y
                    if(train_on_gpu):
                        inputs, targets = inputs.cuda(), targets.cuda()

                    output, val_h = net(inputs, val_h)
                    val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                
                    val_losses.append(val_loss.item())
                
                net.train() # reset to train mode after iterationg through validation data
                
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))

In [21]:
n_hidden=512
n_layers=2
net = CharRNN(chars, n_hidden, n_layers)
batch_size = 128
seq_length = 100
n_epochs = 10
train(net, encode, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001, print_every=10)

TypeError: ignored