<a href="https://colab.research.google.com/github/tarun-jethwani/character_level_language_model/blob/master/generating_names_using_RNN_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Building Character level language with RNN

This tutorial is the perfect demonstration of what RNN's are capable of, Please visit my [Blog Page](https://leakyrelu.com/) to [know more about this project](https://leakyrelu.com/2019/10/03/generating-people-names-by-building-character-level-language-model-rnn-from-scratch/), otherwise this Ipython Notebook is self explainatory

Building the Character Language Model from scratch, just by using Numpy.
like, I have said earlier in my [tutorials](https://leakyrelu.com/2019/09/01/building-a-deep-neural-network-from-scratch/), building any Machine Learning and Deep Learning Model can give you a great insight about how these bit and pieces come together to make a complete model, you could visualize yourself how the Data might be going in getting processed and how RNN ( Recurrent Neural Network ) will be generating predictions, so in a way these so called Deep Learning Models are not mere black box for you.

The underlying model used is RNN which stands for Recurrent Neural Network, it generates prediction for the current time step on the basis of the previous time step

We are going to train the language model with data containing people names,
the dataset is there in the file names.txt, for now I have uploaded the file in my Github Repo.

After Training the Model, we will expect the model to sample random names on the basis of what it has learnt.

We are going to code the following Steps,
- store text data for processing using an RNN 
- to synthesize data, by sampling predictions at each time step and passing it to the next RNN-cell unit
- build a character-level text generation recurrent neural network
- clip gradients ( to prevent it from exploding)

In [0]:
import numpy as np

Some helper functions for intermediate computations

In [0]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)


def print_sample(sample_ix, ix_to_char):
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    txt = txt[0].upper() + txt[1:]  # capitalize first character 
    print ('%s' % (txt, ), end='')

def get_initial_loss(vocab_size, seq_length):
    return -np.log(1.0/vocab_size)*seq_length

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

### Initialize the Parameters

In [0]:
def initialize_parameters(n_a, n_x, n_y):
    """
    Initialize parameters with small random values
    
    Returns:
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    """
    np.random.seed(1)
    Wax = np.random.randn(n_a, n_x)*0.01 # input to hidden
    Waa = np.random.randn(n_a, n_a)*0.01 # hidden to hidden
    Wya = np.random.randn(n_y, n_a)*0.01 # hidden to output
    b = np.zeros((n_a, 1)) # hidden bias
    by = np.zeros((n_y, 1)) # output bias
    
    parameters = {"Wax": Wax, "Waa": Waa, "Wya": Wya, "b": b,"by": by}
    
    return parameters

### One Step of RNN

In [0]:
def rnn_step_forward(parameters, a_prev, x):
    
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    a_next = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b) # hidden state
    p_t = softmax(np.dot(Wya, a_next) + by) # unnormalized log probabilities for next chars # probabilities for next chars 
    
    return a_next, p_t

### Forward Pass of RNN ( for one example )

In [0]:
def rnn_forward(X, Y, a0, parameters, vocab_size = 30):
    
    # Initialize x, a and y_hat as empty dictionaries
    x, a, y_hat = {}, {}, {}
    
    a[-1] = np.copy(a0)
    
    # initialize your loss to 0
    loss = 0
    
    for t in range(len(X)):
        
        # Set x[t] to be the one-hot vector representation of the t'th character in X.
        # if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector. 
        x[t] = np.zeros((vocab_size,1)) 
        if (X[t] != None):
            x[t][X[t]] = 1
        
        # Run one step forward of the RNN
        a[t], y_hat[t] = rnn_step_forward(parameters, a[t-1], x[t])
        
        # Update the loss by substracting the cross-entropy term of this time-step from it.
        loss -= np.log(y_hat[t][Y[t],0])
        
    cache = (y_hat, a, x)
        
    return loss, cache

### One Step of RNN backward

In [0]:
def rnn_step_backward(dy, gradients, parameters, x, a, a_prev):
    
    gradients['dWya'] += np.dot(dy, a.T)
    gradients['dby'] += dy
    da = np.dot(parameters['Wya'].T, dy) + gradients['da_next'] # backprop into h
    daraw = (1 - a * a) * da # backprop through tanh nonlinearity
    gradients['db'] += daraw
    gradients['dWax'] += np.dot(daraw, x.T)
    gradients['dWaa'] += np.dot(daraw, a_prev.T)
    gradients['da_next'] = np.dot(parameters['Waa'].T, daraw)
    return gradients

### Backward Pass through RNN (for one example)

In [0]:
def rnn_backward(X, Y, parameters, cache):
    
    gradients = {}
    
    (y_hat, a, x) = cache
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    
    gradients['dWax'], gradients['dWaa'], gradients['dWya'] = np.zeros_like(Wax), np.zeros_like(Waa), np.zeros_like(Wya)
    gradients['db'], gradients['dby'] = np.zeros_like(b), np.zeros_like(by)
    gradients['da_next'] = np.zeros_like(a[0])
    
    for t in reversed(range(len(X))):
        dy = np.copy(y_hat[t])
        dy[Y[t]] -= 1
        gradients = rnn_step_backward(dy, gradients, parameters, x[t], a[t], a[t-1])
    
    return gradients, a

### Update Paramters 

In [0]:
def update_parameters(parameters, gradients, lr):

    parameters['Wax'] += -lr * gradients['dWax']
    parameters['Waa'] += -lr * gradients['dWaa']
    parameters['Wya'] += -lr * gradients['dWya']
    parameters['b']  += -lr * gradients['db']
    parameters['by']  += -lr * gradients['dby']
    return parameters

In [0]:
import numpy as np
import random

### I have kept the Dataset inside Google Drive, so lets mount the Google Drive

In [0]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


## 1 - Preprocess the Data

### 1.1 - Read the Data from Google Drive

Read the dataset of names, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size. 

In [0]:
data = open('/gdrive/My Drive/names.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 55873 total characters and 30 unique characters in your data.


The characters are a-z (26 characters) plus single quote "'", single space' ', hyphen '-' and newline character "\n" in this project plays a role similar to the `<EOS>` (End of sentence) token, it indicates the end of the name rather than the end of a sentence, Total of 30 chars

Than we create a python dictionary to map each character to an index from 0-29. We also create a second python dictionary that maps each index back to the corresponding character character. This will help you figure out what index corresponds to what character in the probability distribution output of the softmax layer. Below, `char_to_ix` and `ix_to_char` are the python dictionaries. 

In [0]:
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print(char_to_ix)
print(ix_to_char)

{'\n': 0, ' ': 1, "'": 2, '-': 3, 'a': 4, 'b': 5, 'c': 6, 'd': 7, 'e': 8, 'f': 9, 'g': 10, 'h': 11, 'i': 12, 'j': 13, 'k': 14, 'l': 15, 'm': 16, 'n': 17, 'o': 18, 'p': 19, 'q': 20, 'r': 21, 's': 22, 't': 23, 'u': 24, 'v': 25, 'w': 26, 'x': 27, 'y': 28, 'z': 29}
{0: '\n', 1: ' ', 2: "'", 3: '-', 4: 'a', 5: 'b', 6: 'c', 7: 'd', 8: 'e', 9: 'f', 10: 'g', 11: 'h', 12: 'i', 13: 'j', 14: 'k', 15: 'l', 16: 'm', 17: 'n', 18: 'o', 19: 'p', 20: 'q', 21: 'r', 22: 's', 23: 't', 24: 'u', 25: 'v', 26: 'w', 27: 'x', 28: 'y', 29: 'z'}


### 1.2 - Overview of the model

Your model will have the following structure: 

- Initialize parameters 
- Run the optimization loop
    - Forward propagation to compute the loss function
    - Backward propagation to compute the gradients with respect to the loss function
    - Clip the gradients to avoid exploding gradients
    - Using the gradients, update your parameter with the gradient descent update rule.
- Return the learned parameters 
    

At each time-step, the RNN tries to predict what is the next character given the previous characters.

## 2 - Putting together Building blocks of the model

In this part, we will build two important blocks of the overall model:
- Gradient clipping: to avoid exploding gradients
- Sampling: a technique used to generate characters

then apply these two functions to build the model.

### 2.1 - Clipping the gradients in the optimization loop

Before updating the parameters, we will perform gradient clipping when needed to make sure that gradients are not "exploding," taking on overly large values. 

function `clip` that takes in a dictionary of gradients and returns a clipped version of gradients. 
we will use a simple element-wise clipping procedure, in which every element of the gradient vector is clipped to lie between some range [-N, N].

In this project, if any component of the gradient vector is greater than 5, it would be set to 5; and if any component of the gradient vector is less than -5, it would be set to -5. If it is between -5 and 5, it is left alone. 


In [0]:
def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -maxValue, maxValue, out=gradient)
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

### 2.2 - Sampling 
visit project's [tutorial blog_page](https://https://leakyrelu.com/2019/10/03/generating-people-names-by-building-character-level-language-model-rnn-from-scratch/) to know about what is sampling

In [0]:
def sample(parameters, char_to_ix, seed):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.
    seed -- used for grading purposes. Do not worry about it.

    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    
    x = np.zeros((vocab_size, 1))
    a_prev = np.zeros((n_a, 1))
   
    indices = []

    idx = -1 
     
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) + by
        y = softmax(z)
        
        np.random.seed(counter+seed) 
        
        idx = np.random.choice(list(range(vocab_size)), p = y.ravel())
        indices.append(idx)
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        a_prev = a
        
        seed += 1
        counter +=1
        

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

## 3 - Building the language model 

It is time to build the character-level language model for text generation. 


### 3.1 - Gradient descent 

we will implement a function performing one step of stochastic gradient descent (with clipped gradients), will go through the training examples one at a time, so the optimization algorithm will be stochastic gradient descent. Here are the steps of a common optimization loop for an RNN:

- Forward propagate through the RNN to compute the loss
- Backward propagate through time to compute the gradients of the loss with respect to the parameters
- Clip the gradients if necessary 
- Update your parameters using gradient descent 

In [0]:
# GRADED FUNCTION: optimize

def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    
    
  
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    gradients, a = rnn_backward(X, Y, parameters, cache)
    gradients = clip(gradients, 5)
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    
    return loss, gradients, a[len(X)-1]

### 3.2 - Training the model 

Given the dataset of names, we use each line of the dataset (one name) as one training example. Every 100 steps of stochastic gradient descent,
will sample 10 randomly chosen names to see how the algorithm is doing. shuffle the dataset, so that stochastic gradient descent visits the examples in random order. 

`model()`. When `examples[index]` contains one name (string), to create an example (X, Y), you can use this:
```python
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]] 
        Y = X[1:] + [char_to_ix["\n"]]
```

In [0]:
def model(data, ix_to_char, char_to_ix, num_iterations = 200000, n_a = 50, names = 5, vocab_size = 30):
    """
    Trains the model and generates names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    dino_names -- number of dinosaur names you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text, size of the vocabulary
    
    Returns:
    parameters -- learned parameters
    """
    
    n_x, n_y = vocab_size, vocab_size
    parameters = initialize_parameters(n_a, n_x, n_y)
    loss = get_initial_loss(vocab_size, names)
    
   
    with open("/gdrive/My Drive/names.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    
    np.random.seed(0)
    np.random.shuffle(examples)
    
    
    a_prev = np.zeros((n_a, 1))
    

    for j in range(num_iterations):
        
       
        
       
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]]
        Y = X[1:] + [char_to_ix["\n"]]
        
       
        loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, 0.01)
        
        
        loss = curr_loss

        if j % 10000 == 0:
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            seed = 0
            for name in range(names):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                
                seed += 1  
      
            print('\n')
        
    return parameters

Run the following cell, at first model will generate random-looking characters at the first iteration. After a few thousand iterations, the model should learn to generate reasonable-looking names. 

In [0]:
parameters = model(data, ix_to_char, char_to_ix)

Iteration: 0, Loss: 23.809027

Mjzwwtalcpncyfrpv rhihvt
Imc'
Jzwwtalcpncyfrpv rhihvt
Mc'
Zwwtalcpncyfrpv rhihvt


Iteration: 10000, Loss: 10.335634

Mexttn
Jie
Kuttn
Mad
Wrtta


Iteration: 20000, Loss: 13.579897

Meytth
Ilea
Justna
Mad
Wrys


Iteration: 30000, Loss: 12.886642

Mgyttobe
Ilea
Justuce
Mac-dorga
Wsyna


Iteration: 40000, Loss: 16.069340

Meysti
Ind
Justlbe
Mac
Wosta


Iteration: 50000, Loss: 19.530784

Lixpor
Goic
Hustia
Leb
Wrvice


Iteration: 60000, Loss: 13.861752

Meusti
Imba
Justolia
Mal
Worrala


Iteration: 70000, Loss: 11.658527

Meysth
Kieb
Kysto
Mad
Wrysa


Iteration: 80000, Loss: 9.427867

Liworra
Higa
Ivissa
Mad
Wrurdia


Iteration: 90000, Loss: 9.901628

Mevry
Ilba
Justicd
Mad
Wrtofe


Iteration: 100000, Loss: 14.864924

Mivron
Joeb
Kuttja
Mad
Wostal


Iteration: 110000, Loss: 12.407441

Mavrus
Joeb
Justi
Mad
Wossa


Iteration: 120000, Loss: 14.501468

Livorja
Imad
Justold
Mad
Wyrren


Iteration: 130000, Loss: 12.537980

Mevmor
Jola
Kutti
Mad
Wostan


Iteration

## Conclusion

Initially the model was generating random sequence of characters which dint make any sense but during the end of the training we were getting reasonable names.

Thats All for this tutorial, the purpose behind to make this tuorial was to give readers and machine learning, deep learning followers an Idea of how RNNs work and how this bits nad peices come together to build a whole model.

After Implemeting this tutorial, I bet followers of this tutorial will be clear the functional details of RNN and will be having adequeate insight when debugging the RNN and building the RNN based model using any Deep Learning Framework

To Stay updated with my tutorials, please follow my Blog [LeakyReLU : Practical Guide to Machine Learning, Deep Learning and AI](https://leakyrelu.com)