# Table of Contents
**1. Character RNN**

**2. The data**

**3. Data Preprocessing**

**4. Build the model**

**5. Build the predict sequence**

**6. Prediction**

# 1. Character RNN

The term “char-rnn” is short for “character recurrent neural network”, and is effectively a recurrent neural network trained to predict the next character given a sequence of previous characters. In this way, we can think of a char-rnn as a classification model. For a char-rnn we wish to output a probability distribution over character classes, i.e., a vocabulary of characters. In this setting we are given the characters one at a time and only expected to predict an output after the last character.

Below is an example of charRNN

<p align="center">
<img height="250" width="450" src="https://mail.google.com/mail/u/0/?ui=2&ik=0aefe2754b&view=att&th=165c82544a6ca2b7&attid=0.1&disp=safe&realattid=f_jlxkex7n1&zw"></img>


In the example above we have 5 timesteps i.e. length of of the sequence is 5 and we'll get output after passing 5 characters. 


**Use of Char-RNN:**

<p align="center">
<img height="250" width="450" src="https://mail.google.com/mail/u/0/?ui=2&ik=0aefe2754b&view=att&th=165c932922d81ce0&attid=0.1&disp=safe&realattid=f_jlxux5xs1&zw"></img>




# 2. The Data
## Download the data & read in colab in the following format

The book was collected from Gutenberg corpus available in the link below & opened in colab with the following command 

In [1]:
#Download book
!wget -O Pride_and_Prejudice.txt http://www.gutenberg.org/files/1342/1342-0.txt --quiet
text_book=open("Pride_and_Prejudice.txt",encoding='utf8').read()


Redirecting output to ‘wget-log’.


## Let's have a look at the first 23 characters of the book

In [2]:
text_book[:23]

'\ufeffThe Project Gutenberg '

# 3. Data Preprocessing

In [0]:
import numpy as np
from keras.preprocessing.text import Tokenizer                                   # Tokenizer class of keras is imported to tokenize each character 

In [0]:
t=Tokenizer(char_level=True)                                                     # we want character level equal to True because it's all about characters

In [0]:
t.fit_on_texts(text_book)                                                        # We need to fit the entire book on text

In [6]:
vocab_size=len(t.word_index)
vocab_size                                                                       # we want to find number of unique characters in the book

86

In [0]:
book_num=t.texts_to_sequences(text_book)                                         # Characters are converted into numbers

In [8]:
book_length=len(book_num)
book_length                                                                      # The book has 704190 characters in it

704190

In [0]:
sequence_length=100                                                              # After each 100 characters, we want a output(a character)

In [10]:
unique_chars=t.word_index                                                        # Find the index number of each of the 86 characters. Space ' ' is also considered
unique_chars                                                                     # a character


{'\n': 14,
 ' ': 1,
 '!': 49,
 '#': 83,
 '$': 80,
 '%': 85,
 "'": 40,
 '(': 64,
 ')': 65,
 '*': 63,
 ',': 21,
 '-': 32,
 '.': 24,
 '/': 72,
 '0': 69,
 '1': 61,
 '2': 66,
 '3': 67,
 '4': 68,
 '5': 71,
 '6': 74,
 '7': 76,
 '8': 73,
 '9': 75,
 ':': 59,
 ';': 31,
 '?': 50,
 '@': 79,
 'A': 48,
 'B': 33,
 'C': 42,
 'D': 46,
 'E': 37,
 'F': 57,
 'G': 55,
 'H': 41,
 'I': 27,
 'J': 52,
 'K': 60,
 'L': 39,
 'M': 30,
 'N': 54,
 'O': 56,
 'P': 53,
 'Q': 86,
 'R': 58,
 'S': 47,
 'T': 35,
 'U': 62,
 'V': 70,
 'W': 43,
 'X': 78,
 'Y': 51,
 'Z': 77,
 '[': 82,
 ']': 84,
 '_': 38,
 'a': 4,
 'b': 23,
 'c': 16,
 'd': 11,
 'e': 2,
 'f': 18,
 'g': 20,
 'h': 8,
 'i': 7,
 'j': 44,
 'k': 26,
 'l': 12,
 'm': 15,
 'n': 6,
 'o': 5,
 'p': 22,
 'q': 45,
 'r': 9,
 's': 10,
 't': 3,
 'u': 13,
 'v': 25,
 'w': 19,
 'x': 36,
 'y': 17,
 'z': 34,
 '“': 28,
 '”': 29,
 '\ufeff': 81}

## Prepare input & output data

In [0]:
input_data=[]                                                                     # Create empty list both for input & output data
output_data=[]
for i in range(0,book_length-sequence_length):
  input_seq=book_num[i:i+sequence_length]
  output_seq=book_num[i+sequence_length]
  input_data.append(input_seq)
  output_data.append(output_seq)

In [12]:
print(len(input_data[0]))                                                        # As mentioned earlier input data has 100 characters
print(len(output_data[3]))                                                       # Whereas output data has 1 character. After each input data(100 chars) we'll get 1 
                                                                                 # character as output 

100
1


In [13]:
print(len(input_data))
print(len(output_data))

704090
704090


In [0]:
input_data=np.reshape(input_data,(len(input_data),len(input_data[2]),1))         # reshape the input data into 3 dimensions, each example contains a sequence of 100
                                                                                 # characters & 1 denotes , each character is represented by 1 number

In [0]:
input_data=input_data/vocab_size                                                 # Normalize the input data

In [16]:
input_data.shape

(704090, 100, 1)

In [0]:
from keras.utils import to_categorical                                           # One-hot encoding is done for each character because 
output_data=to_categorical(output_data,num_classes=vocab_size+1)                 # We want one one character to be predicted out of 86 unique characters

In [18]:
output_data[1]

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.], dtype=float32)

# 4. Build the model

In [0]:
from keras.models import Sequential                                              # Import the following to proceed further analysis
from keras.layers import Dense,LSTM,Dropout                                                      

In [0]:
model=Sequential()

In [0]:
model.add(LSTM(256,input_shape=(input_data.shape[1],input_data.shape[2])))       # 256 is the memory size of LSTM layer.input_data.shape[1]=100 is the timesteps

In [0]:
model.add(Dropout(0.2))                                                          # Dropout is used to reduce overfiting.In Dropout we drop some neurones for the
                                                                                 # next layer

In [0]:
model.add(Dense(vocab_size+1,activation="softmax"))                              # since it's multiclass classification i.e each input has 1 output out of  86 chars,
                                                                                 # so we'll use softmax classifier

In [0]:
model.compile(optimizer="adam",loss="categorical_crossentropy")                  # We are not concerned with accuracy. We check model performance based on the loss

In [0]:
model.fit(input_data,output_data,epochs=3,batch_size=2048)                       

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fc3eb5dd828>

# 5.  Build the predict sequence

In [0]:
test_seq=input_data[np.random.randint(0,high=input_data.shape[0])]               # test_seq is a seq of 100 characters & each number is represented by 1 number

In [27]:
test_seq.shape

(100, 1)

In [0]:
int_to_char = dict((i,c) for c, i in t.word_index.items())                       #Build a dictionary which can convert numbers into chars

In [29]:
print(int_to_char)

{1: ' ', 2: 'e', 3: 't', 4: 'a', 5: 'o', 6: 'n', 7: 'i', 8: 'h', 9: 'r', 10: 's', 11: 'd', 12: 'l', 13: 'u', 14: '\n', 15: 'm', 16: 'c', 17: 'y', 18: 'f', 19: 'w', 20: 'g', 21: ',', 22: 'p', 23: 'b', 24: '.', 25: 'v', 26: 'k', 27: 'I', 28: '“', 29: '”', 30: 'M', 31: ';', 32: '-', 33: 'B', 34: 'z', 35: 'T', 36: 'x', 37: 'E', 38: '_', 39: 'L', 40: "'", 41: 'H', 42: 'C', 43: 'W', 44: 'j', 45: 'q', 46: 'D', 47: 'S', 48: 'A', 49: '!', 50: '?', 51: 'Y', 52: 'J', 53: 'P', 54: 'N', 55: 'G', 56: 'O', 57: 'F', 58: 'R', 59: ':', 60: 'K', 61: '1', 62: 'U', 63: '*', 64: '(', 65: ')', 66: '2', 67: '3', 68: '4', 69: '0', 70: 'V', 71: '5', 72: '/', 73: '8', 74: '6', 75: '9', 76: '7', 77: 'Z', 78: 'X', 79: '@', 80: '$', 81: '\ufeff', 82: '[', 83: '#', 84: ']', 85: '%', 86: 'Q'}


In [0]:
current_seq = np.copy(test_seq)
current_seq.shape

(100, 1)

In [0]:
def predict_seq(epoch, logs):
    
    print('Output sequence is: ')
    
    #Initialize predicted output
    predicted_output = ''
    
    #lets predict 50 next chars
    current_seq = np.copy(test_seq)
    for i in range(50):
        data_input = np.reshape(current_seq,(1,                                  # We'll use one input
                                             current_seq.shape[0],               # each input has length of 100 chars
                                             current_seq.shape[1]))              # each char is represented by one number
        
        #Get the char int with maximum probability
        predicted_char_int = np.argmax(model.predict(data_input)[0])
        
        #Add to the predicted out, convert int to char
        predicted_output = predicted_output + int_to_char[predicted_char_int]
        
        #Update seq with new value at the end
        current_seq = np.roll(current_seq, -1)
        current_seq[current_seq.shape[0]-1] = [predicted_char_int/vocab_size]
    
    print(predicted_output)

In [0]:
data_input=np.reshape(current_seq,(1,current_seq.shape[0],current_seq.shape[1]))
data_input.shape

(1, 100, 1)

# 6. Prediction

In [0]:
from tensorflow.python.keras.callbacks import LambdaCallback

In [0]:
checkpoint = LambdaCallback(on_epoch_end=predict_seq)                            # Create a LabdaCallback to do prediction at end of every epoch

In [0]:
#Print random starting sequence for prediction
print('Initial sequence is: ')
for i in range (sequence_length):
    print(int_to_char[int(test_seq[i]*vocab_size)], end='')

Initial sequence is: 
ton has our directions,
and all will pe completed in a week. They will then join his regiment,
unles

In [0]:
model.fit(input_data, output_data, 
          batch_size=128, 
          epochs=1,
          callbacks=[checkpoint])                                               #As epoch increases model produces more correct output. In this case we've shown 1 epoch 

Epoch 1/1
Output sequence is: 
  ho  he  he  he  he  he  he  he  he  he  he  he  


<keras.callbacks.History at 0x7fc3e629b7b8>