# Character Level Langauge Model using Keras  
  
  
  
Create a character level language model using Keras. The model will be fed dinosaur names, and once trained, will generate new dinosaur names. Adapted from Coursera/Deep Learning/Module 5/Sequence Model/Character level language model - Dinosaurus land.  
  
  
Download the source text file from here and save it to your Download folder.

In [4]:
import numpy as np   
   
data = open('C:/Users/wee yeow/Downloads/dinos.txt', 'r').read()
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 19909 total characters and 27 unique characters in your data.


Build two dictionaries that converts text to number and number to text

In [5]:
char_to_ix = { ch:i for i,ch in enumerate(sorted(chars)) }
ix_to_char = { i:ch for i,ch in enumerate(sorted(chars)) }
print("char_to_ix = ", "\n", ix_to_char,"\n"*2) 
print("ix_to_char = ", "\n", char_to_ix) 

char_to_ix =  
 {0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z'} 


ix_to_char =  
 {'\n': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


Split up the text into individual names.

In [6]:
examples = data.split("\n") 
print(examples[1:10])
print(examples[-10:-1])

['aardonyx', 'abdallahsaurus', 'abelisaurus', 'abrictosaurus', 'abrosaurus', 'abydosaurus', 'acanthopholis', 'achelousaurus', 'acheroraptor']
['zhuchengtyrannus', 'ziapelta', 'zigongosaurus', 'zizhongosaurus', 'zuniceratops', 'zunityrannus', 'zuolong', 'zuoyunlong', 'zupaysaurus']


Create the input X and output Y by converting all dinosaur names into vectors.

In [7]:
X = []
Y = []
for index in range(len(examples)):
    lineX = [char_to_ix[ch] for ch in examples[index]]
    X.append(lineX)
    lineY = lineX[1:] + [char_to_ix["\n"]]
    Y.append(lineY)
print("X[1]:", X[1]) 
print("Y[1]:", Y[1], "\n") 

print("X[2]:", X[2]) 
print("Y[2]:", Y[2], "\n") 

print("X[3]:", X[3]) 
print("Y[3]:", Y[3], "\n") 

X[1]: [1, 1, 18, 4, 15, 14, 25, 24]
Y[1]: [1, 18, 4, 15, 14, 25, 24, 0] 

X[2]: [1, 2, 4, 1, 12, 12, 1, 8, 19, 1, 21, 18, 21, 19]
Y[2]: [2, 4, 1, 12, 12, 1, 8, 19, 1, 21, 18, 21, 19, 0] 

X[3]: [1, 2, 5, 12, 9, 19, 1, 21, 18, 21, 19]
Y[3]: [2, 5, 12, 9, 19, 1, 21, 18, 21, 19, 0] 



Get the longest dinosaur name.

In [8]:
longest_sentence = len(max(X, key=len))+1
print(longest_sentence)

27


Pad all vectors to the same length of 27 so that they can be feed into keras' SimpleRNN in one batch. Use post-padding so that the zeros appear at the back.

In [9]:
sample = len(X)
vocab_size = len(char_to_ix)    

from keras.preprocessing import sequence
X = sequence.pad_sequences(X , maxlen=longest_sentence, padding = "post")
Y = sequence.pad_sequences(Y , maxlen=longest_sentence, padding = "post")    

print("X[1]:", X[1], "\n") 
print("Y[1]:", Y[1]) 

Using TensorFlow backend.


X[1]: [ 1  1 18  4 15 14 25 24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0] 

Y[1]: [ 1 18  4 15 14 25 24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0]


Create one-hot vectors for X and Y.

In [11]:

X_onehot = np.zeros((sample,longest_sentence,vocab_size))
Y_onehot = np.zeros((sample,longest_sentence,vocab_size))


for i, indices in enumerate(X):
    for j, character_index in enumerate(indices):
        X_onehot[i,j,character_index] = 1
        
        
for i, indices in enumerate(Y):
    for j, character_index in enumerate(indices):
        Y_onehot[i,j,character_index] = 1          
  

Build the RNN model using Keras. It will be a simple model with one layer of RNN followed by a Dense layer with softmax activation.

In [12]:
hidden_size = 50

from keras.models import Sequential
from keras.layers import SimpleRNN, Dense

model = Sequential()
model.add(SimpleRNN(hidden_size, return_sequences=True, input_shape=(longest_sentence, vocab_size)))
model.add(Dense(vocab_size, activation = "softmax"))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, 27, 50)            3900      
_________________________________________________________________
dense_1 (Dense)              (None, 27, 27)            1377      
Total params: 5,277
Trainable params: 5,277
Non-trainable params: 0
_________________________________________________________________


Set the learning rate = 0.01 and clip the gradient at 5 for the Schocastic Gradient Descent.

In [13]:
from keras.optimizers import SGD
sgd = SGD(lr=0.01, clipvalue=5)



Compile the model. For loss, use categorical cross-entropy (as there are 27 Y-labels). Set the metric to be categorical accuraccy.

In [14]:
model.compile(optimizer="sgd",loss='categorical_crossentropy', metrics = ["categorical_accuracy"])

Call the softmax function, which will be used later.

In [15]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

Create the function generate name(), which will  

1) Randomnly draw a character from the dictionary. Exclude "\n".   

2) Use this random character as the first character of newly generated dinosaur names.  

3) Feed this random seed into weights learnt by the RNN model.  

4) Generate a prediction i.e. a second character.  

5) Use the second character as the seed to generate the thirdcharacter.  

6) Continue to predict until "\n" is the prediction or until 50 characters is generated.

In [16]:
def generate_name():
    index = np.random.randint(1,27)
    random_character = ix_to_char[index]    
    dino_name = []
    dino_name.append(random_character)
    
    counter = 0
    newline_character = char_to_ix['\n']
    a_next = np.zeros((50,1))
    
    while (index != newline_character and counter != 50): 
        xt = np.zeros((vocab_size,1))
        xt[index,:] = 1
        
        a_next = np.tanh(np.dot(Waa.T, a_next ) + np.dot(Wax.T, xt) + ba.reshape((50,1)))
        pred = softmax(np.dot(Wya.T,a_next)+by.reshape((27,1)))  
        index = np.random.choice(range(vocab_size), p = pred.ravel())      

        prediction = ix_to_char[index]
        dino_name.append(prediction)

        counter +=1
 
    print(''.join(dino_name), end = "")

Train the model for 10 iterations. During the training process, seven randomly generated dinosaur names will be returned at every 5 epochs.  


In [17]:
iter = 10    
epoch_size = 5
for iteration in range(iter):    
    print("Iteration %d :" % (iteration +1))
    
    model.fit(x = X_onehot, y = Y_onehot, epochs = epoch_size)
    
    weights = []
    for layer in model.layers:
        w = layer.get_weights()
        weights.append(w)

    Wax,Waa,ba = weights[0]
    Wya,by = weights[1]
    
    for i in range(7):
        generate_name()

Iteration 1 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
zgkibvu
vksbae
ugpijctmndjg
ojnotoraiu
plbuvdg
ktuywcayeyiqav
bgdbw
Iteration 2 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
iityhesfo
xporlpusu
uszbpngpcuhsf
mawhassuus
hjjzrsarwualog
cgnmzosurshud
qanxag
Iteration 3 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
vqgrpauialso
ejmuooua
bssxmpsiuor
tlargatroxryut
xhsauasaavgus
euwcwbsorsu
ftnvssr
Iteration 4 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
bbezhisnasoooh
iksaaoksa
uf
jthtlataassksuu
mutfaaaprp
zidnsnudnkr
sdkrisdp
Iteration 5 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
ctaabadkjritu
yqharspop
tboeoauossus
czeelarsdsu
vhsdtsrrju
uq
ttkxapuodoi
Iteration 6 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
b
cmpxumrno
dpgojaiahkapoehus
zlira
yb
xurfxapuus
ulhxsrotqabba
Iteration 7 :
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
ofurloepun
pabphtslrhi
ppsjuaueus
hemvuhhuaepsa
shhizuostiaon
ugdanclasshrtsssi
kustnolkuftt
Iteration 8 

Run generate_name() to generate a new dinosaur name. 10 iterations gave pretty respectable names.

In [19]:
generate_name()

azhoqnuy


  
Reference:  

  1) Adapted from Coursera / Deeplearning.ai / Module5 / SequenceModel / Character level language model - Dinosaurus land.    
  2) Deep Learning with Python - Francois Chollet    
  3) Deep Learning with Keras - Antonio Gulli, Sujit Pal
