## Character Level Language Modelling

In Character Level modelling, we try to predict character by character from a trained model. This requires us to use a Many to many RNN model where given an initial input character, the RNN outputs the next character in the sequence. 

In this particular notebook, I've used an LSTM as the choice of RNN.

In [1]:
import numpy as np
import tensorflow as tf

### Dataset

I've used baby names dataset from the internet to generate new names and see how well the model is able to generate. I also tried Dinosaur names (this was used in one of Coursera's assignments in the Sequence Models course) but it would work for any kind of datset.  

The dataset has a few 100 names seperated by the new line character.

In [2]:
file = open('names.txt', 'r') # Replace the file with any document which contains a list of names
file = file.read()
file = file.lower()

In [3]:
print(file[:100])

emma
olivia
ava
isabella
sophia
charlotte
mia
amelia
harper
evelyn
abigail
emily
elizabeth
mila
ella


### Preprocessing

Find the unique characters in the dataset.

Create two dictionaries to hold the mapping from characters to numbers and numbes to characters.

In [4]:
unique = sorted(list(set(file)))
num_char = {i:j for i,j in enumerate(unique)} # numbers to characters
char_num = {j:i for i,j in num_char.items()} # characters to numbers
vocab_size = len(unique)

Split the data into a list containing the names and shuffle them.

While training, the input to the model will be the list of names as it is and the output would be the same list of names shifted by one charater to the right. This is because we want the model to predict the next possible character in the sequence.

In [5]:
data = file.split('\n')
np.random.shuffle(data)

X = data[:]
# Shift the character's by one to the right and append the new line character at the end (both need to be the same length)
Y = [i[1:]+'\n' for i in data] 

# Convert the characters to integers using the char_num mapping
# Since each name can be of different lengths, pad all the names with zeros to make them equal to the length of the maximum name
# in the data set
X_code = tf.keras.preprocessing.sequence.pad_sequences([np.array([char_num[j] for j in i]) for i in X], padding = 'post')
Y_code = tf.keras.preprocessing.sequence.pad_sequences([np.array([char_num[j] for j in i]) for i in Y], padding = 'post')

# Convert the integers to onehot vectors to feed to the model
X_hot = np.array([tf.keras.utils.to_categorical(i, num_classes=vocab_size) for i in X_code])
Y_hot = np.array([tf.keras.utils.to_categorical(i, num_classes=vocab_size) for i in Y_code])

In [6]:
X_hot.shape # (num of examples, length of each example, dimention of the on hot encoded vector)

(1001, 10, 27)

### Model

The input to the model is processed one hot vectors. Each name in the data is one training example. New linr character marks the end of the example.

The next layer consists of 125 LSTMs. We use return sequence as True because this is a many to many RNN model and we need the outputs after each character passes through. We get 16 outputs after the LSTMs process the input sequence which is 16 characters in length. Without the retuen sequences, it would only output the final value after al the 16 characters have been processed.

Finally, since we need a prediction result, pass each of the outputs through a softmax to predict the character.

In [7]:
inputs = tf.keras.Input(shape=(X_hot.shape[1],X_hot.shape[2]))
x = tf.keras.layers.LSTM(125, activation=tf.nn.relu, return_sequences=True)(inputs)
outputs = tf.keras.layers.Dense(vocab_size, activation=tf.nn.softmax)(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [8]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 10, 27)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 10, 125)           76500     
_________________________________________________________________
dense (Dense)                (None, 10, 27)            3402      
Total params: 79,902
Trainable params: 79,902
Non-trainable params: 0
_________________________________________________________________


In [9]:
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

At the end of every few epochs we can view the output of the LSTM by generating new names. Ww need to use callbacks for generating the outputs. 
The function generate_names generates 5 names after every 25 epochs. 

In [10]:
def generate_names(epoch, _):
    if epoch % 25 == 0:
        
        print('\nNames After Epoch {}: '.format(epoch))

        # generate 5 names
        for j in range(5):
            name = ''
            x = np.zeros((1, X_hot.shape[1], vocab_size)) # Initialize a vector of zeros of size (16,27)
            end = False
            i = 0

            # Keep generating new characters untill a new line character is generated or the length of the generated 
            # sequence reaches a certain limit
            # When we use model.predict() it gives us all the 16 outputs but we need to generate one character at a time and feed
            # it as input and again predict the next character and so on
            while (not end):
                probs = list(model.predict(x)[0,i]) # Output of the ith RNN
                probs = probs / np.sum(probs)
                index = np.random.choice(range(vocab_size), p=probs) # randomly choose a character from the generated ones according to their probability
                if i == X_hot.shape[1]-2:
                    character = '\n'
                    end = True
                else:
                    character = num_char[index] # conver the integer to character
                name += character
                x[0, i+1, index] = 1 # Set the input of the next RNN equal to the value generated by the current RNN
                i += 1
                if character == '\n':
                    end = True
            print(name)
    
name_generator = tf.keras.callbacks.LambdaCallback(on_epoch_end = generate_names)

In [11]:
model.fit(X_hot, Y_hot, epochs=300, callbacks=[name_generator], verbose=1)

Train on 1001 samples
Epoch 1/300
Names After Epoch 0: 
pwnr

dqlve

pyinam

ohxfen

gmsam

Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Names After Epoch 25: 
ailah

cilynh

aelya

zlasy

saria

Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Names After Epoch 50: 
anror

lienhy

felvie

voreney

iaana

Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Ep

<tensorflow.python.keras.callbacks.History at 0x7fb4d46db4e0>

The model does a good job in generating names which sound similiar to the ones used in the English language. The generated names might not make sense because we don't use them in general. If the dataset had Dinosaur names or something with scientific names, it would make more sense as they would have common endings like 'saurus' etc