While taking the **Intro to Deep Learning with PyTorch** course by Udacity, I really liked exercise that was based on building a character-level language model using LSTMs. I was unable to complete all on my own since NLP is still a very new field to me. I decided to give the exercise  a try with `tensorflow 2.0` and because of the ease of use you get in `keras`, I could develop a very simple LSTM-based language model able to predict a single character given a set of characters. 

The exercise uses the **Anna Karenina** nodel written by Leo Tolstoy as its data. I used a small subset of it in this notebook, though. 

In [0]:
!pip install tensorflow-gpu==2.0.0-beta1

In [28]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

print(tf.__version__)

2.0.0-beta1


I start by loading the novel. 

In [0]:
# Open text file and read in data as `text`
with open('anna.txt', 'r') as f:
    text = f.read()

In [5]:
# First hundred characters
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

The text will start look ugly now :(

In [0]:
# Strip all the new lines
tokens = text.split()
text_without_nlines = ' '.join(tokens)

I will be using LSTMs for developing the language model. A sequence in an one-hot-encoded form is needed to be given as its input. Each input sequence will be 50 characters with one output character, making each sequence 51 characters long.

We can create the sequences by enumerating the characters in the text, starting at the 51st character at index 50.

In [7]:
# Prepare the sequences for the model
length = 50
sequences = []
for i in range(length, len(text_without_nlines)):
    # Select sequence of tokens
    seq = text_without_nlines[i-length:i+1]
    sequences.append(seq)
print('Total Sequences: {}'.format(len(sequences)))

Total Sequences: 1976136


In [8]:
# Save these sequences for later use
filename = 'char_sequences.txt'
data = '\n'.join(sequences)
file = open(filename, 'w')
file.write(data)
file.close()
print('File saved!')

File saved!


In [9]:
# Preview
!head -5 char_sequences.txt

Chapter 1 Happy families are all alike; every unhap
hapter 1 Happy families are all alike; every unhapp
apter 1 Happy families are all alike; every unhappy
pter 1 Happy families are all alike; every unhappy 
ter 1 Happy families are all alike; every unhappy f


In [0]:
# Load up the data
sequences_from_file = open('char_sequences.txt')
text = sequences_from_file.read()
lines = text.split('\n')

In [0]:
# Cause computers understand only numbers
# Assigning each character a unique integer
# Charater -> Integer
chars = sorted(list(set(text)))
mapping = dict((c, i) for i, c in enumerate(chars))

In [0]:
# Convert the sequences to integer encodings
int_sequences = []
for line in lines:
    encoded_seq = [mapping[char] for char in line]
    int_sequences.append(encoded_seq)

In [13]:
# How big is the corpus?
vocab_size = len(mapping)
print('Voacabulary size', vocab_size)

Voacabulary size 83


In [0]:
# X -> y mapping of input sequence in this form
int_sequences = np.array(int_sequences)
X, y = int_sequences[:,:-1], int_sequences[:,-1]

I will be using a very small subset of data. 

In [17]:
X[:10000].shape, y[:10000].shape

((10000, 50), (10000,))

The characters will have to be one-hot-encoded before they are fed to the language model. It also preserves a concise input representation but when the input feature space is very very large, Character Embeddings should be used before. 

In [0]:
one_hot_sequences = [tf.keras.utils.to_categorical(x, num_classes=vocab_size) for x in X[:10000]]
X = np.array(one_hot_sequences)
y = tf.keras.utils.to_categorical(y[:10000], num_classes=vocab_size)

In [50]:
# Mini language model :)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(tf.keras.layers.Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 256)               348160    
_________________________________________________________________
dense_1 (Dense)              (None, 83)                21331     
Total params: 369,491
Trainable params: 369,491
Non-trainable params: 0
_________________________________________________________________
None


There can be a problem of exploding gradients and to prevent that I am going to specify the `clipnorm` term in the optimizer. 

In [0]:
adam = Adam(lr=.001, clipnorm=0.5)

In [52]:
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
model.fit(X, y, epochs=200, verbose=2)

Train on 10000 samples
Epoch 1/200
10000/10000 - 6s - loss: 3.0353 - accuracy: 0.1812
Epoch 2/200
10000/10000 - 5s - loss: 2.6381 - accuracy: 0.2876
Epoch 3/200
10000/10000 - 5s - loss: 2.4285 - accuracy: 0.3235
Epoch 4/200
10000/10000 - 5s - loss: 2.3118 - accuracy: 0.3460
Epoch 5/200
10000/10000 - 5s - loss: 2.2234 - accuracy: 0.3619
Epoch 6/200
10000/10000 - 5s - loss: 2.1240 - accuracy: 0.3865
Epoch 7/200
10000/10000 - 5s - loss: 2.0552 - accuracy: 0.4067
Epoch 8/200
10000/10000 - 5s - loss: 1.9887 - accuracy: 0.4269
Epoch 9/200
10000/10000 - 5s - loss: 1.9183 - accuracy: 0.4406
Epoch 10/200
10000/10000 - 5s - loss: 1.8567 - accuracy: 0.4588
Epoch 11/200
10000/10000 - 5s - loss: 1.7977 - accuracy: 0.4738
Epoch 12/200
10000/10000 - 5s - loss: 1.7364 - accuracy: 0.4836
Epoch 13/200
10000/10000 - 5s - loss: 1.6838 - accuracy: 0.5007
Epoch 14/200
10000/10000 - 5s - loss: 1.6153 - accuracy: 0.5217
Epoch 15/200
10000/10000 - 5s - loss: 1.5537 - accuracy: 0.5358
Epoch 16/200
10000/10000 -

<tensorflow.python.keras.callbacks.History at 0x7f0e20c5f860>

The training loss keeps on decreasing and the accuracy keeps getting increased. This is a good sign. 

Now that the model is trained, we can employ it to generate characters on given sequences of characters. For doing this, the model would require the given inputs to be exactly in the shape with which it was trained. If we give an input sequence that does not *exactly* match with that of the training input sequences, we will get errors. 

We will use the `pad_sequences()` function which will truncate the characters from first the half of the test input sequences and padd extra characters if needed (0 essentially). We will define a small helper function for generating characters of user-specified length. The user will have to provide some initial text to the model, though. 

In [0]:
def generate_seq(model, mapping, seq_length, init_text, n_chars):
  in_text = init_text
  # Generate a fixed number of characters
  for _ in range(n_chars):
    # Encode to integers
    encoded = [mapping[char] for char in in_text]
    # Map sequences to a fixed length
    encoded = pad_sequences([encoded], maxlen=seq_length, padding='pre', truncating='pre')
    # print(encoded.shape)
    # One-hot encode
    encoded = tf.keras.utils.to_categorical(encoded, num_classes=vocab_size)
    # print(encoded.shape)
    # Predict character
    yhat = model.predict_classes(encoded, verbose=0)
    # Integer -> Character
    out_char = ''
    for char, index in mapping.items():
      if index == yhat:
        out_char = char
        break
    # We append the characters after the input sequence
    in_text += char
  return in_text

In [54]:
# Let's test
print(generate_seq(model, mapping, 50, 'And Levin said', 20))

And Levin said. "What's to be done


In [55]:
print(generate_seq(model, mapping, 50, 'Happy families', 20))

Happy families, In the glance, in 


The model does generate something meaningful. At this stage it is really nothing apart from just one LSTM layer (and its power is evident). 