# Language modeling with an RNN

In the model that we will build now, the input is a text document, and our goal is to develop a model that can generate new text that is similar in style to the input document. In character-level language modeling, the input is broken down into a sequence of characters that are fed into our network one character at a time. The network will process each new character in conjunction with the memory of the previously seen characters to predict the next one. This is inspired by the paper *Generating Text with Recurrent Neural Networks* by Ilya Sutskever, James Martens, and Geoffrey E. Hinton, Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, available at <https://pdfs.semanticscholar.org/93c2/0e38c85b69fc2d2eb314b3c1217913f7db11.pdf>.

### Data Preprocessing

We downloaded the book *The Mysterious Island*, by Jules Verne (published in 1874) in plain text format from the Project Gutenberg website at <https://www.gutenberg.org/files/1268/1268-0.txt>.

In [5]:
import numpy as np

#read in the text file
with open('1268-0.txt', 'r', encoding='utf-8') as fp:
    text = fp.read()

#cut out filler from beginning and end
start_idx = text.find('THE MYSTERIOUS ISLAND')
end_idx = text.find('*** END OF THE PROJECT GUTENBERG')
text = text[start_idx:end_idx]

#create set of unique characters
char_set = set(text)

#display length of text and number of unique characters
print('Total length:', len(text), '\nUnique characters:', len(char_set))

Total length: 1112296 
Unique characters: 80


To convert the text into a numeric format, we will create a Python dictionary called `char2int` that maps each character to an integer. We will also make a reverse mapping to convert the results of our model back to text, which can be done most efficiently using a NumPy array and indexing the array to map indices to the unique characters.

In [11]:
#create dictionary to map characters to integers
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}

#create reverse mapping via indexing a numpy array
char_array = np.array(chars_sorted)
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)

We can look at the first few integers of the encoded text and the characters that they map to.

In [14]:
for ex in text_encoded[:21]:
    print('{} <-> {}'.format(ex, char_array[ex]))

44 <-> T
32 <-> H
29 <-> E
1 <->  
37 <-> M
48 <-> Y
43 <-> S
44 <-> T
29 <-> E
42 <-> R
33 <-> I
39 <-> O
45 <-> U
43 <-> S
1 <->  
33 <-> I
43 <-> S
36 <-> L
25 <-> A
38 <-> N
28 <-> D


To implement the text generation task in PyTorch, we first clip the sequence length to $40$, so that the input tensor consists of $40$ tokens. For shorter sequences, the model might focus on capturing individual words correctly while largely ignoring the context. Although longer sequences usually result in more meaningful sentences, the RNN model will have problems capturing long-range dependencies. So, we choose $40$ characters as a trade-off.

We will thus split the text into chunks of size $41$: the first $40$ characters form the input sequence, and the last $40$ elements form the target sequence.

In [15]:
import torch
from torch.utils.data import Dataset

#define chunks of size 41
seq_length = 40
chunk_size = seq_length + 1
text_chunks = [text_encoded[i:i+chunk_size] for i in range(len(text_encoded)-chunk_size+1)]

#define custom Dataset class to transform chunks into a dataset
class TextDataset(Dataset):
    
    def __init__(self, text_chunks): 
        self.text_chunks = text_chunks
        
    def __len__(self):
        return len(self.text_chunks)
    
    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1].long(), text_chunk[1:].long()

seq_dataset = TextDataset(torch.tensor(text_chunks))

  seq_dataset = TextDataset(torch.tensor(text_chunks))


Finally, we create a `DataLoader` object from the dataset we just made above.

In [16]:
from torch.utils.data import DataLoader

#create DataLoader with batch size 64
batch_size = 64
torch.manual_seed(1) #for reproducibility
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

### Building an RNN model

The data is now ready to be loaded into an RNN model, keeping in mind that the first layer of the RNN will be an embedding layer. Below, we define an RNN model that follows the embedding layer with an LSTM layer, and finally a single fully-connected layer which outputs $80$ logits, one for each character in the vocabulary. We will sample from these model predictions to generate new text later.

In [17]:
import torch.nn as nn

#define RNN class
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        
        #embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        #LSTM layer
        self.rnn_hidden_size = rnn_hidden_size
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        
        #fully-connected layer
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)
    
    #define forward pass
    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell
    
#specify model parameters and create an RNN model
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
torch.manual_seed(1) #reproducibility
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model

RNN(
  (embedding): Embedding(80, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=80, bias=True)
)

Now, we define the cross-entropy loss function and Adam optimizer before training the model for $10000$ epochs. In each epoch, we will use only one batch randomly chosen from the data loader, `seq_dl`. We will also display the training loss every 500 epochs.

In [18]:
#define cross-entropy loss function
loss_fn = nn.CrossEntropyLoss()

#define Adam optimizer with learning rate 0.005
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

#train the model for 10000 epochs
num_epochs = 10000
torch.manual_seed(1) #reproducibility
for epoch in range(num_epochs):
    hidden, cell = model.init_hidden(batch_size)
    seq_batch, target_batch = next(iter(seq_dl))
    
    #reset gradients to zero
    optimizer.zero_grad()
    
    loss = 0
    for c in range(seq_length):
        #generate predictions and sum loss
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
        
    #compute gradients
    loss.backward()
    
    #update parameters using gradients
    optimizer.step()
    
    #compute loss over entire epoch
    loss = loss.item()/seq_length
    
    if epoch % 500 == 0:
        print(f'Epoch {epoch} loss: {loss:.4f}')

Epoch 0 loss: 4.3721
Epoch 500 loss: 1.3314
Epoch 1000 loss: 1.2173
Epoch 1500 loss: 1.2364
Epoch 2000 loss: 1.1594
Epoch 2500 loss: 1.1852
Epoch 3000 loss: 1.1724
Epoch 3500 loss: 1.1770
Epoch 4000 loss: 1.2197
Epoch 4500 loss: 1.1854
Epoch 5000 loss: 1.1805
Epoch 5500 loss: 1.0848
Epoch 6000 loss: 1.1181
Epoch 6500 loss: 1.1583
Epoch 7000 loss: 1.1011
Epoch 7500 loss: 1.1376
Epoch 8000 loss: 1.0911
Epoch 8500 loss: 1.1532
Epoch 9000 loss: 1.1041
Epoch 9500 loss: 1.1578


We now define a function `sample` which is fed an input string, and then generates an output string autoregressively, character by character. This means that the generated sequence is itself consumed as input for generating new characters. Note that we want to randomly sample from the logits that are output by the RNN in determining the next character, because if we only choose the highest likelihood character the model will say the same thing every time. To randomly draw these samples, we can use the class `torch.distributions.categorical.Categorical`.

In [19]:
from torch.distributions.categorical import Categorical

#define function to autoregressively generate new string from input string
def sample(model, starting_str, len_generated_text=500, scale_factor=1.0):
    
    #encode starting_str to a sequence of integers
    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1))
    
    #initially assign starting_str to generated_str
    generated_str = starting_str
    
    #pass encoded_input to the RNN one character at a time
    model.eval()
    hidden, cell = model.init_hidden(1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell)
    
    #pass last character of encoded_input to RNN to generate a new character
    last_char = encoded_input[:, -1]
    for i in range(len_generated_text):
        
        #obtain logits output from model
        logits, hidden, cell = model(last_char.view(1), hidden, cell)
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * scale_factor
        
        #pass logits to Categorical to generate a new sample
        m = Categorical(logits=scaled_logits)
        
        #repeat until the length of the generated string reaches the desired value
        last_char = m.sample()
        
        #append new sample to end of generated string
        generated_str += str(char_array[last_char])
    
    return generated_str

We can use the `sample` function to generate some new text.

In [20]:
#generate new text using sample
torch.manual_seed(1) #reproducibility
print(sample(model, starting_str='The island'))

The island would decid the
slopes would not recove colonists, strewn was, a tropical hold, was unaccountable, and fallen had been
thrown that she reporter, will become that it is a greatly band, re-athought, beltigated with clean from mouth being claws, but
a glasse to have been very further side of the island.

The colonists appardulated as a clearned most a more of question was from the whole of the implicia, and even had no skinal perfected upon the island was casured by them from twending them to them


We can further tune the training parameters, such as the length of input sequences for training, and the model architecture. We can also change the `scale_factor` parameter to change the randomness of the text being generated. As the value of `scale_factor` becomes larger than $1$, the text produced is more predictable, and as the value of `scale_factor` decreases toward zero, the text produced becomes more random.

In [21]:
#generate text with scale_factor=2
torch.manual_seed(1)
print(sample(model, starting_str='The island', scale_factor=2.0))

The island was the car was on the engineer was struck a few hours.

“What a tree, and on the beach,” answered the engineer, “but not with the summit of the engineer had taken present them to the corral, which even the sea because the brig was also they were already been strewn themselves, in the presence of the mountain which appeared to last the corral. There was no other traces of the lake, and it was still some day the productions were constructed an inhabitants of the cold and running and east forth t


In [22]:
#generate text with scale_factor=0.5
torch.manual_seed(1)
print(sample(model, starting_str='The island', scale_factor=0.5))

The island would decid the
sluce Destem 28 yiedd!” crianfaced the cabin., a tuphers. Howh, yab
undoerae, but, at fealer, he anvednts would had quarterminated,” but
yet My undress as frayoeh; eiging insidett,, eltigees instructed lying!0 mean, Neb dered with
high,

A !q
Was Amer, but saufl. Towirs sharple. Natuationless azod feared
burst reason amPnd, ammossed towards question were
fload of petuently being him.-

March Tad, Top Jup!” exclaneany?”

“It is Taddy your bruish;, our mountain widen feb; themse-f
