# Description

A simple character-level RNN to generate new bits of text

In [1]:
%load_ext watermark
%watermark -a "pytholic" -v -p torch,torchtext

Author: pytholic

Python implementation: CPython
Python version       : 3.11.6
IPython version      : 8.16.1

torch    : 2.1.0
torchtext: 0.16.0



Download the book *The Mysterious Island* by Jules Verne in plain text format.

In [2]:
!curl -O https://www.gutenberg.org/files/1268/1268-0.txt


# Reading and preprocessing text

First we will read the text from the file and remove portions from the beginning and the end (these contain certain descriptions of the Gutenberg project).

In [3]:
import numpy as np

with open("1268-0.txt", "r", encoding="utf8") as f:
    text = f.read()
start_idx = text.find("THE MYSTERIOUS ISLAND")
end_idx = text.find("End of the Project Gutenberg")
text = text[start_idx:end_idx]
char_set = set(text)  # removes duplicates -> get unique characters

In [4]:
print(f"Text Length: {len(text)}")
print(f"Unique characters: {len(char_set)}")

Text Length: 1130711
Unique characters: 85


Now we need to convert this data to numeric format. To do this, we will create a simple Python dictionary that maps each character to an integer. We will also need a reverse mapping to convert the results of our model back to text.

In [5]:
chars_sorted = sorted(char_set)
char2int = {ch:i for i,ch in enumerate(chars_sorted)}
# int2chr = {i:ch for i,ch in enumerate(chars_sorted)}
char_array = np.array(chars_sorted) # more efficient than dict

In [6]:
text_encoded = np.array([char2int[ch] for ch in text], dtype=np.int32)
print("Text encoded shape: ", text_encoded.shape)
print(f"{text[:21]} == Encoding ==> {text_encoded[:21]}")
print(text_encoded[:21], "== Reverse ==>", "".join(char_array[text_encoded[:21]]))

Text encoded shape:  (1130711,)
THE MYSTERIOUS ISLAND == Encoding ==> [48 36 33  1 41 53 47 48 33 46 37 43 49 47  1 37 47 40 29 42 32]
[48 36 33  1 41 53 47 48 33 46 37 43 49 47  1 37 47 40 29 42 32] == Reverse ==> THE MYSTERIOUS ISLAND


For the text generation task, we can formulate the problem as a classification task. Since we have 85 unique characters, it becomes a *multiclass* classification task

We will clip the sequence length to 40. Longer length is better, but RNN will have problems capturing *long-range* dependencies. Sequence length is a hyperparameter optimization porblem, which we have to evaluate empirically.

In [7]:
import torch
from torch.utils.data import Dataset

We will split text into chunks of size 41. The first 40 character will form input sequence *x*, and the last 40 will form the taregt sequence *y* (since target is just offset by 1).

In [8]:
seq_length = 40 # hyperparameter
chunk_size = seq_length + 1
text_chunks = [text_encoded[i:i+chunk_size] for i in range(len(text_encoded) - chunk_size+1)]

In [9]:
class TextDataset(Dataset):
    def __init__(self, text_chunks):
        self.text_chunks = text_chunks

    def __len__(self):
        return len(self.text_chunks)

    def __getitem__(self, idx):
        text_chunk = self.text_chunks[idx]
        return text_chunk[:-1], text_chunk[1:].long()

In [10]:
seq_dataset = TextDataset(torch.tensor(text_chunks))

for i, (seq, target) in enumerate(seq_dataset):
    print("Input (x): ", repr("".join(char_array[seq])))
    print("Target (y): ", repr("".join(char_array[target])))
    print()
    if i == 1: 
        break

Input (x):  'THE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTER'
Target (y):  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERI'

Input (x):  'HE MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERI'
Target (y):  'E MYSTERIOUS ISLAND ***\n\n\n\n\nTHE MYSTERIO'



  seq_dataset = TextDataset(torch.tensor(text_chunks))


In [11]:
# Transform into minibatches
from torch.utils.data import DataLoader
batch_size = 64
torch.manual_seed(1)
seq_dl = DataLoader(seq_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

The `drop_last` argument drops the last non-full batch of each worker’s iterable-style dataset replica i.e. the `drop_last=True` parameter ignores the last batch (when the number of examples in your dataset is not divisible by your `batch_size`) while `drop_last=False` will make the last batch smaller than your `batch_size`.

# Building a character-level RNN model

In [12]:
import torch.nn as nn
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size):
        super().__init__()
        self.rnn_hidden_size = rnn_hidden_size
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        self.fc = nn.Linear(rnn_hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        out = self.embedding(x).unsqueeze(1)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc(out).reshape(out.size(0), -1)
        return out, hidden, cell

    def init_hidden(self, batch_size):
        hidden = torch.zeros(1, batch_size, self.rnn_hidden_size)
        cell = torch.zeros(1, batch_size, self.rnn_hidden_size)
        return hidden, cell

We are taking `logits` as output so that we can sample from the model predictions in order to generate new text.

In [13]:
# Create RNN model
vocab_size = len(char_array)
embed_dim = 256
rnn_hidden_size = 512
torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size)
model

RNN(
  (embedding): Embedding(85, 256)
  (rnn): LSTM(256, 512, batch_first=True)
  (fc): Linear(in_features=512, out_features=85, bias=True)
)

# Train the model

In [14]:
# Define loss and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

In [15]:
num_epochs = 10000
torch.manual_seed(1)
device = "mps" if torch.backends.mps.is_available() else "cpu"
model.to(device)
for epoch in range(num_epochs):
    model.train()
    hidden, cell = model.init_hidden(batch_size)
    hidden, cell = hidden.to(device), cell.to(device)
    seq_batch, target_batch = next(iter(seq_dl))
    seq_batch, target_batch = seq_batch.to(device), target_batch.to(device)
    optimizer.zero_grad()
    loss = 0
    for c in range(seq_length):
        pred, hidden, cell = model(seq_batch[:, c], hidden, cell)
        loss += loss_fn(pred, target_batch[:, c])
    loss.backward()
    optimizer.step()
    loss = loss.item() / seq_length
    if epoch % 500 == 0:
        print(f"Epoch {epoch} loss: {loss:.4f}")

In [16]:
torch.save(model.state_dict(), "./model.pt")

In [17]:
model.load_state_dict(torch.load("./model.pt"))

<All keys matched successfully>

# Evaluation phase - generating new text

The RNN model we trained in the previous section returnss the logits of size 85 for each unique character, which can be converted to probabilities via softmax. One simple way to generate text is to choose the character with the highest logit value. In other words, you pick the character that the model believes is the most likely next character based on its training.

However, always selecting the character with the highest likelihood can result in repetitive and less interesting text. To make the text more diverse and less predictable, it's better to randomly sample from the model's output probabilities. This means you don't always choose the character with the highest probability; you make a random choice based on the distribution of probabilities.

PyTorch already provides a class, `torch.distributions.categorical.Categorical`, which we can use to draw random samples from a categorical distribution.

In [29]:
from torch.distributions.categorical import Categorical

def sample(model, starting_str, max_length=500, scale_factor=1.0):
    """
    starting_str: short starting string
    max_length: max length of generated text
    """

    encoded_input = torch.tensor([char2int[s] for s in starting_str])
    encoded_input = torch.reshape(encoded_input, (1, -1))
    generated_str = starting_str # initially set it equal to the input str

    model.eval()
    hidden, cell = model.init_hidden(batch_size=1)
    for c in range(len(starting_str)-1):
        _, hidden, cell = model(encoded_input[:, c].view(1), hidden, cell)
    last_char = encoded_input[:, -1]
    for i in range(max_length):
        logits, hidden, cell = model(last_char.view(1), hidden, cell)
        logits = torch.squeeze(logits, 0)
        scaled_logits = logits * scale_factor
        m = Categorical(logits=scaled_logits)
        last_char = m.sample()
        generated_str += str(char_array[last_char])

    return generated_str

In [30]:
torch.manual_seed(1)
print(sample(model, starting_str="The island"))

The island had built to push it disappeared, and for
cut off!



Charked during nature depped with himself at the entrme of Granite House might have been invited; must avoided against with his
master, the two hung
began to fire. The conversation of the kill Plencroft did not make sori.

At the Gomering at the fauth
of the chown of stutien on the destive chance of the latitude, and it was the chest of water
and the retreats of the crater, out of lava and measures here, the sailor, “it is one,” replied Cyru


The model msotly generates correct words. In some cases, sentences might also make sense. Further tuning the training parameters and model architecture can improve the performance.

# Temperate

The `scale_factor` controls the randomness of the generated text. Low `scale_factor` (high temperature) results in more randomness because the output probability becomes more uniform, as opposed to more predictable behavior at high `scale_factor` (low temperature) where one logit will have high probability.

In [36]:
# Example
logits = torch.tensor([[1.0, 1.0, 3.0]])
print(f"Probabilities before scaling: {nn.functional.softmax(logits, dim=1).numpy()[0]}")
print(f"Probabilities after scaling with 0.5: {nn.functional.softmax(0.5 * logits, dim=1).numpy()[0]}")
print(f"Probabilities after scaling with 0.1: {nn.functional.softmax(0.1 * logits, dim=1).numpy()[0]}")
print(f"Probabilities after scaling with 3.0: {nn.functional.softmax(3.0 * logits, dim=1).numpy()[0]}")

Probabilities before scaling: [0.10650698 0.10650698 0.78698605]
Probabilities after scaling with 0.5: [0.21194156 0.21194156 0.57611686]
Probabilities after scaling with 0.1: [0.3104238  0.3104238  0.37915248]
Probabilities after scaling with 3.0: [0.00246652 0.00246652 0.9950669 ]


In [31]:
torch.manual_seed(1)
print(sample(model, starting_str="The island", scale_factor=2.0))

The island was rumbled on the sea to the surface of the depths of the plateau, and he had probable to see the reporter was readily of the mouth of the lad, “there is a sandy disappeared. The captain and he was beginning, and the settlers would be a region of the corral.

Cyrus Harding had resembled in a hollow of which they had been still seen to the result of sand. The island was the extent of a convicts were the reporter, “we will save him, will not help like a day, the settlers one!” replied Herbert, s


In [32]:
torch.manual_seed(1)
print(sample(model, starting_str="The island", scale_factor=0.5))

The island
addsh
iron me usual, dixed lided, “you
arragt frours
ond,rlesides:-1,
driving bedubely
admasted
amordiceneim: direfimed Cape.

mb washed firl, invimniable afta. Red, The “Tswesturge
Fally an’clets, tbE!
Thout, vieh for
been, Ayrton, “is, ofw two-thour,
grauts!
evey wisherriiefsure!’”
“Let if they unlay,”

rat’s fegle?”
Not
tie, one or deaver. Chance spilacquatity alludelictle of all eaght; Pencroft? Las rithinkswest at
not admisic-ioved, been made Snall fiormed.
How
stop!”

Herbert, light! pros


As we can see, the results are coherent with our hypothesis. We can choose to generate correct text with less novelty or create diverse text with mroe randomness. It is a trade-off.