# Exercise 2

The point of the exercise is to construct a simple neural character model that can predict the (n+1)th character, given the n preceding characters.

Usually, such language models operate on the word level, but we use a character model because it is simpler and quicker to train and evaluate.


In [1]:
# First run this cell
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from datetime import datetime

We need to map every type of input item (every character, in our case) to a unique ID number. Since we are not sure which characters will appear in our training text, we are going to create new IDs as we encounter new kinds of characters we haven't seen before.

For instance, if the text begins "Harry Potter", we want to transform this into $[1, 2, 3, 3, 4, 5, 6, 7, 8, 8, 9, 3, ...]$, where "H" has ID 1, "a" is 2, "r" is 3, etc. (ID 0 is reserved for the special padding symbol, so we start numbering from 1).


In [2]:
# Run this cell to init mappings from characters to IDs and back again
char_to_id = {}  # Dictionary to store character-to-ID mapping
id_to_char = []  # List to store characters in their ID ordering
PADDING_SYMBOL = '<PAD>'
char_to_id[PADDING_SYMBOL] = 0  # ID 0 is reserved for <PAD>
id_to_char.append(PADDING_SYMBOL)

In [3]:
# Fill in the missing parts in this function
def string_to_ids(string):
    """
    Translate this string into a list of character IDs.
    The IDs will be integers 1,2,..., and created as needed.
    """
    chars_ids = []  # This list will hold the result

    for char in string:
        # YOUR CODE HERE
        if char not in char_to_id:
            char_to_id[char] = len(char_to_id)
            id_to_char.append(char)
        chars_ids.append(char_to_id[char])
    return chars_ids

In [4]:
data_path = 'lang-eng-labs/exercise/HP_book_1.txt' # 'data/HP_book_1.txt'

In [5]:
# Verify
with open(data_path, 'r', encoding='utf-8') as f:
    contents = f.read()
chars_ids = string_to_ids(contents)

print(chars_ids[0] == chars_ids[442737])
print(chars_ids[2677] == chars_ids[7692])
print(chars_ids[146466] == chars_ids[312762])

True
True
True


We now define a class 'CharDataset' that extends the predefined 'Dataset' class.

The init function reads a training text, and slides over it, creating chunks $n$ characters long. These chunks will be our data points, and the corresponding $(n+1)$th character will be the label.

For instance, if $n=4$, and the text begins "Harry P", which corresponds to the IDs $1,2,3,3,4,5,6$, then the first data point will be $[1,2,3,3]$ and its label is $4$, the second data point is $[2,3,3,4]$ with label $5$, and the third data point is $[3,3,4,5]$ with label $6$.

To extend the 'Dataset' class, the CharDataset class has to implement the **len** and **getitem** methods, as seen below.


In [6]:
# Fill in the missing parts in this class definition
class CharDataset(Dataset):

    datapoints = []  # Each datapoint is a sequence of n characters
    labels = []  # The corresponding label is the character that comes next

    def __init__(self, file_path, n):
        """
        'file_path' is the name of the training data file

        'n' is the number of consecutive characters the model will look at
        to predict which letter comes next
        """
        with open(file_path, 'r', encoding='utf-8') as f:
            contents = f.read()
        chars_ids = string_to_ids(contents)

        # Go through the chars_ids and create data points and labels
        # YOUR CODE HERE
        for i in range(len(chars_ids) - n):
            self.datapoints.append(chars_ids[i:i+n])
            self.labels.append(chars_ids[i+n])

    def __len__(self):
        return len(self.datapoints)

    def __getitem__(self, idx):
        idx = idx % len(self.datapoints)
        return torch.tensor(self.datapoints[idx]), torch.tensor(self.labels[idx], dtype=torch.long)

In [7]:
# Verify
dataset = CharDataset(data_path, 4)

d402, l402 = dataset[402]
d4002, l4002 = dataset[4002]
d40002, l40002 = dataset[40002]
print(l402.item() == d4002[0].item())
print(l4002.item() == d402[3].item())
print(l402.item() == d40002[2].item())

True
True
True


In [8]:
# Run this cell. The function below will take care of the case of
# sequences of unequal lengths.

from torch.nn.utils.rnn import pad_sequence


def char_collate_fn(batch):
    """
    Pads sequences to the longest sequence in the batch.

    'batch' is a list of tuples [(datapoint, label), ...]

    Returns a tuple of:
            - Padded datapoints as a tensor
            - Labels as a tensor 
    """
    datapoints, labels = zip(*batch)
    padded_datapoints = pad_sequence(datapoints, batch_first=True)
    return padded_datapoints, torch.tensor(labels)

#### Create a neural network according to the following specification:

The hyperparameters are:

- n -- the number of characters to input (to predict character n+1)
- h -- the number of neurons in the hidden layer
- v -- the number of unique characters

The network should have:

1. an embedding layer, mapping character IDs to h-dimensional vectors
2. a hidden layer with a linear transformation of size $(nh)\times(nh)$, followed by a tanh application
3. a final layer with a linear transformation of size $(nh)\times v$

The input to the forward pass is a tensor of character IDs $x$ of shape $(\mathtt{batch\_size} \times n)$. The forward pass should:

1. Map $x$ to a tensor of character embeddings of shape $(\mathtt{batch\_size} \times n \times h)$
2. Reshape that tensor to shape $(\mathtt{batch\_size} \times nh)$
3. Apply the hidden layer (linear transformation and the tanh operation)
4. Apply the final layer
5. Return the result of the last operation

Before starting the implementation, have a look at the documentation for:

- `torch.nn.Embedding`
- `torch.nn.Linear`
- `torch.tanh`


In [9]:
class CharModel(nn.Module):

    def __init__(self, n, h, v):
        super(CharModel, self).__init__()
        # YOUR CODE HERE
        self.embedding = nn.Embedding(v, h)
        self.hidden = nn.Sequential(
            nn.Linear(n*h, n*h),
            nn.Tanh()
        )
        self.fc = nn.Linear(n*h, v)

    def forward(self, x):
        # YOUR CODE HERE
        x = self.embedding(x)
        x = x.view(x.size(0), -1)
        x = self.hidden(x)
        x = self.fc(x)
        return x

In [10]:
# Verify
_ = torch.manual_seed(42)
charlm = CharModel(4, 64, 81)
logits = charlm(torch.tensor([[10, 9, 12, 1], [1, 2, 3, 4]]))
torch.allclose(logits[0, -1], torch.tensor(-0.0521), atol=1e-4)

True

Next, we will train a model with n=8, i.e. the model will try to predict the 9th character based on the 8 preceding characters.


In [12]:
# Choose 'Run all cells' in the 'Run' menu to run this cell.
_ = torch.manual_seed(21)

# ===================== Hyperparameters ================== #

n = 8
batch_size = 64
hidden_size = 64
learning_rate = 0.00001
number_of_epochs = 50

# ======================= Training ======================= #

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Running on", device)

training_dataset = CharDataset(data_path, n)
print("There are", len(training_dataset), "datapoints and",
       len(id_to_char), "unique characters in the dataset")

training_loader = DataLoader(
    training_dataset, batch_size=batch_size, collate_fn=char_collate_fn, shuffle=True)
charlm = CharModel(n, hidden_size, len(char_to_id)).to(device)
criterion = nn.CrossEntropyLoss()
charlm_optimizer = optim.Adam(charlm.parameters(), lr=learning_rate)

charlm.train()
print(datetime.now().strftime("%X"), "Training starts")
for epoch in range(number_of_epochs):
    for input_tensor, label in training_loader:
        input_tensor, label = input_tensor.to(device), label.to(device)
        charlm_optimizer.zero_grad()
        logits = charlm(input_tensor).to(device)
        loss = criterion(logits.squeeze(1), label)
        loss.backward()
        charlm_optimizer.step()
    print(datetime.now().strftime("%X"), "End of epoch",
          epoch+1, ", loss=", loss.detach().item())

Running on cuda
There are 1328215 datapoints and 81 unique characters in the dataset




12:41:39 Training starts
12:42:03 End of epoch 1 , loss= 2.127134084701538
12:42:27 End of epoch 2 , loss= 2.2263357639312744
12:42:52 End of epoch 3 , loss= 1.591212272644043
12:43:15 End of epoch 4 , loss= 1.8571747541427612
12:43:39 End of epoch 5 , loss= 1.6835119724273682
12:44:02 End of epoch 6 , loss= 2.320404052734375
12:44:26 End of epoch 7 , loss= 1.7896932363510132
12:44:49 End of epoch 8 , loss= 2.187608480453491
12:45:13 End of epoch 9 , loss= 1.923316240310669
12:45:36 End of epoch 10 , loss= 1.1429024934768677
12:45:59 End of epoch 11 , loss= 1.5131518840789795
12:46:22 End of epoch 12 , loss= 1.7750986814498901
12:46:46 End of epoch 13 , loss= 1.6273977756500244
12:47:09 End of epoch 14 , loss= 1.4926913976669312
12:47:33 End of epoch 15 , loss= 1.3606318235397339
12:47:57 End of epoch 16 , loss= 1.9494916200637817
12:48:20 End of epoch 17 , loss= 1.4308462142944336
12:48:45 End of epoch 18 , loss= 1.0586464405059814
12:49:09 End of epoch 19 , loss= 1.8919532299041748
1

Check how well the model works by entering a string and letting the model generate the continuation of that string.


In [15]:
charlm.eval()
while True:
    start = input(">")
    if start.strip() == 'quit':
        break
    # Add spaces in case the start string is too short
    start = ' '*(n-len(start)) + start
    # Ignore everything but the last n characters of the start string
    ids = [char_to_id[c] for c in start][-n:]
    # Generate 200 characters starting from the start string
    try:
        print(start, end='')
        for _ in range(200):
            input_tensor = torch.tensor(ids).unsqueeze(0).to(device)
            logits = charlm(input_tensor).squeeze().to(device)
            _, new_character_tensor = logits.topk(1)
            new_character_id = new_character_tensor.detach().item()
            print(id_to_char[new_character_id], end='')
            ids.pop(0)
            ids.append(new_character_id)
        print()
    except KeyError:
        continue

Harry and Hermione kissed the corridor and the corridor and the corridor and the corridor and the corridor and the corridor and the corridor and the corridor and the corridor and the corridor and the corridor and the corridor
