# Transformer for Modeling Sentences

**Q2 (10 points)** In this task, we will train a Small Language Model (SLM) based on the transformer architecture. The task is to predict the next character in a sentence. 

In [1]:
# As usual, a bit of setup
import time
import torch

%load_ext autoreload
%autoreload 2
%autosave 180

Autosaving every 180 seconds


## Load the data

The code for data loading is ready for you to use, so you don't need to make any changes to the following code block. 


In [2]:
import numpy as np
import scipy
import string
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

def process_textdata(article_iter, char_dict):
    """
    Concatenate a list of articles into a single string, remove non-printable chars from the string,
    and convert it to a np array.
    """

    text = "\n".join([article for article in article_iter])
    text = ''.join(filter(lambda x: x in printable, text))
    data = torch.tensor(list(map(lambda x: char_dict[x], text)), dtype=torch.int32)

    return data

# load the WikiText-2 dataset
train_iter, val_iter, test_iter = WikiText2()

# define the vocabulary, which contains all printable characters.
printable = list(string.printable)
char_dict = dict(zip(printable, range(len(printable)))) # reverse table

# turn each data subset to a 1-d array of numbers ranging in `range(len(printable))`
train_data = process_textdata(train_iter, char_dict)
val_data = process_textdata(val_iter, char_dict)
test_data = process_textdata(test_iter, char_dict)


# Some data exploration

print('Data statistics:')
print(f'Number of characters: {len(train_data)}(train), : {len(val_data)}(val), : {len(test_data)}(test)')

uniq, uniq_counts = torch.unique(train_data, return_counts=True)
total = torch.sum(uniq_counts)
uniq = uniq.numpy()

freq = (uniq_counts / total).numpy()
ch_freq = dict([(printable[uniq[i]], freq[i]) for i in range(len(uniq))])
for ch in string.printable:
    if ch not in ch_freq:
        ch_freq[ch] = 0

print("Some characters' frequencies:")
print(", ".join([f"{ch}: {ch_freq[ch]:.3f}" for ch in "abcdefghijklmnopqrstuvwxyz"]))

ent = scipy.stats.entropy(freq, base=2.0)
print(f"The Shannon entropy of characters is {ent:.3f}, which means that the per-character cross-entropy loss of" +
      f" a simple model (guessing the next character by frequencies) is {ent:.3f}")


KeyboardInterrupt: 

## Use a Transformer as a language model

In the task below, you are supposed to train a transformer to model text data. Essentially your model defines the probability $p(y_t | y_{t-1}, \ldots, y_{t - k})$. 

**(Q2 part 1, 5 points)** You are supposed to implement the transformer using [multi-head attention layers](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html). In particular, you should implement the [Transformer encoder](https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html), but you need to turn on the causal flag in the forward calculation. Please check the documentation of `MultiheadAttention`. Note that the books states GPT as transformer decoder, but it is essentially the encoder with the causal flag on. 

**(Q2 part 2, 5 points)** You will implement the training code that trains the model with the given data. Your work is really similar to this [tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html), so please read this tutorial carefully. However, there are two differences you may want to keep in mind. First, we do character-level language modeling. Second, you CANNOT use a Transformer model from Torch directly. Therefore, you can best use the idea in the tutorial if you have a good understanding of it.  


**(Q2 part 3, 5 points)** Your model will be evaluated by per-character cross-entropy loss. You will get 
* 1 points if your per-character cross-entropy loss is less than 2.5 (a feedforward model defining a Markov model $p(y_t | y_{t-1})$ is able to achieve this number). 
* 4 points if your per-character cross-entropy loss is less than 2
* 5 points if your per-character cross-entropy loss is less than 1.8

\*The performance from a [paper](https://arxiv.org/pdf/1808.04444.pdf) indicates that an LSTM can achieve performance of 1.43 * ln(2) = 0.991.  
\*The `zip` program for compressing files roughly can achieve a performances of 3.522 bits per character. It corresponds to a performance of  3.522 * ln(2) = 2.441

### Train the model

You should implement your model class `SmallLanguageModel` and a function `train` to support the cell below

In [None]:
import torch
from language_modeling import SmallLanguageModel, train

model = SmallLanguageModel(vocabulary=printable).to(device)
train_data = train_data.to(device)
val_data = val_data.to(device)

loss_func = torch.nn.CrossEntropyLoss()
lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

train(model, train_data, val_data, loss_func, optimizer, scheduler, num_epochs=1, bptt = 35)

epoch 0, val loss 1.778340220974441s 0.37107219099998473



SmallLanguageModel(
  (emb): Embedding(100, 256)
  (pos): PositionalEncoding(
    (dropout): Dropout(p=0.01, inplace=False)
  )
  (enc): Sequential(
    (0): EncoderLayer(
      (attn): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=False)
      )
      (norm1): Norm(
        (dropout): Dropout(p=0.01, inplace=False)
        (ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
      (ffn): FFN(
        (dense1): Linear(in_features=256, out_features=256, bias=True)
        (relu): ReLU()
        (dense2): Linear(in_features=256, out_features=256, bias=True)
      )
      (norm2): Norm(
        (dropout): Dropout(p=0.01, inplace=False)
        (ln): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
    (1): EncoderLayer(
      (attn): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=False)
      )
      (norm1): Norm(
        (dr

### Save the model and test it

From this cell on, the code is used to test your model. You should NOT modify the code in this subsection. In particular, you should save your model using the default model name; otherwise, you will lose 2 points. 

In [None]:
torch.save(model, "my_slm.pt")

In [None]:
# Test whether the model has the same behavior when running sequentially or with batches

model = torch.load("my_slm.pt").to(device)

# A example sentence:
sen = "\nThis is a test case."
sen_data = torch.tensor([char_dict[ch] for ch in sen], dtype=torch.long).view([-1, 1]).to(device)

model.eval()

with torch.no_grad():
    output1 = model(sen_data)

    output2 = []
    for i in range(sen_data.shape[0]):
        out = model(sen_data[:(i+1)])
        output2.append(out[-1])

    output2 = torch.stack(output2, dim=0)

    diff = torch.mean(torch.abs(output1 - output2)).cpu().numpy()

print("The entry-wise difference between the two calculations should be very small (below 1e-5).",
      "The difference from your model is: ", diff)


The entry-wise difference between the two calculations should be very small (below 1e-5). The difference from your model is:  4.3259934e-07


In [None]:
# Test the per-character cross-entropy loss of your model
from third_party import evaluate, batchify

eval_batch_size = 10
test_data = test_data.to(device)
test_data = batchify(test_data, eval_batch_size)
test_loss = evaluate(model, test_data, loss_func)

print('The total number of chars in the test set is ', torch.numel(test_data))

print('The per-char-loss is %.3f' % test_loss)


The total number of chars in the test set is  1258600
The per-char-loss is 1.790


### Use the model to generate sentences

Now we can use the trained model to generate text with a starting string. The naive model just predict frequent characters in the text, so there is no meaningful generation yet. You can provide different "prompts" and see what content the model will generate after that. 

In [None]:
import torch.distributions as distributions

def generate_text(model, start_string, char_list):
    """ Generate random text from a starting string. """

    input_string = start_string
    if len(input_string) == 0:
        input_string = "\n" # use the newline character as the BOS

    # Number of characters to generate
    num_generate = 100

    # Converting our start string to numbers (vectorizing)
    input_int = [char_list.index(s) for s in start_string]

    # Low temperature results in more predictable text.
    # Higher temperature results in more surprising text.
    # Experiment to find the best setting.
    temperature = 0.5

    for i in range(num_generate):

        input_tensor = torch.tensor(input_int, dtype = torch.long).view([-1, 1]).to(device)
        outputs = model(input_tensor)

        # remove the batch dimension
        prediction = torch.softmax(outputs[-1, 0, :] / temperature, dim=0)

        # using a categorical distribution to predict the character returned by the model
        pred_int = int(distributions.Categorical(probs = prediction).sample())

        # The calculation has a lot of repeatition because computation for the first part
        # of the sequence is the same at every iteration. But it's fine for our example.
        input_int.append(pred_int)
        input_string = input_string + char_list[pred_int]


    return input_string


start_string = 'I hav'
gen_sen = generate_text(model, start_string, printable)
gen_sen = gen_sen.split('\n')[0]

print('Starting from "' + start_string + '", the generated sentence is:')
print('"' + gen_sen + '"')

Starting from "I hav", the generated sentence is:
"I have of the conded the lating was the compert in a the win his hare the this oust s wo foun tereare act"
