# Load English training data
First, upload the `shakespeare_input.txt` downloaded from the Homework 3 into the Colab file manager. To do this, click the folder icon on the left-hand sidebar. Then, click the upload icon in the sidebar (the one with the arrow pointing up) and select the `shakespeare_input.txt` file.

After you have the file in the Colab notebook's context, you'll need to open it up and read in each line to a Python list and save it to an object called `training_data`.
The function currently removes lines with no text. You can also perform any preprocessing you want to do here as well.

In this notebook, you will train a decoder-only LLM (GPT-2) with a **character** tokenizer on data from Shakespeare and generate sentences.

You will use Hugging Face to train the models.

**Important**: you will need to use a GPU for training. To change to a GPU, select Runtime > Change runtime type from the menu bar above. Select 'T4'.

In [None]:
with open('shakespeare_input.txt') as f:
  training_data = [[line] for line in f.read().lower().splitlines() if len(line) > 0]

training_data[:10] # to check the first 10 lines

[['first citizen:'],
 ['before we proceed any further, hear me speak.'],
 ['all:'],
 ['speak, speak.'],
 ['first citizen:'],
 ['you are all resolved rather to die than to famish?'],
 ['all:'],
 ['resolved. resolved.'],
 ['first citizen:'],
 ['first, you know caius marcius is chief enemy to the people.']]

In [1]:
from pathlib import Path
train_dir = Path("train_dir")
model_weight_dir = Path("model_weight")
tokenizer_weight_dir = Path("tokenizer_weight")

dir_list = [train_dir, model_weight_dir, tokenizer_weight_dir]
for dir_path in dir_list:
    dir_path.mkdir(parents=True, exist_ok=True)

# "Train" a tokenizer

Hugging Face models use specified tokenizers which define the possible tokens.
Here we want to modify the existing `GPT2TokenizerFast` class to tokenize on characters.

Define a new Hugging Face tokenizer here that only accepts characters and save it to an object named `char_tokenizer`.

You can reference the following:
* https://discuss.huggingface.co/t/character-level-tokenizer/12450/3
* https://huggingface.co/learn/nlp-course/en/chapter6/

In [1]:
from transformers import GPT2TokenizerFast
from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
from pathlib import Path
import json

example_input_text = "I'm from China"
tokenizer = GPT2TokenizerFast.from_pretrained('openai-community/gpt2')
output = tokenizer(example_input_text)
tokens = tokenizer.tokenize(example_input_text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
result = tokenizer.prepare_for_model(token_ids)
print(output, '\n', tokens, '\n', token_ids, '\n', result)

# FILL IN code here to create `char_tokenizer` object, a custom tokenizer that tokenizes characters

class CharacterTokenizer(PreTrainedTokenizer):
    def __init__(self, characters, model_max_length, **kwargs):
        self.characters = characters

        self._vocab_str_to_int = {
            "[CLS]": 0,
            "[SEP]": 1,
            "[BOS]": 2,
            "[MASK]": 3,
            "[PAD]": 4,
            "[RESERVED]": 5,
            "[UNK]": 6,
            **{ch: i + 7 for i, ch in enumerate(characters)},
        }

        self._vocab_int_to_str = {v: k for k, v in self._vocab_str_to_int.items()}
        bos_token = AddedToken("[BOS]", lstrip=False, rstrip=False)
        eos_token = AddedToken("[SEP]", lstrip=False, rstrip=False)
        sep_token = AddedToken("[SEP]", lstrip=False, rstrip=False)
        cls_token = AddedToken("[CLS]", lstrip=False, rstrip=False)
        pad_token = AddedToken("[PAD]", lstrip=False, rstrip=False)
        unk_token = AddedToken("[UNK]", lstrip=False, rstrip=False)
        mask_token = AddedToken("[MASK]", lstrip=True, rstrip=False)

        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            sep_token=sep_token,
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
            unk_token=unk_token,
            add_prefix_space=False,
            model_max_length=model_max_length,
            **kwargs,
        )

    # @property
    def vocab_size(self):
        return len(self._vocab_str_to_int)
    
    def get_vocab(self):
        return self._vocab_str_to_int
    
    def _tokenize(self, text):
        return list(text)
    
    def _convert_token_to_id(self, token):
        return self._vocab_str_to_int.get(token, self._vocab_str_to_int['[UNK]'])
    
    def _convert_id_to_token(self, index):
        return self._vocab_int_to_str[index]
    
    def convert_tokens_to_string(self, tokens):
        return "".join(tokens)

    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1 = None):
        sep = [self.sep_token_id]
        cls = [self.cls_token_ids]
        result = cls + token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

    def get_special_tokens_mask(self, token_ids_0, token_ids_1 = None, already_has_special_tokens = False):
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0,
                token_ids_1=token_ids_1,
                already_has_special_tokens=True
            )
        
        result = [1] + ([0] * len(token_ids_0)) + [1]
        if token_ids_1 is not None:
            result += ([0] * len(token_ids_1) + [1])
        return result
    
    def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1 = None):
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]

        result = len(cls + token_ids_0 + sep) * [0]
        if token_ids_1 is not None:
            result += len(token_ids_1 + sep) * [1]
        return result
    
    def get_config(self):
        return {
            "char_ords": [ord(ch) for ch in self.characters],
            "model_max_length": self.model_max_length,
        }
    
    @classmethod
    def from_config(cls, config):
        cfg = {}
        cfg["characters"] = [chr(i) for i in config["char_ords"]]
        cfg["model_max_length"] = config["model_max_length"]
        return cls(**cfg)
    
    def save_pretrained(self, save_directory, legacy_format = None, filename_prefix = None, push_to_hub = False, **kwargs):
        cfg_file = Path(save_directory) / "tokenizer_config.json"
        cfg = self.get_config()
        with open(cfg_file, "w") as f:
            json.dump(cfg, f, indent=4)
    
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, cache_dir = None, force_download = False, local_files_only = False, token = None, revision = "main", trust_remote_code=False, **kwargs):
        cfg_file = Path(pretrained_model_name_or_path) / "tokenizer_config.json"
        with open(cfg_file) as f:
            cfg = json.load(f)
        return cls.from_config(cfg)



  from .autonotebook import tqdm as notebook_tqdm
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': [40, 1101, 422, 2807], 'attention_mask': [1, 1, 1, 1]} 
 ['I', "'m", 'Ġfrom', 'ĠChina'] 
 [40, 1101, 422, 2807] 
 {'input_ids': [40, 1101, 422, 2807], 'attention_mask': [1, 1, 1, 1]}


In [125]:
chars = set()
with open("shakespeare_input.txt", "r") as f:
    line = f.read().replace("-", "")
    chars = chars.union(set(line))
len(chars)

66

In [None]:
model_max_length = 2048
char_tokenizer = CharacterTokenizer(chars, model_max_length)
char_tokenizer.save_pretrained("tokenizer_weight")

Test your new tokenizer with the following cell. It should provide each token as a character. You may get unexpected behavior for the space character, and that's ok.

In [127]:
print(char_tokenizer.tokenize("hello world"))
print(char_tokenizer.encode("hello world"))
print(char_tokenizer.prepare_for_model(char_tokenizer.encode("hello world")))
print(char_tokenizer("hello world"))

['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
[0, 28, 60, 56, 56, 7, 67, 45, 7, 54, 56, 8, 1]
{'input_ids': [0, 0, 28, 60, 56, 56, 7, 67, 45, 7, 54, 56, 8, 1, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [0, 28, 60, 56, 56, 7, 67, 45, 7, 54, 56, 8, 1], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


# Train GPT-2 model with character tokenizer

Here's where you will train your GPT-2 model on the Shakespeare data using your new character tokenizer. Specifically, train the `GPT2LMHeadModel` from the `transformers` package.

Here are some references for the code for this part:
* https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb
* https://huggingface.co/docs/transformers/en/tasks/language_modeling. Note that this is for finetuning, not training from scratch. It is still useful for explanations of Hugging Face classes

You will want to define a model, load in the Shakespeare dataset in a format that Hugging Face can work with, define training parameters, and then train the model.
This training may take 30 minutes or longer.

**You will also need to save the model** with a name like `char_gpt2_shakespeare` to be able to generate from it later.

In [128]:
import torch
from transformers import GPT2LMHeadModel
from transformers import LineByLineTextDataset
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

torch.cuda.is_available()

model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2")
tokenizer = CharacterTokenizer.from_pretrained("tokenizer_weight")

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="shakespeare_input.txt",
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./train_dir",
    overwrite_output_dir=True,
    num_train_epochs=50,
    per_gpu_train_batch_size=64,
    save_steps=100,
    save_total_limit=5,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,0.6219
1000,0.5339
1500,0.506
2000,0.4908
2500,0.48
3000,0.47
3500,0.4639
4000,0.4588
4500,0.4537
5000,0.4502




TrainOutput(global_step=26600, training_loss=0.4296450273614181, metrics={'train_runtime': 6135.6489, 'train_samples_per_second': 1109.72, 'train_steps_per_second': 4.335, 'total_flos': 2.0811259160064e+17, 'train_loss': 0.4296450273614181, 'epoch': 50.0})

In [129]:
trainer.save_model("model_weight")

# Generate from the trained model

In [2]:
from transformers import GPT2LMHeadModel
tokenizer = CharacterTokenizer.from_pretrained("tokenizer_weight")
model = GPT2LMHeadModel.from_pretrained("model_weight").to("cuda")

In [31]:
from transformers import pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
generated_texts = generator(
    "", 
    num_return_sequences=5, 
    max_length=100, 
    do_sample=True, 
    top_k=3
)
generated_texts

Device set to use cuda:0


[{'generated_text': 'e shall not say he would be so forget her. I would not see them. I say. I would. I,ot, I I.ever. I'},
 {'generated_text': "or that we see your heart and the sent, I have not a month, I'll never. I'll bear.'d.' my hear.:.."},
 {'generated_text': "hat she, and says 'tis this?t there were a man.' If I did not, I,'twas not,'t.,'t.'.' I I,'t.'t.'t"},
 {'generated_text': 'he care of the sea and her that his soul, and her to her stand, if too. I would, I here.ough.. I I'},
 {'generated_text': "e saw the more of her streets. What word hast thou art? I would say? I wish? thee.?.?.'I?, and yo?"}]

# Calculate perplexity for test documents

In this section, load the test documents from the Homework 3.
Calculate perplexity for both models.

In [4]:
import math
import torch

def calculate_perplexity(model, tokenizer, text_list):
    perplexities = []
    for text in text_list:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=1024)
        input_ids = inputs["input_ids"].to("cuda")
        attention_mask = inputs["attention_mask"].to("cuda")

        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
            loss = outputs.loss
        
        perplexity = math.exp(loss.item()) if loss.item() < 100 else float('inf') 
        perplexities.append(perplexity)
    return sum(perplexities) / len(perplexities)

with open("test_data/nytimes_article.txt", "r", errors="replace") as f:
    text1 = f.read()
    text1 = text1.split("\n")
    text1 = [text for text in text1 if text]

with open("test_data/shakespeare_sonnets.txt", "r") as f:
    text2 = f.read()
    text2 = text2.split("\n")
    text2 = [text for text in text2 if text]
    
print(calculate_perplexity(model=model, tokenizer=tokenizer, text_list=text1))
print(calculate_perplexity(model=model, tokenizer=tokenizer, text_list=text2))

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


40.38293324703038
7.972513695443804
