# Fine-Tune GPT 2 on Paul Grahams Essays

We important all functions and classes we need to fine tune our transformer

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments, pipeline
import math

Let's load the model and the tokenizer from HuggingFace Hub using the `transformers` library

In [2]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

Let's load the dataset we are using. The fine folks from chromadb scraped all essays by paul graham and uploaded them to HuggingFace. We can download them using the `load_dataset` function from the HuggingFace `dataset` library with the function `load_dataset`. 

We only care about the actual text of the essays (not the embeddings etc.), so we extract the essay texts from the dataset and concantenate everything in a big string. 

Afterwards we tokenize the corpus, which means we convert the words into numbers the transformer can work with. 

In [3]:
dataset = load_dataset("chromadb/paul_graham_essay", split="data")
string = " ".join([x for x in dataset["document"]])
tok_string = tokenizer(string)

Token indices sequence length is longer than the specified maximum sequence length for this model (17352 > 1024). Running this sequence through the model will result in indexing errors


We cannot feed the whole text into language model at once. Language models usually have a fixed context length. This means they can only work with a certain amount of tokens at once. We use a context length of 128 in this example. 

Now we need to cut our whole corpus into chunks of size `128`. I wrote (okay, Github Copilot) a little function to accomplish this. 

In [4]:
context_length = 128

# function that splits a list into chunks of size n
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i : i + n]
        
# chunk tok_string into chunks of n=128
tok_string_chunks = list(chunks(tok_string["input_ids"], context_length))

The `Dataset` class of HuggingFace's `datasets` library is a convenient way to feed the data to the neural network. 

Here we convert our tokens to a `dataset` instance. We also need to add a `attention_mask`. This is ....

Next we split our dataset in a train and a test (or validation) split. This allows us to evaluate our model while training. 

In [5]:
ds = dataset.from_list([{"input_ids": t, "attention_mask": [1] * len(t)} for t in tok_string_chunks])
ds = ds.train_test_split(test_size=0.2)

We'd like to process our data in parallel. This means that more than one chunk will be fed to GPT2 at once. But we need to somehow batch the single chunks into a "batch" (d'oh). That's what the `DataCollatorForLanguageModeling` does. 

The `data_collator` will also add a new key to the dict, called `labels`. Fine-tuning a LLM is still a supervised learning taks. You just don't have to create the labels yourself. The task for the model is to use the previous tokens to guess the next token. That's what a "causal" or "autoregressive" language model does. 

Right now the `data_collator` just copies the input tokens over to the `labels` key. 

As huggingface.co says:
>Shifting the inputs and labels to align them happens inside the model, so the data collator just copies the inputs to create the labels.

In [6]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [7]:
training_args = TrainingArguments(
    output_dir="distilgpt2-paul-graham-essays",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    remove_unused_columns=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    data_collator=data_collator,
)

Now we can actually fine-tune the LLM with our new data. This is what `trainer.train()` does. It runs the training loop with the parameters we specified above.

In [8]:
trainer.train()



  0%|          | 0/42 [00:00<?, ?it/s]

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 3.7619221210479736, 'eval_runtime': 1.4488, 'eval_samples_per_second': 19.326, 'eval_steps_per_second': 2.761, 'epoch': 1.0}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 3.7335638999938965, 'eval_runtime': 1.3669, 'eval_samples_per_second': 20.484, 'eval_steps_per_second': 2.926, 'epoch': 2.0}


  0%|          | 0/4 [00:00<?, ?it/s]

{'eval_loss': 3.7258050441741943, 'eval_runtime': 1.4999, 'eval_samples_per_second': 18.667, 'eval_steps_per_second': 2.667, 'epoch': 3.0}
{'train_runtime': 79.3165, 'train_samples_per_second': 4.085, 'train_steps_per_second': 0.53, 'train_loss': 3.8387327648344494, 'epoch': 3.0}


TrainOutput(global_step=42, training_loss=3.8387327648344494, metrics={'train_runtime': 79.3165, 'train_samples_per_second': 4.085, 'train_steps_per_second': 0.53, 'train_loss': 3.8387327648344494, 'epoch': 3.0})

Now that the training has finished, we can evaluate our newly fine-tuned language model. A common metric to use on language models is called "perplexity". 

In [9]:
eval_results = trainer.evaluate()

print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/4 [00:00<?, ?it/s]

Perplexity: 41.50


## Inference

Now we can test our fine-tuned LLM. To make the inference easier, we can create a HF text-generation pipeline with our newly fine-tuned model. This whay we only need to call the pipeline with our prompt. 

In [10]:
generator = pipeline("text-generation", model=trainer.model, tokenizer=tokenizer)

In [11]:
generator("To start a successful startup")[0]["generated_text"]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




'To start a successful startup which can be replicated on other exchanges. This way we can create both companies that are looking at how they are going to use those new businesses to create new businesses. For instance, our startup team at Airbnb had a new'

Keep in mind this is a very small corpus of data the model has been trained on and the model isn't up to modern standards of language models anymore. It's a distilled version of GPT 2, but the great thing it's so small that you can easily fine-tune it on your laptop. The process of fine-tuning state of the art language models like LLama or Mistral is very similar. Although you might want to switch to more parameter-efficient fine-tuning methods like LoRA or QLoRA, because otherwise you need a lot of compute to actually fine-tune these models with bilion of paramters. 