# Fine-tuning GPT-2 using HF Trainer

Following the tutorial here: [https://huggingface.co/docs/transformers/en/tasks/language_modeling](https://huggingface.co/docs/transformers/en/tasks/language_modeling)

 * Finetune DistilGPT2 on ELI5 dataset.

In [1]:
# HF login
import os
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()

HF_TOKEN = os.getenv("HF_TOKEN")
login(HF_TOKEN)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/vijay/.cache/huggingface/token
Login successful


In [8]:
MODEL_NAME = "distilbert/distilgpt2"
DATASET_NAME = "eli5_category"

## Load ELI5 dataset

In [2]:
from datasets import load_dataset

eli5 = load_dataset(DATASET_NAME, split="train[:5000]")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/62.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/91772 [00:00<?, ? examples/s]

Generating validation1 split:   0%|          | 0/5446 [00:00<?, ? examples/s]

Generating validation2 split:   0%|          | 0/2375 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5411 [00:00<?, ? examples/s]

In [3]:
# split into train and test
eli5 = eli5.train_test_split(test_size=0.2)

In [6]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers', 'title_urls', 'selftext_urls'],
        num_rows: 1000
    })
})

In [4]:
eli5["train"][0]

{'q_id': '73ufsx',
 'title': 'Why do men feel sick when punched or kicked in the testicles?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dnt7lb5', 'dnt74h9'],
  'text': ['A good rule of thumb for questions about why the body works in a particular way is to ask, "How would this trait benefit our caveman ancestors?" The ape-man who has \'numb nuts\' isn\'t going to be particularly concerned with protecting them, leading him to whack them as he climbs over a rock, or leave them exposed during a fight, or just plain sit on them. Meanwhile, the ape-man with delicate danglers pays extra attention while climbing over that rock, is careful to guard them in a tussle, and positions them carefully when sitting down. Numb-Nuts\' accumulation of testicular injuries leads to decreased sperm count, and he has few, if any, offspring. His competitor, however, has no such trouble, and the genes for temperamental testicles are passed down through th

## Prepocessing

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [10]:
# flatten the nested columns in dataset
eli5 = eli5.flatten()
eli5["train"][0]

{'q_id': '73ufsx',
 'title': 'Why do men feel sick when punched or kicked in the testicles?',
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dnt7lb5', 'dnt74h9'],
 'answers.text': ['A good rule of thumb for questions about why the body works in a particular way is to ask, "How would this trait benefit our caveman ancestors?" The ape-man who has \'numb nuts\' isn\'t going to be particularly concerned with protecting them, leading him to whack them as he climbs over a rock, or leave them exposed during a fight, or just plain sit on them. Meanwhile, the ape-man with delicate danglers pays extra attention while climbing over that rock, is careful to guard them in a tussle, and positions them carefully when sitting down. Numb-Nuts\' accumulation of testicular injuries leads to decreased sperm count, and he has few, if any, offspring. His competitor, however, has no such trouble, and the genes for temperamental testicles are passed down through

In [11]:
eli5

DatasetDict({
    train: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'answers.text_urls', 'title_urls', 'selftext_urls'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['q_id', 'title', 'selftext', 'category', 'subreddit', 'answers.a_id', 'answers.text', 'answers.score', 'answers.text_urls', 'title_urls', 'selftext_urls'],
        num_rows: 1000
    })
})

#### Which text to use for fine-tuning? the `question` or the `answer` or both?

* One important question is: should the model be finetuned on the `question` and the `answer` together? 
    * There is no benefit of finetuning on the `question`, as this is received from the user. 
    * So should it be part of the model input during training? But the `answer` is conditioned on the `question`.
    * We can mask the tokens associated with the `question` part for loss and gradient calculation. Thus only the `answer` part will be used for training.
    * This mask is not applied to attention mask as we want the `answer` to attention the `question` part.
    * How to do this in transformers library??? 
* In this example, we only fine-tuned the model on the `answer` part.
    * We are not teaching the model the `answer` to a given `question`.
    * This approach will help tune the tone or style of the `answer` part, which is the goal of this dataset.


In [12]:

def preprocess_function(examples):
    '''Combine [] of text into single string and tokenize'''
    return tokenizer([" ".join(x) for x in examples["answers.text"]])


tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

tokenized_eli5

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1213 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3944 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2683 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1478 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1058 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1544 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2051 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1359 > 1024). Running this sequence through the model will result in indexing errors


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1000
    })
})

The dataset has sequences longer than model's maximum input length.

You can now use a second preprocessing function to

1. concatenate all the sequences
2. split the concatenated sequences into shorter chunks defined by block_size, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [24]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    # IMPORTANT: set labels = input_ids. The model trainer will handle left shifting the labels during training.
    # TODO - figure out where the shifting happens?
    result["labels"] = result["input_ids"].copy()
    return result


In [25]:
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

lm_dataset

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10658
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2425
    })
})

Use `DataCollatorForLanguageModeling` for dynamic padding.

In [26]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

test single batch

In [33]:
# from torch.utils.data.dataloader import DataLoader

# train_dataloader = DataLoader(lm_dataset["train"], shuffle=True, batch_size=16, collate_fn=data_collator)


# for step, batch in enumerate(train_dataloader):          
#     print(step)


## Train

In [36]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to("mps")

In [37]:
model.device

device(type='mps', index=0)

In [38]:
training_args = TrainingArguments(
    output_dir=".model/gpt2-eli5-ft",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/3999 [00:00<?, ?it/s]

{'loss': 3.8893, 'grad_norm': 9.267156600952148, 'learning_rate': 1.749937484371093e-05, 'epoch': 0.38}
{'loss': 3.1433, 'grad_norm': 5.945465564727783, 'learning_rate': 1.4998749687421857e-05, 'epoch': 0.75}


  0%|          | 0/304 [00:00<?, ?it/s]

{'eval_loss': 6.627174377441406, 'eval_runtime': 11.0753, 'eval_samples_per_second': 218.956, 'eval_steps_per_second': 27.449, 'epoch': 1.0}
{'loss': 2.9226, 'grad_norm': 7.047665596008301, 'learning_rate': 1.2498124531132784e-05, 'epoch': 1.13}
{'loss': 2.7529, 'grad_norm': 5.9029436111450195, 'learning_rate': 9.997499374843712e-06, 'epoch': 1.5}
{'loss': 2.6962, 'grad_norm': 7.686512470245361, 'learning_rate': 7.496874218554639e-06, 'epoch': 1.88}


  0%|          | 0/304 [00:00<?, ?it/s]

{'eval_loss': 6.5005316734313965, 'eval_runtime': 10.1204, 'eval_samples_per_second': 239.615, 'eval_steps_per_second': 30.038, 'epoch': 2.0}
{'loss': 2.5932, 'grad_norm': 7.153013706207275, 'learning_rate': 4.996249062265567e-06, 'epoch': 2.25}
{'loss': 2.5405, 'grad_norm': 8.326250076293945, 'learning_rate': 2.4956239059764944e-06, 'epoch': 2.63}


  0%|          | 0/304 [00:00<?, ?it/s]

{'eval_loss': 6.4997334480285645, 'eval_runtime': 10.1108, 'eval_samples_per_second': 239.843, 'eval_steps_per_second': 30.067, 'epoch': 3.0}
{'train_runtime': 775.6023, 'train_samples_per_second': 41.225, 'train_steps_per_second': 5.156, 'train_loss': 2.8811225633556825, 'epoch': 3.0}


TrainOutput(global_step=3999, training_loss=2.8811225633556825, metrics={'train_runtime': 775.6023, 'train_samples_per_second': 41.225, 'train_steps_per_second': 5.156, 'train_loss': 2.8811225633556825, 'epoch': 3.0})

## Evaluate

In [None]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")