# Fine Tune GPT-2 For Causal Language Modeling

There are two categories of language modeling: "causal" and "masked." Causal models are commonly employed for text generation tasks. These models can be utilized in various creative applications, such as crafting personalized text adventures or serving as smart coding assistants like Copilot or CodeParrot.

Causal language modeling predicts the next token in s sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

In this notbook, we are going to fine-tune a text-generation pretrained model with a corresponsive dataset.

## 1. Installations

In [1]:
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1

Collecting transformers==4.35.2
  Using cached transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
Using cached transformers-4.35.2-py3-none-any.whl (7.9 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.37.0
    Uninstalling transformers-4.37.0:
      Successfully uninstalled transformers-4.37.0
Successfully installed transformers-4.35.2
Collecting datasets==2.15.0
  Using cached datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.15.0)
  Using cached pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets==2.15.0)
  Using cached fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm


## 2. Imports

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

Setting up all environment variables

In [3]:
user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "fine-tuned-models"
os.environ["WANDB_NOTES"] = "fine-tuning GPT-2 model"
os.environ["WANDB_NAME"] = "ft-GPT2-with-lyrics"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 3. Load Dataset
We will start by loading a smaller subset  the ELI5 dataset. This will give us a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [4]:
from datasets import load_dataset

lyrics= load_dataset("Nicolas-BZRD/English_French_Songs_Lyrics_Translation_Original", split="train")
print(lyrics)

Downloading readme:   0%|          | 0.00/2.31k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/122M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/99289 [00:00<?, ? examples/s]

Dataset({
    features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language'],
    num_rows: 99289
})


Selecting english lyrics.

In [5]:
en_lyrics = lyrics.filter(lambda sample: sample['language'] == "en")
print(en_lyrics)

Filter:   0%|          | 0/99289 [00:00<?, ? examples/s]

Dataset({
    features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language'],
    num_rows: 75786
})


We are using the combination of Harry Styles and The Weekend songs

In [6]:
hw_lyrics = en_lyrics.filter(lambda sample: sample['artist_name'] == "The Weeknd" or sample['artist_name'] == "Harry Styles" )

Filter:   0%|          | 0/75786 [00:00<?, ? examples/s]

In [7]:
hw_lyrics

Dataset({
    features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language'],
    num_rows: 177
})

Split the dataset's `train` split into a train and test with the train_test_split method

In [8]:
hw_lyrics = hw_lyrics.train_test_split(test_size=0.2)
print(hw_lyrics)

DatasetDict({
    train: Dataset({
        features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language'],
        num_rows: 141
    })
    test: Dataset({
        features: ['artist_name', 'album_name', 'year', 'title', 'number', 'original_version', 'french_version', 'language'],
        num_rows: 36
    })
})


In [9]:
print(hw_lyrics["train"][0])

{'artist_name': 'Harry Styles', 'album_name': 'Harry Styles', 'year': 2017, 'title': 'Sign Of The Times', 'number': 2, 'original_version': 'Just stop your crying, \nIt’s the sign of the times.**\nWelcome to the final show, \nHope you’re wearing your best clothes.\nYou can’t bribe the door on your way to the sky,\nYou look pretty good down here,\nBut you ain’t really good.\nWe’ve never learned like we’ve been here before,\nWhy are we always stuck at running from,\nThe bullets, the bullets.\nJust stop your crying, \nIt’s the sign of the times.\nWe gotta get away from here,\nWe gotta get away from here.\nJust stop your crying, \nIt will be alright.\nThey told me the end is near,\nWe gotta get away from here.\nJust stop your crying,\nHave the time of your life,\nBreaking through the atmosphere,\nThings are pretty good from here.\nRemember everything will be alright,\nWe could meet again somewhere,\nSomewhere far away from here.\nWe’ve never learned like we’ve been here before,\nWhy are we 

Despite the large number of text fields, labels are not necessary for the language modeling jobs.

This is known as an unsupervised task, where the model predicts the next token in a sequence of tokens without the need for labeled data. By utilizing this method, NLP models have been constructed with little to no annotated data, enabling the extraction of knowledge from huge language models without the requirement for labeled data.

## 4. Data Pre-processing

Load a GPT2Tokenizer tokenizer to process the `original_version` subfield:

In [10]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token=tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [11]:
hw_lyrics = hw_lyrics.flatten()
hw_lyrics["train"][0]

{'artist_name': 'Harry Styles',
 'album_name': 'Harry Styles',
 'year': 2017,
 'title': 'Sign Of The Times',
 'number': 2,
 'original_version': 'Just stop your crying, \nIt’s the sign of the times.**\nWelcome to the final show, \nHope you’re wearing your best clothes.\nYou can’t bribe the door on your way to the sky,\nYou look pretty good down here,\nBut you ain’t really good.\nWe’ve never learned like we’ve been here before,\nWhy are we always stuck at running from,\nThe bullets, the bullets.\nJust stop your crying, \nIt’s the sign of the times.\nWe gotta get away from here,\nWe gotta get away from here.\nJust stop your crying, \nIt will be alright.\nThey told me the end is near,\nWe gotta get away from here.\nJust stop your crying,\nHave the time of your life,\nBreaking through the atmosphere,\nThings are pretty good from here.\nRemember everything will be alright,\nWe could meet again somewhere,\nSomewhere far away from here.\nWe’ve never learned like we’ve been here before,\nWhy ar

And instead of tokenizing each sentence separatelty, convert the list to a string so we can jointly tokenize them

In [12]:
def preprocess_function(samples):
    return tokenizer([" ".join(x) for x in samples["original_version"]], max_length=2048, padding=True, truncation=True)

We need to apply this preprocessing function over the entire dataset.

In [13]:
tokenized_hw_lyrics= hw_lyrics.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=hw_lyrics["train"].column_names,
)
tokenized_hw_lyrics

Map (num_proc=4):   0%|          | 0/141 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 141
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 36
    })
})

We also need to make sure the token sequences are shorter than the maximum input length of the model, and we can also add padding if the model supported it.

- concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by block_size, which should be both shorter than the maximum input length and short enough for your GPU RAM.

In [14]:
block_size=128

def group_texts(examples):
    concatenated_examples={k: sum(examples[k], []) for k in examples.keys()}
    total_length=len(concatenated_examples[list(examples.keys())[0]])
    if total_length>=block_size:
        total_length=(total_length//block_size)* block_size
    # Split by chunks of block size
    result={
        k: [t[i: i+block_size] for i in range(0, total_length, block_size)]
        for k,t in concatenated_examples.items()
    }
    
    result["labels"]=result["input_ids"].copy()
    return result

Apply the `group_texts` function over the entire dataset:

In [15]:
lm_dataset = tokenized_hw_lyrics.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/141 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36 [00:00<?, ? examples/s]

In [16]:
print(lm_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2256
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 576
    })
})


In [17]:
print(lm_dataset["train"][0])

{'input_ids': [41, 334, 264, 256, 220, 220, 264, 256, 267, 279, 220, 220, 331, 267, 334, 374, 220, 220, 269, 374, 331, 1312, 299, 308, 837, 220, 220, 220, 198, 314, 256, 564, 247, 264, 220, 220, 256, 289, 304, 220, 220, 264, 1312, 308, 299, 220, 220, 267, 277, 220, 220, 256, 289, 304, 220, 220, 256, 1312, 285, 304, 264, 764, 1635, 1635, 220, 198, 370, 304, 300, 269, 267, 285, 304, 220, 220, 256, 267, 220, 220, 256, 289, 304, 220, 220, 277, 1312, 299, 257, 300, 220, 220, 264, 289, 267, 266, 837, 220, 220, 220, 198, 367, 267, 279, 304, 220, 220, 331, 267, 334, 564, 247, 374, 304, 220, 220, 266, 304, 257, 374, 1312, 299, 308, 220, 220, 331, 267, 334, 374], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Here we are going to use `dynamically pad` the sentence to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. 

In [18]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token=tokenizer.eos_token
# Use the end of sequence token as the padding token and set `mlm=False`.
# This will use the inputs as labels shifted to the right by one element.
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

2024-02-20 17:27:51.443411: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-20 17:27:51.443523: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-20 17:27:51.702897: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 5. Train

In [19]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("gpt2")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [20]:
training_args=TrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    save_strategy="no",
    logging_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=30,
    weight_decay=0.01,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    push_to_hub=False,
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33msurajkark[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,2.0058,1.529321
2,1.5797,1.381351
3,1.4549,1.309338
4,1.3837,1.266117
5,1.3354,1.23283
6,1.2976,1.214228
7,1.2678,1.197299
8,1.2401,1.182472
9,1.2168,1.173725
10,1.1971,1.160442


TrainOutput(global_step=2130, training_loss=1.1961356722692928, metrics={'train_runtime': 1667.2868, 'train_samples_per_second': 40.593, 'train_steps_per_second': 1.278, 'total_flos': 4421061181440000.0, 'train_loss': 1.1961356722692928, 'epoch': 30.0})

In [21]:
trainer.save_model(os.getenv("WANDB_NAME"))

In [22]:
loaded_model = AutoModelForCausalLM.from_pretrained(os.getenv("WANDB_NAME"))
print(loaded_model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [40]:
trainer.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/surajkarki/ft-GPT2-with-lyrics/commit/57b59036418ce3baa6b8a47fd46443ea96295942', commit_message='Upload tokenizer', commit_description='', oid='57b59036418ce3baa6b8a47fd46443ea96295942', pr_url=None, pr_revision=None, pr_num=None)

## 6.Evaluate

In [25]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



Perplexity: 3.04


## 7. Inference

In [66]:
prompt1="Remember that time"

In [67]:
from transformers import pipeline

generator = pipeline("text-generation", model=os.getenv("WANDB_NAME"))

generator(prompt)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Remember that time I d on't   h a v e   y o u   m a k e   m y   f a l l \n A n d   y o u   b e"}]

Tokenize the text and return the input_ids as PyTorch tensors:

In [68]:
prompt2 = "I'm just tryna live life"

In [69]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token=tokenizer.eos_token
trainer
inputs=tokenizer(prompt2, return_tensors="pt").input_ids

In [70]:
from transformers import AutoModelForCausalLM

model=AutoModelForCausalLM.from_pretrained(os.getenv("WANDB_NAME"))
outputs=model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=5, top_p=0.95)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [71]:
tokenizer.batch_decode(outputs, skip_special_tokens=True)

["I'm just tryna live life's   a   l e a r n   a g a i n \n I   d on't   w a n n a   b e   t h e   w a y \n A n d   I   c a n   f e e l   y o u   t h i n k i n'  i t's   l o v e   t o"]

## 8. Conclusion
To get the better result, we need to train the model for more epochs.