<a href="https://colab.research.google.com/github/zetavg/LLM-Research/blob/8130726/Minimal_Example_Fine_tuning_a_Transformers_Causal_LM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Minimal Example: Fine-tuning a Transformers Causal LM

A minimal example of fine-tuning a causal language model (LLaMA, GPT-J, etc.) with 🤗 Transformers's Trainer.

Run the code cells one by one to and see their outputs.

(For a even more minimal version of this, check out the [Very Minimal Example: Fine-tuning a Transformers Causal LM](https://github.com/zetavg/LLM-Research/blob/39511ae/Very_Minimal_Example_Fine_tuning_a_Transformers_Causal_LM.ipynb).)

## Install Dependencies

(~30sec)

In [1]:
!pip install torch transformers==4.28.1 datasets==2.12.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Get the Device Type

So that subsequent code can place the model and stuff on the correct device. (~10sec)

In [2]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Load the Model and Tokenizer

(~10 sec)

In [3]:
import gc
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = None
model = None

# Here we use a relatively small model. Training larger models on Colab will get
# to CUDA Out-Of-Memory really quick.
tokenizer_name = "EleutherAI/pythia-70m"
model_name = "EleutherAI/pythia-70m"

def get_tokenizer():
    clear_cache()
    print('Loading tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # if no pad token, set it to eos
    if tokenizer.pad_token is None:
        print(
            f"Tokenizer has no pad_token set, setting it to eos_token ({tokenizer.eos_token}).")
        tokenizer.pad_token = tokenizer.eos_token

    print('Tokenizer loaded.')
    return tokenizer

def get_model():
    clear_cache()
    print('Loading model...')
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model = model.to(device)  # move to device (GPU if available)
    print('Model loaded.')
    return model


def clear_cache():
    # To avoid eating up GPU RAM.
    # Not sure if this works. At least we try.
    gc.collect()
    with torch.no_grad():
        torch.cuda.empty_cache()


model = get_model()
tokenizer = get_tokenizer()


Loading model...
Model loaded.
Loading tokenizer...


Using pad_token, but it is not set yet.


Tokenizer has no pad_token set, setting it to eos_token (<|endoftext|>).
Tokenizer loaded.


## Test the Model Before Training

In [4]:
# Set to evaluation mode
model.eval()
print("Model training:", model.training, "(should be False)")

# Tokenize the prompt into tensor of token IDs
prompt = "This is"
input_ids = tokenizer(
    prompt,
    return_tensors="pt"  # Let it return PyTorch (`pt`) tensors
).input_ids
# Send values to device (GPU)
input_ids = input_ids.to(device)

# Let the model generate the completion
output_sequences = model.generate(input_ids, max_length=32)
output_ids = output_sequences[0]
generated_text = tokenizer.decode(output_ids)

# Print the results
print()
print("input_ids:", input_ids)
print("output_ids:", output_ids)
print()
print("prompt:", prompt)
print("generated_text:", generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Model training: False (should be False)

input_ids: tensor([[1552,  310]], device='cuda:0')
output_ids: tensor([1552,  310,  247, 1270, 1650,  273,  849,  436,  310, 1469,  281,  320,
         247, 1270, 1650,  273,  849,  436,  310, 1469,  281,  320,  247, 1270,
        1650,  273,  849,  436,  310, 1469,  281,  320], device='cuda:0')

prompt: This is
generated_text: This is a great example of how this is going to be a great example of how this is going to be a great example of how this is going to be


## The Training

In [5]:
# If the model behavies wierd durning or after the training, 
# uncomment the following lines to avoid training from a model instance that
# has been already used to generate text. This might fix the issue.

# model = get_model()
# tokenizer = get_tokenizer()

### Prepare Train Data

While we will normally load the dataset from elsewhere using the [`load_dataset()` function](https://huggingface.co/docs/datasets/loading).

But for simplicity and transparency here we'll just define our train data with a small list.

In [6]:
from datasets import Dataset

items = [
    {'text': "This is a great language model. Meow meow meow, meow meow. Oh, I'm not a cat. Meow."},
    {'text': "The quick brown fox jumps over the lazy dog."},
    {'text': "A book can't decide everything. For greater hope, I just know lots more nuances of possibilities."},
]

ds = Dataset.from_list(items)
ds

Dataset({
    features: ['text'],
    num_rows: 3
})

Before feeding our data into the trainer, we'll need to convert each of them into tokenized `input_ids` and `labels`.

In [7]:
def tokenize_data(data_point):
    batch_encoding = tokenizer(
        # See: https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer
        data_point['text'],
        max_length=32,
        truncation=True,
        padding="max_length",
        # return_tensors="pt"  # This is handled by the trainer.
    )
    batch_encoding["labels"] = batch_encoding["input_ids"].copy()
    # This is handled by the trainer.
    # batch_encoding = {k: v.to(device) for k, v in batch_encoding.items()}
    return batch_encoding

train_data = ds.map(tokenize_data)

import json
print("Sample item:")
print(json.dumps(train_data[1]).replace('{', '{\n  ').replace('",', '",\n ').replace('],', '],\n ').replace('}', '\n}'))

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Sample item:
{
  "text": "The quick brown fox jumps over the lazy dog.",
  "input_ids": [510, 3158, 8516, 30013, 27287, 689, 253, 22658, 4370, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  "attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  "labels": [510, 3158, 8516, 30013, 27287, 689, 253, 22658, 4370, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}


### Set Training Arguments

See [the docs](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) for more info about the arguments.

In [8]:
from transformers import TrainingArguments

# See: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
training_args = TrainingArguments(
    output_dir="./training_output",
    overwrite_output_dir=True,
    num_train_epochs=7,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="adamw_torch",
    learning_rate=5e-5,
    logging_steps=5
)

### Create the Trainer

In [9]:
from transformers import Trainer

#### Some advanced stuff - just for printing more info while training

⬇️ You can just press the play button to execute this and check back what's happening here later.

In [10]:
import time

# @markdown Here the `transformers.Trainer` class is subclass-ed and have the `training_step` and `compute_loss` functions overridden to print additional information, showing what's going on under the hood.
class CustomTrainer(Trainer):
    def training_step(self, model, inputs):
        tensor = super().training_step(model, inputs)

        # Do not print info on the first step to avoid 
        # messing up with the tqdm progress bar.
        if hasattr(self, "not_first_step"):
            time.sleep(3)  # Just for visual effects - so that we can see each step one by one.
            print("Step completed.")
            print()

        self.not_first_step = True
        return tensor

    def compute_loss(self, model, inputs, return_outputs=False):
        loss, outputs = super().compute_loss(
            model, inputs,
            # force the original `training_step` to return outputs 
            # so we can inspect it
            return_outputs=True
        )

        # Do not print info on the first step to avoid 
        # messing up with the tqdm progress bar.
        if hasattr(self, "not_first_step"):
              # Preview what the model have generated
              logits = outputs.logits
              # Get the token IDs with the highest probabilities
              token_ids = logits.argmax(dim=-1).squeeze().tolist()
              generated_text = tokenizer.decode(token_ids)  # Decode the token ids
              print("Target:", tokenizer.decode(inputs['labels'][0][1:]))
              print("Actual:", generated_text)

              # Print information about the train step
              print("Loss:", loss)

        return (loss, outputs) if return_outputs else loss

# @markdown Normally, something like the code in this cell will not appear in actual training script, except you want to override the trainer for custom behavior.
Trainer = CustomTrainer

#### Create the Trainer

Create the trainer with the defined `training_args`. See [the docs](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) for more info about the arguments.

In [11]:
# See: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer
trainer = Trainer(
    model=model,
    train_dataset=train_data,
    args=training_args
)

### Start the Training!

Since we have overwritten the `Trainer` class in the "Some advanced stuff" block above, some details that will normally not shown will be printed during training. 

Also, a delay between steps is added for the convenience of checking each step one by one.

You can observe the dropping loss and the actual `output` getting closer to the target `label` on each step.

In [12]:
trainer.train()
model.save_pretrained("./trained_model")

Step,Training Loss
5,3.1586
10,1.1832
15,0.3323
20,0.1012


Target:  quick brown fox jumps over the lazy dog.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
Actual: <|endoftext|> and and can over the moon,.<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>
Loss: tensor(1.7123, device='cuda:0', grad_fn=<NllLossBackward0>)
Step completed.

Target:  is a great language model. Meow meow meow, meow meow. Oh, I'm not a cat. Meow.<|endoftext|><|endoftext|>
Actual: <|endoftext|> a great book..<|endoftext|><|endoftext|>. with.ow I Iow,ow,<|endoftext|>, I know just a big.<|en

## Test the Trained Model

Now the model has been trained, we can set it to evaluation mode and use the `generate` function to test it.

In [13]:
# Set to evaluation mode
model.eval()
print("Model training:", model.training, "(should be False)")

prompt = "This is a"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

output_sequences = model.generate(input_ids, max_length=32)
output_ids = output_sequences[0]
generated_text = tokenizer.decode(output_ids)

print()
print("input_ids:", input_ids)
print("output_ids:", output_ids)
print()
print("prompt:", prompt)
print("generated_text:", generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Model training: False (should be False)

input_ids: tensor([[1552,  310,  247]], device='cuda:0')
output_ids: tensor([1552,  310,  247, 1270, 3448, 1566,   15, 1198, 3687, 8794, 1972,  273,
        8794, 1972,  273, 8794, 1972,  273, 8794, 1972,  273, 8794, 1972,  273,
        8794, 1972,  273, 8794, 1972,  273, 8794, 1972], device='cuda:0')

prompt: This is a
generated_text: This is a great language model. For greater nuances of nuances of nuances of nuances of nuances of nuances of nuances of nuances
