<a href="https://colab.research.google.com/github/zetavg/LLM-Research/blob/2b7282b/Minimal_Example_Fine_tuning_a_Transformers_Causal_LM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Very Minimal Example: Fine-tuning a Transformers Causal LM

A very minimal example of fine-tuning a causal language model (LLaMA, GPT-J, etc.): training the model to learn to generate a single sentence.

Run the code cells one by one to and see their outputs.

## Install Dependencies

(~30sec)

In [1]:
!pip install torch transformers==4.28.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.1
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


## Get the Device Type

So that subsequent code can place the model and stuff on the correct device. (~10sec)

In [2]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Load the Model and Tokenizer

(~10 sec)

In [3]:
import gc
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = None
model = None

# Here we use a relatively small model. Training larger models on Colab will get
# to CUDA Out-Of-Memory really quick.
tokenizer_name = "EleutherAI/pythia-70m"
model_name = "EleutherAI/pythia-70m"

def get_tokenizer():
    clear_cache()
    print('Loading tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    # if no pad token, set it to eos
    if tokenizer.pad_token is None:
        print(
            f"Tokenizer has no pad_token set, setting it to eos_token ({tokenizer.eos_token}).")
        tokenizer.pad_token = tokenizer.eos_token

    print('Tokenizer loaded.')
    return tokenizer

def get_model():
    clear_cache()
    print('Loading model...')
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model = model.to(device)  # move to device (GPU if available)
    print('Model loaded.')
    return model


def clear_cache():
    # To avoid eating up GPU RAM.
    # Not sure if this works. At least we try.
    gc.collect()
    with torch.no_grad():
        torch.cuda.empty_cache()


model = get_model()
tokenizer = get_tokenizer()


Loading model...


Downloading (…)lve/main/config.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/166M [00:00<?, ?B/s]

Model loaded.
Loading tokenizer...


Downloading (…)okenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Using pad_token, but it is not set yet.


Tokenizer has no pad_token set, setting it to eos_token (<|endoftext|>).
Tokenizer loaded.


## Test the Model Before Training

In [4]:
# Set to evaluation mode
model.eval()
print("Model training:", model.training, "(should be False)")

# Tokenize the prompt into tensor of token IDs
prompt = "This is"
input_ids = tokenizer(
    prompt,
    return_tensors="pt"  # Let it return PyTorch (`pt`) tensors
).input_ids
# Send values to device (GPU)
input_ids = input_ids.to(device)

# Let the model generate the completion
output_sequences = model.generate(input_ids, max_length=32)
output_ids = output_sequences[0]
generated_text = tokenizer.decode(output_ids)

# Print the results
print()
print("input_ids:", input_ids)
print("output_ids:", output_ids)
print()
print("prompt:", prompt)
print("generated_text:", generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Model training: False (should be False)

input_ids: tensor([[1552,  310]], device='cuda:0')
output_ids: tensor([1552,  310,  247, 1270, 1650,  273,  849,  436,  310, 1469,  281,  320,
         247, 1270, 1650,  273,  849,  436,  310, 1469,  281,  320,  247, 1270,
        1650,  273,  849,  436,  310, 1469,  281,  320], device='cuda:0')

prompt: This is
generated_text: This is a great example of how this is going to be a great example of how this is going to be a great example of how this is going to be


## The Training

In [5]:
# If the model behavies wierd durning or after the training, 
# uncomment the following lines to avoid training from a model instance that
# has been already used to generate text. This might fix the issue.

# model = get_model()
# tokenizer = get_tokenizer()

### Prepare the Optimizer and Scheduler for Training

In [6]:
from torch.optim import AdamW
learning_rate = 5e-5
optimizer = AdamW(
    model.parameters(), lr=learning_rate
)

from transformers import get_scheduler
# Here we use a constant scheduler, which will make the learning rate remain the
# same during the whole training.
lr_scheduler = get_scheduler(
    name="constant", optimizer=optimizer
)
# Another common choice is a linear scheduler.
# lr_scheduler = get_scheduler(
#     name="linear", optimizer=optimizer,
#     num_warmup_steps=0, num_training_steps=20
# )

### Run a Single Train Step

The following code cell is a single step of the training process, which will normally be repeated many times, with different batchs of training data, during an actual training.

You can run it mutiple times, and observe the dropping `loss` and the actual output getting closer to the target label.

In [16]:
# Set model to training mode
model.train()
print("Model training:", model.training, "(should be True)")
print()

# Prepare batch data
train_text = "This is a great language model. Meow meow meow, meow meow. Oh, I'm not a cat. Meow."
batch = tokenizer(train_text, return_tensors="pt")
batch['labels'] = batch['input_ids'].clone()
batch = {k: v.to(device) for k, v in batch.items()}

# Do a train step
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()

# Preview what the model have generated
logits = outputs.logits
token_ids = logits.argmax(dim=-1).squeeze().tolist()  # Get the token IDs with the highest probabilities
generated_text = tokenizer.decode(token_ids)  # Decode the token ids
print("Target:", tokenizer.decode(batch['labels'][0][1:]))
print("Actual:", generated_text)
print()

# Print information about the train step
print("Loss:", loss)
print("Step completed.")

Model training: True (should be True)

Target:  is a great language model. Meow meow meow, meow meow. Oh, I'm not a cat. Meow.
Actual:  is a great language model. Meow meow meow, meow meow. Oh, I'm not a cat. Meow. Oh

Loss: tensor(0.0902, device='cuda:0', grad_fn=<NllLossBackward0>)
Step completed.


## Test the Trained Model

Now the model has been trained, we can set it to evaluation mode and use the `generate` function to test it.

In [17]:
# Set to evaluation mode
model.eval()
print("Model training:", model.training, "(should be False)")

prompt = "This is a"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

output_sequences = model.generate(input_ids, max_length=32)
output_ids = output_sequences[0]
generated_text = tokenizer.decode(output_ids)

print()
print("input_ids:", input_ids)
print("output_ids:", output_ids)
print()
print("prompt:", prompt)
print("generated_text:", generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Model training: False (should be False)

input_ids: tensor([[1552,  310,  247]], device='cuda:0')
output_ids: tensor([1552,  310,  247, 1270, 3448, 1566,   15, 3189,  319,  479,  319,  479,
         319,  479,  319,  479,  319,  479,  319,   15, 5531,   13,  309, 1353,
         417,  247, 5798,   15, 3189,  319,   15, 5531], device='cuda:0')

prompt: This is a
generated_text: This is a great language model. Meow meow meow meow meow meow. Oh, I'm not a cat. Meow. Oh
