# GPU Efficient LLM fine-tuning

We'll go through an example of parameter efficinet GPU training on a T4 GPU by using Google Colab.

## Package setup

In [1]:
%pip install transformers -q
%pip install bitsandbytes -q
%pip install datasets -q
%pip install accelerate -q
%pip install peft -q
%pip install trl -q
%pip install einops -q
%pip install tensorboard -q

%pip install watermark -q # version checks

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
%load_ext watermark

In [3]:
# check the GPU
!nvidia-smi

Tue Oct 10 01:31:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   36C    P8    18W / 250W |      6MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:2E:00.0 Off |                  N/A |
| 27%   36C    P8    20W / 220W |      6MiB / 11264MiB |      0%      Default |
|       

## Loading the model

We will load the model by using the `transformers` library from Hugging Face.

In order to fit really large models into a single GPU, you can
load the model in half precision. Most LLMs are even trained in half precision (float16, bfloat16) and there is almost no performace loss compared with full precision (float32) training.

If that is not enough, you can quantize the weights of the model to 8bit or even 4bit. For that we need the `accelerate` and `bitsandbytes` library.

In [4]:
# Illustrate memory usage
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to("cuda")
model_fp16 = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype="auto").to("cuda") # torch_dtype=torch.float16
model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True) # 8bit and 4bit models are automatically loaded to GPU
model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True)

def print_model_gpu_usage(model, name):
  params_size = sum([param.nelement() * param.element_size()
                    for param in model.parameters()])  # in bytes
  print(f"Model: {name} (GB): {params_size/1e9}")

print_model_gpu_usage(model, "base")
print_model_gpu_usage(model_fp16, "fp16")
print_model_gpu_usage(model_8bit, "8bit")
print_model_gpu_usage(model_4bit, "4bit")

Model: base (GB): 1.324785664
Model: fp16 (GB): 0.662392832
Model: 8bit (GB): 0.359354368
Model: 4bit (GB): 0.207835136


We decreased model memory usage from 1.32GB to 0.2GB ~ 15% of original size. But loading in 8bit and 4bit comes with a bit of performance degradation. Keep in mind that we also need to save some GPU memory for training, so we don't want to take the whole memory just by loading the model.

If we are interested in only running the inference on short sequences, we do not need that much extra memory. Memory requirements also increase if we will be processing very long sequences.

In [5]:
# clear memory
import gc
import torch
del model
del model_fp16
del model_8bit
del model_4bit
gc.collect()
with torch.no_grad():
  torch.cuda.empty_cache()

In [6]:
!nvidia-smi

Tue Oct 10 01:31:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   39C    P2    58W / 250W |    666MiB / 11264MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:2E:00.0 Off |                  N/A |
| 27%   36C    P8    20W / 220W |      8MiB / 11264MiB |      0%      Default |
|       

## Load the model for Fine-Tuning
Now we will load the model that we will be working with in 8bit. If you are working locally, after downloading the model weights for the first time they will be cached, so following model loadings will be much faster.

In [7]:
# Falcon base model for tuning
model_name = "tiiuae/falcon-rw-1b"

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", load_in_8bit=True)

print_model_gpu_usage(model, "base")

Model: base (GB): 1.41529088


In [9]:
# Print only first 15 layers
for i, (name, param) in enumerate(model.named_parameters()):
    print(f"{name:60s} {str(param.dtype):18s} {str(param.requires_grad):6s}  {param.shape}")
    if i == 15:
        break

transformer.word_embeddings.weight                           torch.bfloat16     True    torch.Size([50304, 2048])
transformer.h.0.self_attention.query_key_value.weight        torch.int8         False   torch.Size([6144, 2048])
transformer.h.0.self_attention.query_key_value.bias          torch.bfloat16     False   torch.Size([6144])
transformer.h.0.self_attention.dense.weight                  torch.int8         False   torch.Size([2048, 2048])
transformer.h.0.self_attention.dense.bias                    torch.bfloat16     False   torch.Size([2048])
transformer.h.0.mlp.dense_h_to_4h.weight                     torch.int8         False   torch.Size([8192, 2048])
transformer.h.0.mlp.dense_h_to_4h.bias                       torch.bfloat16     False   torch.Size([8192])
transformer.h.0.mlp.dense_4h_to_h.weight                     torch.int8         False   torch.Size([2048, 8192])
transformer.h.0.mlp.dense_4h_to_h.bias                       torch.bfloat16     False   torch.Size([2048])
transf

Loading model in 8bit changed attention query, key, value and mlp weight layers in transformer block to int8 and locked them.
Only embedding, layernorm and classification head layers (not visible here) are kept unlocked and loaded in bfloat16.

They are loaded in bfloat16 because we set `torch_dtype="auto"` and in the model configuration file bfloat16 is the type used. You can see the model config file in Hugging Face Hub or write `model.config`

## PEFT - Parameter Efficient Fine Tuning
LLMs are huge! We can't train the whole model for multiple reasons:

*   We probably do not have enough data (overfit issues)
*   Not enough hardware to run it in a reasonable amount of time
*   Very expensive to rent GPUs

Solution is to train only a subset of parameters using techniques like LoRA.

The idea is to freeze all layers and inject additional layers that will be trained. These additional layers are two lower rank matrices which means that they have less parameters than the original weight matrix.

In [10]:
import peft

lora_config = peft.LoraConfig(
    task_type="CAUSAL_LM",
    inference_mode=False,
    r=16,           # Rank of update matrices, lower the rank lower number of parameters to train
    lora_alpha=32,  # LoRA scaling factor
    lora_dropout=0.05,
    bias="none",  # Should bias parameter be trained (of the original model); if True this will change the base model biases not just Lora AB matrices
    target_modules=["query_key_value"] # Print model parameters name to figure out which modules to target
)

peft_model = peft.get_peft_model(model, lora_config)  # this changes the model in-place

In [11]:
# Print only first 15 layers
for i, (name, param) in enumerate(model.named_parameters()):  # deliberately using model instead of peft_model
    print(f"{name:75s} {str(param.dtype):18s} {str(param.requires_grad):6s}  {param.shape}")
    if i == 15:
        break

transformer.word_embeddings.weight                                          torch.bfloat16     False   torch.Size([50304, 2048])
transformer.h.0.self_attention.query_key_value.weight                       torch.int8         False   torch.Size([6144, 2048])
transformer.h.0.self_attention.query_key_value.bias                         torch.bfloat16     False   torch.Size([6144])
transformer.h.0.self_attention.query_key_value.lora_A.default.weight        torch.float32      True    torch.Size([16, 2048])
transformer.h.0.self_attention.query_key_value.lora_B.default.weight        torch.float32      True    torch.Size([6144, 16])
transformer.h.0.self_attention.dense.weight                                 torch.int8         False   torch.Size([2048, 2048])
transformer.h.0.self_attention.dense.bias                                   torch.bfloat16     False   torch.Size([2048])
transformer.h.0.mlp.dense_h_to_4h.weight                                    torch.int8         False   torch.Size([8192

We can see how peft added `lora_A` and `lora_B` layers (in float32) and locked all other layers. When multiplied, matrix B*A correspond to `query_key_value.weight` dimension but overall has fewer parameters.

Those are the only layers we will be training.

In [12]:
peft_model.print_trainable_parameters()

trainable params: 3,145,728 || all params: 1,314,770,944 || trainable%: 0.2392605354077554


## Tokenizer
Tokenizer converts text into small chunks - tokens. Each token is associated with an ID. This list of token IDs is the actual input to LLMs.

```
#       Text     --->           Tokens               --->    Input ids
"This is great!" ---> ["This", "_is", "_great", "!"] ---> [52, 345, 124, 13]
````

Note: It is important to use the same tokenizer that was used during model pretraining. In practice, this means loading the tokenizer with the same name as the model.

In [13]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [14]:
# Base models are often trained without padding tokens
assert tokenizer.pad_token_id is None

In [15]:
# Often used in various tutorials as a quick PoC fix
# We'll see that this approach is not optimal
tokenizer.pad_token = tokenizer.eos_token

In [16]:
tokenized_text = tokenizer("Large language models are")
print(tokenized_text)

{'input_ids': [21968, 3303, 4981, 389], 'attention_mask': [1, 1, 1, 1]}


When calling the tokenizer directly, it uses the `__call__` method which returns `input_ids` and `attention_mask` directly.

In [17]:
tokenizer.tokenize("Large language models are")

['Large', 'Ġlanguage', 'Ġmodels', 'Ġare']

If you want to see the tokens first, you can use `tokenizer.tokenize` method.

Or you can convert ids to tokens using `tokenizer.convert_ids_to_tokens`.

In [18]:
tokenizer.convert_ids_to_tokens(tokenized_text['input_ids'])

['Large', 'Ġlanguage', 'Ġmodels', 'Ġare']

`attention_mask` is useful for batch processing. Different texts, when tokenized have different number of tokens, but in order to create a batch we need to have the same number of tokens. We can add some token to the end of sequences to make them have the same number of toknes. This token is called padding token. But when we process the text we want to ignore these artificially added tokens so we use `attention_mask`. It tells the attention mechanism which tokens to include and which to ignore during its calculation.

In [19]:
tokenizer(["Large lanugage models are", "I am"], padding=True)

{'input_ids': [[21968, 26992, 1018, 496, 4981, 389], [40, 716, 50256, 50256, 50256, 50256]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 0, 0, 0, 0]]}

First sentece has 6 tokens, while the second one has only 2 tokens. So we need to add 4 padding tokens (with the ID = 50256). In this case they are paded from the right. Attention mask for first sentence contains only 1s because all tokens are relevant, while in the second sentece, last 4 tokens are padded so attention mask is 0 for these tokens. This way attention mechanism will know to look only first 2 tokens as the relevant ones for this example

In order to decode input ids back to text, we can use the `tokenizer.decode` method

In [20]:
tokenizer.decode(tokenized_text['input_ids'])

'Large language models are'

In [21]:
print("Padding token: ", tokenizer.pad_token)
print("Padding token id: ", tokenizer.pad_token_id)

Padding token:  <|endoftext|>
Padding token id:  50256


## Inference

In order to run inference, we need to:
1. Tokenize the text in order to get `input_ids` and `attention_mask`
2. Pass `input_ids` and `attention_mask` to `model.generate` function
3. Decode the output of the model back to text

In [22]:
import torch

text = "What are large language models?"

# 1. Tokenize
model_input = tokenizer(text, return_tensors="pt").to("cuda")  # return input_ids and attention_mask as pytorch tensors ("pt")
print(model_input)

{'input_ids': tensor([[2061,  389, 1588, 3303, 4981,   30]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]], device='cuda:0')}


In [23]:
# 2. Generate
peft_model.eval()  # make sure model is in evaluation mode
with torch.no_grad():
    out_tokens = peft_model.generate(
        **model_input,       # unpack input_ids and attention_mask
        max_new_tokens=100,  # How many new tokens to generate
        do_sample=True,      # Greedy search or sampling?
        top_p=0.95,          # Consider only minimal amount of tokens that cover 95% of overall probability
        temperature=0.7,     # Skew the probability -> the lower the number more deterministic
        pad_token_id=tokenizer.pad_token_id
        )

print(out_tokens)



tensor([[ 2061,   389,  1588,  3303,  4981,    30,   198, 21968,  3303,  4981,
           357,  3069,    44,     8,   389,  3303,  4981,   326,   389,  4457,
          1588,   287,  1111,   262,  1271,   286, 10007,   290,   287,   262,
          2546,   286,   511,  3047,  1366,    13,  1119,   389,   517,  2408,
           284,  4512,    11,   475,   389,  6007,   286,  9489,  8861,   326,
          2421,  1588,  3146,   286, 10007,   393,  1588,  6867,   286,  3047,
          1366,    13,   198,   464,  3303,  2746,   318,   262,  5072,   286,
           262,  4673,  1429,    13,   383,  3303,  2746,   318,  8776,   416,
         13157,   355,   867,  8405,   355,  1744,   422,   262,  5981,  1366,
           900,   290,  4673,   262,  5981, 10007,   290,    14,   273,   262,
          5981, 19590,    13,   198,  1890,  1672]], device='cuda:0')


In [24]:
# 3. Convert back to text
print(tokenizer.decode(out_tokens[0]))

What are large language models?
Large language models (LLM) are language models that are extremely large in both the number of parameters and in the size of their training data. They are more difficult to train, but are capable of performing tasks that require large numbers of parameters or large amounts of training data.
The language model is the output of the learning process. The language model is trained by collecting as many samples as possible from the relevant data set and learning the relevant parameters and/or the relevant weights.
For example


In [25]:
# Let's put this into function
def generate_response(input, max_new_tokens=100, do_sample=True, top_p=0.95, temperature=0.7):
  batch = tokenizer(input, return_tensors="pt").to("cuda")

  peft_model.eval()
  with torch.no_grad():
    out_tokens = peft_model.generate(
        **batch,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        top_p=top_p,
        temperature=temperature,
        pad_token_id=tokenizer.pad_token_id)

  model.train()
  out = tokenizer.decode(out_tokens[0])
  return out

In [26]:
response = generate_response("In order to learn how to draw you need to")
print("\n")
print(response)



In order to learn how to draw you need to learn the basics.
This tutorial explains how to draw a dog.
Step 1
The first thing you need to do is draw a box. I will use the box for my dog’s head.
Step 2
First draw a rectangle of about 7.5 inches wide and 5.5 inches high.
Step 3
Now draw a circle about 1 inch in diameter. This will be the dog’s head.
Step 4
Draw a circle about 1.5


In [27]:
response = generate_response("Write 5 steps to learn how to draw")
print("\n")
print(response)

## Dataset

To align our model to answer human questions and follow instructions we will fine-tune it on a dataset containing instructions and desired answers.
We will use alpaca, version that has train-validation-test splits.

In [None]:
from datasets import load_dataset

dataset = load_dataset("disham993/alpaca-train-validation-test-split")

train = dataset['train']
val = dataset['validation']
test = dataset['test']

In [None]:
train[0]

Datasets usually contain instruction/input/question and output/answer etc. In the end model needs one field where everything is combined, like the "text" field here. This "text" field usually follows a specific format that we can create. After fine-tuning, during inference we will get the best results if we follow that format.

We will use the format provided in the "text" field.

Format if input field is not empty:

```
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
```

Format if input field is empty:
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
```

## Training

Because we want to fine-tune the model only on the responses, not the whole prompt text, we use a data collator that makes sure to calculate the loss only for the response. That is why we need standardized response format. We tell the collator after which phrase the response starts. Then it uses the tokenizer to convert that phrase into a list of tokens. When that list of tokens is found in the prompt, it knows that is should count the loss only for tokens that come after it. If it can't find that list of tokens in the prompt, it will raise a warning.

In [None]:
from trl import DataCollatorForCompletionOnlyLM, SFTTrainer

response_template = "\n### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

To train the model we use `SFTTrainer` from `trl` library (also in the hugging face ecosystem).

In [None]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir="./results_2023_10",
    evaluation_strategy="steps",
    max_steps=1000,
    eval_steps=100,
    logging_steps=100,
    logging_first_step=True,
    save_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # Accumulate gradiens to have larger batch size while keeping GPU memory usage low
    learning_rate=5e-5,
    weight_decay=0,
    fp16=True, # Whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
    group_by_length = True,
    report_to="tensorboard"
)

trainer = SFTTrainer(
    model=peft_model,
    tokenizer=tokenizer,
    train_dataset=train,
    eval_dataset=val.select(indices=range(0, 500)),  # Let's take a subset of val so that evaluation is faster
    dataset_text_field="text",  # Field that contains the final prompt to use
    data_collator=collator,     # Send our collator that will calculate loss only for response
    args=training_arguments,
    max_seq_length=384
)

In [None]:
trainer.train()

In [None]:
# Save trained model
trainer.model.save_pretrained("falcon-rw-1b-alpaca")

In [None]:
%load_ext tensorboard
%tensorboard --logdir results/runs