Using 🤗 PEFT & bitsandbytes to finetune a LoRa checkpoint

In [2]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[33mDEPRECATION: distro-info 1.1build1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: python-debian 0.1.43ubuntu1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of python-debian or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: distro-info 1.1build1 has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
!nvidia-smi -L



GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-7b04640e-8675-bd46-0a63-451f345644c7)


Setup the Model

In [5]:
import torch  
from transformers import AutoModelForCausalLM, BitsAndBytesConfig  
  
# Load model
modelpath = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    torch_dtype=torch.bfloat16,
    # FA2 does not work yet
    # attn_implementation="flash_attention_2",          
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Setup Tokenizer


In [6]:
from transformers import AutoTokenizer  
  
# fast tokenizer sometimes ignores the added tokens  
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast=False)      
  
# add special tokens for ChatML formatting and a pad token  
tokenizer.add_tokens(["<|im_start|>", "<PAD>"])
tokenizer.pad_token = "<PAD>"
tokenizer.add_special_tokens(dict(eos_token="<|im_end|>"))
model.config.eos_token_id = tokenizer.eos_token_id

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Freezing Original weights


In [7]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

Setting up the LoRa Adapters


In [8]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [9]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    # target_modules=["q_proj", "v_proj"], #if you know the 
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

model = get_peft_model(model, config)
print_trainable_parameters(model)


trainable params: 18350080 || all params: 1539742720 || trainable%: 1.1917627381280946


In [10]:
import transformers
from datasets import load_dataset

# load the dataset created in Part 1
dataset = load_dataset("g-ronimo/riddles_evolved")

# split into training (90%) and test set (10%)
dataset = dataset["train"].train_test_split(test_size=0.1)

In [11]:
dataset["train"][0]


{'number': 424,
 'messages': ["I am a word used for a person's last wish, spoken before dying. I'm found in the deepest blue, where sunrays refrain from intruding.",
  'The word you are looking for is "deathbed request" or "deathbed wish." It is a request or desire that a person expresses before passing away. The metaphorical description given refers to the final moments of a person\'s life, often depicted as lying on a bed in the deepest blue (meaning the end of life or the afterlife) where sunrays no longer reach. If you have any questions about specific examples or interpretations, please ask at the end of this answer.',
  "What's a wish that people sometimes make when they're very sick or close to death? (curiously)",
  "People who are sick or close to death may have a variety of wishes, depending on their individual circumstances and personal beliefs. Some common wishes or requests include:\n\n1. Reuniting with loved ones: People might express a desire to be reunited with deceased

In [12]:
import os
from functools import partial

# ChatML format
templates = [
    "<|im_start|>assistant\n{msg}<|im_end|>",      # message by assistant
    "<|im_start|>user\n{msg}<|im_end|>"        # message by user
]

# This special index is used to ignore certain tokens during loss calculation.
IGNORE_INDEX = -100

def tokenize(input, max_length):
    input_ids, attention_mask, labels = [], [], []

    # Iterate over each message in the dataset
    for i, msg in enumerate(input["messages"]):

        # Check if the message is from human (user) or assistant, apply ChatML template
        isHuman = i%2==0
        msg_chatml = templates[isHuman].format(msg=msg)

        # tokenize all, truncate later
        msg_tokenized = tokenizer(
          msg_chatml, 
          truncation=False, 
          add_special_tokens=False)

        # Copy tokens and attention mask without changes
        input_ids += msg_tokenized["input_ids"]
        attention_mask += msg_tokenized["attention_mask"]

        # Adapt labels for loss calculation: if user->IGNORE_INDEX, if assistant->input_ids  (=ignore human messages, calculate loss only for assistant messages since these are the reponses we want to learn)
        labels += [IGNORE_INDEX]*len(msg_tokenized["input_ids"]) if isHuman else msg_tokenized["input_ids"]

    # truncate to max. length
    return {
        "input_ids": input_ids[:max_length], 
        "attention_mask": attention_mask[:max_length],
        "labels": labels[:max_length],
    }

dataset_tokenized = dataset.map(
    # cut samples at 1024 tokens
    # enough for the riddles dataset (max. length 1000 tokens)
    # has to be adapted for other datasets, higher=more VRAM needed
    partial(tokenize, max_length=1024), 
    batched = False,
    num_proc = os.cpu_count(),    # multithreaded
    remove_columns = dataset["train"].column_names  # Remove original columns, no longer needed
)

Map (num_proc=24):   0%|          | 0/1513 [00:00<?, ? examples/s]

Map (num_proc=24):   0%|          | 0/169 [00:00<?, ? examples/s]

In [13]:
# collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to a single dictionary forming a batch { input_ids: [..], labels: [..], attention_mask: [..] }
def collate(elements):

    # Extract input_ids from each element and find the maximum length among them
    tokens = [e["input_ids"] for e in elements]
    tokens_maxlen = max([len(t) for t in tokens])

    for i, sample in enumerate(elements):
        input_ids = sample["input_ids"]
        labels = sample["labels"]
        attention_mask = sample["attention_mask"]

        # Calculate the padding length required to match the maximum token length
        pad_len = tokens_maxlen-len(input_ids)

        # Pad 'input_ids' with the pad token ID, 'labels' with IGNORE_INDEX, and 'attention_mask' with 0
        input_ids.extend( pad_len * [tokenizer.pad_token_id] )
        labels.extend( pad_len * [IGNORE_INDEX] )
        attention_mask.extend( pad_len * [0] )

    # create and return batch with all the data in elements
    batch={
        "input_ids": torch.tensor( [e["input_ids"] for e in elements] ),
        "labels": torch.tensor( [e["labels"] for e in elements] ),
        "attention_mask": torch.tensor( [e["attention_mask"] for e in elements] ),
    }
    return batch

In [14]:
from transformers import TrainingArguments, Trainer

bs=1         # batch size
ga_steps=16  # gradient acc. steps
epochs=20
lr=0.00002

steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch//2,      # eval twice per epoch
    save_steps=steps_per_epoch,         # save once per epoch
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    optim="paged_adamw_32bit",      # val_loss will go NaN with paged_adamw_8bit
    learning_rate=lr,
    group_by_length=False,
    bf16=True,
    ddp_find_unused_parameters=False,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
)

trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
47,1.5283,1.508486
94,1.549,1.383974
141,1.3315,1.335903
188,1.3522,1.304526
235,1.2325,1.289981
282,1.3557,1.275905
329,1.3973,1.265281
376,1.2885,1.260216
423,1.2579,1.255164
470,1.2198,1.249489


Checkpoint destination directory out/checkpoint-94 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1880, training_loss=1.2460531431943813, metrics={'train_runtime': 10958.761, 'train_samples_per_second': 2.761, 'train_steps_per_second': 0.172, 'total_flos': 1.4841422260353024e+17, 'train_loss': 1.2460531431943813, 'epoch': 19.88})

In [4]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Path to your saved checkpoint
checkpoint_path = "out/checkpoint-1880"

# Load the model with the specific configuration
model = AutoModelForCausalLM.from_pretrained(
    checkpoint_path,  # Use the checkpoint path instead of "microsoft/phi-2"
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4",
    ),
    torch_dtype=torch.bfloat16,
    # Ensure any specific configurations used during training are maintained here
    # attn_implementation="flash_attention_2",  # Uncomment if used during training and supported
)

# Make sure to adjust `checkpoint_path` to the actual path where your model checkpoint is stored.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
model.push_to_hub("samyoon727/Phi2-01",
                  use_auth_token=True,
                  commit_message="basic training",
                  private=True)



README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]



adapter_model.safetensors:   0%|          | 0.00/73.4M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/samyoon727/Phi2-01/commit/e81a3ad0b91563d8d018a838af4e2c35d69a98f1', commit_message='basic training', commit_description='', oid='e81a3ad0b91563d8d018a838af4e2c35d69a98f1', pr_url=None, pr_revision=None, pr_num=None)

In [6]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "samyoon727/Phi2-01"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


adapter_model.safetensors:   0%|          | 0.00/73.4M [00:00<?, ?B/s]

In [7]:

batch = tokenizer("“Training models with PEFT and LoRa is cool” ->: ", return_tensors='pt')

with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch, max_new_tokens=50)

print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




 “Training models with PEFT and LoRa is cool” ->: 

The statement "Training models with PEFT and LoRa is cool" is an opinion and not a fact. It is subjective and based on personal preference. Some people may find it cool to train models with these technologies, while others may not
