# Instruction Finetuning using QLoRA

In this notebook, we will look into how to perform instruction finetuning using QLoRA PEFT method. The task is to perform Supervised finetuning (SFT) of CodeLlama for function calling

Load the required libraries

In [1]:
import os
os.environ["WANDB_PROJECT"]="codellama_instruct_finetuning"

from enum import Enum
from functools import partial
import pandas as pd
import torch
import json

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig, set_seed
from datasets import load_dataset
from trl import SFTTrainer
from peft import get_peft_model, LoraConfig, TaskType

seed = 42
set_seed(seed)

2024-01-02 12:52:35.746611: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-02 12:52:35.746660: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-02 12:52:35.747489: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-02 12:52:35.753251: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[2024-01-02 12:52:38,201] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


## Data preprocessing

In [2]:

model_name = "codellama/CodeLlama-13b-Instruct-hf"
dataset_name = "heegyu/glaive-function-calling-v2-formatted"
tokenizer = AutoTokenizer.from_pretrained(model_name)
template = """{% for message in messages %}\n{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% if loop.last and add_generation_prompt %}{{'<|im_start|>assistant\n' }}{% endif %}{% endfor %}"""
tokenizer.chat_template = template

def preprocess(samples):
    batch = []
    for system_prompt, function_desc, conversation in zip(samples["system_message"], samples["function_description"], samples["conversations"]):
        try:
            function_desc_formatted = json.dumps(json.loads(f"[{function_desc}]"), indent=2, sort_keys=True)
        except:
            function_desc_formatted = f"[{function_desc}]"
        system_message = {"role": "system", "content": f"{system_prompt}\nfunctions: {function_desc_formatted}"}
        conversation.insert(0, system_message)
        batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
    return {"content": batch}

dataset = load_dataset(dataset_name)
dataset = dataset.map(
    preprocess,
    batched=True,
    remove_columns=dataset["train"].column_names
)
dataset = dataset["train"].train_test_split(0.1)
print(dataset)
print(dataset["train"][0])

DatasetDict({
    train: Dataset({
        features: ['content'],
        num_rows: 101664
    })
    test: Dataset({
        features: ['content'],
        num_rows: 11296
    })
})
{'content': '<|im_start|>system\nYou are a helpful assistant, with no access to external functions.\nfunctions: []<|im_end|>\n<|im_start|>user\nConsider the following equation with the added constraint that x must be a prime number: \n1 + 1 = x\nWhat is the value of x?<|im_end|>\n<|im_start|>assistant\nThe equation 1 + 1 = x has a solution of x = 2. However, with the added constraint that x must be a prime number, the only solution is x = 2.<|im_end|>\n<|im_start|>user\nCan you explain how encryption works?<|im_end|>\n<|im_start|>assistant\nEncryption is the process of converting plaintext into ciphertext, which is a scrambled version of the original message. This is done by using a mathematical algorithm and a key to encode the information in a way that can only be decoded with the corresponding key.<|im_

In [3]:
print(dataset["train"][6]["content"])

<|im_start|>system
You are a helpful assistant with access to the following functions. Use them if required -
functions: [{
    "name": "analyze_image",
    "description": "Analyze the contents of an image",
    "parameters": {
        "type": "object",
        "properties": {
            "image_url": {
                "type": "string",
                "description": "The URL of the image"
            }
        },
        "required": [
            "image_url"
        ]
    }
}

{
    "name": "search_recipe",
    "description": "Search for a recipe",
    "parameters": {
        "type": "object",
        "properties": {
            "keywords": {
                "type": "array",
                "items": {
                    "type": "string"
                },
                "description": "The keywords to search for in the recipe"
            },
            "cuisine": {
                "type": "string",
                "description": "The cuisine type (e.g. Italian, Mexican)"
          

## Create the PEFT model

### LoRA Config

In [4]:
peft_config = LoraConfig(r=8,
                         lora_alpha=16,
                         lora_dropout=0.1,
                         target_modules=["gate_proj","q_proj","lm_head","o_proj","k_proj","embed_tokens","down_proj","up_proj","v_proj"],
                         task_type=TaskType.CAUSAL_LM)

### bitsandbytes 4-bit quantization config

In [5]:
bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

In [6]:
class ChatmlSpecialTokens(str, Enum):
    user = "<|im_start|>user"
    assistant = "<|im_start|>assistant"
    system = "<|im_start|>system"
    function_call = "<|im_start|>function-call"
    function_response = "<|im_start|>function-response"
    eos_token = "<|im_end|>"
    bos_token = "<s>"
    pad_token = "<pad>"

    @classmethod
    def list(cls):
        return [c.value for c in cls]

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        pad_token=ChatmlSpecialTokens.pad_token.value,
        bos_token=ChatmlSpecialTokens.bos_token.value,
        eos_token=ChatmlSpecialTokens.eos_token.value,
        additional_special_tokens=ChatmlSpecialTokens.list(),
        trust_remote_code=True
    )
tokenizer.chat_template = template

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             quantization_config=bnb_config,
                                             device_map="auto",
                                             attn_implementation="flash_attention_2")
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Embedding(32024, 5120)

In [7]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32024, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaFlashAttention2(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): L

## Training

In [8]:
output_dir = "codellama_function_calling_instruct"
per_device_train_batch_size = 2
per_device_eval_batch_size = 2
gradient_accumulation_steps = 4
logging_steps = 5
learning_rate = 5e-4
max_grad_norm = 1.0
num_train_epochs=1
warmup_ratio = 0.1
lr_scheduler_type = "cosine"
max_seq_length = 2048

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_strategy="no",
    evaluation_strategy="epoch",
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    max_grad_norm=max_grad_norm,
    weight_decay=0.1,
    warmup_ratio=warmup_ratio,
    lr_scheduler_type=lr_scheduler_type,
    bf16=True,
    report_to=["tensorboard", "wandb"],
    hub_private_repo=True,
    push_to_hub=True,
    num_train_epochs=num_train_epochs,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)


In [9]:
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    packing=True,
    dataset_text_field="content",
    max_seq_length=max_seq_length,
    peft_config=peft_config,
    dataset_kwargs={
        "append_concat_token": False,
        "add_special_tokens": False,
    },
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [10]:
trainer.train()
trainer.save_model()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msmangrul[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Epoch,Training Loss,Validation Loss
0,0.3484,0.368368




events.out.tfevents.1704196084.hf-dgx-01.3328095.0:   0%|          | 0.00/4.88k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.79k [00:00<?, ?B/s]

events.out.tfevents.1704196370.hf-dgx-01.3355468.0:   0%|          | 0.00/119k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/720M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
!nvidia-smi

## Loading the trained model and getting the predictions of the trained model

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
import torch

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

peft_model_id = "smangrul/codellama_function_calling_instruct"
device = "cuda"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                             quantization_config=bnb_config,
                                             device_map="auto",
                                             attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
model = PeftModel.from_pretrained(model, peft_model_id)
# model.to(torch.bfloat16)
# model.cuda()
model.eval()

In [150]:
system_prompt = """You are a helpful assistant with access to the following functions. You do a reasoning step before acting. Use the functions if required -
functions: [{
    "name": "get_current_location",
    "description": "Returns the current location. ONLY use this if the user has not provided an explicit location in the query.",
    "parameters": {}
},
{
    "name": "search",
    "description": "A search engine. 
    Useful for when you need to answer questions and provide information about real-time updates, current events and latest NEWS.
    Useful to answer general knowledge questions and provide information about people, places, companies, facts, historical events, or other subjects.
    Input should be a search query."
    "parameters": {
        "type": "object",
        "properties": {
            "search_query": {
                "type": "array",
                "items": {
                    "type": "string"
                },
                "description": "The search query"
            }
        },
        "required": [
            "search_query"
        ]
    }
},
{
    "name": "code_analysis",
    "description": "Useful when the user query can be solved by writing Python code. 
    This function generates high quality Python code and runs it to solve the user query and provide the output.
    Useful when user asks queries that can be solved with Python code. 
    Useful for sorting, generating graphs for data visualization and analysis, solving arthmetic and logical questions, data wrangling and data chrunching tasks for csv files etc.",
    "parameters": {
        "type": "object",
        "properties": {
            "text_prompt": {
                "type": "string",
                "description": "The description of the problem to be solved by writing python code."
            }
        },
        "required": [
            "text_prompt"
        ]
    }
},
{
    "name": "analyze_image",
    "description": "Analyze the contents of an image",
    "parameters": {
        "type": "object",
        "properties": {
            "image_url": {
                "type": "string",
                "description": "The URL of the image"
            }
        },
        "required": [
            "image_url"
        ]
    }
},
{
    "name": "generate_image",
    "description": "generate image based on the given description. ",
    "parameters": {
        "type": "object",
        "properties": {
            "text_prompt": {
                "type": "string",
                "description": "The description of the image to be generated"
            }
        },
        "required": [
            "text_prompt"
        ]
    }
}]"""

# Can you tell me what's in the image www.example.com/myimage.jpg?
# Please give me the receipe for kadhai paneer.
# Where am I?
# Generate an image of Indian festival Sankranthi where children are flying colourful kites on their terrace.
# What is the latest news of earthquakes in Japan?
# Sort the array [1,7,5,6].
# I want to know about tour packages from India to Maldives.

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "I want to know about tour packages from India to Maldives."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k,v in inputs.items()}
with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
    outputs = model.generate(**inputs, 
                             max_new_tokens=128, 
                             do_sample=True, 
                             top_p=0.95, 
                             temperature=0.2, 
                             repetition_penalty=1.0, 
                             eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

Setting `pad_token_id` to `eos_token_id`:32016 for open-end generation.


<|im_start|>system 
You are a helpful assistant with access to the following functions. You do a reasoning step before acting. Use the functions if required -
functions: [{
    "name": "get_current_location",
    "description": "Returns the current location. ONLY use this if the user has not provided an explicit location in the query.",
    "parameters": {}
},
{
    "name": "search",
    "description": "A search engine. 
    Useful for when you need to answer questions and provide information about real-time updates, current events and latest NEWS.
    Useful to answer general knowledge questions and provide information about people, places, companies, facts, historical events, or other subjects.
    Input should be a search query."
    "parameters": {
        "type": "object",
        "properties": {
            "search_query": {
                "type": "array",
                "items": {
                    "type": "string"
                },
                "description": "The searc

In [None]:
!nvidia-smi