# QLoRA / PEFT fine-tuning

* QLoRA reduces the number of trainable parameters while keeping the pre-training weights frozen.
* This allows for more efficient fine-tuning of the model on smaller GPUs, much faster.


## Dataset

In [1]:
import json
import wandb
from wandb import Api

# load the packed dataset
api = Api()
artifact = api.artifact('vijaygkd/alpaca_ft/alpaca_gpt4_splitted:v0', type='dataset')
artifact_dir = artifact.download()

def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data
    
train_dataset = load_jsonl(f"{artifact_dir}/alpaca_gpt4_train.jsonl")
eval_dataset = load_jsonl(f"{artifact_dir}/alpaca_gpt4_eval.jsonl")

[34m[1mwandb[0m:   2 of 2 files downloaded.  


In [2]:
import pandas as pd
from datasets import Dataset

train_dataset = Dataset.from_pandas(pd.DataFrame.from_records(train_dataset))
eval_dataset = Dataset.from_pandas(pd.DataFrame.from_records(eval_dataset))


In [16]:
train_dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 51002
})

In [4]:
train_dataset[0]

{'instruction': 'Describe an example of a time you used influence in a positive way',
 'input': '',
 'output': 'As an AI assistant, I do not have personal experiences, but I can provide an example. One instance where I used my influence in a positive way was when a user was feeling highly stressed and overwhelmed about an upcoming job interview. They confided in me, expressing their self-doubts and fears of failure. Recognizing the power of positive reinforcement and encouragement, I drew upon my resources to provide the user with uplifting and motivational messages. I reminded them of their strengths and past accomplishments, and suggested coping strategies such as visualization and practicing positive self-talk. Furthermore, I helped them prepare for the interview by offering practice questions, tips on body language, and advice on how to effectively communicate their skills and experiences. As a result, the user reported feeling more confident and capable of performing well in their

Below formatting function will be used by SFTTrainer class

In [5]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n{output}").format_map(row)

def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n{output}").format_map(row)

def create_prompt(row) -> str:
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

## Model

Load the pre-trained LLM in 4 bit to reduce memory use in forward pass

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

model_id = 'meta-llama/Llama-2-7b-hf'

use_4_bit_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_kwargs = dict(
    device_map="auto",
    # trust_remote_code=True,
    # low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    # use_flash_attention_2=True,
    use_cache=False,
    quantization_config=use_4_bit_config
)

# only model_id and config are required for LoRA

4 bit model requires 4GB of memory

## Evaluation

Eval metrics

Evaluation sample dataset to test generations

In [7]:
# remove answers
def create_prompt_no_anwer(row):
    row["output"] = ""
    return {"text": create_prompt(row)}

test_dataset = eval_dataset.map(create_prompt_no_anwer)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

WandB logger callback

`model` object is accessed from `trainer.model`. This will have loaded the trained LoRA adapter.

In [8]:
from tqdm.auto import tqdm
from transformers import GenerationConfig
from transformers.integrations import WandbCallback


class LLMSampleCB(WandbCallback):
    def __init__(self, trainer, test_dataset, num_samples=10, max_new_tokens=256, log_model="checkpoint"):
        super().__init__()
        self._log_model = log_model
        self.sample_dataset = test_dataset.select(range(num_samples))
        self.model, self.tokenizer = trainer.model, trainer.tokenizer
        self.gen_config = GenerationConfig.from_pretrained(trainer.model.name_or_path,
                                                           max_new_tokens=max_new_tokens)
    def generate(self, prompt):
        tokenized_prompt = self.tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
        with torch.inference_mode():
            output = self.model.generate(inputs=tokenized_prompt, generation_config=self.gen_config)
        return self.tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)
    
    def samples_table(self, examples):
        records_table = wandb.Table(columns=["prompt", "generation"] + list(self.gen_config.to_dict().keys()))
        for example in tqdm(examples, leave=False):
            prompt = example["text"]
            generation = self.generate(prompt=prompt)
            records_table.add_data(prompt, generation, *list(self.gen_config.to_dict().values()))
        return records_table
        
    def on_evaluate(self, args, state, control,  **kwargs):
        super().on_evaluate(args, state, control, **kwargs)
        records_table = self.samples_table(self.sample_dataset)
        self._wandb.log({"sample_predictions":records_table})

## Training and PEFT config

In [9]:
from peft import LoraConfig, get_peft_model, PeftConfig

peft_config = LoraConfig(
    r=64,  # the rank of the LoRA matrices
    lora_alpha=16, # the weight
    lora_dropout=0.1, # dropout to add to the LoRA layers
    bias="none", # add bias to the nn.Linear layers?
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj","v_proj","o_proj"], # the name of the layers to add LoRA
)

In [10]:
from transformers import TrainingArguments
from trl import SFTTrainer

2024-03-29 22:41:49.276208: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-29 22:41:49.439090: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/:/anaconda/envs/azureml_py38/lib/
2024-03-29 22:41:49.439115: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-03-29 22:41:50.274937: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loa

In [11]:
batch_size = 16
gradient_accumulation_steps = 2
epochs = 3

total_num_steps = 11_210 * epochs // (batch_size * gradient_accumulation_steps)      
print(total_num_steps)

1050


In [12]:
output_dir = ".model/"
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size//2,
    bf16=False,                        # requires Ampere GPU (A100. not T4)
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    # num_train_epochs=1,
    max_steps=total_num_steps,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs=dict(use_reentrant=False),
    evaluation_strategy="steps",
    eval_steps=total_num_steps // epochs,
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="no",
    report_to="wandb",
)

In [17]:
trainer = SFTTrainer(
    model=model_id,
    model_init_kwargs=model_kwargs,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    packing=True,                       # pack multiple examples into one seperated by `eos` token.
    max_seq_length=1024,
    args=training_args,
    formatting_func=create_prompt,      # function to transform columns to `text` field. `label` field is added automatically by model.
    # compute_metrics=token_accuracy,
    peft_config=peft_config             # LoRA config
)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OSError: Not enough disk space. Needed: Unknown size (download: Unknown size, generated: Unknown size, post-processed: Unknown size)

In [None]:
wandb_callback = LLMSampleCB(trainer, test_dataset, num_samples=20, max_new_tokens=256)

# add wandb callback
trainer.add_callback(wandb_callback)

## Training

In [None]:
wandb.init(project="alpaca_ft", 
           tags=["hf_sft"],
           job_type="train",
           config=training_args)

# start training
trainer.train()


wandb.finish()

VBox(children=(Label(value='0.102 MB of 0.145 MB uploaded\r'), FloatProgress(value=0.6994534802091603, max=1.0…

0,1
eval/loss,██▁▁
eval/runtime,▁▁██
eval/samples_per_second,██▁▁
eval/steps_per_second,██▁▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,█▆▂▁▁▁▁▁▁▁▂▁▂▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▁▂▄▆▆████████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁
train/loss,█▆▄▄▃▃▃▃▃▃▃▃▄▂▃▃▃▃▃▃▃▂▃▃▂▂▁▁▁▁▁▁▁▂▁▂▁▁▂▁

0,1
eval/loss,0.91031
eval/runtime,18.9471
eval/samples_per_second,11.506
eval/steps_per_second,0.369
train/epoch,1.54
train/global_step,541.0
train/grad_norm,0.34375
train/learning_rate,3e-05
train/loss,0.8505


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Step,Training Loss,Validation Loss
233,0.8813,0.928112


  0%|          | 0/20 [00:00<?, ?it/s]