<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/LoftQ_A_Better_LoRA_Adapter_for_Quantized_LLMs_Example_with_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to fine-tune an LLM, here Mistral 7B, with loftq.


First, we need all these dependencies:

In [None]:
!pip install -q -U bitsandbytes
!pip install --upgrade -q -U transformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U flash_attn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m82.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

Import the following packages. Not all these imports are necessary depending on the method (#1 or #2) you use to apply loftq.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model, replace_lora_weights_loftq
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

#Method #1: Replace LoRA with LoftQ


Load the tokenizer, configure padding, and detect if the GPU support bfloat16 and FlashAttention (not necessary but better).

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' #Necessary for FlashAttention compatibility

#Better to use bf16 if supported (Ampere GPUs or more recent)
#If bf16 is supported, the GPU is also recent enough to support FlashAttention
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

Load the dataset that we will use for fine-tuning.

In [None]:
dataset = load_dataset("timdettmers/openassistant-guanaco")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Load the model and prepare it for QLoRA

In [None]:
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0},  attn_implementation=attn_implementation
)
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Define the configuration of LoRA

In [None]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

The following cell is the only code specific to LoftQ fine-tuning.
We first make a peft model with the LoraConfig, and then replace LoRA's weights with LoftQ's weights.

In [None]:
peft_model = get_peft_model(model, peft_config)
print(peft_model)
replace_lora_weights_loftq(peft_model)
print(peft_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer):

For this tutorial, I fine-tuned the adapter for one epoch.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./results/",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=12,
        gradient_accumulation_steps=2,
        per_device_eval_batch_size=12,
        log_level="debug",
        logging_steps=50,
        learning_rate=1e-4,
        eval_steps=50,
        num_train_epochs=1,
        save_strategy='epoch',
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

SFTTrainer configuration and start training:

In [None]:
trainer = SFTTrainer(
        model=peft_model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Currently training with a batch size of: 12
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 12
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 2
  Total optimization steps = 410
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
50,1.2147,1.144708
100,1.114,1.127303
150,1.1091,1.122229
200,1.0879,1.118388
250,1.0793,1.115253
300,1.1116,1.112868
350,1.0984,1.111607
400,1.0725,1.111051


***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
Saving model checkpoint to ./results/checkpoint-410
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initiali

TrainOutput(global_step=410, training_loss=1.109550892434469, metrics={'train_runtime': 19745.2716, 'train_samples_per_second': 0.499, 'train_steps_per_second': 0.021, 'total_flos': 2.1604031504429875e+17, 'train_loss': 1.109550892434469, 'epoch': 1.0})

#Method #2: Jointly search for better LLM quantization and LoRA initliazation.

We are going to use a script from PEFT. I cloned the repository to have everything locally.

Then, we run the script which performs 5 iterations of LoftQ to find a better quantization and initialization of LoRA.

In [None]:
!git clone https://github.com/huggingface/peft.git

!python peft/examples/loftq_finetuning/quantize_save_load.py \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --bits 4 \
    --iter 5 \
    --rank 16 \
    --save_dir "./loftq_iters/"

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
    (lora_embedding_B): ParameterDict()
  )
  (o_proj): Shell()
  (rotary_emb): MistralRotaryEmbedding()
)
MistralMLP(
  (gate_proj): lora.Linear(
    (base_layer): Linear(in_features=4096, out_features=14336, bias=False)
    (lora_dropout): ModuleDict(
      (default): Dropout(p=0.1, inplace=False)
    )
    (lora_A): ModuleDict(
      (default): Linear(in_features=4096, out_features=16, bias=False)
    )
    (lora_B): ModuleDict(
      (default): Linear(in_features=16, out_features=14336, bias=False)
    )
    (lora_embedding_A): ParameterDict()
    (lora_embedding_B): ParameterDict()
  )
  (up_proj): lora.Linear(
    (base_layer): Linear(in_features=4096, out_features=14336, bias=False)
    (lora_dropout): ModuleDict(
      (default): Dropout(p=0.1, inplace=False)
    )
    (lora_A): ModuleDict(
      (default): Linear(in_features=4096, out_features=16, bias=False)
    )
    (lora_B): ModuleD

The following code is a standard LoRA fine-tuning but using the new base model and adapter found by the script.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

MODEL_DIR = "./loftq_iters/Mistral-7B-v0.1-4bit-16rank"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' #Necessary for FlashAttention compatibility

#Better to use bf16 if supported (Ampere GPUs or more recent)
#If bf16 is supported, the GPU is also recent enough to support FlashAttention
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

dataset = load_dataset("timdettmers/openassistant-guanaco")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          MODEL_DIR, quantization_config=bnb_config, device_map={"": 0}, torch_dtype=compute_dtype,  attn_implementation=attn_implementation
)
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

peft_model = PeftModel.from_pretrained(
    model,
    MODEL_DIR,
    subfolder="loft_init",
    is_trainable=True,
)



training_arguments = TrainingArguments(
        output_dir="./results_loftq/",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=12,
        gradient_accumulation_steps=2,
        per_device_eval_batch_size=12,
        log_level="debug",
        logging_steps=50,
        learning_rate=1e-4,
        eval_steps=50,
        num_train_epochs=1,
        save_strategy='epoch',
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=peft_model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Currently training with a batch size of: 12
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 12
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 2
  Total optimization steps = 410
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
50,1.2108,1.144687
100,1.1143,1.127882
150,1.1099,1.123113


***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12


Step,Training Loss,Validation Loss
50,1.2108,1.144687
100,1.1143,1.127882
150,1.1099,1.123113
200,1.0886,1.11897
250,1.0795,1.115887
300,1.1117,1.113478
350,1.0987,1.112291
400,1.0731,1.111724


***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
Saving model checkpoint to ./results_loftq/checkpoint-410
tokenizer config file saved in ./results_loftq/checkpoint-410/tokenizer_config.json
Special tokens file saved in ./results_loftq/checkpoint-410/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=410, training_loss=1.1094729190919457, metrics={'train_runtime': 19963.3802, 'train_samples_per_second': 0.493, 'train_steps_per_second': 0.021, 'total_flos': 2.1604031504429875e+17, 'train_loss': 1.1094729190919457, 'epoch': 1.0})

#LoRA (for reference)

The following code runs LoRA fine-tuning for comparison with LoftQ.

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

model_name = "mistralai/Mistral-7B-v0.1"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' #Necessary for FlashAttention compatibility

#Better to use bf16 if supported (Ampere GPUs or more recent)
#If bf16 is supported, the GPU is also recent enough to support FlashAttention
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

dataset = load_dataset("timdettmers/openassistant-guanaco")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0},  attn_implementation=attn_implementation
)
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

peft_model = get_peft_model(model, peft_config)


training_arguments = TrainingArguments(
        output_dir="./results/",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=12,
        gradient_accumulation_steps=2,
        per_device_eval_batch_size=12,
        log_level="debug",
        logging_steps=50,
        learning_rate=1e-4,
        eval_steps=50,
        num_train_epochs=1,
        save_strategy='epoch',
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=peft_model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Currently training with a batch size of: 12
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 12
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 2
  Total optimization steps = 410
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
50,1.199,1.141348
100,1.1123,1.125894
150,1.108,1.121279
200,1.0867,1.117162
250,1.0778,1.114128
300,1.11,1.111743
350,1.0971,1.110371
400,1.0711,1.109758


***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
***** Running Evaluation *****
  Num examples = 518
  Batch size = 12
Saving model checkpoint to ./results/checkpoint-410
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initiali

TrainOutput(global_step=410, training_loss=1.1064177047915575, metrics={'train_runtime': 19850.6029, 'train_samples_per_second': 0.496, 'train_steps_per_second': 0.021, 'total_flos': 2.1604031504429875e+17, 'train_loss': 1.1064177047915575, 'epoch': 1.0})