<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Fine_tune_Mixtral_8x7B_on_a_Single_Consumer_GPU_with_AQLM_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to fine-tune an LLM quantized with AQLM. It takes Mixtral-8x7B quantized to 2-bit with AQLM.

The notebook requires at least a 24 GB GPU. If you adjust training hyperparameters (e.g., batch size, max sequence length, and remove w1, w2, and w3, from the LoRA target modules), it can run on a 16 GB GPU.

Since the support of AQLM by HF libraries is quite recent (at the time I wrote this notebook...), we need to install them from source:

In [None]:
!pip install transformers peft trl accelerate bitsandbytes
!pip install aqlm[gpu,cpu]

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-gty5sjc4
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-gty5sjc4
  Resolved https://github.com/huggingface/transformers.git to commit d45f47ab7f7c31991bb98a0302ded59ab6adac31
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/peft
  Cloning https://github.com/huggingface/peft to /tmp/pip-req-build-rkt4ycce
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft /tmp/pip-req-build-rkt4ycce
  Resolved https://github.com/huggingface/peft to commit e5973883057b723b3f0fe3982bfa9d1e0c0fd8ec
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel 

Load the model and its tokenizer:

In [None]:
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer
model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    trust_remote_code=True, torch_dtype="auto", device_map="cuda", low_cpu_mem_usage=True

)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/6.13k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/263k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Prepare the model with gradient checkpointing enabled (don't forget this step otherwise you will have OOM errors).

In [None]:
model = prepare_model_for_kbit_training(model)
#Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id

Then, load an instruction dataset for fine-tuning:

In [None]:
dataset = load_dataset("timdettmers/openassistant-guanaco")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Run the training.

Note that the notebook fine-tunes for only 100 steps. It roughly takes 2.5 hours per 100 steps. Fine-tune for 2 or 3 epochs to obtain good results.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./mixtral8x7b_aqlm_lora",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=4,
        log_level="debug",
        logging_steps=25,
        learning_rate=1e-4,
        eval_steps=25,
        save_strategy='steps',
        max_steps=100,
        warmup_steps=25,
        lr_scheduler_type="linear",
)


peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate", "w1", "w2", "w3"]
)

trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=256,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
You have loaded a model on multiple GPUs. `is_model_parallel` attribute will be force-set to `True` to avoid any unexpected behavior such as device placement mismatching.
max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 242,225,152
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
25,1.3423,1.242244
50,1.1432,1.198169


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4


Step,Training Loss,Validation Loss
25,1.3423,1.242244
50,1.1432,1.198169
75,1.1322,1.188843
100,1.1351,1.185728


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=100, training_loss=1.216408224105835, metrics={'train_runtime': 9774.0708, 'train_samples_per_second': 0.164, 'train_steps_per_second': 0.01, 'total_flos': 1.632085255323648e+16, 'train_loss': 1.216408224105835, 'epoch': 0.16})