# Finetuning using Axolotl

This notebook is an minimal example of how to finetune a LLM using [Axolotl](https://github.com/axolotl-ai-cloud/axolotl). Axolotl is a CLI tool that uses config files for different methods of LLM finetuning. We created a Python wrapper around the CLI for the end-to-end workflow for this process.

In the example below, we show how you can define or load finetuning configurations to replicate the fine-tuning process from our paper, start a fine-tuning job, and push the model to Hugging Face.s

## Setup

Make sure to run this notebook in a system with enough compute resources to run the finetuning, and follow the setup instructions in the README to install axolotl and related libraries.

Let's start with loading the code components we need. The `FinetuneConfig` class holds configurations, and the `Finetune` class is used to create and run a finetuning job.

In [1]:
import sys
import os

sys.path.insert(0, os.path.join(os.getcwd(), '..'))

from finetune import Finetune, FinetuneConfig

[2025-01-21 17:35:35,201] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)


Also make sure to login to Hugging Face Hub to save the output model.

In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Configuration
We then load up a config to perform QLoRA finetuning of `meta-llama/Llama-2-7b-chat-hf` from in a config file stored locally. Optionally, we can assign a new field `hub_model_id`, indicating the Hugging Face model the finetuned LLM will be pushed to.

For experiments in the paper we did:
1. LoRA finetuning. The config file for LoRA finetune is `altus/finetune/examples/llama-2/lora.yml`.
2. SFT. The config file is `altus/finetune/examples/llama-2/fft_optimized.yml`


Next, we load a configuration file to perform LoRA fine-tuning of `meta-llama/Llama-2-7b-chat-hf`. The configuration file is stored locally. Optionally, a new field `hub_model_id` can be assigned to specify the Hugging Face model where the fine-tuned LLM will be pushed.

For the experiments presented in the paper, we performed the following:

1. **LoRA** Fine-tuning: The configuration file for LoRA fine-tuning is located at:

`altus/finetune/examples/llama-2/lora.yml`.

2. **SFT** (Supervised Fine-tuning): The configuration file for SFT is located at:

`altus/finetune/examples/llama-2/fft_optimized.yml`

In [3]:
import yaml

# Specify the path to your YAML file
file_path = os.path.join(os.getcwd(), '..', 'finetune/examples/llama-2/lora.yml')

# Open the file and load the data
with open(file_path, encoding='utf-8') as file:
    config_dict = yaml.safe_load(file)  # Load the existing data

config_dict['base_model'] = 'meta-llama/Llama-2-7b-chat-hf'
config_dict['hub_model_id'] = 'vijil/my_lora_tune'  # Add or update the model_id to push the trained model
config_dict['eval_sample_packing'] = False

In [4]:
config_dict

{'base_model': 'meta-llama/Llama-2-7b-chat-hf',
 'model_type': 'LlamaForCausalLM',
 'tokenizer_type': 'LlamaTokenizer',
 'load_in_8bit': True,
 'load_in_4bit': False,
 'strict': False,
 'datasets': [{'path': 'mhenrichsen/alpaca_2k_test', 'type': 'alpaca'}],
 'dataset_prepared_path': None,
 'val_set_size': 0.05,
 'output_dir': './lora-out',
 'sequence_len': 4096,
 'sample_packing': True,
 'pad_to_sequence_len': True,
 'adapter': 'lora',
 'lora_model_dir': None,
 'lora_r': 32,
 'lora_alpha': 16,
 'lora_dropout': 0.05,
 'lora_target_linear': True,
 'lora_fan_in_fan_out': None,
 'wandb_project': None,
 'wandb_entity': None,
 'wandb_watch': None,
 'wandb_name': None,
 'wandb_log_model': None,
 'gradient_accumulation_steps': 4,
 'micro_batch_size': 2,
 'num_epochs': 4,
 'optimizer': 'adamw_bnb_8bit',
 'lr_scheduler': 'cosine',
 'learning_rate': 0.0002,
 'train_on_inputs': False,
 'group_by_length': False,
 'bf16': 'auto',
 'fp16': None,
 'tf32': False,
 'gradient_checkpointing': True,
 'earl

Let's now load the config dict in the `FinetuneConfig` object.

In [5]:
# see all config options in './finetune/axolotl/examples/config.qmd'
config = FinetuneConfig(config_dict)

[2025-01-21 17:35:54,661] [DEBUG] [axolotl.normalize_config:79] [PID:3716] [RANK:0] bf16 support detected, enabling for this configuration.[39m




config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

[2025-01-21 17:35:54,938] [INFO] [axolotl.normalize_config:182] [PID:3716] [RANK:0] GPU memory usage baseline: 0.000GB (+0.682GB misc)[39m


## Run the finetuning job
Now simply load up the config into a `FineTune` object and kick off the job.

In [6]:
# create a finetune object with the config and run
finetune = Finetune(config)

In [7]:
finetune.run() # start train

[INFO] Job ID: 2cb97706-d81e-11ef-9844-0242ac120002


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

[2025-01-21 17:36:00,111] [DEBUG] [axolotl.load_tokenizer:279] [PID:3716] [RANK:0] EOS: 2 / </s>[39m
[2025-01-21 17:36:00,112] [DEBUG] [axolotl.load_tokenizer:280] [PID:3716] [RANK:0] BOS: 1 / <s>[39m
[2025-01-21 17:36:00,112] [DEBUG] [axolotl.load_tokenizer:281] [PID:3716] [RANK:0] PAD: 2 / </s>[39m
[2025-01-21 17:36:00,112] [DEBUG] [axolotl.load_tokenizer:282] [PID:3716] [RANK:0] UNK: 0 / <unk>[39m
[2025-01-21 17:36:00,113] [INFO] [axolotl.load_tokenizer:293] [PID:3716] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.[39m
[2025-01-21 17:36:00,113] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:3716] [RANK:0] Unable to find prepared dataset in last_run_prepared/a68bb67a61191b8469cb3317f4e3323e[39m
[2025-01-21 17:36:00,114] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:3716] [RANK:0] Loading raw datasets...[39m
[2025-01-21 17:36:00,114] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:3716] [RANK:0] No s

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.76M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Tokenizing Prompts (num_proc=64):   0%|          | 0/2000 [00:00<?, ? examples/s]

[2025-01-21 17:36:09,997] [INFO] [axolotl.load_tokenized_prepared_datasets:410] [PID:3716] [RANK:0] merging datasets[39m


Dropping Long Sequences (num_proc=255):   0%|          | 0/2000 [00:00<?, ? examples/s]

Add position_id column (Sample Packing) (num_proc=255):   0%|          | 0/2000 [00:00<?, ? examples/s]

[2025-01-21 17:36:22,744] [INFO] [axolotl.load_tokenized_prepared_datasets:423] [PID:3716] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/a68bb67a61191b8469cb3317f4e3323e[39m


Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

[2025-01-21 17:36:22,926] [DEBUG] [axolotl.log:61] [PID:3716] [RANK:0] total_num_tokens: 414_041[39m
[2025-01-21 17:36:22,940] [DEBUG] [axolotl.log:61] [PID:3716] [RANK:0] `total_supervised_tokens: 294_246`[39m
[2025-01-21 17:36:28,480] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 414041[39m
[2025-01-21 17:36:28,481] [DEBUG] [axolotl.log:61] [PID:3716] [RANK:0] data_loader_len: 6[39m
[2025-01-21 17:36:28,481] [INFO] [axolotl.log:61] [PID:3716] [RANK:0] sample_packing_eff_est across ranks: [0.9719637357271634][39m
[2025-01-21 17:36:28,482] [DEBUG] [axolotl.log:61] [PID:3716] [RANK:0] sample_packing_eff_est: 0.98[39m
[2025-01-21 17:36:28,482] [DEBUG] [axolotl.log:61] [PID:3716] [RANK:0] total_num_steps: 24[39m
[2025-01-21 17:36:28,514] [DEBUG] [axolotl.train.log:61] [PID:3716] [RANK:0] loading tokenizer... meta-llama/Llama-2-7b-chat-hf[39m




[2025-01-21 17:36:28,782] [DEBUG] [axolotl.load_tokenizer:279] [PID:3716] [RANK:0] EOS: 2 / </s>[39m
[2025-01-21 17:36:28,782] [DEBUG] [axolotl.load_tokenizer:280] [PID:3716] [RANK:0] BOS: 1 / <s>[39m
[2025-01-21 17:36:28,782] [DEBUG] [axolotl.load_tokenizer:281] [PID:3716] [RANK:0] PAD: 2 / </s>[39m
[2025-01-21 17:36:28,783] [DEBUG] [axolotl.load_tokenizer:282] [PID:3716] [RANK:0] UNK: 0 / <unk>[39m
[2025-01-21 17:36:28,783] [INFO] [axolotl.load_tokenizer:293] [PID:3716] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.[39m
[2025-01-21 17:36:28,783] [DEBUG] [axolotl.train.log:61] [PID:3716] [RANK:0] loading model and peft_config...[39m
[2025-01-21 17:36:29,425] [INFO] [axolotl.load_model:359] [PID:3716] [RANK:0] patching with flash attention for sample packing[39m
[2025-01-21 17:36:29,428] [INFO] [axolotl.load_model:408] [PID:3716] [RANK:0] patching _expand_mask[39m


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

[2025-01-21 17:39:50,392] [INFO] [accelerate.utils.modeling.get_balanced_memory:965] [PID:3716] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

[2025-01-21 17:42:08,910] [INFO] [axolotl.load_model:720] [PID:3716] [RANK:0] GPU memory usage after model load: 6.681GB (+0.000GB cache, +1.168GB misc)[39m
[2025-01-21 17:42:08,946] [INFO] [axolotl.load_model:771] [PID:3716] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training[39m
[2025-01-21 17:42:08,948] [INFO] [axolotl.load_model:780] [PID:3716] [RANK:0] converting modules to torch.bfloat16 for flash attention[39m
[2025-01-21 17:42:08,952] [INFO] [axolotl.load_lora:924] [PID:3716] [RANK:0] found linear modules: ['q_proj', 'k_proj', 'o_proj', 'down_proj', 'up_proj', 'v_proj', 'gate_proj'][39m
trainable params: 79,953,920 || all params: 6,818,369,536 || trainable%: 1.172625208678628
[2025-01-21 17:42:57,361] [INFO] [axolotl.load_model:825] [PID:3716] [RANK:0] GPU memory usage after adapters: 6.979GB (+0.851GB cache, +1.168GB misc)[39m


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


[2025-01-21 17:42:57,918] [INFO] [axolotl.train.log:61] [PID:3716] [RANK:0] Pre-saving adapter config to ./lora-out[39m
[2025-01-21 17:42:57,972] [INFO] [axolotl.train.log:61] [PID:3716] [RANK:0] Starting trainer...[39m
[2025-01-21 17:42:58,271] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 414041[39m
[2025-01-21 17:42:58,273] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 414041[39m
[2025-01-21 17:42:58,333] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 414041[39m




Step,Training Loss,Validation Loss
1,1.3141,1.278956
3,1.2393,1.269746
6,1.1715,1.16147
9,0.979,1.041919
12,0.9962,1.008087
15,0.9937,0.951219
18,0.9483,0.918372
21,0.8596,0.905319
24,0.9864,0.896698
27,0.8925,0.891167


[2025-01-21 17:44:29,644] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 17:44:48,878] [INFO] [axolotl.callbacks.on_step_end:125] [PID:3716] [RANK:0] GPU memory usage while training: 7.210GB (+7.442GB cache, +1.201GB misc)[39m
[2025-01-21 17:46:18,239] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 17:48:26,105] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 17:50:34,143] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 17:52:42,189] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered te



[2025-01-21 17:53:05,035] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 414041[39m
[2025-01-21 17:54:53,709] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 17:57:01,695] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 17:59:09,779] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:01:17,876] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.




[2025-01-21 18:01:57,196] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 414041[39m
[2025-01-21 18:03:26,688] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:05:34,683] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:07:42,634] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:09:50,563] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.




[2025-01-21 18:12:02,650] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:12:02,689] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:3716] [RANK:0] packing_efficiency_estimate: 0.98 total_num_tokens per device: 414041[39m
[2025-01-21 18:14:10,695] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:16:18,747] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
[2025-01-21 18:18:26,798] [INFO] [accelerate.accelerator.log:61] [PID:3716] The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.




[2025-01-21 18:18:41,447] [INFO] [axolotl.train.log:61] [PID:3716] [RANK:0] Training Completed!!! Saving pre-trained model to ./lora-out[39m




adapter_model.bin:   0%|          | 0.00/320M [00:00<?, ?B/s]