# Fine-tuning an LLM for Command Generation in CALM

This is a worked example of how to efficiently fine-tune a base language model from [Hugging Face Hub](https://huggingface.co/models) using the [Unsloth](https://docs.unsloth.ai) and [TRL](https://huggingface.co/docs/trl/en/index) libraries for the task of command generation within [CALM](https://rasa.com/docs/rasa-pro/calm).

Unsloth integrates with TRL in order to reduce the time and GPU memory required to fine-tune LLMs, when compared to using TRL exclusively.

To run fine-tuning, you must have first [generated the dataset](https://rasa.com/rasa-pro/docs/operating/fine-tuning-recipe) files `train.jsonl` and `val.jsonl`, which must be in the [TRL instruction format](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support).

## 1. Configure fine-tuning environment

In order to run this notebook, you will need to first install Unsloth onto a machine with the following minimum hardware requirements:
- Single NVIDIA A100 GPU with 40GB VRAM
- 12 core CPU with 85GB RAM
- 250GB disk

Please refer to the Unsloth installation instructions in the [official documentation](https://github.com/unslothai/unsloth/blob/main/README.md).

Here is an example of how to set up the environment:

First, we provisioned a Linux instance with the appropriate hardware and the following software installed:
- Python 3.10
- CUDA Toolkit 12.1
- PyTorch 2.2.

Next, we installed Unsloth and other packages as follows:

In [2]:
%%sh
pip install torch
pip install unsloth
pip install xformers trl peft accelerate bitsandbytes huggingface_hub[cli]
pip install pandas matplotlib



[0m

Collecting unsloth
  Downloading unsloth-2025.3.19-py3-none-any.whl.metadata (46 kB)
Collecting unsloth_zoo>=2025.3.17 (from unsloth)
  Downloading unsloth_zoo-2025.3.17-py3-none-any.whl.metadata (8.0 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.18-py3-none-any.whl.metadata (9.2 kB)
Collecting transformers!=4.47.0,>=4.46.1 (from unsloth)
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece>=0.2.0 (from unsloth)
  Downloading s

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.6.0.dev20250319+cu128 requires torch==2.8.0.dev20250319, but you have torch 2.6.0 which is incompatible.[0m[31m
[0m

Successfully installed accelerate-1.5.2 aiohappyeyeballs-2.6.1 aiohttp-3.11.14 aiosignal-1.3.2 bitsandbytes-0.45.4 cut_cross_entropy-25.1.1 datasets-3.5.0 diffusers-0.32.2 dill-0.3.8 docstring-parser-0.16 frozenlist-1.5.0 hf_transfer-0.1.9 huggingface_hub-0.30.1 markdown-it-py-3.0.0 mdurl-0.1.2 multidict-6.3.0 multiprocess-0.70.16 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-cusparselt-cu12-0.6.2 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 pandas-2.2.3 peft-0.15.1 propcache-0.3.1 protobuf-3.20.3 pyarrow-19.0.1 pytz-2025.2 regex-2024.11.6 rich-14.0.0 safetensors-0.5.3 sentencepiece-0.2.0 shtab-1.7.1 sympy-1.13.1 tokenizers-0.21.1 torch-2.6.0 torchvision-0.21.0 tqdm-4.67.1 transformers-4.50.3 triton-3.2.0 trl-0.15.2 typeg

[0m

Collecting matplotlib
  Downloading matplotlib-3.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.56.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (101 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)
Downloading matplotlib-3.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m205.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading contourpy-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_

[0m

In [9]:
# %%sh
# # install unsloth and other dependencies
# pip install torch==2.2.2
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# pip install --no-deps "xformers<=0.0.26" "trl<0.9.0" peft accelerate bitsandbytes huggingface_hub[cli]
# # remove tpu-only package that is installed by default on gcp runtimes, even when only using gpu
# pip uninstall torch-xla -y

The Unsloth installation is very sensitive to the environment, in particular Cuda and PyTorch versions, so follow the [official installation instructions](https://github.com/unslothai/unsloth/blob/main/README.md) appropriate for your set-up.

## 2. Download base model

You can download the model you want to fine-tune from Hugging Face Hub using the [official CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) with an [API access token](https://huggingface.co/docs/transformers.js/en/guides/private#step-1-generating-a-user-access-token) as per the code below. Make sure you first update the `HUGGINGFACE_TOKEN` and `BASE_MODEL` environment variables with your own values.

When testing this notebook, the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) model was used. Note that `meta-llama/Meta-Llama-3.1-8B-Instruct` is a [gated model](https://huggingface.co/docs/hub/en/models-gated) that you must first request access to before using. 

You can use any other PyTorch model available on [Hugging Face Hub](https://huggingface.co/models). It is recommended that you use a model that has been pre-trained on instructional tasks, such as the [CodeLlama 13B Instruct](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) model.

Pre-trained models with more parameters will generally perform better at tasks than models with fewer parameters. However, the size of model you can use is limited by how much memory your GPU has.

Alternatively, if you already have a PyTorch model directory to hand, you can upload it to your notebook environment manually.

In [7]:
%%sh
# TODO: update with your values
# export HUGGINGFACE_TOKEN="CHANGEME"
export BASE_MODEL="unsloth/mistral-7b-instruct-v0.2-bnb-4bit"

# download model
huggingface-cli download "${BASE_MODEL}" \
    --token "${HUGGINGFACE_TOKEN}" \
    --local-dir "./base_model"

Downloading '.gitattributes' to 'base_model/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.a6344aac8c09253b3b630fb776ae94478aa0275b.incomplete'
Download complete. Moving file to base_model/.gitattributes
Downloading 'README.md' to 'base_model/.cache/huggingface/download/Xn7B-BWUGOee2Y6hCZtEhtFu4BE=.15be1d0739032f5907387b43c73353325a49cc4c.incomplete'
Download complete. Moving file to base_model/README.md
Downloading 'config.json' to 'base_model/.cache/huggingface/download/8_PA_wEVGiVa2goH2H4KQOQpvVY=.40e1f573d0d8a844df016c015a0dd4d38bfbba26.incomplete'
Download complete. Moving file to base_model/config.json
Downloading 'generation_config.json' to 'base_model/.cache/huggingface/download/3EVKVggOldJcKSsGjSdoUCN1AyQ=.3e22f6e907f99d57857ae62725aacfd251ca8e37.incomplete'
Download complete. Moving file to base_model/generation_config.json
Downloading 'model.safetensors' to 'base_model/.cache/huggingface/download/xGOKKLRSlIhH692hSVvI1-gpoa8=.5ac048c8614d6888b433a9ddb4ba8ae063376575

/workspace/rasa-rag-challange-2025/base_model


## 3. Load and quantize base model

The [quantization of model parameters](https://huggingface.co/docs/optimum/en/concept_guides/quantization) can significantly reduce the GPU memory required to run model fine-tuning and inference, at the cost of model accuracy.

Here, the base model is loaded from disk and quantized into an 4-bit representation on the fly using the [BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes) library.

In [10]:
import torch
torch.cuda.empty_cache()

In [1]:
from unsloth import FastLanguageModel
from transformers import BitsAndBytesConfig

max_seq_length = 2048
random_seed = 42


# configure quantization method for base model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
)

# load quantized model and tokenizer from disk
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./base_model",
    max_seq_length=max_seq_length,
    quantization_config=quantization_config,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Mistral patching. Transformers: 4.50.3.
   \\   /|    NVIDIA A100 80GB PCIe. Num GPUs = 1. Max memory: 79.254 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Will load ./base_model as a legacy tokenizer.


## 4. Configure base model for PEFT

[Parameter Efficient Fine-Tuning](https://huggingface.co/blog/peft) (PEFT) is a technique for adapting LLMs for specific tasks by freezing all of the base model parameters and only training a relatively small number of additional parameters. Compared to fine-tuning all parameters, PEFT can significantly reduce the amount of GPU memory required at the cost of the fine-tuned model accuracy.

In the code below, the base model is configured for PEFT using the [Low-Rank Adaptation](https://arxiv.org/pdf/2106.09685) (LoRA) method. It is recommended that you read the [official documentation](https://docs.unsloth.ai/basics/lora-parameters-encyclopedia) and experiment with the arguments of the `get_peft_model` method. For example, you may get better model performance with different values for `r` and `lora_alpha`.

In [2]:
from unsloth import FastLanguageModel

# adapt model for peft
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=random_seed,
    use_rslora=False,
    loftq_config=None,
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 5. Load training and validation datasets

The following code loads the training and validation datasets from the `train.jsonl` and `val.jsonl` files, respectively

As the files use the TRL instruction format, the TRL trainer used later will be able to [automatically parse](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support) the datasets and [generate the prompts from a template](https://huggingface.co/docs/transformers/en/chat_templating) configured in the tokenizer.

Prompt templates vary between models and TRL will infer the correct template from your base model. If this is not available in your base model or if you wish to change it, you can set your own [template string](https://huggingface.co/docs/transformers/en/chat_templating#advanced-adding-and-editing-chat-templates) manually. Unsloth also provides a selection of [pre-defined chat templates](https://docs.unsloth.ai/basics/chat-templates) for popular language models that you can use.

In [3]:
import datasets
from trl.extras.dataset_formatting import get_formatting_func_from_dataset
from unsloth.chat_templates import get_chat_template

train = '/workspace/rasa-rag-challange-2025/tests/e2e_finetune/output_conversational/4_train_test_split/ft_splits/train.jsonl'
eval_file = '/workspace/rasa-rag-challange-2025/tests/e2e_finetune/output_conversational/4_train_test_split/ft_splits/val.jsonl'

# Load the training and evaluation datasets from JSONL files on disk
train_dataset = datasets.load_dataset(
    "json", data_files={"train": train}, split="train"
)
eval_dataset = datasets.load_dataset(
    "json", data_files={"eval": eval_file}, split="eval"
)

# Uncomment the following line if you want to test prompt formatting on a single example from the eval dataset
# print(get_formatting_func_from_dataset(train_dataset, tokenizer)(eval_dataset[0]))

# Get a tokenizer with a chat template to format conversations according to a specified structure
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3",  # Specifies the chat template format (options: zephyr, chatml, mistral, llama, alpaca, etc.)
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},  # Maps dataset roles and messages to expected format
)

# Define a function to format prompts for each example in the dataset
def formatting_prompts_func(examples):
    # Extract conversation messages from each example
    print([k for k in examples.keys()])
    convos = examples["messages"]
    
    # Apply the chat template to each conversation without tokenizing or adding generation prompts
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    
    # Return the formatted texts in a new dictionary key
    return {"text": texts}

# Apply the formatting function to both the training and evaluation datasets in batches
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)

## 6. Configure trainer

Below, the arguments for the supervised fine-tuning (SFT) trainer are configured. Their values were chosen somewhat arbitrarily and resulted in satisfactory results during testing.

It is recommended that you read the official documentation and experiment with the arguments passed to `SFTConfig` (see [here](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer)) and `SFTTrainer` (see [here](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer)).

For example:
- If you get an OOM error when running fine-tuning, you can reduce `per_device_train_batch_size` in order to reduce the memory footprint. However, if your GPU has sufficient memory, you can try increasing it in order to reduce the total number of training steps.
- Consider setting `max_steps`, as you may not need to perform all epochs in order to achieve optimal model accuracy. Conversely, you may see better model accuracy by increasing `num_train_epochs`.
- If fine-tuning is taking too long, you can increase `eval_steps` in order to reduce how often validation is performed. 

In [4]:
import torch
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

# configure training args
args = TrainingArguments(
    ###### training
    seed = random_seed,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    #max_steps = 60,
    num_train_epochs = 5,
    learning_rate = 2e-4,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    weight_decay = 0.01,
    ###### datatypes
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    ###### evaluation
    eval_strategy = "steps",
    eval_steps = 50,
    per_device_eval_batch_size = 8,
    ###### outputs
    logging_steps = 30,
    output_dir = "outputs",
)

# setup trainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    max_seq_length = max_seq_length,
    args = args,
)

## 7. Perform supervised fine-tuning

In the code below, fine-tuning is performed using the previously congfigured trainer.

When testing this step on an NVIDIA A100 using the configuration defined above, it took around 12 minutes to perform fine-tuning with a training dataset containing around 500 examples.

In [5]:
# run fine-tuning
finetune_metrics = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,826 | Num Epochs = 5 | Total steps = 1,140
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/3,794,014,208 (1.11% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
50,0.4706,0.01162
100,0.0065,0.003033
150,0.0027,0.002687
200,0.0026,0.002575
250,0.0026,0.002591
300,0.0028,0.002611
350,0.0025,0.002704
400,0.0027,0.002588
450,0.0024,0.002629
500,0.0027,0.002581


Unsloth: Not an error, but MistralForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient




After fine-tuning, the base model and fine-tuned adapters are [merged together and saved to disk](https://docs.unsloth.ai/basics/saving-models/saving-to-vllm) in 16-bit for future compatibility with the [vLLM](https://github.com/vllm-project/vllm) model serving library.

In [6]:
# save model to disk in 16-bit
model.save_pretrained_merged("./finetuned_model_2", tokenizer, save_method="merged_16bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 593.94 out of 944.44 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 70.79it/s]


Unsloth: Saving tokenizer... Done.
Done.


## 8. Visualize fine-tuning metrics

Some of the metrics collected during fine-tuning are visualised below in order for you to diagnose any potential issues with the model.

Specifically, the training and validation losses are plotted against the training step number. Please check the plot for the following:
- Ideally, as the fine-tuning steps increase, the training and validation losses should decrease and converge. 
- If both loss curves do not converge, it may be worth performing more fine-tuning steps or epochs. This is known as [underfitting](https://www.ibm.com/topics/underfitting).
- If the validation loss suddenly starts to increase while the training loss continues to decrease or converge, you should decrease your total number of steps or epochs. This is known as [overfitting](https://www.ibm.com/topics/overfitting).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# plot step against train and val losses
log_history = pd.DataFrame(trainer.state.log_history)
log_history


In [None]:
fig, ax = plt.subplots()
eval_loss = log_history[["step", "loss"]].dropna().plot(x="step", ax=ax)
train_loss = log_history[["step", "train_loss"]].dropna().plot(x="step", ax=ax)
fig.show()

## 9. Run ad hoc inference

You can load your fine-tuned model from disk using Unsloth and use it to run optimized inference on individual inputs of your choosing using the code below.

Note that the inputs passed to model are in the [TRL convertsational format](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support) as the Hugging Face [chat template requires them to be](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates). During training TRL will [automatically convert the instruction format to the conversational format](https://github.com/huggingface/trl/blob/main/trl/extras/dataset_formatting.py). However, you have to do this yourself when applying chat templates manually for inference.

In [None]:
from transformers import TextStreamer
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("./finetuned_model")
FastLanguageModel.for_inference(model)  # enable inference optimizations
streamer = TextStreamer(tokenizer)  # stream model outputs as they are generated

# the content to include in the input prompt
# by default, a value from the validation dataset as example
content = eval_dataset["text"][0]

# apply prompt template and tokenize
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": content}],  # in the TRL conversational format
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

# generate model output from user input
_ = model.generate(
    input_ids=input_ids,
    streamer=streamer,  # remove streamer if you want whole output at end
    max_new_tokens=64,  # set the limit on how many tokens are generated
    do_sample=False,  # disable random sampling for deterministic outputs
)

## 10. Export fine-tuned model

Lastly, export your fine-tuned model directory to an appropriate storage location that can be easily accessed later for [deployment](https://rasa.com/rasa-pro/docs/building-assistants/self-hosted-llm).

It is recommended that you use a cloud object store, such as [Amazon S3](https://aws.amazon.com/s3/) or [Google Cloud Storage](https://cloud.google.com/storage).

Uncomment and run the corresponding commands below for your cloud provider, making sure to first update the environment variables with your own values. It is assumed that:
- your bucket already exists
- you have already installed the CLI tool for your cloud provider
- you have already authenticated with your cloud provider and have sufficient permissions to write to your bucket

In [None]:
%%sh
export LOCAL_MODEL_PATH="./finetuned_model"

# if using amazon
# export S3_MODEL_URI="s3://CHANGEME" # update with your value
# aws s3 cp "${LOCAL_MODEL_PATH}" "${S3_MODEL_URI}" --recursive

# if using google
# export GCS_MODEL_URI="gs://CHANGEME" # update with your value
# gsutil cp -r "${LOCAL_MODEL_PATH}" "${GCS_MODEL_URI}"