<a href="https://colab.research.google.com/github/ygn81pg1/ai_rf_test_generator/blob/main/New_Test_Fine_Tune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Fine‚ÄëTuning with Unsloth ‚Äî Full, Step‚Äëby‚ÄëStep Tutorial

This notebook walks you through a complete **parameter‚Äëefficient fine‚Äëtuning (PEFT)** workflow using **[Unsloth](https://github.com/unslothai/unsloth)**. It‚Äôs written as a **teaching notebook**: every section has an explanation of *what* the code does, *why* it‚Äôs done that way, and *how* to adapt it.

## What you‚Äôll learn
- Installing the correct dependencies for GPU training (local vs Colab).
- Loading a base LLM and enabling **4‚Äëbit** inference/training for lower VRAM.
- Configuring **LoRA adapters** for efficient fine‚Äëtuning.
- Preparing custom data (e.g., parsing Robot Framework test cases) into instruction‚Äëtuning format.
- Launching training, monitoring GPU memory/steps, and saving checkpoints.
- (Optional) Logging in to Hugging Face and pushing models.
- (Optional) Converting/Saving for deployment (Transformers, GGUF, llama.cpp).

## Prerequisites
- **GPU** with at least ~12‚Äì16 GB VRAM recommended for 7B‚Äë13B models (less with QLoRA).
- Python 3.10+ environment with CUDA‚Äëcompatible PyTorch.
- A Hugging Face account (optional, only if you want to push models).

## Quickstart
1. Run **Step 1** (Environment setup) to install the libraries.
2. Run the **Model + LoRA config** cell to load and prepare the model.
3. Upload or point to your dataset and run **Data prep**.
4. Start **Training** and monitor logs.
5. **Save/Export** your model for inference.

## Step 1 ‚Äî Environment Setup & Installation

This step installs the libraries needed for **QLoRA fine‚Äëtuning** via Unsloth:

- `unsloth` ‚Äì high‚Äëperformance wrappers/utilities for efficient training.  
- `bitsandbytes` ‚Äì 4‚Äëbit/8‚Äëbit quantization (QLoRA).  
- `accelerate`, `xformers`, `trl`, `peft` ‚Äì training speed‚Äëups and trainer utilities.  
- `datasets`, `huggingface_hub` ‚Äì dataset/model I/O.  
- `sentencepiece`, `protobuf` ‚Äì tokenizer and protocol buffers for some models.

The code auto‚Äëdetects **Colab** and installs the proper variants to avoid dependency conflicts. If you run locally, ensure your CUDA/PyTorch versions match your GPU drivers.

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

## Step 2 ‚Äî Load Base Model & Configure LoRA

- **Base model**: pick a LLaMA/compatible model (e.g., `meta-llama/Meta-Llama-3-8B-Instruct`).  
- **Quantization**: enable 4‚Äëbit loading (QLoRA) to fit in smaller VRAM.  
- **LoRA**: choose ranks (`r`), `alpha`, and target modules (e.g., `q_proj`, `v_proj`).  
- **Max sequence length**: set based on your data (e.g., 2048 tokens).

**Why LoRA?** It trains a small set of adapter weights instead of all parameters‚Äî**faster, cheaper**, and often sufficient for domain adaptation.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model_name = "unsloth/llama-3-8b-instruct"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.6: Fast Llama patching. Transformers: 4.55.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [None]:
tokenizer.pad_token = tokenizer.unk_token

In [None]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.8.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Step 3 ‚Äî Prepare Your Dataset (Robot Framework ‚Üí Instruction Format)

If your source files are **Robot Framework** tests, parse them into an *instruction‚Äëtuning* JSONL with keys like:
- `instruction`: the prompt/question/task
- `input` (optional): extra context
- `output`: the desired answer/completion

Typical pipeline:
1. **Upload** your `.robot`/text files.  
2. **Extract** the relevant `*** Test Cases ***` section.  
3. **Transform** into instruction‚Äëtuning examples (e.g., ‚ÄúGenerate test steps for ‚Ä¶‚Äù).  
4. **Split** into train/validation sets.

Keep sequences within `max_seq_length`. If examples are long, consider truncation or summarization.

In [None]:
from google.colab import files
import json

# Upload .robot files
uploaded = files.upload()

# --- Helpers ---
def extract_test_case_section(content):
    lines = content.splitlines()
    test_case_lines = []
    inside_test_section = False
    for line in lines:
        if line.strip().startswith("*** Test Cases ***"):
            inside_test_section = True
            continue
        if inside_test_section:
            if line.strip().startswith("***"):
                break
            test_case_lines.append(line.rstrip())
    return test_case_lines

def extract_structured_test_cases(lines):
    test_cases = []
    current_case = None
    for line in lines:
        if not line.strip():
            continue
        if not line.startswith(" ") and not line.startswith("#"):
            if current_case:
                test_cases.append(current_case)
            current_case = {"name": line.strip(), "body": [], "doc": ""}
        elif current_case:
            current_case["body"].append(line.rstrip())
            if "[Documentation]" in line:
                doc_text = line.split("[Documentation]")[-1].strip()
                if doc_text.startswith("..."):
                    doc_text = doc_text[3:].strip()
                current_case["doc"] = doc_text
    if current_case:
        test_cases.append(current_case)
    return test_cases

def to_finetune_format(test_cases):
    dataset = []
    for case in test_cases:
        # Build instruction more intelligently
        instruction = case["doc"].strip() if case["doc"] else f"Create a Robot Framework test case for: {case['name']}"

        # Build output: full test case block
        output_lines = ["*** Test Cases ***", case["name"]] + case["body"]
        output = "\n".join(output_lines)

        dataset.append({
            "text": f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{instruction}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n{output}\n<|eot_id|>"
        })
    return dataset

# --- Main Processing ---
all_test_cases = []

for filename in uploaded:
    content = uploaded[filename].decode("utf-8")
    test_case_lines = extract_test_case_section(content)
    structured = extract_structured_test_cases(test_case_lines)
    all_test_cases.extend(structured)

# Convert to Unsloth-compatible format
finetune_data = to_finetune_format(all_test_cases)

# Save as JSONL for training
output_file = "robot_framework_finetune_dataset.jsonl"
with open(output_file, "w", encoding="utf-8") as f:
    for item in finetune_data:
        f.write(json.dumps(item) + "\n")

# Download
files.download(output_file)

Saving Full_Test.robot to Full_Test (1).robot
Saving Full_Test2(safety).robot to Full_Test2(safety).robot


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from datasets import load_dataset

# Load dataset from the JSONL file
dataset = load_dataset("json", data_files="robot_framework_finetune_dataset.jsonl", split="train")

# Optional: show a few examples
dataset[0]


Generating train split: 0 examples [00:00, ? examples/s]

{'text': '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nCheck on general voltage reading\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n*** Test Cases ***\nEnvD_SysQual_PowerManagementSubsystem_General_VoltageMonitoringWithInterbladeSignal\n    [Documentation]    Check on general voltage reading\n    [Tags]    POWER    FID-1275341,2655699    TCID-195840    FULLTEST\n    Log To Console    Hi\n    prj.sut.disconnect.dlt\n    bits_platform.Output Set Analog Signal Value    Battery    0\n    prj.sut.connect.dlt\n    prj.sut.can.initandloadconfiguration\n    prj.sut.can.simulation.Start\n    prj.sut.can.simulation.PowerBladeON\n    prj.sut.power.add.voltage.to.dictionary.check    16.4\n<|eot_id|>'}

## Step 4 ‚Äî Training Configuration

Key knobs to tune:
- `per_device_train_batch_size`, `gradient_accumulation_steps` ‚Üí effective batch size.  
- `learning_rate` (start with `2e-5` to `5e-5` for LoRA), `weight_decay`, `lr_scheduler_type` (e.g., cosine).  
- `warmup_ratio` or `warmup_steps` to stabilize early training.  
- `max_steps` **or** `num_train_epochs`.  
- `logging_steps` for progress feedback.  
- `bf16`/`fp16` based on your GPU (Ampere+ supports bf16).

Save checkpoints regularly if training for a long time. Consider evaluation every N steps if you have a validation set.

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc=2,
    packing=True,  # Enable this to speed up training unless you're using long sequences
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 16,
        warmup_ratio = 0.1,
        max_steps=65,
        #num_train_epochs = 1, # Set this for 1 full training run.
        learning_rate = 5e-5,
        logging_steps = 1,
        optim = "adamw_torch_fused",
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
9.41 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 688 | Num Epochs = 2 | Total steps = 65
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 16 x 1) = 16
 "-____-"     Trainable parameters = 167,772,160 of 8,198,033,408 (2.05% trained)


Step,Training Loss
1,0.8012
2,0.6011
3,0.7349
4,0.7802
5,0.5247
6,0.7079
7,0.5754
8,0.4565
9,0.684
10,0.4872


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1478.203 seconds used for training.
24.64 minutes used for training.
Peak reserved memory = 10.941 GB.
Peak reserved memory for training = 1.531 GB.
Peak reserved memory % of max memory = 74.222 %.
Peak reserved memory for training % of max memory = 10.386 %.


## Step 5 ‚Äî Evaluate & Inference (Optional)

After training, run a few prompts to verify behavior:
```python
prompt = "Write Robot Framework steps to validate login failure on wrong password."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
Create a small **eval set** and compute metrics (e.g., BLEU, ROUGE) or do **human eval** for quality.

In [None]:
# Test the fine-tuned model
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Inference code to test the model's response
messages = [
    {"role": "user", "content": "Check on general voltage reading"}
]

# Tokenize the input messages
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")  # Ensure to send to GPU if available

# Generate response from the model
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,  # Max tokens in response
    use_cache=True,
    temperature=0.7,  # Controls randomness
    do_sample=True,   # Enable sampling
    top_p=0.9         # Top-p sampling for diversity
)

# Decode and print the generated response
response = tokenizer.batch_decode(outputs)[0]
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Check on general voltage reading<|eot_id|><|start_header_id|>assistant<|end_header_id|>

*** Test Cases ***
EnvD_SysQual_Diagnostics_VoltageMonitoring_GeneralVoltage
    [Documentation]    Check on general voltage reading
    [Tags]    CustomerDiagnostic    TCID-272544    FULLTEST	FID-1824449
    prj.sut.can.diag.send    ${sut.diag.Extended_Session_Req}
    prj.sut.diag.compare    ${sut.diag.Extended_Session_Resp}
    prj.sut.can.diag.send    ${sut.diag.GeneralVoltage_Req}
    prj.sut.diag.compare    ${sut.diag.GeneralVoltage_Passive_Resp}
    # Active Blade
    prj.sut.can.diag.send    ${sut.diag.ActiveBladeGeneralVoltage_Req}
    prj.sut.diag.compare    ${sut.diag.ActiveBladeGeneralVoltage_Passive_Resp}
    # Passive Blade
    prj.sut.can.diag.send    ${sut.diag.PassiveBladeGeneralVoltage_Req}
    prj.sut.diag.compare    ${sut.diag.PassiveBladeGeneralVoltage_Passive_Resp}
<|eot_id|>


## Push your model
- Push your model to the **Hugging Face Hub** and share a model card.  
- Convert to **GGUF** for CPU‚Äëfriendly inference via llama.cpp.  
- Add **evaluation harnesses** (e.g., lm‚Äëevaluation‚Äëharness) for quantitative benchmarks.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineGrained).
The token `llm-test-case` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to r

In [None]:
# Push the model and tokenizer to the Hub
model.push_to_hub("ygn81pg1/llm-test-case")
tokenizer.push_to_hub("ygn81pg1/llm-test-case")

print("Model and tokenizer have been uploaded successfully!")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...pr4kc7y4m/adapter_model.safetensors:   0%|          | 39.3kB /  671MB            

Saved model to https://huggingface.co/ygn81pg1/llm-test-case


README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpuhxhrl7t/tokenizer.json       : 100%|##########| 17.2MB / 17.2MB            

  /tmp/tmpuhxhrl7t/tokenizer.json       : 100%|##########| 17.2MB / 17.2MB            

Model and tokenizer have been uploaded successfully!


In [None]:
pip install mistral_common


Collecting mistral_common
  Downloading mistral_common-1.8.3-py3-none-any.whl.metadata (3.8 kB)
Collecting pydantic-extra-types>=2.10.5 (from pydantic-extra-types[pycountry]>=2.10.5->mistral_common)
  Downloading pydantic_extra_types-2.10.5-py3-none-any.whl.metadata (3.9 kB)
Collecting pycountry>=23 (from pydantic-extra-types[pycountry]>=2.10.5->mistral_common)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading mistral_common-1.8.3-py3-none-any.whl (6.5 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.5/6.5 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydantic_extra_types-2.10.5-py3-none-any.whl (38 kB)
Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.3/6.3 MB[0m [31m46.9 MB/s[0m eta [36m0:00

In [None]:
model.save_pretrained_gguf("model", tokenizer,)


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 3.17 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [03:30<00:00,  6.57s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00004.bin...
Unsloth: Saving model/pytorch_model-00002-of-00004.bin...
Unsloth: Saving model/pytorch_model-00003-of-00004.bin...
Unsloth: Saving model/pytorch_model-00004-of-00004.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /content/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-g

In [None]:
from google.colab import files

# Define the path to your file
file_path = '/content/model/unsloth.Q8_0.gguf'

# Use the 'files' module to download the file
files.download(file_path)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>