## LoRA/QLoRA Fine-Tuning with Kubeflow Trainer and Training Hub on OpenShift AI

This notebook demonstrates how to use Training Hub's LoRA (Low-Rank Adaptation) and QLoRA capabilities for parameter-efficient fine-tuning. We'll train a model to convert natural language questions into SQL queries using the popular [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset.

## What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Injects trainable low-rank matrices into each layer
- Reduces trainable parameters by ~10,000x compared to full fine-tuning
- Enables fine-tuning large models on consumer GPUs

**QLoRA** extends LoRA by adding 4-bit quantization, further reducing memory requirements while maintaining quality.

## Training Task: Natural Language to SQL

We'll train the model to understand database schemas and generate SQL queries from natural language questions. For example:

**Input:**
```
Table: employees (id, name, department, salary)
Question: What is the average salary in the engineering department?
```

**Output:**
```sql
SELECT AVG(salary) FROM employees WHERE department = 'engineering'
```

## Setup

First, let's install the required dependencies.

In [None]:
# Install additional workbench dependencies
!pip install --upgrade pip --quiet
!pip install datasets --quiet
!pip install unsloth --quiet
# Install Kubeflow Trainer V2.
!pip install kubeflow --no-cache-dir --index-url https://console.redhat.com/api/pypi/public-rhai/rhoai/3.3/cuda12.9-ubi9/simple/ --quiet
# !pip install kubeflow --no-cache-dir --index-url https://console.redhat.com/api/pypi/public-rhai/rhoai/3.3/cpu-ubi9/simple/
# !pip install kubeflow --no-cache-dir --index-url https://console.redhat.com/api/pypi/public-rhai/rhoai/3.3/rocm6.4-ubi9/simple/

# Standard library imports
import json
import logging
import sys
from pathlib import Path

from datasets import load_dataset
from kubernetes import client as k8s
from unsloth import FastLanguageModel

In [None]:
# Configure logging to show only essential information
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)

# Suppress verbose logging from transformers and other libraries
logging.getLogger("transformers").setLevel(logging.WARNING)
logging.getLogger("datasets").setLevel(logging.WARNING)
logging.getLogger("torch").setLevel(logging.WARNING)

print("âœ… Logging configured for notebook environment")

## Authenticate to your OpenShift Cluster

In [None]:
api_server = "<REPLACE WITH OPENSHIFT SERVER>"
token = "<REPLACE WITH OPENSHIFT TOKEN>"
PVC_NAME = "shared"  # Replace if the shared RWX storage name is different than in the example provided
PVC_PATH = "shared"  # Replace if the shared RWX storage path is different than in the example provided
configuration = k8s.Configuration()
configuration.host = api_server
# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA
# configuration.verify_ssl = False
configuration.api_key = {"authorization": f"Bearer {token}"}
api_client = k8s.ApiClient(configuration)

## 1. Load and Explore the Dataset

We'll use the sql-create-context dataset from HuggingFace. This dataset contains:

    Natural language questions
    Database schema context (CREATE TABLE statements)
    Corresponding SQL queries

In [None]:
# Load the dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")

In [None]:
# Converting the format of the intial messages.
def convert_to_messages(example):
    """
    Convert a sql-create-context example to chat template format.

    The user provides the database schema and question.
    The assistant responds with the SQL query.
    """
    user_message = f"""Given the following database schema:

{example["context"]}

Write a SQL query to answer this question: {example["question"]}"""

    assistant_message = example["answer"]

    return {
        "messages": [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_message},
        ]
    }


# Show an example of the converted format
sample_converted = convert_to_messages(dataset[0])
print("Converted format:")
print(json.dumps(sample_converted, indent=2))

## 2. Prepare Training Data

Training Hub expects data in the chat template format with a messages field containing the conversation. We'll convert each example into a user message (question + context) and an assistant message (SQL query).

In [None]:
# Training Dataset Preparation.
print(f"Dataset size: {len(dataset)} examples")
print(f"\nDataset columns: {dataset.column_names}")
print("\n" + "=" * 60)
print("Sample entry:")
print("=" * 60)

# Show a sample
sample = dataset[0]
print(f"\nQuestion: {sample['question']}")
print(f"\nContext (Schema):\n{sample['context']}")
print(f"\nAnswer (SQL): {sample['answer']}")

print(f"Original training set size: {len(dataset)}")

# For a quick demonstration, we'll use a subset of the data
# You can increase this for better results (full dataset is ~78k examples)
TRAIN_SIZE = 100  # Adjust based on your time/compute budget

# Shuffle and select a subset
train_dataset = dataset.shuffle(seed=42).select(range(min(TRAIN_SIZE, len(dataset))))

# Convert to messages format
train_data = [convert_to_messages(example) for example in train_dataset]

print(f"Training examples: {len(train_data)}")

# Save training data to JSONL format
OUTPUT_DIR = Path(f"{PVC_PATH}/lora_text_sql_output")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

training_file = OUTPUT_DIR / "train_data.jsonl"

with open(training_file, "w") as f:
    for example in train_data:
        f.write(json.dumps(example) + "\n")

print(f"Training data saved to: {training_file}")
print(f"File size: {training_file.stat().st_size / 1024:.1f} KB")

data_path = f"{PVC_PATH}/lora_text_sql_output/train_data.jsonl"
print(data_path)

## 3. Configure and Run LoRA Training

Now we'll use Training Hub's lora_sft function to train the model. Key parameters:
LoRA Parameters

    lora_r: Rank of the low-rank matrices (higher = more capacity, more memory)
    lora_alpha: Scaling factor (typically 2x the rank)
    target_modules: Which layers to apply LoRA to

QLoRA Parameters (Optional)

    load_in_4bit: Enable 4-bit quantization to reduce memory
    bnb_4bit_quant_type: Quantization type ('nf4' recommended)


In [None]:
# Training configuration
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

# You can also try these alternatives:
# MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"    # Larger, more capable
# MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"  # Smaller, faster training
# MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"  # Alternative architecture

# LoRA configuration
LORA_R = 16  # Rank - start small, increase if needed
LORA_ALPHA = 32  # Alpha - typically 2x rank
LORA_DROPOUT = 0.0  # Dropout - 0.0 is optimized for Unsloth

# Training configuration
NUM_EPOCHS = 2  # More epochs = better learning, longer training
LEARNING_RATE = 2e-4  # Standard LoRA learning rate
MAX_SEQ_LEN = 1024  # Maximum sequence length
MICRO_BATCH_SIZE = 16  # Batch size per GPU (reduce if OOM)
GRADIENT_ACCUMULATION = 4  # Effective batch = micro_batch * grad_accum

# QLoRA settings (set to True to enable 4-bit quantization)
USE_QLORA = True  # Set to True if you have limited GPU memory

# Multi-GPU Configuration:
enable_model_splitting = False

print("Training Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  LoRA Rank: {LORA_R}")
print(f"  LoRA Alpha: {LORA_ALPHA}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Effective Batch Size: {MICRO_BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"  QLoRA (4-bit): {USE_QLORA}")
print(f"  ENABLE MODEL SPLITTING: {enable_model_splitting}")

params = {
    # Model and data path
    "model_path": MODEL_NAME,
    "data_path": f"/mnt/{data_path}",
    "ckpt_output_dir": f"/mnt/{PVC_PATH}/checkpoints-logs-dir",
    "data_output_path": f"/mnt/{PVC_PATH}/lora-json/_data",
    # Important for LORA
    "lr_scheduler": "cosine",
    "warmup_steps": 0,
    "seed": 42,
    # LoRA configuration
    "lora_r": LORA_R,
    "lora_alpha": LORA_ALPHA,
    "lora_dropout": LORA_DROPOUT,
    # Training configuration
    "num_epochs": NUM_EPOCHS,
    "learning_rate": LEARNING_RATE,
    "micro_batch_size": MICRO_BATCH_SIZE,
    "max_seq_len": MAX_SEQ_LEN,
    "gradient_accumulation_steps": GRADIENT_ACCUMULATION,
    # Dataset format
    "dataset_type": "chat_template",
    "field_messages": "messages",
    # Quantization
    "load_in_4bit": USE_QLORA,
    # GPU configuration
    "nproc_per_node": 2,
    "nnodes": 2,
    # Logging
    "logging_steps": 10,
    "save_steps": 200,
    "save_total_limit": 3,
    # Multi-GPU Configuration
    "enable_model_splitting": enable_model_splitting,
    # Model Checkpointing
    "save_final_checkpoint": True,
    "checkpoint_at_epoch": 2,
}
params

## Training with LORA SFT and Kubeflow Trainer
 we launch a training job via Kubeflow Trainer with configured hyperparameters.

In [None]:
from kubeflow.common.types import KubernetesBackendConfig
from kubeflow.trainer import TrainerClient
from kubeflow.trainer.rhai import TrainingHubAlgorithms, TrainingHubTrainer

backend_cfg = KubernetesBackendConfig(client_configuration=api_client.configuration)
client = TrainerClient(backend_cfg)

In [None]:
for runtime in client.list_runtimes():
    if runtime.name == "training-hub":
        th_runtime = runtime
        print("Found runtime: " + str(th_runtime))

In [None]:
from kubeflow.trainer.options.kubernetes import (
    ContainerOverride,
    PodSpecOverride,
    PodTemplateOverride,
    PodTemplateOverrides,
)

cache_root = f"/mnt/{PVC_PATH}/.cache/huggingface"

job_name = client.train(
    trainer=TrainingHubTrainer(
        algorithm=TrainingHubAlgorithms.LORA_SFT,
        func_args=params,
        resources_per_node={
            "cpu": 3,
            "memory": "16Gi",
            "nvidia.com/gpu": 2,
        },
    ),
    options=[
        PodTemplateOverrides(
            PodTemplateOverride(
                target_jobs=["node"],
                spec=PodSpecOverride(
                    volumes=[
                        {
                            "name": "work",
                            "persistentVolumeClaim": {"claimName": PVC_NAME},
                        },
                    ],
                    containers=[
                        ContainerOverride(
                            name="node",  # Target the existing container
                            volume_mounts=[
                                {
                                    "name": "work",
                                    "mountPath": f"/mnt/{PVC_PATH}",
                                    "readOnly": False,
                                },
                            ],
                        )
                    ],
                ),
            )
        )
    ],
    runtime=th_runtime,
)

print(job_name)

In [None]:
# Follow job logs
logs = client.get_job_logs(job_name, follow=True)
for line in logs:
    print(line)

## Loading the Model from the Desired Checkpoints.

In [None]:
CHECKPOINTS_PATH = "./shared/checkpoints-logs-dir/checkpoint-16"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=CHECKPOINTS_PATH,
    max_seq_length=MAX_SEQ_LEN,
    dtype=None,  # Auto-detect
    load_in_4bit=USE_QLORA,
)

In [None]:
def generate_sql(question: str, schema: str, max_tokens: int = 256) -> str:
    """
    Generate a SQL query from a natural language question.

    Args:
        question: Natural language question
        schema: Database schema (CREATE TABLE statements)
        max_tokens: Maximum tokens to generate

    Returns:
        Generated SQL query
    """
    messages = [
        {
            "role": "user",
            "content": f"""Given the following database schema:

{schema}

Write a SQL query to answer this question: {question}""",
        }
    ]

    # Apply chat template
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    # Decode response (only the new tokens)
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True
    )

    return response.strip()

## Test the Trained Model

Let's load the trained model and test it on some SQL generation examples.

In [None]:
# Test with examples from the dataset
test_examples = [
    {
        "schema": "CREATE TABLE employees (id INT, name VARCHAR, department VARCHAR, salary DECIMAL, hire_date DATE)",
        "question": "What is the average salary of employees in the engineering department?",
    },
    {
        "schema": "CREATE TABLE orders (order_id INT, customer_id INT, product_name VARCHAR, quantity INT, order_date DATE)",
        "question": "How many orders were placed in the last 30 days?",
    },
    {
        "schema": "CREATE TABLE students (student_id INT, name VARCHAR, grade INT, subject VARCHAR, score DECIMAL)",
        "question": "Find the top 5 students with the highest average score across all subjects.",
    },
]

print("Testing the trained model:")
print("=" * 60)

for i, example in enumerate(test_examples, 1):
    print(f"\nExample {i}:")
    print(f"Schema: {example['schema']}")
    print(f"Question: {example['question']}")

    sql = generate_sql(example["question"], example["schema"])
    print(f"Generated SQL: {sql}")
    print("-" * 60)

## Final Analysis and Summary
In this notebook, we demonstrated how LORA/QLORA can be used fine tuning Qwen 2.5 1.5B Instruct model, 
we were able to fine tune the model to understand natural languages to sql queries generation.