# LoRA/QLoRA Fine-Tuning with Training Hub

This notebook demonstrates how to use Training Hub's LoRA (Low-Rank Adaptation) and QLoRA capabilities for parameter-efficient fine-tuning. We'll train a model to convert natural language questions into SQL queries using the popular [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset.

## What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that:
- Freezes the pre-trained model weights
- Injects trainable low-rank matrices into each layer
- Reduces trainable parameters by ~10,000x compared to full fine-tuning
- Enables fine-tuning large models on consumer GPUs

**QLoRA** extends LoRA by adding 4-bit quantization, further reducing memory requirements while maintaining quality.

## Training Task: Natural Language to SQL

We'll train the model to understand database schemas and generate SQL queries from natural language questions. For example:

**Input:**
```
Table: employees (id, name, department, salary)
Question: What is the average salary in the engineering department?
```

**Output:**
```sql
SELECT AVG(salary) FROM employees WHERE department = 'engineering'
```

## Hardware Requirements

This notebook is designed to run on a single GPU:
- **Minimum**: 16GB VRAM (with QLoRA 4-bit quantization)
- **Recommended**: 24GB VRAM (for faster training with larger batch sizes)
- Works on: A10, A100, L4, L40S, RTX 3090/4090, and similar GPUs

## Setup

First, let's install the required dependencies. Training Hub uses Unsloth for optimized LoRA training.

In [None]:
# Install training-hub with LoRA dependencies
# Note: Unsloth requires specific versions of dependencies
!pip install 'training-hub[lora]'

# Import required libraries
import json
from pathlib import Path

import torch
from datasets import load_dataset

In [None]:
# Check GPU availability
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
    print(f"GPU: {gpu_name}")
    print(f"Memory: {gpu_memory:.1f} GB")
else:
    print("WARNING: No GPU detected. Training will be very slow!")

## 1. Load and Explore the Dataset

We'll use the [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset from HuggingFace. This dataset contains:
- Natural language questions
- Database schema context (CREATE TABLE statements)
- Corresponding SQL queries

In [None]:
# Load the dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
print(f"Dataset size: {len(dataset)} examples")
print(f"\nDataset columns: {dataset.column_names}")
print("\n" + "=" * 60)
print("Sample entry:")
print("=" * 60)

# Show a sample
sample = dataset[0]
print(f"\nQuestion: {sample['question']}")
print(f"\nContext (Schema):\n{sample['context']}")
print(f"\nAnswer (SQL): {sample['answer']}")

## 2. Prepare Training Data

Training Hub expects data in the **chat template format** with a `messages` field containing the conversation. We'll convert each example into a user message (question + context) and an assistant message (SQL query).

In [None]:
def convert_to_messages(example):
    """
    Convert a sql-create-context example to chat template format.

    The user provides the database schema and question.
    The assistant responds with the SQL query.
    """
    user_message = f"""Given the following database schema:

{example["context"]}

Write a SQL query to answer this question: {example["question"]}"""

    assistant_message = example["answer"]

    return {
        "messages": [
            {"role": "user", "content": user_message},
            {"role": "assistant", "content": assistant_message},
        ]
    }


# Show an example of the converted format
sample_converted = convert_to_messages(dataset[0])
print("Converted format:")
print(json.dumps(sample_converted, indent=2))

In [None]:
# For a quick demonstration, we'll use a subset of the data
# You can increase this for better results (full dataset is ~78k examples)
TRAIN_SIZE = 100  # Adjust based on your time/compute budget

# Shuffle and select a subset
train_dataset = dataset.shuffle(seed=42).select(range(min(TRAIN_SIZE, len(dataset))))

# Convert to messages format
train_data = [convert_to_messages(example) for example in train_dataset]

print(f"Training examples: {len(train_data)}")

In [None]:
# Save training data to JSONL format
OUTPUT_DIR = Path("./lora_sql_output")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

training_file = OUTPUT_DIR / "train_data.jsonl"

with open(training_file, "w") as f:
    for example in train_data:
        f.write(json.dumps(example) + "\n")

print(f"Training data saved to: {training_file}")
print(f"File size: {training_file.stat().st_size / 1024:.1f} KB")

## 3. Configure and Run LoRA Training

Now we'll use Training Hub's `lora_sft` function to train the model. Key parameters:

### LoRA Parameters
- **lora_r**: Rank of the low-rank matrices (higher = more capacity, more memory)
- **lora_alpha**: Scaling factor (typically 2x the rank)
- **target_modules**: Which layers to apply LoRA to

### QLoRA Parameters (Optional)
- **load_in_4bit**: Enable 4-bit quantization to reduce memory
- **bnb_4bit_quant_type**: Quantization type ('nf4' recommended)

In [None]:
# Training configuration
# Feel free to adjust these based on your GPU and requirements

# Model selection - using Qwen2.5 1.5B as a good balance of capability and speed
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

# You can also try these alternatives:
# MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"    # Larger, more capable
# MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"  # Smaller, faster training
# MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"  # Alternative architecture

# LoRA configuration
LORA_R = 16  # Rank - start small, increase if needed
LORA_ALPHA = 32  # Alpha - typically 2x rank
LORA_DROPOUT = 0.0  # Dropout - 0.0 is optimized for Unsloth

# Training configuration
NUM_EPOCHS = 1  # More epochs = better learning, longer training
LEARNING_RATE = 2e-4  # Standard LoRA learning rate
MAX_SEQ_LEN = 1024  # Maximum sequence length
MICRO_BATCH_SIZE = 8  # Batch size per GPU (reduce if OOM)
GRADIENT_ACCUMULATION = 4  # Effective batch = micro_batch * grad_accum

# QLoRA settings (set to True to enable 4-bit quantization)
USE_QLORA = False  # Set to True if you have limited GPU memory

print("Training Configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  LoRA Rank: {LORA_R}")
print(f"  LoRA Alpha: {LORA_ALPHA}")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Effective Batch Size: {MICRO_BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"  QLoRA (4-bit): {USE_QLORA}")

In [None]:
from training_hub import lora_sft

# Run LoRA training with Training Hub
checkpoint_dir = OUTPUT_DIR / "checkpoints"

result = lora_sft(
    # Required parameters
    model_path=MODEL_NAME,
    data_path=str(training_file),
    ckpt_output_dir=str(checkpoint_dir),
    # LoRA configuration
    lora_r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    # QLoRA (4-bit quantization) - reduces memory significantly
    load_in_4bit=USE_QLORA,
    # Training parameters
    num_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    max_seq_len=MAX_SEQ_LEN,
    micro_batch_size=MICRO_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    # Logging
    logging_steps=10,
    save_steps=500,
    # Dataset format
    dataset_type="chat_template",
    field_messages="messages",
)

print("\nTraining completed!")
print(f"Model saved to: {checkpoint_dir}")

## 4. Test the Trained Model

Let's load the trained model and test it on some SQL generation examples.

In [None]:
from unsloth import FastLanguageModel

# The model and tokenizer are returned in the result
model = result["model"]
tokenizer = result["tokenizer"]

# Enable inference mode
FastLanguageModel.for_inference(model)

print("Model ready for inference!")

In [None]:
def generate_sql(question: str, schema: str, max_tokens: int = 256) -> str:
    """
    Generate a SQL query from a natural language question.

    Args:
        question: Natural language question
        schema: Database schema (CREATE TABLE statements)
        max_tokens: Maximum tokens to generate

    Returns:
        Generated SQL query
    """
    messages = [
        {
            "role": "user",
            "content": f"""Given the following database schema:

{schema}

Write a SQL query to answer this question: {question}""",
        }
    ]

    # Apply chat template
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

    # Decode response (only the new tokens)
    response = tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[1] :], skip_special_tokens=True
    )

    return response.strip()

In [None]:
# Test with examples from the dataset
test_examples = [
    {
        "schema": "CREATE TABLE employees (id INT, name VARCHAR, department VARCHAR, salary DECIMAL, hire_date DATE)",
        "question": "What is the average salary of employees in the engineering department?",
    },
    {
        "schema": "CREATE TABLE orders (order_id INT, customer_id INT, product_name VARCHAR, quantity INT, order_date DATE)",
        "question": "How many orders were placed in the last 30 days?",
    },
    {
        "schema": "CREATE TABLE students (student_id INT, name VARCHAR, grade INT, subject VARCHAR, score DECIMAL)",
        "question": "Find the top 5 students with the highest average score across all subjects.",
    },
]

print("Testing the trained model:")
print("=" * 60)

for i, example in enumerate(test_examples, 1):
    print(f"\nExample {i}:")
    print(f"Schema: {example['schema']}")
    print(f"Question: {example['question']}")

    sql = generate_sql(example["question"], example["schema"])
    print(f"Generated SQL: {sql}")
    print("-" * 60)

## 5. Save and Load the Model (Optional)

If you want to use the model later, you can save it and reload it. The checkpoint directory already contains the saved model.

In [None]:
# List the saved files
print("Saved model files:")
for file in sorted(checkpoint_dir.glob("*")):
    if file.is_file():
        size_mb = file.stat().st_size / (1024 * 1024)
        print(f"  {file.name}: {size_mb:.2f} MB")

In [None]:
# To reload the model later:
# from unsloth import FastLanguageModel
#
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name=str(checkpoint_dir),
#     max_seq_length=1024,
#     load_in_4bit=False,
# )
# FastLanguageModel.for_inference(model)

## Summary

In this notebook, we:

1. **Loaded** the sql-create-context dataset from HuggingFace
2. **Prepared** the data in chat template format for Training Hub
3. **Trained** a LoRA adapter using Training Hub's `lora_sft` function
4. **Tested** the model on SQL generation examples

### Key Takeaways

- **LoRA** enables efficient fine-tuning by only training a small number of parameters
- **QLoRA** (4-bit quantization) can further reduce memory requirements
- Training Hub handles all the complexity of setting up Unsloth, LoRA configs, and training
- The chat template format makes it easy to prepare instruction-following data

### Next Steps

- Try different models (larger = more capable, smaller = faster)
- Increase training data size for better results
- Experiment with LoRA rank (higher rank = more capacity)
- Enable QLoRA for training on GPUs with limited memory
- Add W&B logging for experiment tracking (`wandb_project="my-project"`)