# Fine-tune Qwen3 (14B) for Reasoning & Conversation

This notebook demonstrates how to fine-tune **Qwen3-14B** to **combine reasoning and conversational capabilities** using **Unsloth** for optimized training.

## 🌟 What is Qwen3?

**Qwen3** (Qwen 3.0) is Alibaba's latest open-source LLM family:
- **Sizes**: 1.7B to 32B parameters
- **Dual mode**: Reasoning (`<think>` tags) + conversational
- **High performance**: Competitive with GPT-4 level models
- **Multilingual**: Strong Chinese and English support
- **Efficient**: Optimized architecture

## 🎯 What You'll Learn

- How to combine **reasoning** and **conversational** datasets
- How to use **Unsloth** for 2x faster training
- How to balance reasoning vs chat capabilities
- How to enable/disable thinking mode at inference
- How to save models in multiple formats (LoRA, merged, GGUF)

## 💡 Why Mix Reasoning + Conversation?

**Pure reasoning models:**
- ✅ Great at complex problems
- ❌ Verbose for simple questions
- ❌ Higher inference cost

**Pure chat models:**
- ✅ Fast, concise responses
- ❌ Struggle with complex reasoning

**Combined approach (this notebook):**
- ✅ Reason when needed
- ✅ Chat normally otherwise
- ✅ User controls mode
- ✅ Best of both worlds!

## 🔧 Requirements

- **GPU**: 16GB+ VRAM (T4, V100, A100)
- **Time**: 30-60 minutes for quick training
- **Model**: Qwen3-14B (4-bit quantized)
- **Library**: Unsloth (2x faster than standard)

## 📊 Key Stats

| Metric | Value |
|--------|-------|
| Base Model | Qwen3-14B (4-bit) |
| LoRA Rank | 32 |
| Reasoning Data | ~10K examples (75%) |
| Chat Data | ~3K examples (25%) |
| Training Steps | 30 (demo) / 1000+ (production) |
| Training Time | ~30 min (demo) / 2-3 hours (full) |
| GPU Memory | ~12-14GB |

## 📖 Table of Contents

1. [Installation and Setup](#1-installation-and-setup)
2. [Load Model with Unsloth](#2-load-model-with-unsloth)
3. [Prepare Mixed Dataset](#3-prepare-mixed-dataset)
4. [Train the Model](#4-train-the-model)
5. [Inference: Thinking vs Non-Thinking](#5-inference-thinking-vs-non-thinking)
6. [Save in Multiple Formats](#6-save-in-multiple-formats)

---

**Credits**: Based on Unsloth's official Qwen3 reasoning notebook

## 1. Installation and Setup

### Install Unsloth

**Unsloth** provides:
- ✅ **2x faster** training than standard methods
- ✅ **70% less memory** usage
- ✅ **No accuracy loss**
- ✅ Support for latest models (Qwen3, Gemma 3, Llama, etc.)

### Installation Options

**For Colab/Kaggle:**
- Automatic detection and installation

**For local:**
```bash
pip install unsloth
```

**For this repository:**
- Already included in Poetry dependencies!
```bash
make install-reasoning
```

In [None]:
# Install Unsloth and dependencies
%%capture
import os, re

# Check if in Colab
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Colab-specific installation
    import torch
    v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

# Install specific versions for compatibility
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2

print("✅ All packages installed!")

## 2. Load Model with Unsloth

### Available Qwen3 Models

Unsloth provides optimized 4-bit versions of all Qwen3 models:

- `unsloth/Qwen3-1.7B-unsloth-bnb-4bit` (smallest, fastest)
- `unsloth/Qwen3-4B-unsloth-bnb-4bit`
- `unsloth/Qwen3-8B-unsloth-bnb-4bit`
- `unsloth/Qwen3-14B-unsloth-bnb-4bit` ← **We'll use this**
- `unsloth/Qwen3-32B-unsloth-bnb-4bit` (largest, best quality)

### Model Configuration

- **max_seq_length**: 2048 tokens (can be longer but uses more memory)
- **load_in_4bit**: Enable 4-bit quantization
- **full_finetuning**: False (we'll use LoRA)

In [None]:
# Import Unsloth
from unsloth import FastLanguageModel
import torch

# List of available Qwen3 models
fourbit_models = [
    "unsloth/Qwen3-1.7B-unsloth-bnb-4bit",
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    "unsloth/Qwen3-14B-unsloth-bnb-4bit",  # We'll use this
    "unsloth/Qwen3-32B-unsloth-bnb-4bit",
]

print("Available Qwen3 models:")
for model_name in fourbit_models:
    print(f"  - {model_name}")

print("\nWe'll use: Qwen3-14B for best quality on consumer hardware")

In [None]:
# Load Qwen3-14B with Unsloth
print("Loading Qwen3-14B...")
print("This may take a few minutes...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-14B",
    max_seq_length = 2048,      # Context length
    load_in_4bit = True,        # 4-bit uses much less memory
    load_in_8bit = False,       # 8-bit is more accurate but uses 2x memory
    full_finetuning = False,    # Use LoRA (parameter-efficient)
    # token = "hf_...",         # Use if accessing gated models
)

print("\n✅ Model and tokenizer loaded successfully!")
print(f"Model: Qwen3-14B (4-bit quantized)")
print(f"Max sequence length: 2048 tokens")

### Add LoRA Adapters

We'll add **LoRA adapters** to make training efficient.

**Configuration:**
- **r=32**: LoRA rank (balance of quality and speed)
- **lora_alpha=32**: Scaling factor (typically equal to rank)
- **lora_dropout=0**: No dropout (optimized for Unsloth)
- **Target modules**: All attention and MLP layers

**Unsloth optimization:**
- `use_gradient_checkpointing="unsloth"`: 30% less VRAM, 2x larger batch sizes!

In [None]:
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # LoRA rank - higher = more capacity
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",      # MLP
    ],
    lora_alpha = 32,  # Scaling factor
    lora_dropout = 0,  # 0 is optimized for Unsloth
    bias = "none",  # "none" is optimized
    use_gradient_checkpointing = "unsloth",  # [NEW] 30% less VRAM!
    random_state = 3407,
    use_rslora = False,  # Rank-stabilized LoRA
    loftq_config = None,  # LoftQ
)

print("✅ LoRA adapters added!")
print(f"\nConfiguration:")
print(f"  Rank: 32")
print(f"  Alpha: 32")
print(f"  Dropout: 0 (optimized)")
print(f"  Gradient checkpointing: unsloth (optimized)")

## 3. Prepare Mixed Dataset

### Dual Dataset Approach

We'll combine TWO types of data:

1. **Reasoning Dataset** (75%)
   - OpenMathReasoning dataset
   - Mathematical problems with detailed reasoning
   - Uses `<think>` tags for step-by-step solving
   - Teaches the model HOW to reason

2. **Conversational Dataset** (25%)
   - FineTome-100k dataset
   - General conversations
   - Normal chat without reasoning overhead
   - Teaches the model WHEN to reason

### Why Mix Both?

- ✅ **Balanced model**: Reason when needed, chat normally otherwise
- ✅ **Efficiency**: Don't waste compute on simple questions
- ✅ **Flexibility**: User controls thinking mode
- ✅ **Practical**: Real-world usage pattern

### Dataset Ratio

We'll use **75% reasoning, 25% chat**:
- Maintains reasoning capabilities
- Adds conversational fluency
- Adjustable based on your needs

In [None]:
# Load both datasets
from datasets import load_dataset

print("Loading datasets...")
print("This may take a moment...")

# 1. Reasoning dataset (math problems with CoT)
reasoning_dataset = load_dataset(
    "unsloth/OpenMathReasoning-mini",
    split="cot"  # Chain-of-thought split
)

# 2. Conversational dataset (general chat)
non_reasoning_dataset = load_dataset(
    "mlabonne/FineTome-100k",
    split="train"
)

print("\n✅ Both datasets loaded successfully!")
print(f"\nReasoning dataset: {len(reasoning_dataset):,} examples")
print(f"Conversational dataset: {len(non_reasoning_dataset):,} examples")

In [None]:
# Explore reasoning dataset structure
print("=== Reasoning Dataset ===")
print(reasoning_dataset)
print(f"\nColumns: {reasoning_dataset.column_names}")
print(f"\nSample:")
print(f"  Problem: {reasoning_dataset[0]['problem'][:100]}...")
print(f"  Solution length: {len(reasoning_dataset[0]['generated_solution'])} chars")

In [None]:
# Explore conversational dataset structure
print("=== Conversational Dataset ===")
print(non_reasoning_dataset)
print(f"\nColumns: {non_reasoning_dataset.column_names}")
print(f"\nSample:")
if 'conversations' in non_reasoning_dataset.column_names:
    print(f"  Conversations: {non_reasoning_dataset[0]['conversations'][:2]}")

### Convert Reasoning Dataset to Conversational Format

We need to convert the problem-solution pairs into a conversational format:

```python
[
  {"role": "user", "content": "Solve (x + 2)^2 = 0."},
  {"role": "assistant", "content": "<think>...</think> x = -2"}
]
```

In [None]:
# Function to convert reasoning data to conversations
def generate_conversation(examples):
    """
    Convert problem-solution pairs to conversational format.
    
    Args:
        examples: Batch of examples with 'problem' and 'generated_solution'
        
    Returns:
        Dictionary with conversations
    """
    problems = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role": "user", "content": problem},
            {"role": "assistant", "content": solution},
        ])
    
    return {"conversations": conversations}

print("✅ Conversion function defined!")

In [None]:
# Convert reasoning dataset to conversations
print("Converting reasoning dataset to conversation format...")

reasoning_conversations = reasoning_dataset.map(
    generate_conversation,
    batched=True
)["conversations"]

print(f"\n✅ Converted {len(reasoning_conversations):,} reasoning conversations!")
print(f"\nSample conversation:")
print(reasoning_conversations[0])

In [None]:
# Apply Qwen3 chat template to reasoning data
reasoning_conversations_formatted = tokenizer.apply_chat_template(
    reasoning_conversations,
    tokenize=False,
)

print("✅ Chat template applied to reasoning data!")
print(f"\nFormatted examples: {len(reasoning_conversations_formatted):,}")
print(f"\nFirst example (first 500 chars):")
print(reasoning_conversations_formatted[0][:500])

### Process Conversational Dataset

The conversational dataset is in ShareGPT format, which we need to standardize first.

In [None]:
# Standardize ShareGPT format
from unsloth.chat_templates import standardize_sharegpt

print("Standardizing conversational dataset...")
dataset_standardized = standardize_sharegpt(non_reasoning_dataset)

print("\n✅ Dataset standardized!")
print(f"Conversations: {len(dataset_standardized):,}")

In [None]:
# Apply chat template to conversational data
non_reasoning_conversations = tokenizer.apply_chat_template(
    dataset_standardized["conversations"],
    tokenize=False,
)

print("✅ Chat template applied to conversational data!")
print(f"\nFormatted examples: {len(non_reasoning_conversations):,}")
print(f"\nFirst example (first 300 chars):")
print(non_reasoning_conversations[0][:300])

In [None]:
# Check dataset sizes
print("=== Dataset Sizes ===")
print(f"Reasoning conversations: {len(reasoning_conversations_formatted):,}")
print(f"Chat conversations: {len(non_reasoning_conversations):,}")
print(f"\nRatio: {len(non_reasoning_conversations) / len(reasoning_conversations_formatted):.2f}:1 (chat:reasoning)")

### Mix Datasets with Custom Ratio

We'll create a balanced dataset with:
- **75% reasoning** (for problem-solving skills)
- **25% conversational** (for chat fluency)

This ratio can be adjusted based on your needs:
- **More reasoning**: Better at complex problems, more verbose
- **More chat**: Better at normal conversation, less verbose

**Formula:**
```python
chat_samples = reasoning_samples × (chat_% / reasoning_%)
```

In [None]:
# Define the chat percentage (reasoning will be 1 - chat_percentage)
chat_percentage = 0.25  # 25% chat, 75% reasoning

print(f"Target mix:")
print(f"  Reasoning: {(1 - chat_percentage) * 100:.0f}%")
print(f"  Chat: {chat_percentage * 100:.0f}%")

In [None]:
# Sample the conversational dataset to match desired ratio
import pandas as pd

# Convert to pandas for easy sampling
non_reasoning_subset = pd.Series(non_reasoning_conversations)

# Calculate how many chat examples we need
n_chat_samples = int(len(reasoning_conversations_formatted) * (chat_percentage / (1 - chat_percentage)))

# Sample
non_reasoning_subset = non_reasoning_subset.sample(
    n_chat_samples,
    random_state=2407,
)

print(f"\n=== Sampled Dataset ===")
print(f"Reasoning examples: {len(reasoning_conversations_formatted):,}")
print(f"Chat examples: {len(non_reasoning_subset):,}")
print(f"\nActual chat percentage: {len(non_reasoning_subset) / (len(non_reasoning_subset) + len(reasoning_conversations_formatted)) * 100:.1f}%")

In [None]:
# Combine both datasets
data = pd.concat([
    pd.Series(reasoning_conversations_formatted),
    pd.Series(non_reasoning_subset)
])
data.name = "text"

# Convert to HuggingFace Dataset
from datasets import Dataset
combined_dataset = Dataset.from_pandas(pd.DataFrame(data))

# Shuffle for better training
combined_dataset = combined_dataset.shuffle(seed=3407)

print("✅ Combined dataset created!")
print(f"\nTotal examples: {len(combined_dataset):,}")
print(f"  Reasoning: {len(reasoning_conversations_formatted):,} ({(1-chat_percentage)*100:.0f}%)")
print(f"  Chat: {len(non_reasoning_subset):,} ({chat_percentage*100:.0f}%)")

In [None]:
# Inspect the combined dataset
print("=== Combined Dataset Sample ===\n")
print(combined_dataset[0][:600])
print("\n... (truncated) ...")

## 4. Train the Model

### Configure Training with SFTTrainer

We'll use TRL's **SFTTrainer** with optimized settings.

**Training Configuration:**
- **Batch size**: 2 per device
- **Gradient accumulation**: 4 steps (effective batch size = 8)
- **Learning rate**: 2e-4 (standard for LoRA)
- **Optimizer**: AdamW 8-bit (memory efficient)
- **Steps**: 30 (demo) / 1000+ (production)

**For production:**
- Set `num_train_epochs=1` and `max_steps=None`
- Increase batch size if you have more memory
- Add evaluation dataset for monitoring

In [None]:
# Initialize SFTTrainer
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=combined_dataset,
    eval_dataset=None,  # Add validation set for monitoring
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size = 8
        warmup_steps=5,
        # num_train_epochs=1,  # Uncomment for full training
        max_steps=30,  # Quick demo - increase to 1000+ for real training
        learning_rate=2e-4,  # Reduce to 2e-5 for longer runs
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",  # Set to "wandb" for experiment tracking
        output_dir="output",
    ),
)

print("✅ SFTTrainer initialized!")
print(f"\nTraining configuration:")
print(f"  Batch size: 2 (per device)")
print(f"  Gradient accumulation: 4")
print(f"  Effective batch size: 8")
print(f"  Max steps: 30 (demo)")
print(f"  Learning rate: 2e-4")

In [None]:
# Check current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Max memory: {max_memory} GB")
print(f"Reserved memory: {start_gpu_memory} GB")
print(f"Available: {max_memory - start_gpu_memory} GB")

In [None]:
# Start training!
print("=" * 70)
print("STARTING TRAINING")
print("=" * 70)
print()
print(f"Training on {len(combined_dataset):,} mixed examples")
print(f"  - Reasoning: ~{len(reasoning_conversations_formatted):,}")
print(f"  - Chat: ~{len(non_reasoning_subset):,}")
print(f"Running for {trainer.args.max_steps} steps (demo)")
print()
print("For production:")
print("  - Set num_train_epochs=1, max_steps=None")
print("  - Expected time: 2-3 hours")
print()
print("=" * 70)
print()

# Train the model
# Uncomment to start training
# trainer_stats = trainer.train()

print("⚠️ Training is commented out by default.")
print("Uncomment 'trainer_stats = trainer.train()' to start training.")

In [None]:
# Show final memory and time stats (after training)
# Uncomment when training is complete

# used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
# used_percentage = round(used_memory / max_memory * 100, 3)
# lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

# print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
# print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
# print(f"Peak reserved memory = {used_memory} GB.")
# print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
# print(f"Peak reserved memory % of max memory = {used_percentage}%.")
# print(f"Peak reserved memory for training % of max memory = {lora_percentage}%.")

print("⚠️ Uncomment after training to see memory stats")

## 5. Inference: Thinking vs Non-Thinking

### Qwen3's Dual Modes

Qwen3 has a unique feature: **controllable thinking**!

**Non-Thinking Mode** (Fast):
- `enable_thinking=False` in chat template
- `temperature=0.7, top_p=0.8, top_k=20`
- Quick, concise responses
- Best for: Simple questions, chat, Q&A

**Thinking Mode** (Detailed):
- `enable_thinking=True` in chat template
- `temperature=0.6, top_p=0.95, top_k=20`
- Detailed reasoning in `<think>` tags
- Best for: Math, logic, complex problems

### When to Use Each

| Scenario | Mode | Why |
|----------|------|-----|
| "What is 2+2?" | Non-thinking | Simple calculation |
| "Solve (x+2)^2=0" | Thinking | Needs steps |
| "Hi, how are you?" | Non-thinking | Simple chat |
| "Explain quantum physics" | Thinking | Complex topic |

In [None]:
# Test 1: Non-Thinking Mode (Fast Chat)
print("=" * 70)
print("TEST 1: NON-THINKING MODE (Fast Chat)")
print("=" * 70)
print()

messages = [
    {"role": "user", "content": "Solve (x + 2)^2 = 0."}
]

# Apply chat template WITHOUT thinking
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # Must add for generation
    enable_thinking=False,  # Disable thinking for fast response
)

print("Prompt (without thinking enabled):")
print(text)
print("\n" + "=" * 70)
print("Response:")
print()

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=256,
    temperature=0.7, top_p=0.8, top_k=20,  # Settings for non-thinking
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

print("\n" + "=" * 70)
print("💡 Notice: Concise answer without detailed reasoning")
print("=" * 70)

In [None]:
# Test 2: Thinking Mode (Detailed Reasoning)
print("\n" + "=" * 70)
print("TEST 2: THINKING MODE (Detailed Reasoning)")
print("=" * 70)
print()

messages = [
    {"role": "user", "content": "Solve (x + 2)^2 = 0."}
]

# Apply chat template WITH thinking
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,  # Enable thinking for detailed reasoning
)

print("Prompt (with thinking enabled):")
print(text)
print("\n" + "=" * 70)
print("Response:")
print()

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=1024,  # More tokens for reasoning
    temperature=0.6, top_p=0.95, top_k=20,  # Settings for thinking
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

print("\n" + "=" * 70)
print("💡 Notice: Detailed reasoning in <think> tags before answer")
print("=" * 70)

### Test with Various Questions

Let's test both modes with different types of questions to see when each mode shines.

In [None]:
# Define test cases
test_cases = [
    {
        "question": "What is 15% of 80?",
        "best_mode": "Non-thinking (simple calculation)"
    },
    {
        "question": "If I invest $1000 at 5% annual interest compounded monthly for 3 years, how much will I have?",
        "best_mode": "Thinking (multi-step calculation)"
    },
    {
        "question": "Hello! How are you today?",
        "best_mode": "Non-thinking (simple chat)"
    },
    {
        "question": "Explain the concept of present value in finance and show me an example calculation.",
        "best_mode": "Thinking (complex explanation + calculation)"
    }
]

print("=== Test Cases ===\n")
for i, case in enumerate(test_cases, 1):
    print(f"{i}. Question: {case['question']}")
    print(f"   Best mode: {case['best_mode']}")
    print()

In [None]:
# Helper function to test both modes
def test_both_modes(question, max_tokens_non_thinking=256, max_tokens_thinking=1024):
    """
    Test a question with both thinking and non-thinking modes.
    
    Args:
        question: The question to ask
        max_tokens_non_thinking: Max tokens for non-thinking mode
        max_tokens_thinking: Max tokens for thinking mode
    """
    messages = [{"role": "user", "content": question}]
    
    # Non-thinking mode
    print("\n" + "=" * 70)
    print("NON-THINKING MODE:")
    print("=" * 70)
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
    )
    _ = model.generate(
        **tokenizer(text, return_tensors="pt").to("cuda"),
        max_new_tokens=max_tokens_non_thinking,
        temperature=0.7, top_p=0.8, top_k=20,
        streamer=TextStreamer(tokenizer, skip_prompt=True),
    )
    
    # Thinking mode
    print("\n" + "=" * 70)
    print("THINKING MODE:")
    print("=" * 70)
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
    )
    _ = model.generate(
        **tokenizer(text, return_tensors="pt").to("cuda"),
        max_new_tokens=max_tokens_thinking,
        temperature=0.6, top_p=0.95, top_k=20,
        streamer=TextStreamer(tokenizer, skip_prompt=True),
    )
    print("=" * 70)

print("✅ Testing function defined!")
print("\nUse this to compare both modes on any question.")

## 6. Save in Multiple Formats

Unsloth supports saving your model in **multiple formats** for different use cases:

### Saving Options

| Format | Size | Use Case | Load With |
|--------|------|----------|-----------|
| **LoRA adapters** | ~100MB | Sharing, experimentation | PEFT library |
| **Merged 16-bit** | ~28GB | Production (GPU) | Transformers |
| **Merged 4-bit** | ~7GB | Production (limited GPU) | Transformers + BitsAndBytes |
| **GGUF (q8_0)** | ~14GB | llama.cpp, Ollama | llama.cpp |
| **GGUF (q4_k_m)** | ~8GB | Edge deployment | llama.cpp |

### Recommended Workflow

1. **Development**: Save LoRA adapters only
2. **Testing**: Merge to 16-bit for quality check
3. **Deployment**: Convert to GGUF for efficiency

In [None]:
# Option 1: Save LoRA Adapters Only (Smallest)
# This saves only the trained adapter weights (~100MB)

local_path = "lora_model"
hub_path = "your-username/qwen3-14b-reasoning-chat-lora"  # Update username

print(f"Saving LoRA adapters...")

# Local save
model.save_pretrained(local_path)
tokenizer.save_pretrained(local_path)
print(f"\n✅ Saved locally to: {local_path}")

# Hub save (uncomment when ready)
# model.push_to_hub(hub_path, token="...")
# tokenizer.push_to_hub(hub_path, token="...")
# print(f"✅ Pushed to Hub: {hub_path}")

print("\n💡 LoRA adapters are small (~100MB) and easy to share")
print("   Load with: PeftModel.from_pretrained(base_model, 'lora_model')")

In [None]:
# Load the LoRA adapters (for future use)
# Set to True when you have a saved model

if False:
    from unsloth import FastLanguageModel
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="lora_model",  # Your saved model path
        max_seq_length=2048,
        load_in_4bit=True,
    )
    
    print("✅ Model loaded from saved LoRA adapters!")

print("💡 Set if False to if True above to load saved model")

### Save Merged Models

For production deployment, you may want to merge LoRA adapters with the base model.

**Merged 16-bit:**
- Full precision
- Best quality
- Large file size (~28GB)
- For GPU deployment

**Merged 4-bit:**
- Quantized
- Good quality
- Smaller size (~7GB)
- For limited GPU memory

In [None]:
# Option 2: Save Merged 16-bit (Full Precision)
if False:
    model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
    print("✅ Saved as merged 16-bit model")

# Push to Hub
if False:
    model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_16bit", token="")
    print("✅ Pushed merged 16-bit to Hub")

# Option 3: Save Merged 4-bit (Quantized)
if False:
    model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")
    print("✅ Saved as merged 4-bit model")

# Push to Hub
if False:
    model.push_to_hub_merged("hf/model", tokenizer, save_method="merged_4bit", token="")
    print("✅ Pushed merged 4-bit to Hub")

# Option 4: Save LoRA Only
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
    print("✅ Saved LoRA adapters")

# Push LoRA to Hub
if False:
    model.push_to_hub("hf/model", token="")
    tokenizer.push_to_hub("hf/model", token="")
    print("✅ Pushed LoRA to Hub")

print("💡 Uncomment the option you want to use")
print("\nRecommended:")
print("  - Development: LoRA adapters")
print("  - Production (GPU): Merged 16-bit")
print("  - Production (limited GPU): Merged 4-bit")
print("  - Edge devices: GGUF (see next section)")

### Save to GGUF for llama.cpp

**GGUF** (GPT-Generated Unified Format) is for:
- **llama.cpp**: Fast CPU/GPU inference
- **Ollama**: Local model serving
- **LM Studio**: Desktop app
- **Jan**: Desktop app

### GGUF Quantization Methods

| Method | Size | Quality | Speed | Use Case |
|--------|------|---------|-------|----------|
| **q8_0** | ~14GB | Best | Slower | GPU with memory |
| **q5_k_m** | ~10GB | Very good | Fast | Recommended |
| **q4_k_m** | ~8GB | Good | Faster | Most common |
| **f16** | ~28GB | Perfect | Slower | Quality check |

**Recommended**: `q4_k_m` for best balance of size/quality/speed

In [None]:
# Option 5: Save to GGUF (for llama.cpp, Ollama, etc.)

# Save to 8-bit Q8_0
if False:
    model.save_pretrained_gguf("model", tokenizer)
    print("✅ Saved as GGUF (q8_0)")

# Push to Hub
if False:
    model.push_to_hub_gguf("hf/model", tokenizer, token="")
    print("✅ Pushed GGUF (q8_0) to Hub")

# Save to 16-bit GGUF (highest quality)
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")
    print("✅ Saved as GGUF (f16)")

# Save to q4_k_m GGUF (recommended)
if False:
    model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")
    print("✅ Saved as GGUF (q4_k_m)")

# Push multiple GGUF formats at once (much faster!)
if False:
    model.push_to_hub_gguf(
        "hf/model",  # Update with your username!
        tokenizer,
        quantization_method=["q4_k_m", "q8_0", "q5_k_m"],
        token="",  # Get token at https://huggingface.co/settings/tokens
    )
    print("✅ Pushed multiple GGUF formats to Hub!")

print("💡 Uncomment to save in GGUF format")
print("\nGGUF files can be used with:")
print("  - llama.cpp (fast CPU/GPU inference)")
print("  - Ollama (local serving)")
print("  - LM Studio (desktop app)")
print("  - Jan (desktop app)")

## 🎉 Congratulations!

You've successfully:
- ✅ Loaded Qwen3-14B with Unsloth (2x faster!)
- ✅ Mixed reasoning and conversational datasets (75/25)
- ✅ Trained a model that can both reason and chat
- ✅ Tested thinking vs non-thinking modes
- ✅ Learned to save in multiple formats (LoRA, merged, GGUF)

## 🎯 Key Achievements

### Dual-Mode Model
- **Thinking mode**: Detailed reasoning for complex problems
- **Non-thinking mode**: Fast, concise responses for simple questions
- **User control**: Choose mode at inference time
- **Best of both**: Efficiency when needed, depth when required

### Efficient Training
- **Unsloth**: 2x faster than standard training
- **70% less memory**: Fits on consumer GPUs
- **4-bit quantization**: 14B model on 16GB GPU
- **LoRA**: Train only ~1-2% of parameters

### Practical Deployment
- **Multiple formats**: LoRA, merged, GGUF
- **Flexible deployment**: GPU, CPU, edge devices
- **Production-ready**: Save and share easily
- **Tool compatibility**: Works with llama.cpp, Ollama, etc.

## 📊 Performance Summary

| Metric | Value |
|--------|-------|
| **Training speedup** | 2x faster (Unsloth) |
| **Memory savings** | 70% less VRAM |
| **Model size** | 14B parameters (4-bit) |
| **Trainable params** | ~1-2% (LoRA) |
| **Training time** | ~30 min (demo) / 2-3 hours (full) |
| **Inference modes** | Thinking + Non-thinking |

## 🚀 Next Steps

### Immediate
1. **Train longer**: Set `num_train_epochs=1` for full training
2. **Adjust ratio**: Try different reasoning/chat percentages
3. **Evaluate**: Test on math benchmarks (GSM8K, MATH)
4. **Deploy**: Convert to GGUF and use with Ollama

### Advanced
1. **Larger model**: Try Qwen3-32B for better quality
2. **More data**: Add domain-specific datasets
3. **Custom datasets**: Create your own reasoning data
4. **Multi-GPU**: Scale training with DeepSpeed

### Production
1. **Quantize**: Use q4_k_m GGUF for deployment
2. **Serve**: Deploy with vLLM or llama.cpp
3. **API**: Wrap in FastAPI with mode selection
4. **Monitor**: Track thinking vs non-thinking usage

## 📚 Resources

### Documentation
- [Unsloth Docs](https://docs.unsloth.ai/)
- [Qwen3 Model Card](https://huggingface.co/Qwen/Qwen3-14B)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)

### Datasets
- [OpenMathReasoning](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini)
- [FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k)
- [GSM8K](https://huggingface.co/datasets/gsm8k) (math evaluation)

### Related Notebooks
- [Qwen3 Full Docs](https://qwenlm.github.io/blog/qwen3/)
- [Unsloth Notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks)
- [GGUF Quantization Guide](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)

## 💡 Tips for Better Results

### Dataset Mixing
1. **Math-heavy use case**: 80-90% reasoning, 10-20% chat
2. **Balanced use case**: 50-50 split
3. **Chat-heavy use case**: 20-30% reasoning, 70-80% chat

### Training
1. **Full epochs**: For best quality, train 1-2 full epochs
2. **Learning rate**: Start with 2e-4, reduce to 2e-5 for long runs
3. **Batch size**: Increase if you have more memory
4. **LoRA rank**: Try 64 or 128 for larger models

### Inference
1. **Simple questions**: Use non-thinking mode (faster, cheaper)
2. **Complex problems**: Use thinking mode (better quality)
3. **Temperature**: Lower (0.6) for thinking, higher (0.7) for chat
4. **Max tokens**: More (1024+) for thinking, less (256) for chat

## 🔧 Troubleshooting

### Out of Memory
- Reduce `per_device_train_batch_size` to 1
- Increase `gradient_accumulation_steps` to 8
- Reduce `max_seq_length` to 1024
- Use smaller model (Qwen3-8B or 4B)

### Poor Reasoning Quality
- Increase reasoning dataset percentage (80-90%)
- Train for more steps/epochs
- Increase LoRA rank to 64
- Use larger model (Qwen3-32B)

### Slow Training
- Already using Unsloth (2x faster!)
- Increase batch size if memory allows
- Use gradient checkpointing: `"unsloth"`
- Ensure you're on GPU

### Model Doesn't Use Thinking
- Ensure `enable_thinking=True` in chat template
- Check training data has `<think>` tags
- Train on more reasoning examples
- Use thinking-specific temperature settings

---

**Happy Fine-Tuning!** 🎓✨

For more tutorials, check out:
- [DeepSeek-R1 (Synthetic Data)](./Math-Reasoning-Qwen-GRPO.ipynb)
- [Gemma 3 (Financial Q&A)](./Financial-Reasoning-Gemma-3.ipynb)
- [GPT-2 From Scratch](../01-Full-Fine-Tuning/)
- [Falcon-7B LoRA](../02-PEFT/)
- [FLAN-T5 Summarization](../03-Instruction-Tuning/Summarization-FLAN-T5.ipynb)

---

## 🌟 Special Features of This Notebook

### Unique Capabilities

1. **Dual-Mode Training** 🔀
   - Mix reasoning and chat data
   - Adjustable ratios
   - Single model, multiple capabilities

2. **Unsloth Optimization** ⚡
   - 2x faster training
   - 70% less memory
   - No quality loss

3. **Controllable Thinking** 🧠
   - Enable/disable at inference
   - Recommended settings per mode
   - Flexibility for different use cases

4. **Multiple Export Formats** 💾
   - LoRA adapters
   - Merged models
   - GGUF for llama.cpp
   - Single command for multiple formats

---

## 🎊 You've Completed the Reasoning Trilogy!

This is the **3rd reasoning notebook** in this repository:

1. **DeepSeek-R1**: Synthetic reasoning data + Unsloth + code generation
2. **Gemma 3**: Financial reasoning with expert datasets
3. **Qwen3**: Mixed reasoning + chat with controllable thinking ⭐

**Each teaches different aspects of reasoning fine-tuning!**

---

**Thank you for completing this tutorial!** 🙏

**You now have a powerful model that can reason deeply AND chat naturally!** 🚀