## 🚀 Large Language Model Fine-tuning Practical Tutorial

### 📘 Welcome to Your LLM Fine-tuning Journey!

In this tutorial, we'll explore how to transform a powerful pre-trained model into your personalized AI assistant. Whether you're an AI beginner or an experienced developer, this guide will help you master the core techniques of model fine-tuning in the most intuitive way possible.

---

### 🎯 Why Fine-tune Models?

Think of pre-trained large language models as brilliant generalists—they possess vast knowledge but may not understand your specific business needs. **Model fine-tuning** is the process of transforming this "generalist" into a "domain expert" for your field.

#### Core Concepts Explained:
- **Pre-trained Models**: Foundation models trained on massive text corpora with broad language understanding
- **Fine-tuning Process**: Using your domain data to teach the model task-specific patterns and knowledge
- **End Result**: A customized model that retains general capabilities while excelling at specific tasks

---

### 🛠️ Our Technology Stack

We've carefully selected an **efficient, user-friendly, and resource-conscious** technology stack:

#### 1. **Unsloth Acceleration Engine** ⚡
- Provides **2x training speedup**
- Reduces **60% memory usage**
- Enables large model training on consumer GPUs!

#### 2. **LoRA Fine-tuning Technology** 🎯
- Trains only **0.1% of parameters** while achieving near full fine-tuning performance
- Post-training adapter files are just tens of MBs, easy to deploy and share
- Supports flexible switching between multiple LoRA adapters

#### 3. **4-bit Quantization** 💾
- Compresses model size to **1/4 of original**
- Maintains performance while drastically reducing hardware requirements
- Enables 70B models on 24GB consumer GPUs

#### 4. **Base Model: Baidu ERNIE-4.5** 🤖
We're using Baidu's latest ERNIE-4.5 series as our foundation:
- **0.3B version**: Perfect for quick experiments and learning (default in this tutorial)
- **21B version**: Production-ready with high performance

---

### 📚 Learning Roadmap

```
Step 1: Environment Setup (10 minutes)
   ↓
Step 2: Model Loading & Configuration (5 minutes)
   ↓
Step 3: Data Preparation & Formatting (10 minutes)
   ↓
Step 4: Training Launch (20 minutes)
   ↓
Step 5: Save & Deploy (5 minutes)
```

Let's begin this exciting AI journey! 🎉

## 📦 Step 1: Environment Setup

### 🔧 Installing Required Libraries

First, we need to install several key Python libraries that will provide all the functionality needed for model training.

**Tip**: Installation takes about 2-3 minutes. Feel free to read ahead while waiting!

In [None]:
# ===== Install Core Libraries =====
# Unsloth: Our training acceleration engine, installed from source for latest features
!pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# bitsandbytes: Provides model quantization to compress models to 4-bit or 8-bit
# unsloth_zoo: Contains pre-configured models and toolsets
!pip install bitsandbytes unsloth_zoo

# transformers: Hugging Face's core library providing models and trainers
!pip install -U transformers

print("✅ Environment ready! All dependencies successfully installed.")

## 🤖 Step 2: Load Model and Configure LoRA

### 💡 Understanding LoRA: Making Large Model Fine-tuning Simple

Before diving into code, let's understand the magic of **LoRA (Low-Rank Adaptation)**:

#### Traditional Fine-tuning vs LoRA Fine-tuning

| Aspect | Traditional Full Fine-tuning | LoRA Fine-tuning |
|--------|------------------------------|------------------|
| Parameters Trained | 100% (billions) | 0.1-1% (millions) |
| Memory Required | Extreme (>80GB) | Moderate (8-24GB) |
| Training Time | Days | Hours |
| Model File Size | Full model size | Adapter only ~100MB |
| Performance | Best | Near-best |

#### How LoRA Works

Think of model knowledge updates like home renovation:
- **Traditional method**: Tear down and rebuild the entire house (train all parameters)
- **LoRA method**: Install "adapters" in key locations (train small new parameters)

Specifically, LoRA adds two small matrices (A and B) alongside the model's attention layers, training these small matrices to adjust model behavior:
```
Original output = W × input
LoRA output = W × input + (B × A) × input
              ↑unchanged    ↑new trainable part
```

#### Key Parameters Explained

- **r (rank)**: LoRA matrix rank, controls adapter capacity
  - r=8: Lightweight fine-tuning for simple tasks
  - r=16: Balanced choice (used in this tutorial)
  - r=32+: Complex tasks requiring more new knowledge
  
- **lora_alpha**: Scaling factor, typically set to 2× rank
- **lora_dropout**: Prevents overfitting, usually 0.05-0.1
- **target_modules**: Modules to apply LoRA to
  - "all-linear": All linear layers (recommended)
  - Can specify specific layers like ["q_proj", "v_proj"]

In [None]:
# ===== Step 1: Load Pre-trained Model =====
from unsloth import FastModel
import torch

# Set maximum sequence length (affects memory usage and training speed)
MAX_LEN = 4096  # Adjust based on task: 2048 (chat), 4096 (balanced), 8192 (long text)

# Model Selection Guide:
# - baidu/ERNIE-4.5-0.3B-PT: Lightweight, perfect for learning and quick experiments (recommended for beginners)
# - baidu/ERNIE-4.5-21B-A3B-PT: Production-grade, powerful but requires more resources

print("🔄 Loading model, please wait...")
model, tokenizer = FastModel.from_pretrained(
    model_name="baidu/ERNIE-4.5-0.3B-PT",  # Can switch to 21B version
    # max_seq_length=MAX_LEN,               # Some versions require this
    # load_in_4bit=False,                   # Set True for QLoRA if memory constrained
    # load_in_8bit=False,                   # 8-bit quantization option
    # full_finetuning=False,                # True for full fine-tuning (requires large memory)
    trust_remote_code=True,                  # Required for ERNIE models
)
print("✅ Model loaded successfully!")

# ===== Step 2: Configure LoRA Adapters =====
from unsloth import FastLanguageModel

print("🔧 Configuring LoRA adapters...")
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank: 8 (lightweight), 16 (balanced), 32 (complex tasks)
    lora_alpha=32,           # Scaling factor: typically 2× rank
    lora_dropout=0.05,       # Dropout rate: prevents overfitting (0.05-0.1)
    target_modules="all-linear",  # Layers to apply LoRA: "all-linear" or specific layer list
    use_rslora=True,         # Use improved RSLoRA (more stable)
    # use_gradient_checkpointing=True,  # Saves memory but slows training
)

# Print model information
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
trainable_percent = 100 * trainable_params / all_params

print(f"✅ LoRA configuration complete!")
print(f"📊 Model Statistics:")
print(f"   - Total parameters: {all_params:,}")
print(f"   - Trainable parameters: {trainable_params:,}")
print(f"   - Trainable percentage: {trainable_percent:.2f}%")

## 📊 Step 3: Prepare Training Data

### 📝 The Importance of Data Formatting

Data is the "textbook" for model fine-tuning. Just as teaching requires appropriate materials, model training needs correctly formatted data.

#### Why Data Format Matters

1. **Model Understanding**: Models learned specific conversation formats during pre-training
2. **Performance Impact**: Incorrect formatting causes models to "misunderstand" instructions
3. **Consistency**: Maintaining format consistency helps models generalize better

#### ERNIE's Conversation Format

ERNIE-4.5 uses a specific conversation template:
```
<|begin_of_sentence|>User: [user input]
Assistant: [assistant response]<|end_of_sentence|>
```

These special tokens help the model distinguish:
- Start and end of conversations
- Boundaries between user input and AI responses
- Context in multi-turn dialogues

#### Dataset Selection Recommendations

This tutorial uses Microsoft's Orca Math dataset as an example, but you can choose based on your needs:

- **General Dialogue**: ShareGPT, Alpaca, etc.
- **Math Reasoning**: GSM8K, MATH, etc.
- **Code Generation**: CodeAlpaca, StarCoder, etc.
- **Domain-Specific**: Medical, legal, financial specialized datasets
- **Custom Data**: Your own business data (recommended)

In [None]:
# ===== Load and Format Dataset =====
from datasets import load_dataset

print("📚 Loading dataset...")
# Using Microsoft's Orca math word problems dataset as example
# split="train[:1%]" means using only 1% of data for quick demonstration
# For production, use complete dataset or your own data
ds = load_dataset("microsoft/orca-math-word-problems-200k", split="train[:1%]")
print(f"✅ Data loaded! Total samples: {len(ds)}")

# ===== Data Formatting Function =====
def format_chat(example):
    """
    Convert raw data to ERNIE model's conversation format
    
    Input format:
    - question: User's question
    - answer: Expected response
    
    Output format:
    <|begin_of_sentence|>User: [question]
    Assistant: [answer]<|end_of_sentence|>
    """
    # Build conversation message list
    messages = [{"role": "user", "content": example["question"]}]
    
    # Add assistant response if answer exists
    if "answer" in example and example["answer"]:
        messages.append({"role": "assistant", "content": example["answer"]})
    
    # Use tokenizer's template to automatically add special tokens
    formatted_text = tokenizer.apply_chat_template(
        messages, 
        tokenize=False,  # Don't tokenize, just format
        add_generation_prompt=False  # No generation prompt needed for training
    )
    
    return {"text": formatted_text}

# ===== Apply Formatting =====
print("🔄 Formatting data...")
ds = ds.map(
    format_chat,  # Apply formatting function
    remove_columns=ds.column_names,  # Remove original columns, keep only "text"
    desc="Formatting data"  # Progress bar description
)

# ===== Validate Data Format =====
print("✅ Data formatting complete!")
print("\n📝 Data Sample (first entry):")
print("-" * 50)
print(ds[0]["text"][:500])  # Show first 500 characters
print("-" * 50)
print(f"\n💡 Tip: Verify the data contains correct special tokens!")

### 🔍 Data Format Validation

From the output above, we can see the data has been correctly formatted for the ERNIE model:
- `<|begin_of_sentence|>` marks conversation start
- `User:` and `Assistant:` clearly distinguish roles
- `<|end_of_sentence|>` marks conversation end

This format ensures the model correctly understands task boundaries and role transitions.

## 🚂 Step 4: Configure and Launch Training

### ⚙️ Training Parameters Explained

Training configuration is key to successful fine-tuning. Let's understand each parameter's role and optimization tips:

#### 🎯 Batch Processing & Gradient Accumulation

**Effective Batch Size = per_device_train_batch_size × gradient_accumulation_steps × Number of GPUs**

In our configuration:
- `per_device_train_batch_size = 1`: Process 1 sample per forward pass
- `gradient_accumulation_steps = 8`: Accumulate gradients for 8 steps before updating
- **Effective Batch Size = 1 × 8 = 8**

💡 **Optimization Tips**:
- Sufficient memory: Increase `per_device_train_batch_size`
- Limited memory: Reduce batch size, increase gradient accumulation
- Target: Effective batch size between 8-32 typically works well

#### 📈 Learning Rate & Optimizer

- `learning_rate = 1e-4`: Typical learning rate for LoRA fine-tuning
  - Too high (>5e-4): Unstable training, oscillating loss
  - Too low (<1e-5): Slow convergence, poor results
  - Experience: 1e-4 to 2e-4 for LoRA, 1e-5 to 5e-5 for full fine-tuning

- `optim = "adamw_8bit"`: 8-bit AdamW optimizer
  - Saves 75% optimizer state memory
  - Negligible performance loss
  - Suitable for large models and long sequences

- `lr_scheduler_type = "linear"`: Linear learning rate decay
  - High learning rate early for rapid learning
  - Low learning rate later for fine adjustments
  - Alternatives: cosine (cosine decay), constant (fixed rate)

#### 🔄 Training Epochs & Steps

- `num_train_epochs = 1`: Number of training epochs
  - Small datasets (<10k): 3-5 epochs
  - Medium datasets (10k-100k): 1-3 epochs
  - Large datasets (>100k): 1 epoch or use steps

- Alternative using `max_steps`:
  - Quick testing: 50-100 steps
  - Normal training: 500-2000 steps
  - Deep training: 5000+ steps

#### 💾 Checkpoint Saving

- `save_strategy = "steps"`: Save by steps
- `save_steps = 200`: Save every 200 steps
  - Enables resuming after interruption
  - Allows selecting best checkpoint
  - Recommendation: Set to 10-20% of total steps

#### 🚀 Performance Optimization

- `fp16 = True`: Use half-precision training
  - 2x speed improvement
  - 50% memory reduction
  - Works on most GPUs (V100, T4, RTX, etc.)

- `bf16 = False`: BFloat16 precision
  - Better numerical stability
  - Requires newer GPUs (A100, H100)
  - Prefer bf16 if supported

- `packing = True`: Sample packing
  - Concatenates short samples for better GPU utilization
  - Especially useful for varying-length datasets
  - Can improve training speed 2-5x

#### 📊 Monitoring & Debugging

- `logging_steps = 5`: Log every 5 steps
  - Monitor loss decrease trend
  - Quickly identify training issues
  - Production: can set to 10-50

- `report_to = "none"`: No external reporting
  - Options: "tensorboard", "wandb"
  - Useful for visualization and team collaboration

### 🎓 Understanding Training Progress

During training, you'll see output like:
```
Step | Training Loss | Validation Loss
-----|---------------|----------------
10   | 2.345         | -
20   | 1.876         | -
30   | 1.234         | -
```

**Normal behavior**:
- Loss continuously decreases (from 2-3 down to 0.5-1.5)
- Decrease rate gradually slows
- No sudden jumps

**Troubleshooting**:
- Loss not decreasing → Check data format, increase learning rate
- Loss oscillating → Reduce learning rate, increase gradient accumulation
- Loss suddenly increases → Possible overfitting, add dropout, reduce training steps

In [None]:
# ===== Configure Training Parameters =====
from trl import SFTTrainer
from transformers import TrainingArguments

print("⚙️ Configuring training parameters...")

# Training arguments configuration
training_args = TrainingArguments(
    # ===== Batch Processing =====
    per_device_train_batch_size=1,      # Batch size per device (adjust based on memory)
    gradient_accumulation_steps=8,      # Gradient accumulation (effective batch = 1 × 8 = 8)
    
    # ===== Learning Rate =====
    learning_rate=1e-4,                 # Initial learning rate (LoRA typically uses 1e-4 to 2e-4)
    lr_scheduler_type="linear",         # LR scheduler: "linear", "cosine", "constant"
    warmup_steps=0,                     # Warmup steps (optional, typically 10% of total)
    
    # ===== Training Duration =====
    num_train_epochs=1,                 # Number of epochs (or use max_steps instead)
    # max_steps=100,                    # Maximum training steps (choose one with num_train_epochs)
    
    # ===== Optimizer =====
    optim="adamw_8bit",                 # 8-bit AdamW optimizer (saves memory)
    weight_decay=0.01,                  # Weight decay (L2 regularization)
    
    # ===== Precision =====
    fp16=True,                          # Use FP16 mixed precision (T4, V100, etc.)
    bf16=False,                         # Use BF16 (A100, H100 newer GPUs)
    
    # ===== Saving & Logging =====
    output_dir="outputs",               # Output directory
    save_strategy="steps",              # Save strategy: "steps" or "epoch"
    save_steps=200,                     # Save checkpoint every N steps
    logging_steps=5,                    # Log every N steps
    report_to="none",                   # Reporting: "tensorboard", "wandb", or "none"
    
    # ===== Other Settings =====
    remove_unused_columns=False,        # Keep all data columns
    seed=42,                            # Random seed (for reproducibility)
)

# ===== Initialize Trainer =====
print("🚀 Initializing SFT trainer...")

trainer = SFTTrainer(
    model=model,                        # Model with LoRA configured
    tokenizer=tokenizer,                # Tokenizer
    train_dataset=ds,                   # Training dataset
    dataset_text_field="text",          # Text field name in dataset
    args=training_args,                 # Training arguments
    max_seq_length=MAX_LEN,             # Maximum sequence length
    packing=True,                       # Enable sample packing (improves GPU utilization)
    use_cache=False,                    # Disable cache (required for LoRA training)
)

print("✅ Trainer configured!")
print(f"📊 Training Information:")
print(f"   - Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"   - Total training samples: {len(ds)}")
print(f"   - Estimated training steps: {len(ds) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)}")

# ===== Start Training =====
print("\n🏃 Starting training... (this may take a few minutes)")
print("💡 Tip: Watch for decreasing loss values\n")

# Execute training
trainer_stats = trainer.train(
    # resume_from_checkpoint=True,      # Resume from checkpoint if exists
)

# ===== Training Complete =====
print("\n🎉 Training complete!")
print(f"📊 Training Statistics:")
print(f"   - Total training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"   - Samples per second: {trainer_stats.metrics['train_samples_per_second']:.2f}")
print(f"   - Final loss: {trainer_stats.metrics['train_loss']:.4f}")

## 💾 Step 5: Save the Fine-tuned Model

In [None]:
# ===== Save Fine-tuned Model =====
print("💾 Saving model...")

# Save path
save_path = "ernie-4.5-0.3b-sft-merged"

# Execute save (merge LoRA weights into base model)
model.save_pretrained_merged(
    save_path,                          # Save path
    tokenizer,                          # Also save tokenizer
    save_method="merged_16bit",         # Save method:
                                       # - "merged_16bit": Merge and save as 16-bit (recommended)
                                       # - "merged_4bit": 4-bit quantized save (smallest)
                                       # - "lora": Save only LoRA adapter
)

print(f"✅ Model saved successfully!")
print(f"📁 Save location: {save_path}/")
print(f"📦 File descriptions:")
print(f"   - config.json: Model configuration file")
print(f"   - model.safetensors: Model weights file")
print(f"   - tokenizer.json: Tokenizer file")
print(f"   - tokenizer_config.json: Tokenizer configuration")

### 🎉 Congratulations! Training Complete

Your fine-tuned model is saved at: `/content/ernie-4.5-0.3b-sft-merged`

#### Save Options Explained

We used the `save_pretrained_merged` method, which offers several save options:

1. **merged_16bit** (used in this tutorial)
   - Merges LoRA weights back into original model
   - Saves as 16-bit precision
   - Moderate file size, fast inference
   - Best for deployment

2. **merged_4bit**
   - 4-bit quantized save
   - Smallest files (about 1/4 original)
   - Best for resource-constrained environments

3. **lora**
   - Saves only LoRA adapter
   - Extremely small files (typically <100MB)
   - Requires original model to use
   - Best for multi-task switching scenarios

#### Next Steps: Using Your Model

```python
# Load your fine-tuned model for inference
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "ernie-4.5-0.3b-sft-merged",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("ernie-4.5-0.3b-sft-merged")

# Generate responses with your model
messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

In [None]:
# ===== Memory Cleanup (Optional) =====
# Run this code if you need to free GPU memory for other tasks

# import gc
# import torch

# # Delete model and trainer objects
# del model, tokenizer, trainer
# 
# # Force garbage collection
# gc.collect()
# 
# # Clear CUDA cache
# if torch.cuda.is_available():
#     torch.cuda.empty_cache()
#     print("✅ GPU memory cleared!")

## 📚 Summary & Advanced Guide

### 🎯 Skills You've Mastered

By completing this tutorial, you've learned to:
1. ✅ Accelerate model training with Unsloth
2. ✅ Configure and apply LoRA fine-tuning
3. ✅ Prepare and format training data
4. ✅ Adjust training hyperparameters
5. ✅ Save and deploy fine-tuned models

### 🚀 Advanced Optimization Tips

#### 1. **Data Quality Optimization**
- Ensure data diversity and balance
- Clean low-quality samples
- Add edge cases and challenging examples
- Use data augmentation techniques

#### 2. **Hyperparameter Tuning**
- Use grid search or Bayesian optimization
- Monitor validation set performance
- Implement early stopping to prevent overfitting
- Apply learning rate warmup strategies

#### 3. **Multi-task Learning**
- Train multiple LoRA adapters
- Implement task routing mechanisms
- Share base model to save resources

#### 4. **Production Deployment**
- Use ONNX or TensorRT for acceleration
- Implement batch inference
- Add caching mechanisms
- Monitor model performance metrics

### 🔧 Troubleshooting Guide

| Issue | Possible Cause | Solution |
|-------|----------------|----------|
| OOM (Out of Memory) | Batch size too large/sequence too long | Reduce batch size/use gradient accumulation/enable quantization |
| Loss not decreasing | Improper learning rate/data issues | Adjust learning rate/check data format |
| Slow training | Unoptimized configuration | Enable fp16/use packing/upgrade Unsloth |
| Poor results | Insufficient data/improper parameters | Increase data/adjust LoRA rank |

### 📖 Recommended Learning Resources

1. **Unsloth Documentation**: [github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
2. **LoRA Paper**: [arxiv.org/abs/2106.09685](https://arxiv.org/abs/2106.09685)
3. **Hugging Face Tutorials**: [huggingface.co/docs/peft](https://huggingface.co/docs/peft)
4. **ERNIE Model Cards**: [huggingface.co/baidu](https://huggingface.co/baidu)

### 💬 Join the Community

- Having issues? Open a GitHub Issue
- Want to share experiences? Join our technical community
- Have improvements? Submit a Pull Request

---

**Wishing you success on your AI fine-tuning journey!** 🌟

*If this tutorial helped you, please give our repository a ⭐ Star!*