## Preface

As pre-trained Large Language Models (LLMs) demonstrate exceptional general capabilities in natural language processing, efficiently adapting them to specific scenarios and maximizing their value has become a focal point for both industry and academia. This tutorial aims to provide developers and researchers with a complete, efficient, and resource-controlled fine-tuning practice solution.

### I. Model Fine-tuning Overview

Model fine-tuning is a crucial transfer learning technique. The core idea is to leverage a foundational model that has been pre-trained on massive unlabeled corpora and possesses rich world knowledge and language capabilities, then further train it using relatively smaller, labeled datasets targeted at specific downstream tasks or domains.

This process doesn't start from scratch; instead, it adjusts the model's existing parameters to better align its knowledge structure and capability distribution with the data characteristics and requirements of the target task. Through fine-tuning, the model transforms from a "general problem solver" to a "domain-specific expert," achieving significant performance improvements in target applications.

### II. Core Objectives and Application Value

The primary goals of model fine-tuning are to achieve model customization, specialization, and alignment.

### III. Technology Stack and Methodology Adopted in This Tutorial

To ensure efficiency and reproducibility of this fine-tuning practice, this tutorial is built upon a cutting-edge set of open-source tools and advanced methodologies:

**Core Optimization Engine: Unsloth**

An open-source library focused on improving LLM fine-tuning efficiency. Through its deeply optimized CUDA kernels, this tutorial will achieve up to 2x speed improvement and over 60% memory usage reduction throughout the training process, enabling large-scale model fine-tuning on consumer-grade hardware.

**Key Implementation Technologies:**

**Parameter-Efficient Fine-Tuning (PEFT / LoRA):** This tutorial will adopt Low-Rank Adaptation (LoRA) technology. This method freezes most of the model's original parameters and only introduces and trains a small number of pluggable adapter modules, significantly reducing computational and storage overhead during training while maintaining model performance.

**4-bit Model Quantization:** Using the bitsandbytes library, this tutorial will apply 4-bit quantization technology during model loading. This technique compresses the model's static memory footprint and runtime memory peaks by reducing the numerical precision of model weights, which is key to enabling large model fine-tuning in resource-constrained environments.

This tutorial will use the open-source model baidu/ERNIE-4.5-21B-A3B-PT as the experimental base model, with all practices built upon the industry-standard Hugging Face ecosystem (TRL, Transformers). Now, let's formally begin our fine-tuning exploration journey.

## Environment Setup

### 1. Environment Setup: Install Unsloth and Related Dependencies

In [None]:
# Install the latest version of unsloth from GitHub source code
# This ensures we're using the latest features and fixes
!pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Install bitsandbytes and unsloth_zoo packages
# bitsandbytes is a library for quantization and model optimization, helping reduce memory usage
# unsloth_zoo contains pre-trained models or other tools for convenient use
!pip install bitsandbytes unsloth_zoo
!pip install -U transformers

### 2. Load Model and Configure LoRA Fine-tuning

* **Configure PEFT (LoRA):** Directly fine-tuning the entire large model is impractical due to massive computational resource requirements. Therefore, we adopt a technique called PEFT (Parameter-Efficient Fine-Tuning), with LoRA (Low-Rank Adaptation) being the most classic approach. The FastLanguageModel.get_peft_model function will add LoRA adapters to our model.

**Principle:** The core idea of LoRA is to freeze most of the original model's parameters and only inject small, trainable "adapter" layers in key parts of the model (such as the attention layers defined in target_modules).

**Effect:** Instead of training all tens of billions of parameters, we only train a few million LoRA parameters. This dramatically reduces the memory and time required for training.

`use_gradient_checkpointing=True` is another important memory optimization technique that further reduces peak memory usage during training by recomputing intermediate activations during backpropagation rather than storing them.

In [None]:
# 1) Load model
from unsloth import FastModel
import torch

MAX_LEN = 4096

# Two model sizes are available for the workflow below: baidu/ERNIE-4.5-0.3B-PT, baidu/ERNIE-4.5-21B-A3B-PT
# For first-time script testing, we recommend using the 0.3B small model for quick experience (though it may struggle with most real downstream tasks)
# For normal task training, we recommend switching to A100 GPU and training with the 21B model
model, tokenizer = FastModel.from_pretrained(
    model_name      = "baidu/ERNIE-4.5-0.3B-PT", # baidu/ERNIE-4.5-0.3B-PT, baidu/ERNIE-4.5-21B-A3B-PT
    # max_seq_length  = MAX_LEN,
    # load_in_4bit    = False,            # Common QLoRA configuration, set to True if memory is tight
    # load_in_8bit    = False,
    # full_finetuning = False,           # LoRA/QLoRA training; set True for full parameter training (requires more memory)
    trust_remote_code = True,          # Required for ERNIE
)

# 2) Configure LoRA
from unsloth import FastLanguageModel
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, lora_alpha = 32, lora_dropout = 0.05,
    # The r value can be increased based on task difficulty and data volume (larger rank can store more new knowledge)
    # Alpha is recommended to be set to 2x the rank
    target_modules = "all-linear",     # Unsloth will automatically handle; you can also specify specific modules
    use_rslora = True,                 # Common stable configuration
)

### 3. Prepare Fine-tuning Data

The model is ready, now we need to prepare the "textbook" for it - our training dataset. The effectiveness of large model fine-tuning largely depends on the quality and format of the training data. If the data format doesn't match the format used during model pre-training, training effectiveness will be significantly reduced.

This code loads a raw dataset and strictly converts it into the conversation format expected by the ERNIE model.

In [None]:
# 3) Prepare data (Example: ShareGPT / chat format)
from datasets import load_dataset
ds = load_dataset("microsoft/orca-math-word-problems-200k", split="train[:1%]")  # Replace with your data

def format_chat(ex):
    msgs = [{"role": "user", "content": ex["question"]}]
    if "answer" in ex and ex["answer"]:
        msgs.append({"role": "assistant","content": ex["answer"]})
    return {"text": tokenizer.apply_chat_template(
        msgs, tokenize=False, add_generation_prompt=False
    )}
ds = ds.map(format_chat, remove_columns=ds.column_names)

# Print one data sample to see the underlying input sequence for the model
ds[:1]

As we can see, ERNIE's underlying sequence structure is: `<|begin_of_sentence|>User: XXX.\nAssistant: XXX<|end_of_sentence|>`

### 4. Configure and Launch Model Fine-tuning (SFT)

Now, we'll configure the Trainer from the transformers library and launch the fine-tuning process. Unsloth integrates seamlessly with trl's SFTTrainer in the background, allowing us to define complex training loops with very concise code.

The workflow in this code block is as follows:

**Instantiate SFTTrainer:**
- SFTTrainer is a powerful tool specifically designed for Supervised Fine-Tuning. We pass the prepared model, tokenizer, and formatted dataset to it.
- `dataset_text_field="text"` explicitly tells the trainer that the column containing the text content we want to train on is called "text".
- `max_seq_length` ensures that input data will be truncated or padded to our previously set maximum length.

**Configure Training Parameters (TrainingArguments):**
- This is the "control panel" for fine-tuning, where we define all training hyperparameters.
- **Memory optimization combination:** `per_device_train_batch_size=2` and `gradient_accumulation_steps=4` work together to achieve an effective batch size of 8 (2 * 4) while keeping single-iteration memory requirements low. This is a key technique for training with limited memory.
- **Performance optimization:** Automatically detects and enables bf16 or fp16 mixed precision training, which can significantly improve training speed. `optim="adamw_8bit"` uses 8-bit optimizer states to further save memory.
- **Training schedule:** `max_steps=60` means we're only training for 60 steps for quick demonstration. In practice, adjust this to a larger value (e.g., 500-1000 steps) based on task complexity and data volume. `learning_rate=2e-4` is a commonly used learning rate for LoRA fine-tuning.

**Start Training (trainer.train()):**
- After calling this simple function, the complex training loop (data loading, forward pass, loss calculation, backpropagation, parameter updates) begins automatically.
- In the output, you'll see the training loss decreasing continuously, indicating that the model is learning from the data and becoming increasingly "intelligent".

In [None]:
# 4) Train with TRL's SFTTrainer (Unsloth officially recommended approach)
from trl import SFTTrainer
from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 8,
    learning_rate = 1e-4,             # Can be reduced to 2e-5 for longer training
    logging_steps = 5,
    num_train_epochs = 1,             # Or use max_steps
    fp16 = True,
    bf16 = False,
    optim = "adamw_8bit",
    lr_scheduler_type = "linear",
    weight_decay = 0.01,
    output_dir = "outputs",
    save_strategy = "steps",
    save_steps = 200,                 # Save checkpoint every 200 steps for easy resume
    report_to = "none",
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = ds,
    dataset_text_field = "text",
    args = args,
    max_seq_length = MAX_LEN,
    packing = True,                   # Makes sample concatenation more efficient
    use_cache=False,
)

trainer_stats = trainer.train()       # Can pass resume_from_checkpoint=True to resume training

### 5. Save the Fine-tuned LoRA Adapter

In [None]:
model.save_pretrained_merged(
    "ernie-4.5-0.3b-sft-merged",
    tokenizer,
    save_method = "merged_16bit",
)

The saved post-training weights are located at: /content/ernie-4.5-0.3b-sft-merged

In [None]:
# Clean up memory if needed
# import gc
# del model, tokenizer, trainer, dpo_trainer
# gc.collect()
# torch.cuda.empty_cache()