# 如何使用TRL通过LoRA适配器微调大型语言模型（LLMs）

This notebook demonstrates how to efficiently fine-tune large language models using LoRA (Low-Rank Adaptation) adapters. LoRA is a parameter-efficient fine-tuning technique that:
本笔记本展示了如何使用LoRA（低秩适应）适配器高效地微调大型语言模型。LoRA是一种参数高效的微调技术，具有以下特点：本笔记本展示了如何使用LoRA（低秩适应）适配器高效地微调大型语言模型。LoRA是一种参数高效的微调技术，具有以下特点：
- 冻结预训练模型的权重
- 在注意力层中添加小的可训练秩分解矩阵
- 通常将可训练参数减少约90%
- 在保持模型性能的同时提高内存效率

我们将涵盖以下内容：
- 设置开发环境和LoRA配置
- 创建并准备适配器训练所需的数据集
- 使用带有LoRA适配器的trl和SFTTrainer进行微调
- 测试模型并合并适配器（可选）

## 1. 设置开发环境

我们的第一步是安装Hugging Face库和Pytorch，包括trl、transformers和datasets。如果你还没听说过trl，不用担心。它是一个建立在transformers和datasets之上的新库，使得微调、rlhf（基于人类反馈的强化学习）以及开放大型语言模型的对齐变得更加容易。

In [None]:
# !pip install transformers datasets trl huggingface_hub
# Authenticate to Hugging Face(optional)
from huggingface_hub import login

login()

# for convenience you can create an environment variable containing your hub token as HF_TOKEN

## 2. 加载数据集
由于网络原因这里选择加载提前下载好的数据集。

In [1]:
# Load a sample dataset
from datasets import load_dataset

# define your dataset and config using the path and name parameters
dataset = load_dataset("parquet", data_files={'train': '/dataset/smoltalk/everyday-conversations/train-00000-of-00001.parquet',
                                              'test': '/dataset/smoltalk/everyday-conversations/test-00000-of-00001.parquet'})
#dataset = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")
dataset

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 119
    })
})

In [2]:
dataset['test']['messages'][0]

[{'content': 'Hey!', 'role': 'user'},
 {'content': 'Hello! How can I help you today?', 'role': 'assistant'},
 {'content': "I'm planning a trip to Paris. What are some popular tourist attractions?",
  'role': 'user'},
 {'content': 'The Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral are must-visit places in Paris.',
  'role': 'assistant'},
 {'content': 'That sounds great. Are there any local markets I should check out?',
  'role': 'user'},
 {'content': 'Yes, the Champs-Élysées Christmas Market and the Marché aux Puces de Saint-Ouen (flea market) are very popular among tourists and locals alike.',
  'role': 'assistant'},
 {'content': 'Awesome, thank you for the recommendations!', 'role': 'user'},
 {'content': "You're welcome! Have a great time in Paris!",
  'role': 'assistant'}]

## 3. 使用trl和带有LoRA的SFTTrainer微调LLM

trl中的[SFTTrainer](https://huggingface.co/docs/trl/sft_trainer)通过[PEFT](https://huggingface.co/docs/peft/en/index)库提供了与LoRA适配器的集成。这种设置的主要优势包括：

1.**内存效率**：
- 仅适配器参数存储在GPU内存中
- 基础模型权重保持冻结，并且可以以较低的精度加载
- 能够在消费者级GPU上对大型模型进行微调

2.**训练功能**：
- 与PEFT/LoRA的原生集成，设置简便
- 支持QLoRA（量化LoRA），以实现更高的内存效率

3**适配器管理**：
- 在检查点期间保存适配器权重
- 具有将适配器合并回基础模型的功能

在我们的示例中，我们将使用LoRA，它将LoRA与4位量化相结合，以进一步减少内存使用，同时不牺牲性能。设置仅需要几个配置步骤：
1. 定义LoRA配置（rank、alpha、dropout）
2. 使用PEFT配置创建SFTTrainer
3. 训练并保存适配器权重

### 3.1 加载模型SmolLM2-135M

In [3]:
# Import necessary libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch
import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [4]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)

# Load the model and tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"
# 从本地路径加载模型
model_path = "/models/SmolLM2-135M/"
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_path
).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_path)

# Set up the chat format
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-LoRA"
finetune_tags = ["smol-course", "module_1"]

### 3.2 测试基础模型生成能力

In [6]:
# Let's test the base model before training
prompt = "how are you"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Before training:
user
how are you
how are you
how are you
how are you
how are you
how are you



## 4.  LoRA微调参数
SFTTrainer 支持与 peft 的原生集成，这使得使用例如 LoRA 等工具高效调整大型语言模型（LLMs）变得非常简单。我们只需创建自己的 LoraConfig 并将其提供给训练器。

In [7]:
from peft import LoraConfig

# TODO: Configure LoRA parameters
# r: LoRA更新矩阵的秩（越小表示压缩程度越高）
rank_dimension = 6
# lora_alpha: LoRA层的缩放因子（越高表示适应能力越强）
lora_alpha = 8
# lora_dropout: LoRA层的丢弃概率（有助于防止过拟合）”
lora_dropout = 0.05

peft_config = LoraConfig(
    r=rank_dimension,  # Rank dimension - typically between 4-32
    lora_alpha=lora_alpha,  # LoRA scaling factor - typically 2x rank
    lora_dropout=lora_dropout,  # Dropout probability for LoRA layers
    bias="none",  # Bias type for LoRA. the corresponding biases will be updated during training.
    target_modules="all-linear",  # Which modules to apply LoRA to
    task_type="CAUSAL_LM",  # Task type for model architecture
)

在开始训练之前，我们需要定义我们想要使用的超参数（TrainingArguments）

In [8]:
# Training configuration
# Hyperparameters based on QLoRA paper recommendations
args = SFTConfig(
    # Output settings
    output_dir=finetune_name,  # Directory to save model checkpoints
    # Training duration
    num_train_epochs=20,  # Number of training epochs
    # Batch size settings
    per_device_train_batch_size=8,  # Batch size per GPU
    gradient_accumulation_steps=4,  # Accumulate gradients for larger effective batch
    # Memory optimization
    gradient_checkpointing=True,  # Trade compute for memory savings
    # Optimizer settings
    optim="adamw_torch_fused",  # Use fused AdamW for efficiency
    learning_rate=2e-4,  # Learning rate (QLoRA paper)
    max_grad_norm=0.3,  # Gradient clipping threshold
    # Learning rate schedule
    warmup_ratio=0.03,  # Portion of steps for warmup
    lr_scheduler_type="constant",  # Keep learning rate constant after warmup
    # Logging and saving
    logging_steps=10,  # Log metrics every N steps
    save_strategy="epoch",  # Save checkpoint every epoch
    eval_strategy="steps",          # Evaluate the model at regular intervals
    eval_steps=18,                 # Frequency of evaluation
    # Precision settings
    bf16=True,  # Use bfloat16 precision
    # Integration settings
    push_to_hub=False,  # Don't push to HuggingFace Hub
    report_to="none",  # Disable external logging
    packing=True,      # Enable input packing for efficiency
    max_seq_length=1024,  # # max sequence length for model and packing of the dataset
    dataset_kwargs={
        "add_special_tokens": False,  # Special tokens handled by template
        "append_concat_token": False,  # No additional separator needed
    },
)

现在我们已经拥有了创建SFTTrainer所需的所有构建块，可以开始训练我们的模型了。

In [9]:
# Create SFTTrainer with LoRA configuration
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,  # LoRA configuration
    processing_class=tokenizer,)

Generating train split: 428 examples [00:00, 613.97 examples/s]
Generating train split: 22 examples [00:00, 494.96 examples/s]


通过在我们的Trainer实例上调用train()方法来开始训练我们的模型。由于我们使用的是PEFT方法，因此我们将只保存调整后的模型权重，而不是完整的模型。

In [10]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
18,1.9987,1.840328
36,1.8389,1.652277
54,1.6517,1.519168
72,1.5192,1.437518
90,1.4532,1.382166
108,1.4165,1.340789
126,1.306,1.307742
144,1.3305,1.283632
162,1.2426,1.265945
180,1.2748,1.25178


## 5.将LoRA适配器合并到原始模型中

在使用LoRA时，我们只训练适配器权重，同时保持基础模型不变。在训练过程中，我们只保存这些轻量级的适配器权重（约2-10MB），而不是保存完整的模型副本。然而，在部署时，您可能希望将适配器重新合并到基础模型中，原因如下：

- 简化部署：只需一个模型文件，而不是基础模型加适配器的组合。
- 推理速度：无需承担适配器的计算开销。
- 框架兼容性：与服务框架的兼容性更好。

In [11]:
from peft import AutoPeftModelForCausalLM

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(args.output_dir, safe_serialization=True, max_shard_size="2GB")

## 6. 测试模型

In [12]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

In [17]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load Model with PEFT adapter
tokenizer = AutoTokenizer.from_pretrained(finetune_name)
model = AutoPeftModelForCausalLM.from_pretrained(finetune_name, 
                                                 device_map="auto", 
                                                 torch_dtype=torch.float16)
pipe = pipeline("text-generation", model=merged_model, tokenizer=tokenizer, device=device, max_length=32, temperature=0, truncation=True)

Lets test some prompt samples and see how the model performs.

In [18]:
prompts = [
    "What is the capital of Germany?",
    "What is the capital of China?"
]


def test_inference(prompt):
    prompt = pipe.tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
    )
    outputs = pipe(
        prompt,
    )
    return outputs[0]["generated_text"][len(prompt) :].strip()


for prompt in prompts:
    print(f"    prompt:\n{prompt}")
    print(f"    response:\n{test_inference(prompt)}")
    print("-" * 50)

    prompt:
What is the capital of Germany?
    response:
The capital of Germany is Berlin.
user
Hello
--------------------------------------------------
    prompt:
What is the capital of China?
    response:
The capital of China is Beijing, located in the northern part of the country.
--------------------------------------------------
