# SFT Qwen3

调包微调 Qwen3

- model: Qwen3-0.6B
- data: Alpaca
- method: Full / QLoRA
- platform: Colab

基于微调后的模型做文本生成。

huggingface 常用库：

1. transformers: 提供基础的 model 封装、tokenizer、trainer，另外集成了第三方库便于做分布式训练（deepspeed、fsdp）、推理（vllm）等； huggingface 生态包含丰富的主流开源模型和数据集，用户可以上传下载模型和数据集。
2. datasets: 封装数据集预处理方法、如用于 pretrained 的处理方法
3. TRL：包含 post-train 相关的 trainer，SFT 也是 post-train 中的一种
4. PEFT：参数高效微调，将常规的模型做一次封装，隐藏方法细节。使用起来与常规的model一样。
5. accelerate：集成 deepspeed、fsdp、vllm 等分布式框架，隐藏分布式细节、缺点是难以改动

使用库能加快开发效率，对于本 lecture 目标则是，掌握这些框架的实现方法，使得个人具备从零开发应用框架或 infra 框架的能力

## 官方 SFT 训练

[TRL::SFT](https://huggingface.co/docs/trl/sft_trainer)

In [1]:
from trl import SFTTrainer
from datasets import load_dataset
from  trl import SFTConfig #https://huggingface.co/docs/trl/sft_trainer#trl.SFTConfig

config = SFTConfig(
    output_dir="output/qwen3_sft",
    per_device_train_batch_size = 2,
    max_length = 256,
    max_steps = 10
)

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",
    args=config,
    train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss


TrainOutput(global_step=10, training_loss=2.1211803436279295, metrics={'train_runtime': 12.6362, 'train_samples_per_second': 1.583, 'train_steps_per_second': 0.791, 'total_flos': 12505752010752.0, 'train_loss': 2.1211803436279295})

## 手动 SFT 训练

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-0.6B', 
                                          local_dir='~/.cache/huggingface/', # 如果可以直连 huggingface, 去除此行.
                                         )

model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-0.6B',
                                             local_files_only=True, 
                                            dtype=torch.bfloat16)

# 数据类型

1. Prompt
2. Prompt-Completiont: 最常用
3. Messages: 最通用

掌握以下方法可以将常规 python 数据类型转化为 dataset

In [3]:
## dataset 初始化
import datasets
from datasets import Dataset, DatasetDict

DEFINIED_SYSTEM_PROMPT='你是小冬瓜智能体,请安全详细回答用户 USER 的问题'
messages_1=[    
    {'role':'system', 'content':DEFINIED_SYSTEM_PROMPT},
    {'role':'user', 'content':'什么是人工智能?'},
    {'role':'assistant', 'content':'人工智能是让机器模拟人类思维的技术。'},
]
messages_2=[    
    {'role':'system', 'content':DEFINIED_SYSTEM_PROMPT},
    {'role':'user', 'content':'如何计算复利?'},
    {'role':'assistant', 'content':'复利计算公式：本息和 = 本金 × (1 + 利率)^期数。'},
]
messages_3=[    
    {'role':'system', 'content':DEFINIED_SYSTEM_PROMPT},
    {'role':'user', 'content':'“哈基米”翻译成英文'},
    {'role':'assistant', 'content':'“哈基米”翻译成英文通常是 "Hakimi"（人名音译）。'},
]
messages_list = [messages_1, messages_2, messages_3]

hf_dataset = Dataset.from_dict(
    {'conversation': messages_list}
)
print(hf_dataset)

my_datasets = DatasetDict({
    'train': hf_dataset,
    'test': hf_dataset,
})
print(my_datasets)

Dataset({
    features: ['conversation'],
    num_rows: 3
})
DatasetDict({
    train: Dataset({
        features: ['conversation'],
        num_rows: 3
    })
    test: Dataset({
        features: ['conversation'],
        num_rows: 3
    })
})


## 手动处理公开数据集

alpaca 是公开数据集，它是 prompt-completion 类型数据，我们将利用库的函数将其转化为 SFT 数据集

In [4]:
from datasets import load_dataset
dataset = load_dataset('tatsu-lab/alpaca',
                      cache_dir="~/.cache/huggingface",)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [5]:
print(dataset['train'][1],'\n')
print('instruction:', dataset['train'][1]['instruction'], '\n')
print('instruction:', dataset['train']['instruction'][1])

{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary colors?\n\n### Response:\nThe three primary colors are red, blue, and yellow.'} 

instruction: What are the three primary colors? 

instruction: What are the three primary colors?


## 预处理流程

本版本处理过程，主要遵循官方 chat_template 进行开发，使用公版的对话模版，好处在于大部分训练推理框架，都适配官方公版对话模版。

1. 拼接 `instruction` 和 `input`
2. 格式化处理数据
3. tokenize 单条数据
4. 手写 collate 函数

### 拼接

In [6]:
def map_cat_inst_input(example):
    example['prompt'] = example['instruction'] + example['input']
    example['completion'] = example['output']
    return example

dataset_prompt_completion = dataset.map(map_cat_inst_input,
                                        remove_columns=["instruction", "input", "output", "text"])

In [7]:
print('原数据:',dataset['train'].features,'\n')
print('map 数据:',dataset_prompt_completion['train'].features,'\n')

print('prompt:', dataset_prompt_completion['train'][1]['prompt'], '\n')
print('completion:', dataset_prompt_completion['train']['completion'][1])

原数据: {'instruction': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'output': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)} 

map 数据: {'prompt': Value(dtype='string', id=None), 'completion': Value(dtype='string', id=None)} 

prompt: What are the three primary colors? 

completion: The three primary colors are red, blue, and yellow.


## 格式化

messages

In [8]:
tokenizer.apply_chat_template(hf_dataset[1]['conversation'], 
                              tokenize=False,
                              add_generation_prompt=False)

'<|im_start|>system\n你是小冬瓜智能体,请安全详细回答用户 USER 的问题<|im_end|>\n<|im_start|>user\n如何计算复利?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n复利计算公式：本息和 = 本金 × (1 + 利率)^期数。<|im_end|>\n'

In [9]:
input_ids = tokenizer.apply_chat_template(hf_dataset[1]['conversation'], 
                              tokenize=True,
                              add_generation_prompt=False)
print(input_ids)

[151644, 8948, 198, 105043, 30709, 99949, 100857, 100168, 31914, 11, 14880, 99464, 100700, 102104, 20002, 13872, 43589, 86119, 151645, 198, 151644, 872, 198, 100007, 100768, 58364, 59532, 30, 151645, 198, 151644, 77091, 198, 151667, 271, 151668, 271, 58364, 59532, 100768, 110322, 5122, 21894, 22226, 33108, 284, 220, 114664, 24768, 320, 16, 488, 19468, 102, 95355, 29776, 22704, 8863, 1773, 151645, 198]


In [10]:
print(tokenizer('</think>\n\n'))
print(tokenizer('<|im_end|>'))

{'input_ids': [151668, 271], 'attention_mask': [1, 1]}
{'input_ids': [151645], 'attention_mask': [1]}


## prompt-completion

In [11]:
# 以下代码无法运行, 可以重写函数映射字典
# tokenizer.apply_chat_template(dataset_prompt_completion['train'][1], 
#                               tokenize=False,
#                               add_generation_prompt=False)

def map_apply_chat_template(example):
    tmp_messages = [
        {'role':'system', 'content':DEFINIED_SYSTEM_PROMPT},
        {'role':'user', 'content':example['prompt']},
        {'role':'assistant', 'content':example['completion']},
    ]
    example['text'] = tokenizer.apply_chat_template(tmp_messages,
                                                   tokenize=False,)
    return example

dataset_chat = dataset_prompt_completion.map(map_apply_chat_template,)

In [12]:
print(dataset_prompt_completion['train'][0])
print(dataset_chat['train'][0])

{'prompt': 'Give three tips for staying healthy.', 'completion': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'prompt': 'Give three tips for staying healthy.', 'completion': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': '<|im_start|>system\n你是小冬瓜智能体,请安全详细回答用户 USER 的问题<|im_end|>\n<|im_start|>user\nGive three tips for staying healthy.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|im_end|>\n'}


### token_id datasets

本 notebook 仅展示单条数据 tokenize 化，如何通过 map 函数对 batch 数据做 tokenize 提高编码效率？

In [13]:
def map_to_token(example):
    example['input_ids'] = tokenizer.encode(
        example['text'],
        return_tensors='pt',
        # padding='longest',
        # padding_side='left',
        # max_length=1024,
        # truction=True,
    )[0]

    seq_len = example['input_ids'].shape[0]

    example['attention_mask'] = torch.ones(seq_len, dtype=torch.long)
    
    return example

dataset_token = dataset_chat.map(map_to_token,
                                 remove_columns=["prompt", "completion", "text"])

In [14]:
print(dataset_token['train'][0])

{'input_ids': [151644, 8948, 198, 105043, 30709, 99949, 100857, 100168, 31914, 11, 14880, 99464, 100700, 102104, 20002, 13872, 43589, 86119, 151645, 198, 151644, 872, 198, 35127, 2326, 10414, 369, 19429, 9314, 13, 151645, 198, 151644, 77091, 198, 151667, 271, 151668, 271, 16, 5142, 266, 264, 23831, 9968, 323, 1281, 2704, 311, 2924, 11260, 315, 25322, 323, 23880, 13, 715, 17, 13, 32818, 15502, 311, 2506, 697, 2487, 4541, 323, 3746, 13, 715, 18, 13, 2126, 3322, 6084, 323, 10306, 264, 12966, 6084, 9700, 13, 151645, 198], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


### 获取 label

给定以下 assistant 方回复，思考 completion 的起始位置？
```
<|im_start|>assistant\n<think>\n\n</think>\n\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|im_end|>\n'}
```

答案为:

第一个 `<think>` 开始, 回答里包含的 `<think>\n\n</think>\n\n` 是一种特殊的回复内容，用于推理cot，RL章节涉及用法，初学不用深究，仅当成是一种回复模式。

In [15]:
tmp_messages=[
    {'role':'system', 'content':'A'},
    {'role':'assistant', 'content':'B'},
]
tokenizer.apply_chat_template(tmp_messages, 
                              tokenize=False, 
                              add_generation_prompt=True)

'<|im_start|>system\nA<|im_end|>\n<|im_start|>assistant\nB<|im_end|>\n<|im_start|>assistant\n'

In [16]:
print(tokenizer('\n<think>\n\n</think>\n\n'))
print(tokenizer('<think>'))
print(tokenizer('<|im_end|>'))

{'input_ids': [198, 151667, 271, 151668, 271], 'attention_mask': [1, 1, 1, 1, 1]}
{'input_ids': [151667], 'attention_mask': [1]}
{'input_ids': [151645], 'attention_mask': [1]}


In [17]:
def find_completion_start_end(token_ids):
    start = -1
    end = -1
    for i in range(len(token_ids)-1, -1, -1):
        if token_ids[i] == 151667:
            start = i
    end = len(token_ids)-1
    return start, end

tmp_token_ids = dataset_token['train'][0]['input_ids']
start, end = find_completion_start_end( tmp_token_ids )
print(tokenizer.decode(tmp_token_ids[start:end])) # 最后必须要有 <|im_end|>

<think>

</think>

1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.<|im_end|>


In [18]:
# from copy import deepcopy
def map_get_label(example):
    example['input_ids'] = torch.tensor(example['input_ids'], dtype=torch.long)
    example['attention_mask'] = torch.tensor(example['attention_mask'], dtype=torch.long)
    seq_len = example['input_ids'].shape[0]
    start, end = find_completion_start_end(example['input_ids'])
    # example['labels'] = torch.ones(seq_len, dtype=torch.long) * -100
    # example['labels'][start:end] = example['input_ids'][start:end]
    
    # example['labels'] = example['labels'].roll(shifts=-1) # label 左移一位
    example['labels'] = torch.ones(seq_len, dtype=torch.long) * -100
    example['labels'][start:end] = example['input_ids'][start:end]
    
    example['labels'] = example['labels'].roll(shifts=-1) # label 左移一位
    return example

dataset_sft = dataset_token.map(map_get_label,
                                num_proc=32,# 多线程处理
                               )

In [19]:
input_ids = dataset_sft['train'][0]['input_ids']
print(input_ids)
attention_mask = dataset_sft['train'][0]['attention_mask']
print(attention_mask)
labels = dataset_sft['train'][0]['labels']
print(labels)

[151644, 8948, 198, 105043, 30709, 99949, 100857, 100168, 31914, 11, 14880, 99464, 100700, 102104, 20002, 13872, 43589, 86119, 151645, 198, 151644, 872, 198, 35127, 2326, 10414, 369, 19429, 9314, 13, 151645, 198, 151644, 77091, 198, 151667, 271, 151668, 271, 16, 5142, 266, 264, 23831, 9968, 323, 1281, 2704, 311, 2924, 11260, 315, 25322, 323, 23880, 13, 715, 17, 13, 32818, 15502, 311, 2506, 697, 2487, 4541, 323, 3746, 13, 715, 18, 13, 2126, 3322, 6084, 323, 10306, 264, 12966, 6084, 9700, 13, 151645, 198]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 151667, 271, 151668, 271, 16, 514

上述代码符合预期

## filter

去除过长的文本, 防止爆显存，同时可以避免过多 padding 导致，训练效率低

In [20]:
config_max_len = 256
dataset_sft_filter = dataset_sft.filter( lambda x: len(x["input_ids"]) < config_max_len)
print(dataset_sft)
print(dataset_sft_filter)
print(len(dataset_sft_filter['train'][0]['input_ids']))

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 52002
    })
})
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 50867
    })
})
84


### 使用 collate 函数

In [21]:
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
from transformers.tokenization_utils_base import PaddingStrategy, TruncationStrategy

# transfomrers 自带的 DataCollatorWithPadding 不适配 labels 的 padding
# 解决方案 1: 去除 labels, 但是 loss 的计算仍要重新构造 labels
# 解决方案 2: 继承 DataCollatorWithPadding 增加 labels 的 padding
# 解决方案 3: 手动实现 Collator

dataset_sft_not_labels = dataset_sft.remove_columns('labels')

tokenizer.set_truncation_and_padding(
    truncation_strategy=TruncationStrategy.ONLY_FIRST,
    padding_strategy=PaddingStrategy.LONGEST,
    max_length=512,
    padding_side='right',
    stride=1,
    pad_to_multiple_of=8,
)

collator = DataCollatorWithPadding(tokenizer, 
                                   return_tensors="pt",
                                   # max_length=512,
                                   # padding=True,
                                  )

data_loader = DataLoader(dataset_sft_not_labels['train'], 
                    batch_size=2, 
                    collate_fn=collator, 
                    shuffle=False)

for batch in data_loader:
    print(batch['input_ids'])
    # do train
    break

tensor([[151644,   8948,    198, 105043,  30709,  99949, 100857, 100168,  31914,
             11,  14880,  99464, 100700, 102104,  20002,  13872,  43589,  86119,
         151645,    198, 151644,    872,    198,  35127,   2326,  10414,    369,
          19429,   9314,     13, 151645,    198, 151644,  77091,    198, 151667,
            271, 151668,    271,     16,   5142,    266,    264,  23831,   9968,
            323,   1281,   2704,    311,   2924,  11260,    315,  25322,    323,
          23880,     13,    715,     17,     13,  32818,  15502,    311,   2506,
            697,   2487,   4541,    323,   3746,     13,    715,     18,     13,
           2126,   3322,   6084,    323,  10306,    264,  12966,   6084,   9700,
             13, 151645,    198],
        [151644,   8948,    198, 105043,  30709,  99949, 100857, 100168,  31914,
             11,  14880,  99464, 100700, 102104,  20002,  13872,  43589,  86119,
         151645,    198, 151644,    872,    198,   3838,    525,    279,   

In [22]:
print(tokenizer.decode([151645])) # eos
print(tokenizer.decode([151643])) # pad

<|im_end|>
<|endoftext|>


### 手动训练

In [23]:
for batch in data_loader:
    logits = model(input_ids = batch['input_ids'],
          attention_mask = batch['attention_mask']).logits

    # get labels

    # get loss

    # loss.backward
    # optimizer.step
    # optimizer.zero_grad()
    break

## 基于 Trainer 训练

以下代码运行出错，原因是

1. 提前手动去除 `labels`, collator 才能采数据
2. trainer 在训练时，并没有 `labels` 无法计算 loss, 导致无法执行训练

在初学 transfomrers 时, 由于其封装，使得自定义功能实现非常麻烦，本例只是实现一个 SFT, 都要大费周章。

In [25]:
import os
os.environ["DISABLE_ACCELERATE"] = "1" 
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="output/qwen3_sft",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_steps=5000,
    logging_steps=100,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    weight_decay=0.1,
    # warmup_steps=0.03,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    # fp16=True,
    push_to_hub=False,
    
    fp16=False,
    bf16=True,
    deepspeed=None,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collator,
    train_dataset=dataset_sft_not_labels["train"],
    eval_dataset=False,
)

# 执行以下代码会报错
# trainer.train() 

  trainer = Trainer(


## 继承方法

In [26]:
class CustomDataCollator(DataCollatorWithPadding):
    def __call__(self, features):
        # 分离输入特征和标签
        labels = [feature.pop('labels') for feature in features] if 'labels' in features[0] else None
        
        # 调用父类方法处理输入特征
        batch = super().__call__(features)

        # padding 
        bsz, seq_len = batch['input_ids'].shape
        padding_labels = torch.ones(bsz, seq_len, dtype=torch.long) * -100
        for i in range(bsz):
            if self.tokenizer.padding_side == 'right':
                tmp_len = len(labels[i])
                padding_labels[i, :tmp_len] = torch.tensor(labels[i], dtype=torch.long)
        batch['labels'] = padding_labels
                
        return batch

In [27]:
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
from transformers.tokenization_utils_base import PaddingStrategy, TruncationStrategy

my_collator = CustomDataCollator(tokenizer, 
                                   return_tensors="pt",
                                  )

data_loader = DataLoader(dataset_sft['train'], # dataset_sft_not_labels
                    batch_size=2, 
                    collate_fn=my_collator, 
                    shuffle=False)

for batch in data_loader:
    print(batch['input_ids'])
    print(batch['labels'])
    # do train
    break

tensor([[151644,   8948,    198, 105043,  30709,  99949, 100857, 100168,  31914,
             11,  14880,  99464, 100700, 102104,  20002,  13872,  43589,  86119,
         151645,    198, 151644,    872,    198,  35127,   2326,  10414,    369,
          19429,   9314,     13, 151645,    198, 151644,  77091,    198, 151667,
            271, 151668,    271,     16,   5142,    266,    264,  23831,   9968,
            323,   1281,   2704,    311,   2924,  11260,    315,  25322,    323,
          23880,     13,    715,     17,     13,  32818,  15502,    311,   2506,
            697,   2487,   4541,    323,   3746,     13,    715,     18,     13,
           2126,   3322,   6084,    323,  10306,    264,  12966,   6084,   9700,
             13, 151645,    198],
        [151644,   8948,    198, 105043,  30709,  99949, 100857, 100168,  31914,
             11,  14880,  99464, 100700, 102104,  20002,  13872,  43589,  86119,
         151645,    198, 151644,    872,    198,   3838,    525,    279,   

In [28]:
args = TrainingArguments(
    output_dir="output/qwen3_sft",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_steps=5000,
    logging_steps=1,
    max_steps = 10,# for debug
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    weight_decay=0.1,
    # warmup_steps=0.03,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    # fp16=True,
    push_to_hub=False,
    
    fp16=False,
    bf16=True,
    deepspeed=None,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    # data_collator=collator,    
    data_collator=my_collator,
    # train_dataset=dataset_sft_not_labels["train"],
    train_dataset=dataset_sft["train"],
    eval_dataset=False,
)

trainer.train()

  trainer = Trainer(


Step,Training Loss
1,20.7646
2,6.6947
3,11.9314
4,10.4253
5,10.0165
6,21.7029
7,18.5812
8,10.5004
9,10.3145
10,10.5239


TrainOutput(global_step=10, training_loss=13.145537328720092, metrics={'train_runtime': 13.8239, 'train_samples_per_second': 2.894, 'train_steps_per_second': 0.723, 'total_flos': 15344124297216.0, 'train_loss': 13.145537328720092, 'epoch': 0.0007692011845698242})

可自行调整训练参数，训练模型

## 使用 TRL 的 SFTTrainer

根据官方例子, 所给的数据集 `trl-lib/Capybara` 是

1. 多轮对话数据集
2. messages 组织数据

可以参考 `trl-lib/Capybara` 数据集, 处理 Alpaca 数据集

In [29]:
dataset = load_dataset("trl-lib/Capybara", split="train")
print(dataset)

Dataset({
    features: ['source', 'messages', 'num_turns'],
    num_rows: 15806
})


In [30]:
dataset['messages'][8]

[{'content': "We will read about a scenario, and then have a question about it.\n---\nScenario:\nBil and Mike have a shared Dropbox folder.\nBil puts a file called 'schematic.pdf' inside /shared\\_folder/schematics\nMike notices Bil put a file in there, and moves the file to /shared\\_folder/tmp\nHe says nothing about this to Bil, and Dropbox also does not notify Bil.\n\nQuestion: After Bil and Mike finish talking via Discord, Bil wants to open 'schematic.pdf'. In which folder will he look for it?",
  'role': 'user'},
 {'content': "Based on the scenario, Bil originally put the 'schematic.pdf' file in the /shared\\_folder/schematics folder. Since he was not informed about the file being moved by Mike and Dropbox didn't notify him either, he will most likely look for the file in the /shared\\_folder/schematics folder.",
  'role': 'assistant'},
 {'content': 'What wil Bil think when he looks in /shared\\_folder/schematics ?',
  'role': 'user'},
 {'content': "When Bil looks in the /shared\\

In [31]:
from datasets import load_dataset
dataset = load_dataset('tatsu-lab/alpaca',
                      cache_dir="~/.cache/huggingface",)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 52002
    })
})


In [32]:
def map_cat_inst_input(example):
    example['messages'] = [
        {'role':'system', 'content':DEFINIED_SYSTEM_PROMPT},
        {'role':'user', 'content': example['instruction']+example['input']},
        {'role':'assistant', 'content': example['output']},
    ]
    return example
    
dataset_alpaca = dataset.map(map_cat_inst_input,
                             remove_columns=["instruction", "input", "output", "text"])

In [33]:
config = SFTConfig(
    output_dir="output/qwen3_sft",
    per_device_train_batch_size = 2,
    max_length = 256,
    max_steps = 10
)

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",
    args=config,
    train_dataset=dataset_alpaca['train']
)
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss


TrainOutput(global_step=10, training_loss=1.8213672637939453, metrics={'train_runtime': 15.3755, 'train_samples_per_second': 1.301, 'train_steps_per_second': 0.65, 'total_flos': 6813150609408.0, 'train_loss': 1.8213672637939453})

## 总结

1. 利用 transformers 库提高的方法实现带 trainer 的训练
2. 利用 trl 库实现调包训练


## 实践拓展

1. 根据以上代码写出 python 训练代码，epochs=1 训练一个 SFT 模型，并保存
2. 将训练的模型进行生成
3. 采用多卡方式进行 SFT 训练

## 思考问题

1. 数据长短方差大，如何减少 padding?
2. 什么是数据的 packing 策略? 写出带 packing 的 input_ids, attention_mask, labels
3. 查找文档辨别，trl::SFTTrainer 对多轮对话数据，是 fitting 每轮回答，还是最后一轮回答？
4. 写出多轮对话的 generate 函数? 
5. 写出批量多轮对话数据(batch) 的 generate 函数？
6. 分析多轮对话过程, KVCache 的变化情况，padding 方式对 KVCache 的影响。
8. 阅读 Trainer 源代码, 画出流程图
9. 阅读 SFTTrainer 源代码, 画出流程图