# Training Pipeline
[run_training_dpo_pipeline.ipynb](https://github.com/shibing624/MedicalGPT/blob/main/run_training_dpo_pipeline.ipynb)    | [Open In Colab](https://colab.research.google.com/github/shibing624/MedicalGPT/blob/main/run_training_dpo_pipeline.ipynb)

# Stage 1: Continue Pretraining

第一阶段：PT(Continue PreTraining)增量预训练，在海量领域文本数据上二次预训练GPT模型，以适配领域数据分布

注意：
1. 此阶段是可选的，如果你没有海量领域文本，可以跳过此阶段，直接进行SFT阶段的有监督微调
2. 我实验发现：做领域知识注入，SFT比PT更高效，也可以跳过PT阶段

| Stage 1: Continue Pretraining   |  [pretraining.py](https://github.com/shibing624/MedicalGPT/blob/main/pretraining.py) | [run_pt.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_pt.sh)    |

#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是Qwen/Qwen2.5-0.5B
2. 数据集：PT阶段使用的是中文天龙八部小说部分文本和英文书籍部分文本，位于`data/pretrain`文件夹

## 配置运行环境

本地执行可注释以下配置环境的命令，colab执行要打开注释，用于配置环境

colab建议使用T4 GPU训练，设置方式：`代码执行程序 -> 更改运行时类型 -> 运行时类型：Python3，硬件加速器：GPU，GPU类型：T4 -> 保存`

步骤：
1. 下载最新代码到本地
2. 安装依赖包

依赖包如下，保证最新版本：

```
loguru
transformers
sentencepiece
datasets
tensorboard
tqdm
peft
trl
```

In [1]:
!git clone --depth 1 https://github.com/shibing624/MedicalGPT.git
%cd MedicalGPT
%ls
#!pip install -r requirements.txt

Cloning into 'MedicalGPT'...
remote: Enumerating objects: 98, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (88/88), done.[K
remote: Total 98 (delta 19), reused 52 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (98/98), 8.98 MiB | 28.46 MiB/s, done.
Resolving deltas: 100% (19/19), done.
/content/MedicalGPT
build_domain_tokenizer.py   requirements.txt
chatpdf.py                  reward_modeling.py
CITATION.cff                [0m[01;34mrole_play_data[0m/
_config.yml                 run_dpo.sh
CONTRIBUTING.md             run_eval_quantize.sh
convert_dataset.py          run_full_sft.sh
[01;34mdata[0m/                       run_grpo.sh
DISCLAIMER                  run_orpo.sh
[01;34mdocs[0m/                       run_ppo.sh
dpo_training.py             run_pt.sh
eval_quantize.py            run_quant.sh
fastapi_server_demo.py      run_rm.sh
gradio_demo.py              run_sft_accelerate.sh
grpo_training.py            run_sft.s

In [2]:
# 1. Install specific compatible versions for the Model & Training
!pip install transformers==4.46.3 peft==0.12.0 accelerate==0.34.2 trl==0.8.6 datasets

Collecting transformers==4.46.3
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft==0.12.0
  Downloading peft-0.12.0-py3-none-any.whl.metadata (13 kB)
Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl.metadata (19 kB)
Collecting trl==0.8.6
  Downloading trl-0.8.6-py3-none-any.whl.metadata (11 kB)
Collecting tokenizers<0.21,>=0.20 (from transformers==4.46.3)
  Downloading tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting tyro>=0.5.11 (from trl==0.8.6)
  Downloading tyro-1.0.1-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10

In [3]:
# 2. Install the Math dependencies (with the forced ANTLR runtime)
!pip install latex2sympy2_extended math-verify==0.5.2 antlr4-python3-runtime==4.13.2

Collecting latex2sympy2_extended
  Downloading latex2sympy2_extended-1.10.2-py3-none-any.whl.metadata (5.3 kB)
Collecting math-verify==0.5.2
  Downloading math_verify-0.5.2-py3-none-any.whl.metadata (347 bytes)
Collecting antlr4-python3-runtime==4.13.2
  Downloading antlr4_python3_runtime-4.13.2-py3-none-any.whl.metadata (304 bytes)
Collecting latex2sympy2_extended
  Downloading latex2sympy2_extended-1.0.6-py3-none-any.whl.metadata (4.9 kB)
Downloading math_verify-0.5.2-py3-none-any.whl (27 kB)
Downloading latex2sympy2_extended-1.0.6-py3-none-any.whl (82 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.0/82.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading antlr4_python3_runtime-4.13.2-py3-none-any.whl (144 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.5/144.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: antlr4-python3-runtime, latex2sympy2_extended, math-verify
  Attempting uninstall: an

In [4]:
# 3. (Optional) Install other utilities from the list
!pip install loguru sentencepiece scikit-learn tensorboard tqdm

Collecting loguru
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Downloading loguru-0.7.3-py3-none-any.whl (61 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: loguru
Successfully installed loguru-0.7.3


In [5]:
# 4. Uninstall bitsandbytes to prevent Triton crashes (since you are using bf16, not 8-bit)
!pip uninstall -y bitsandbytes

[0m

## Stage1 咱们开始吧

训练步骤如下：

1. 确认训练集
2. 执行训练脚本

训练脚本的执行逻辑如下：
1. 导入依赖包
2. 设置参数
3. 定义各函数并加载训练集
4. 加载模型和tokenizer
5. 开始训练并评估
6. 查看训练结果

**以下参数可以根据你的GPU实际情况修改，当前参数是根据Colab的T4单卡GPU（16GB显存）配置的**

In [6]:
%ls ./data/pretrain/

en_article_tail500.txt  fever.txt  tianlongbabu.txt


In [3]:
!head -n 100 ./data/pretrain/en_article_tail500.txt

contract to work in specified mines and mills. There seemed to be no
limit to the factories, forges, refineries, and railways that could be
built, to the multitudes that could be employed in conquering a
continent. As for the future, that was in the hands of Providence!

=Business Theories of Politics.=--As the statesmen of Hamilton's school
and the planters of Calhoun's had their theories of government and
politics, so the leaders in business enterprise had theirs. It was
simple and easily stated. "It is the duty of the government," they
urged, "to protect American industry against foreign competition by
means of high tariffs on imported goods, to aid railways by generous
grants of land, to sell mineral and timber lands at low prices to
energetic men ready to develop them, and then to leave the rest to the
initiative and drive of individuals and companies." All government
interference with the management, prices, rates, charges, and conduct of
private business they held to be either w

In [2]:
!head -n 100 ./data/pretrain/fever.txt

第一章论
传染病是指由病原微生物，如朊粒、病毒、衣原体、立克次体、支原体（mycoplasma)细菌真菌、螺旋体和寄生虫，如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病，包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法，同时兼顾流行病学和预防措施的研究，做到防治结合。
传染病学与其他学科有密切联系，其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中，传染病不仅威胁着人类的健康和生命，而且影响着人类文明的进程，甚至改写过人类历史。人类在与传染病较量过程中，取得了许多重大战果，19世纪以来，病原微生物的不断发现及其分子生物学的兴起，推动了生命科学乃至整个医学的发展；疫苗的研究诞生了感染免疫学，奠定了免疫学的理论基础，已用来研究各种疾病的发生机制及防治手段；抗生素的发现和应用被誉为20世纪最伟大的医学成就；“Koch法则“明确了传染病与病原微生物之间的因果关系，建立了病原学理论，已被广泛应用到其他许多疾病的研究，奠定了现代医学发展的基石。
正是由于上述辉煌战果，加上社会文明的推进和物质生活水平的提高，人类逐渐在与传染病的斗争中占了上风。20世纪70年代西方医学界一度认为，传染病正在消亡。然而，1981年的艾滋病、2003年的传染性非典型肺炎、2012年的中东呼吸综合征、2013年的人感染H7N9禽流感、2014年的埃博拉出血热等新的传染病相继出现，不断给人类敲响警钟；与此同时，登革热、结核病、症疾及性传播疾病等老传染病再度肆虐，严重影响世界经济发展和社会和谐。20世纪90年代国际上提出了“eme1一ging infectiou s diseases"的概念，起初被我国学者翻译为“新发传染病”，此后随着人们对感染性疾病认识的不断深入，该定义得到了修订，“新发传染病”逐渐演变为“新发感染病”，不仅包括由新种或新型病原微生物引起的新发现的感染病，而且包括近年来导致地区性或国际性公共卫生问题的再发的老感染病。新传染病的出现，老传染病的复燃，病原

In [9]:
!head -n 100 ./data/pretrain/tianlongbabu.txt

天龙八部


正文 释名
“天龙八部”这名词出于佛经。许多大乘佛经叙述佛向诸菩萨、比丘等说法时，崐常有天龙八部参与听法。如“法华经：提婆达多品”：“天龙八部、人与非人，皆崐遥见彼龙女成佛”。
“非人”，包括八种神道怪物，因为以“天”及“龙”为首，崐所以称为《天龙八部》。八部罗，七归那罗，八摩听罗迦。
“天”是指天神。在佛教中，天神的地位并非至高无上，只不过比人能享受到崐到更大、更长久的福报而已。佛教认为一切事物无常，天神的寿命终了之后，也是崐要死的。天神临死之前有五种征状：衣裳垢腻、头上花萎、身体臭秽、腋下汗出、崐不乐本座(第五个征状或说是“玉子离散”)，这就是所谓“天人五衰”，是天神最崐大的悲哀。帝释是众天神的领袖。
“龙”是指神。佛经中的龙，和我国的传说中的龙大致差不多，不过没有脚，崐有的大蟒蛇也称。事实上，中国人对龙和龙王的观念，主要是从佛经中来的。佛经崐中有五龙五、七龙王、八龙王等等名称，古印度人龙很是尊敬，认为水中主物以龙崐的力气最大，因此对德行崇高的人尊称为“龙象”，如西来龙”，那是指从西方来崐的高僧。古印度人以为下雨是龙从天海中取水而洒下人间。中国人也接受这种说法，崐历本上注明几龙取水，表示今年雨量的多寡。龙王之中，有一位叫做沙竭罗龙王，崐他和幼女八岁时到释迦反牟尼所说法的灵鹫山前，转为男身，现佛之相。她成佛之崐时，为天龙八部所见。“夜叉”是佛经中的一种鬼神，有“夜叉八大将”、“十六大夜叉将”等名词。崐“夜叉”是本义是能吃鬼的神，又有敏捷、勇健、轻灵、秘密等意思。“维摩经”崐注：“什曰：‘夜叉有三种：一、在地，二、在空虚，三、天夜叉也。’”现在我崐们说到“夜叉”都是指恶鬼。但在佛经中，有很多夜叉是好的，夜叉八大将的任务崐是“维护众生界”。
“乾达婆”是一种不吃酒内、只寻香气作为滋养的神，是服侍帝释的乐神之一，崐身上发出浓冽的香气，“乾达婆”在梵语中又是“变幻莫测”的意思，魔术师也叫崐“乾达婆”，海市蜃楼叫做“乾达婆城”。香气和音乐都是缥缈隐约，难以捉摸。
“阿修罗”这种神道非常特别，男的极丑陋，而女的极美丽。阿修罗王常常率崐部和帝释战斗，因为阿修罗有美女而无美好食物，帝释有美食而无美女，互相妒忌崐抢夺，每有恶战，总是打得天翻地覆。我们常称惨遭轰炸、尸横遍地的大战场为“崐修罗场”，就是由此而来。大战的结果，阿修罗王往打败，，上崐天下地，

In [7]:
!python pretraining.py \
    --model_name_or_path Qwen/Qwen2.5-0.5B \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 3 \
    --per_device_eval_batch_size 3 \
    --do_train \
    --do_eval \
    --use_peft True \
    --seed 42 \
    --bf16 \
    --max_train_samples 20000 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --eval_strategy steps \
    --save_steps 50 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --block_size 128 \
    --group_by_length True \
    --output_dir outputs-pt-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype bfloat16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

2025-12-15 04:21:25.075237: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-15 04:21:25.092849: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765772485.114476    4151 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765772485.120902    4151 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765772485.137993    4151 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [8]:
%ls -lh outputs-pt-v1

total 22M
-rw-r--r-- 1 root root  721 Dec 15 04:26 adapter_config.json
-rw-r--r-- 1 root root  17M Dec 15 04:26 adapter_model.safetensors
-rw-r--r-- 1 root root  605 Dec 15 04:26 added_tokens.json
-rw-r--r-- 1 root root  471 Dec 15 04:26 all_results.json
drwxr-xr-x 2 root root 4.0K Dec 15 04:26 [0m[01;34mcheckpoint-750[0m/
drwxr-xr-x 2 root root 4.0K Dec 15 04:26 [01;34mcheckpoint-800[0m/
drwxr-xr-x 2 root root 4.0K Dec 15 04:26 [01;34mcheckpoint-834[0m/
-rw-r--r-- 1 root root  262 Dec 15 04:26 eval_results.json
-rw-r--r-- 1 root root 1.6M Dec 15 04:26 merges.txt
-rw-r--r-- 1 root root 5.0K Dec 15 04:26 README.md
drwxr-xr-x 3 root root 4.0K Dec 15 04:22 [01;34mruns[0m/
-rw-r--r-- 1 root root  616 Dec 15 04:26 special_tokens_map.json
-rw-r--r-- 1 root root 7.1K Dec 15 04:26 tokenizer_config.json
-rw-r--r-- 1 root root  20K Dec 15 04:26 trainer_state.json
-rw-r--r-- 1 root root  229 Dec 15 04:26 train_results.json
-rw-r--r-- 1 root root 3.3M Dec 15 04:26 vocab.json


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`
- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`

lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下：

In [9]:
!python merge_peft_adapter.py \
    --base_model Qwen/Qwen2.5-0.5B --lora_model outputs-pt-v1 --output_dir merged-pt/

2025-12-15 04:27:47.224596: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-15 04:27:47.242053: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765772867.264125    5921 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765772867.271462    5921 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765772867.288581    5921 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [10]:
%ls -lh merged-pt/

total 958M
-rw-r--r-- 1 root root  605 Dec 15 04:27 added_tokens.json
-rw-r--r-- 1 root root  744 Dec 15 04:27 config.json
-rw-r--r-- 1 root root  117 Dec 15 04:27 generation_config.json
-rw-r--r-- 1 root root 1.6M Dec 15 04:27 merges.txt
-rw-r--r-- 1 root root 943M Dec 15 04:27 model.safetensors
-rw-r--r-- 1 root root  616 Dec 15 04:27 special_tokens_map.json
-rw-r--r-- 1 root root 7.1K Dec 15 04:27 tokenizer_config.json
-rw-r--r-- 1 root root  11M Dec 15 04:27 tokenizer.json
-rw-r--r-- 1 root root 2.7M Dec 15 04:27 vocab.json


In [None]:
%cat merged-pt/config.json

Stage1 增量预训练完成。

# Stage 2: Supervised FineTuning

第二阶段：SFT(Supervised Fine-tuning)有监督微调，构造指令微调数据集，在预训练模型基础上做指令精调，以对齐指令意图，并注入领域知识

| Stage 2: Supervised Fine-tuning | [supervised_finetuning.py](https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py) | [run_sft.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_sft.sh)  |

#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是Qwen/Qwen2.5-0.5B 或者 Stage1得到的预训练模型
2. 数据集：SFT阶段使用的是使用的是Belle的1千条抽样数据，位于`data/finetune`文件夹

## Stage2 咱们开始吧

训练步骤如下：

1. 确认训练集
2. 执行训练脚本

训练脚本的执行逻辑如下：
1. 导入依赖包
2. 设置参数
3. 定义各函数并加载训练集
4. 加载模型和tokenizer
5. 开始训练并评估
6. 查看训练结果

In [None]:
%ls ./data/finetune

In [11]:
!python supervised_finetuning.py \
    --model_name_or_path merged-pt \
    --train_file_dir ./data/finetune \
    --validation_file_dir ./data/finetune \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft True \
    --bf16 \
    --max_train_samples 1000 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.05 \
    --weight_decay 0.05 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --eval_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --output_dir outputs-sft-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype bfloat16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

2025-12-15 04:28:44.724712: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-15 04:28:44.742228: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765772924.763460    6207 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765772924.769960    6207 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765772924.786776    6207 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [12]:
%ls -lh outputs-sft-v1

total 22M
-rw-r--r-- 1 root root  713 Dec 15 04:30 adapter_config.json
-rw-r--r-- 1 root root  17M Dec 15 04:30 adapter_model.safetensors
-rw-r--r-- 1 root root  605 Dec 15 04:30 added_tokens.json
-rw-r--r-- 1 root root  431 Dec 15 04:30 all_results.json
drwxr-xr-x 2 root root 4.0K Dec 15 04:30 [0m[01;34mcheckpoint-249[0m/
-rw-r--r-- 1 root root  222 Dec 15 04:30 eval_results.json
-rw-r--r-- 1 root root 1.6M Dec 15 04:30 merges.txt
-rw-r--r-- 1 root root 5.0K Dec 15 04:30 README.md
drwxr-xr-x 3 root root 4.0K Dec 15 04:29 [01;34mruns[0m/
-rw-r--r-- 1 root root  648 Dec 15 04:30 special_tokens_map.json
-rw-r--r-- 1 root root 7.1K Dec 15 04:30 tokenizer_config.json
-rw-r--r-- 1 root root 6.1K Dec 15 04:30 trainer_state.json
-rw-r--r-- 1 root root  229 Dec 15 04:30 train_results.json
-rw-r--r-- 1 root root 3.3M Dec 15 04:30 vocab.json


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`
- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`

lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下：

In [13]:
!python merge_peft_adapter.py \
    --base_model merged-pt --lora_model outputs-sft-v1 --output_dir ./merged-sft

2025-12-15 04:31:06.273614: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-15 04:31:06.290849: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765773066.312558    6875 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765773066.319108    6875 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765773066.335694    6875 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [14]:
%ls -lh merged-sft/

total 958M
-rw-r--r-- 1 root root  605 Dec 15 04:31 added_tokens.json
-rw-r--r-- 1 root root  736 Dec 15 04:31 config.json
-rw-r--r-- 1 root root  117 Dec 15 04:31 generation_config.json
-rw-r--r-- 1 root root 1.6M Dec 15 04:31 merges.txt
-rw-r--r-- 1 root root 943M Dec 15 04:31 model.safetensors
-rw-r--r-- 1 root root  616 Dec 15 04:31 special_tokens_map.json
-rw-r--r-- 1 root root 7.1K Dec 15 04:31 tokenizer_config.json
-rw-r--r-- 1 root root  11M Dec 15 04:31 tokenizer.json
-rw-r--r-- 1 root root 2.7M Dec 15 04:31 vocab.json


In [15]:
%cat merged-sft/config.json

{
  "_name_or_path": "merged-pt",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 896,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "model_type": "qwen2",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.46.3",
  "use_cache": true,
  "use_mrope": false,
  "use_sliding_window": false,
  "vocab_size": 151936
}


Stage2 SFT训练完成。

# Stage 3: DPO(Direct Preference Optimization)

第三阶段：DPO(Direct Preference Optimization)直接偏好优化，DPO通过直接优化语言模型来实现对其行为的精确控制，而无需使用复杂的强化学习，也可以有效学习到人类偏好，DPO相较于RLHF更容易实现且易于训练，效果更好

| Stage 3: Direct Preference Optimization        |  [dpo_training.py](https://github.com/shibing624/MedicalGPT/blob/main/dpo_training.py) | [run_dpo.sh](https://github.com/shibing624/MedicalGPT/blob/main/run_dpo.sh)    |

#### 说明：
以下 notebook/colab 代码为了快速验证训练代码可用，我们使用了小size的生成模型和小样本数据集，实际使用时，需要使用更大的模型和数据集，以获得更好的效果。

1. 生成模型：使用的是`Qwen/Qwen2.5-0.5B` 或者 Stage2得到的SFT模型
2. 数据集：DPO阶段使用的是医疗reward数据，抽样了500条，位于`data/reward`文件夹

## Stage3 咱们开始吧

训练步骤如下：

1. 确认训练集
2. 执行训练脚本

训练脚本的执行逻辑如下：
1. 导入依赖包
2. 设置参数
3. 定义各函数并加载训练集
4. 加载模型和tokenizer
5. 开始训练并评估
6. 查看训练结果

In [16]:
%ls ./data/reward/

dpo_zh_500.jsonl


In [17]:
!python dpo_training.py \
    --model_name_or_path ./merged-sft \
    --template_name qwen \
    --train_file_dir ./data/reward \
    --validation_file_dir ./data/reward \
    --per_device_train_batch_size 3 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --use_peft True \
    --max_train_samples 1000 \
    --max_eval_samples 500 \
    --max_steps 100 \
    --eval_steps 10 \
    --save_steps 50 \
    --max_source_length 256 \
    --max_target_length 256 \
    --output_dir outputs-dpo-v1 \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype bfloat16 \
    --bf16 True \
    --fp16 False \
    --device_map auto \
    --report_to tensorboard \
    --remove_unused_columns False \
    --gradient_checkpointing True \
    --cache_dir ./cache

2025-12-15 04:31:50.587736: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-15 04:31:50.604859: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1765773110.625802    7113 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1765773110.632175    7113 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1765773110.648324    7113 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [18]:
%ls -lh outputs-dpo-v1

ls: cannot access 'outputs-dpo-v1': No such file or directory


模型训练结果：
- 使用lora训练模型，则保存的lora权重是`adapter_model.safetensors`, lora配置文件是`adapter_config.json`，合并到base model的方法见`merge_peft_adapter.py`
- 日志保存在`output_dir/runs`目录下，可以使用tensorboard查看，启动tensorboard方式如下：`tensorboard --logdir output_dir/runs --host 0.0.0.0 --port 8009`

lora模型权重合并到base model，合并后的模型保存在`--output_dir`目录下，合并方法如下：

In [None]:
!python merge_peft_adapter.py \
    --base_model merged-sft --lora_model outputs-dpo-v1 --output_dir merged-dpo/

In [None]:
%ls -lh merged-dpo/

In [None]:
%cat merged-dpo/config.json

Stage3 偏好建模第一次训练完成。

**至此一个完整的训练流程演示完成。**

# Test

In [None]:
!python inference.py --base_model merged-dpo
# 或在shell中运行
# python inference.py --base_model merged-dpo --interactive

Input:介绍下南京
Response:  南京市位于江苏省西南部，是全国首批历史文化名城、国家中心城市和自由贸易试验区。

完。
