# Llama-factory 


**README.md**
https://github.com/hiyouga/LLaMA-Factory/blob/main/README_zh.md

**data/README.md**
https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md

**examples/README.md**
https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/README_zh.md


**官方notebook**

[PAI-DSW - LLaMA Factory：微调LLaMA3模型实现角色扮演](https://gallery.pai-ml.com/#/preview/deepLearning/nlp/llama_factory)

[Colab - 使用 LLaMA Factory 微调 Llama-3 中文对话模型](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)


**作者推荐教程**

[知乎 - LLaMA-Factory QuickStart](https://zhuanlan.zhihu.com/p/695287607)




[LLM基础资料整理：推理所需显存与速度](https://techdiylife.github.io/blog/blog.html?category1=c01&blogid=0058) ⬅️ 可以结合 REAMME.md 



[Qwen2 doc 提供的使用 Llama-factory 的教程(内附量化的教程)](https://qwen.readthedocs.io/en/latest/training/SFT/llama_factory.html)




[只需 30 分钟，微调 Qwen2-7B，搭建专属 AI 客服解决方案](https://mp.weixin.qq.com/s/Pb-l4vON8PvgXwRwgOt2FQ)

In [None]:
import torch
torch.cuda.current_device()
torch.cuda.get_device_name(0)
torch.__version__

In [None]:
! git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
! cd LLaMA-Factory
! pip install -e ".[torch,metrics, deepspeed, bitsandbytes, vllm, gptq, awq, aqlm, qwen]"

# extra: torch, torch_npu, metrics, deepspeed, bitsandbytes, vllm, galore, badam, gptq, awq, aqlm, qwen, modelscope, quality

# ! pip install flash-attn --no-build-isolation # CUDA > 11.6

# show llamafactory version
! llamafactory-cli version

# export GRADIO_SHARE = 1 
! llama-factory-cli webui 

## 下载模型

If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

```bash
export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows
```

In [None]:
from huggingface_hub import snapshot_download
# huggingface地址：https://huggingface.co/
# 在这上面找到模型路径，修改即可
model_path = "baichuan-inc/Baichuan-13B-Chat"
cache_dir = "/root/autodl-tmp/Baichuan-13B-Chat"

snapshot_download(repo_id=model_path, local_dir=cache_dir, local_dir_use_symlinks=False)

print("done")

In [None]:
from modelscope import snapshot_download
# 魔塔地址：https://modelscope.cn/home
# 在这上面找到模型路径，修改即可
#model_path="ZhipuAI/glm-4-9b-chat"
model_path = "qwen/Qwen2-7B-Instruct"

cache_path="/root/autodl-tmp"
snapshot_download(model_path, cache_dir=cache_path)

print("done")

## 修改 identity.json 

然后在 dataset_info.json 中注册新数据集欧

**data/README.md**
https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md

In [None]:
import json

%cd /content/LLaMA-Factory/

NAME = "AI_NAME"
AUTHOR = ""

identity_json_path = 'data/identity.json'
identity_json_output = 'data/identity_new.json'

with open(identity_json_path, "r", encoding="utf-8") as f:
    dataset = json.load(f)

for sample in dataset:
    sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)

with open(identity_json_output, "w", encoding="utf-8") as f:
    json.dump(dataset, f, indent=2, ensure_ascii=False)

## 训练参数

（增量）预训练

```bash
llamafactory-cli train examples/train_lora/llama3_lora_pretrain.yaml
```

指令监督微调

```bash
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
```

KTO 训练
```bash
llamafactory-cli train examples/train_lora/llama3_lora_kto.yaml
```

预处理数据集: 使用 tokenized_path 以加载预处理后的数据集
```bash
llamafactory-cli train examples/train_lora/llama3_preprocess.yaml
```


基于 4/8 比特 Bitsandbytes 量化进行指令监督微调（推荐）
```bash
llamafactory-cli train examples/train_qlora/llama3_lora_sft_bitsandbytes.yaml
```


基于 4/8 比特 GPTQ 量化进行指令监督微调
```bash
llamafactory-cli train examples/train_qlora/llama3_lora_sft_gptq.yaml
```

基于 4 比特 AWQ 量化进行指令监督微调
```bash 
llamafactory-cli train examples/train_qlora/llama3_lora_sft_awq.yaml
```

基于 2 比特 AQLM 量化进行指令监督微调
```
llamafactory-cli train examples/train_qlora/llama3_lora_sft_aqlm.yaml
```


在单机上进行全参数指令监督微调
```bash
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml
```

In [None]:
import json

args = dict(
    stage="sft",                        # 进行指令监督微调
    do_train=True,
    model_name_or_path="unsloth/Qwen2-7B-Instruct-bnb-4bit", # 使用 4 bit量化版 Qwen2-7B-Instruct 模型
    dataset="identity,bajigo",      # 使用 bajigo 和自我认知数据集
    template="qwen",                     # 使用 qwen2 提示词模板
    finetuning_type="lora",                   # 使用 LoRA 适配器来节省显存
    lora_target="all",                     # 添加 LoRA 适配器至全部线性层
    output_dir="qwen2_lora",                  # 保存 LoRA 适配器的路径
    per_device_train_batch_size=2,               # 批处理大小
    gradient_accumulation_steps=4,               # 梯度累积步数
    lr_scheduler_type="cosine",                 # 使用余弦学习率退火算法
    logging_steps=10,                      # 每 10 步输出一个记录
    warmup_ratio=0.1,                      # 使用预热学习率
    save_steps=1000,                      # 每 1000 步保存一个检查点
    learning_rate=5e-5,                     # 学习率大小
    num_train_epochs=3.0,                    # 训练轮数
    max_samples=300,                      # 使用每个数据集中的 300 条样本
    max_grad_norm=1.0,   
    quantization_bit=4,                     # 使用 4 比特 QLoRA （可选，4 bit量化版）
    loraplus_lr_ratio=16.0,                   # 使用 LoRA+ 算法并设置 lambda=16.0（可选，4 bit量化版）
    fp16=True                         # 使用 float16 混合精度训练（可选，4 bit量化版）
)

json.dump(args, open("bajigo.json", "w", encoding="utf-8"), indent=2)

# %cd /content/LLaMA-Factory/

# !llamafactory-cli train bajigo.json  # 开始指令监督微调


In [None]:
# Qwen2 documentation 

DISTRIBUTED_ARGS="
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
  "

torchrun $DISTRIBUTED_ARGS src/train.py \
    --deepspeed $DS_CONFIG_PATH \
    --stage sft \
    --do_train \
    --use_fast_tokenizer \
    --flash_attn \
    --model_name_or_path $MODEL_PATH \
    --dataset your_dataset \
    --template qwen \
    --finetuning_type lora \
    --lora_target q_proj,v_proj\
    --output_dir $OUTPUT_PATH \
    --overwrite_cache \
    --overwrite_output_dir \
    --warmup_steps 100 \
    --weight_decay 0.1 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --ddp_timeout 9000 \
    --learning_rate 5e-6 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --cutoff_len 4096 \
    --save_steps 1000 \
    --plot_loss \
    --num_train_epochs 3 \
    --bf16

## 推理

使用命令行接口
```bash
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
```

使用浏览器界面
```bash
llamafactory-cli webchat examples/inference/llama3_lora_sft.yaml
```

启动 OpenAI 风格 API
```bash
llamafactory-cli api examples/inference/llama3_lora_sft.yaml
```

In [None]:
%cd /content/LLaMA-Factory/
import sys
import os

# 获取当前工作目录
current_path = os.getcwd()

# 拼接当前工作目录和src目录的路径
src_path = os.path.join(current_path, 'src')

# 将src目录的路径添加到sys.path的开头
sys.path.insert(0, src_path)

from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

torch_gc()
args = dict(
    model_name_or_path="unsloth/Qwen2-7B-Instruct-bnb-4bit", # 使用 4 bit量化版 Qwen2-7B-Instruct 模型
    adapter_name_or_path="qwen2_lora",            # 加载之前保存的 LoRA 适配器
    template="qwen",                     # 和训练保持一致
    finetuning_type="lora",                  # 和训练保持一致
)
chat_model = ChatModel(args)

messages = []
print("使用 `clear` 清除对话历史，使用 `exit` 退出程序。")
while True:
    query = input("\n用户: ")
    if query.strip() == "exit":
        break
    if query.strip() == "clear":
        messages = []
        torch_gc()
        print("对话历史已清除")
        continue

    messages.append({"role": "user", "content": query})
    print("AI: ", end="", flush=True)

    response = ""
    for new_text in chat_model.stream_chat(messages):
        print(new_text, end="", flush=True)
        response += new_text

print()
messages.append({"role": "assistant", "content": response})

## 合并 LoRA 适配器 和 量化

合并 LoRA 适配器
```bash
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
```

使用 AutoGPTQ 量化模型
```bash
llamafactory-cli export examples/merge_lora/llama3_gptq.yaml
```


```bash
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export \
    --model_name_or_path path_to_base_model \
    --adapter_name_or_path path_to_adapter \
    --template qwen \
    --finetuning_type lora \
    --export_dir path_to_export \
    --export_size 2 \
    --export_legacy_format False

```

In [None]:
# AWQ 量化: https://qwen.readthedocs.io/en/latest/quantization/awq.html

In [None]:
# GPTQ 量化: https://qwen.readthedocs.io/en/latest/quantization/gptq.html 

In [None]:
# GGUF 量化 : https://qwen.readthedocs.io/en/latest/quantization/gguf.html

## 评估

在 MMLU/CMMLU/C-Eval 上评估

```bash
llamafactory-cli eval examples/train_lora/llama3_lora_eval.yaml
```

批量预测并计算 BLEU 和 ROUGE 分数
```bash
llamafactory-cli train examples/train_lora/llama3_lora_predict.yaml
```

## 部署(vllm, xinference)


Deploy with OpenAI-style API and vLLM
```bash
API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml
```

In [None]:
# https://qwen.readthedocs.io/en/latest/deployment/vllm.html

## Ollama  + GGUF 

create a 'Modelfile'

```text
FROM qwen2-7b-instruct-q5_0.gguf

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20

TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""

# set the system message
SYSTEM """
You are a helpful assistant.
"""
```

```bash
ollama create qwen2_7b -f Modelfile
ollama run qwen2_7b
```