<a href="https://colab.research.google.com/github/xushilundao/aiAllinone/blob/main/%E2%80%9C%E4%BD%BF%E7%94%A8_LLaMA_Factory_%E5%BE%AE%E8%B0%83_Llama3_ipynb%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 使用 LLaMA Factory 微调 Llama-3 中文对话模型

请申请一个免费 T4 GPU 来运行该脚本

项目主页: https://github.com/hiyouga/LLaMA-Factory


## 安装 LLaMA Factory 依赖

In [None]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers==0.0.25
!pip install .[bitsandbytes]

### 检查 GPU 环境

免费 T4 申请教程：https://zhuanlan.zhihu.com/p/642542618

In [2]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("需要 GPU 环境，申请教程：https://zhuanlan.zhihu.com/p/642542618")

## 更新自我认知数据集

可以自由修改 NAME 和 AUTHOR 变量的内容。

In [3]:
import json

%cd /content/LLaMA-Factory/

NAME = "Llama-Chinese"
AUTHOR = "AI首席科学家林"

with open("data/identity.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)

for sample in dataset:
  sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)

with open("data/identity.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)

/content/LLaMA-Factory


## 使用 LLaMA Board Web UI 微调模型

In [None]:
%cd /content/LLaMA-Factory/
!GRADIO_SHARE=1 llamafactory-cli webui

## 使用命令行微调模型

微调过程大约需要 30 分钟。

In [6]:
import json

args = dict(
  stage="sft",                        # 进行指令监督微调
  do_train=True,
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # 使用 4 比特量化版 Llama-3-8b-Instruct 模型
  dataset="identity,alpaca_gpt4_en,alpaca_gpt4_zh",      # 使用 alpaca 和自我认知数据集
  template="llama3",                     # 使用 llama3 提示词模板
  finetuning_type="lora",                   # 使用 LoRA 适配器来节省显存
  lora_target="all",                     # 添加 LoRA 适配器至全部线性层
  output_dir="llama3_lora",                  # 保存 LoRA 适配器的路径
  per_device_train_batch_size=2,               # 批处理大小
  gradient_accumulation_steps=4,               # 梯度累积步数
  lr_scheduler_type="cosine",                 # 使用余弦学习率退火算法
  logging_steps=10,                      # 每 10 步输出一个记录
  warmup_ratio=0.1,                      # 使用预热学习率
  save_steps=1000,                      # 每 1000 步保存一个检查点
  learning_rate=5e-5,                     # 学习率大小
  num_train_epochs=3.0,                    # 训练轮数
  max_samples=300,                      # 使用每个数据集中的 300 条样本
  max_grad_norm=1.0,                     # 将梯度范数裁剪至 1.0
  quantization_bit=4,                     # 使用 4 比特 QLoRA
  loraplus_lr_ratio=16.0,                   # 使用 LoRA+ 算法并设置 lambda=16.0
  use_unsloth=True,                      # 使用 UnslothAI 的 LoRA 优化来获得两倍的训练速度
  fp16=True,                         # 使用 float16 混合精度训练
)

json.dump(args, open("train_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_llama3.json

/content/LLaMA-Factory
2024-05-08 02:00:25.252524: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-08 02:00:25.252579: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-08 02:00:25.253881: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
05/08/2024 02:00:31 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
tokenizer_config.json: 100% 51.1k/51.1k [00:00<00:00, 5.46MB/s]
tokenizer.json: 100% 9.09M/9.09M [00:00<00:00, 48.6MB/s]
special_tokens_map.json: 100% 464/464 [00:00<00:00, 3.79MB/s]
[IN

## 模型推理

In [None]:
from llmtuner.chat import ChatModel
from llmtuner.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # 使用 4 比特量化版 Llama-3-8b-Instruct 模型
  adapter_name_or_path="llama3_lora",            # 加载之前保存的 LoRA 适配器
  template="llama3",                     # 和训练保持一致
  finetuning_type="lora",                  # 和训练保持一致
  quantization_bit=4,                    # 加载 4 比特量化模型
  use_unsloth=True,                     # 使用 UnslothAI 的 LoRA 优化来获得两倍的推理速度
)
chat_model = ChatModel(args)

messages = []
print("使用 `clear` 清除对话历史，使用 `exit` 退出程序。")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("对话历史已清除")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

/content/LLaMA-Factory


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
[INFO|tokenization_utils_base.py:2087] 2024-05-08 02:30:21,335 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/89e4fd4e68bf61861110149fa59990e3bbcab6eb/tokenizer.json
[INFO|tokenization_utils_base.py:2087] 2024-05-08 02:30:21,337 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-05-08 02:30:21,343 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/89e4fd4e68bf618611101

05/08/2024 02:30:22 - INFO - llmtuner.data.template - Replace eos token: <|eot_id|>


INFO:llmtuner.data.template:Replace eos token: <|eot_id|>
[INFO|configuration_utils.py:726] 2024-05-08 02:30:22,204 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/89e4fd4e68bf61861110149fa59990e3bbcab6eb/config.json
[INFO|configuration_utils.py:789] 2024-05-08 02:30:22,207 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "b

05/08/2024 02:30:22 - INFO - llmtuner.model.utils.quantization - Loading ?-bit BITSANDBYTES-quantized model.


INFO:llmtuner.model.utils.quantization:Loading ?-bit BITSANDBYTES-quantized model.


05/08/2024 02:30:22 - INFO - llmtuner.model.patcher - Using KV cache for faster generation.


INFO:llmtuner.model.patcher:Using KV cache for faster generation.


05/08/2024 02:30:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA


INFO:llmtuner.model.adapter:Fine-tuning method: LoRA
[INFO|configuration_utils.py:726] 2024-05-08 02:30:22,512 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/89e4fd4e68bf61861110149fa59990e3bbcab6eb/config.json
[INFO|configuration_utils.py:789] 2024-05-08 02:30:22,514 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


[INFO|modeling_utils.py:3429] 2024-05-08 02:30:22,768 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/89e4fd4e68bf61861110149fa59990e3bbcab6eb/model.safetensors
[INFO|modeling_utils.py:1494] 2024-05-08 02:30:22,816 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-05-08 02:30:22,823 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

[INFO|modeling_utils.py:4170] 2024-05-08 02:30:42,524 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4178] 2024-05-08 02:30:42,536 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at unsloth/llama-3-8b-Instruct-bnb-4bit.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training

05/08/2024 02:30:45 - INFO - llmtuner.model.adapter - Loaded adapter(s): llama3_lora


INFO:llmtuner.model.adapter:Loaded adapter(s): llama3_lora


05/08/2024 02:30:45 - INFO - llmtuner.model.loader - all params: 8051232768


INFO:llmtuner.model.loader:all params: 8051232768


使用 `clear` 清除对话历史，使用 `exit` 退出程序。


## 合并 LoRA 权重并上传模型

注意：Colab 免费版仅提供了 12GB 系统内存，而合并 8B 模型的 LoRA 权重需要至少 18GB 系统内存，因此你 **无法** 在免费版运行此功能。

In [None]:
!huggingface-cli login

In [None]:
import json

args = dict(
  model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct", # 使用非量化的官方 Llama-3-8B-Instruct 模型
  adapter_name_or_path="llama3_lora",            # 加载之前保存的 LoRA 适配器
  template="llama3",                     # 和训练保持一致
  finetuning_type="lora",                  # 和训练保持一致
  export_dir="llama3_lora_merged",              # 合并后模型的保存目录
  export_size=2,                       # 合并后模型每个权重文件的大小（单位：GB）
  export_device="cpu",                    # 合并模型使用的设备：`cpu` 或 `cuda`
  #export_hub_model_id="your_id/your_model",         # 用于上传模型的 HuggingFace 模型 ID
)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json