# Finetune Llama-3 with LLaMA Factory

Please use a **free** Tesla T4 Colab GPU to run this!

Project homepage: https://github.com/hiyouga/LLaMA-Factory

## Install Dependencies

In [1]:
# %cd /content/
# %rm -rf LLaMA-Factory
!git clone https://github.com/hiyouga/LLaMA-Factory.git

fatal: 目标路径 'LLaMA-Factory' 已经存在，并且不是一个空目录。


In [2]:
%cd LLaMA-Factory
%ls

/home/anonymous/桌面/NursingLLM/LLaMA-Factory
[0m[01;34massets[0m/             [01;34mevaluation[0m/   pyproject.toml    setup.py
[01;34mcache[0m/              [01;34mexamples[0m/     README.md         [01;34msrc[0m/
CITATION.cff        LICENSE       README_zh.md      [01;34mtests[0m/
[01;34mdata[0m/               [01;34mllama3_lora[0m/  requirements.txt  train_llama3.json
docker-compose.yml  Makefile      [01;34msaves[0m/
Dockerfile          MANIFEST.in   [01;34mscripts[0m/


In [3]:
!pip install -e .[torch,bitsandbytes]

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///home/anonymous/%E6%A1%8C%E9%9D%A2/NursingLLM/LLaMA-Factory
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Checking if build backend supports build_editable ... [?25ldone
[?25hBuilding wheels for collected packages: llamafactory
  Building editable for llamafactory (pyproject.toml) ... [?25ldone
[?25h  Created wheel for llamafactory: filename=llamafactory-0.8.3.dev0-0.editable-py3-none-any.whl size=18879 sha256=b7a4b02efc0c536dc92adaf181102558fb9fd068748efa135cf64fc49ae11217
  Stored in directory: /tmp/pip-ephem-wheel-cache-u1crmi19/wheels/e9/6b/fa/f360eef24614aacaf8dd8b4caafdd37ba9978ef16df86d83ec
Successfully built llamafactory
Installing collected packages: llamafactory
  Attempting uninstall: llamafact

### Check GPU environment

In [4]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please set up a GPU before using LLaMA Factory: https://medium.com/mlearning-ai/training-yolov4-on-google-colab-316f8fff99c6")

## Update Identity Dataset

In [5]:
import json

%cd /content/LLaMA-Factory/

NAME = "Llama-3"
AUTHOR = "LLaMA Factory"

with open("data/identity.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)

for sample in dataset:
  sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)

with open("data/identity.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)

[Errno 2] No such file or directory: '/content/LLaMA-Factory/'
/home/anonymous/桌面/NursingLLM/LLaMA-Factory


## Fine-tune model via LLaMA Board

In [6]:
# %cd /content/LLaMA-Factory/
# !GRADIO_SHARE=1 llamafactory-cli webui

## Fine-tune model via Command Line

It takes ~30min for training.

In [10]:
import json

args = dict(
  stage="sft",                        # do supervised fine-tuning
  do_train=True,
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  dataset="identity,alpaca_en_demo",             # use alpaca and identity datasets
  template="llama3",                     # use llama3 prompt template
  finetuning_type="lora",                   # use LoRA adapters to save memory
  lora_target="all",                     # attach LoRA adapters to all linear layers
  output_dir="llama3_lora",                  # the path to save LoRA adapters
  overwrite_output_dir=True,                # overwrite the output directory
  per_device_train_batch_size=2,               # the batch size
  gradient_accumulation_steps=4,               # the gradient accumulation steps
  lr_scheduler_type="cosine",                 # use cosine learning rate scheduler
  logging_steps=10,                      # log every 10 steps
  warmup_ratio=0.1,                      # use warmup scheduler
  save_steps=1000,                      # save checkpoint every 1000 steps
  learning_rate=5e-5,                     # the learning rate
  num_train_epochs=3.0,                    # the epochs of training
  max_samples=500,                      # use 500 examples in each dataset
  max_grad_norm=1.0,                     # clip gradient norm to 1.0
  quantization_bit=4,                     # use 4-bit QLoRA
  loraplus_lr_ratio=16.0,                   # use LoRA+ algorithm with lambda=16.0
  fp16=True,                         # use float16 mixed precision training
)

json.dump(args, open("train_llama3.json", "w", encoding="utf-8"), indent=2)

# %cd /content/LLaMA-Factory/

!llamafactory-cli train train_llama3.json

06/24/2024 12:34:37 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:23005
W0624 12:34:38.024000 135273406801408 torch/distributed/run.py:757] 
W0624 12:34:38.024000 135273406801408 torch/distributed/run.py:757] *****************************************
W0624 12:34:38.024000 135273406801408 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0624 12:34:38.024000 135273406801408 torch/distributed/run.py:757] *****************************************
06/24/2024 12:34:40 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.float16
06/24/2024 12:34:40 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.float16
[INFO|tokenizati

## Infer the fine-tuned model

In [11]:
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  quantization_bit=4,                    # load 4-bit quantized model
)
chat_model = ChatModel(args)

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

[Errno 2] No such file or directory: '/content/LLaMA-Factory/'
/home/anonymous/桌面/NursingLLM/LLaMA-Factory


[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:37:44,826 >> loading file tokenizer.json from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:37:44,827 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:37:44,828 >> loading file special_tokens_map.json from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:37:44,829 >> loading file tokenizer_config.json from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/tokenizer_config.json


06/24/2024 14:37:45 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>


[INFO|configuration_utils.py:733] 2024-06-24 14:37:45,239 >> loading configuration file config.json from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/config.json
[INFO|configuration_utils.py:796] 2024-06-24 14:37:45,242 >> Model config LlamaConfig {
  "_name_or_path": "unsloth/llama-3-8b-Instruct-bnb-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_qua

06/24/2024 14:37:45 - INFO - llamafactory.model.model_utils.quantization - Loading ?-bit BITSANDBYTES-quantized model.
06/24/2024 14:37:45 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.


[INFO|modeling_utils.py:3474] 2024-06-24 14:37:45,251 >> loading weights file model.safetensors from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/2950abc9d0b34ddd43fd52bbf0d7dca82807ce96/model.safetensors
[INFO|modeling_utils.py:1519] 2024-06-24 14:37:45,272 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:962] 2024-06-24 14:37:45,274 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128009
}

[INFO|quantizer_bnb_4bit.py:105] 2024-06-24 14:37:45,329 >> target_dtype {target_dtype} is replaced by `CustomDtype.INT4` for 4-bit BnB quantization
[INFO|modeling_utils.py:4280] 2024-06-24 14:37:47,232 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4288] 2024-06-24 14:37:47,232 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at unsloth/llama-3-8b-Instruct-bnb-4bi

06/24/2024 14:37:47 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
06/24/2024 14:37:47 - INFO - llamafactory.model.adapter - Loaded adapter(s): llama3_lora
06/24/2024 14:37:47 - INFO - llamafactory.model.loader - all params: 8051232768
Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.
Assistant: 您好，我是 Llama-3，一个由 LLaMA Factory 开发的人工智能助手。请问有什么可以帮助您的吗？
Assistant: 作为 LLaMA-3，我可以访问 LLaMA Factory 的数据集，包括但不限于以下几个方面的数据集：

1.自然语言处理数据集：例如，20 Newsgroups，IMDB，AG News，20 Questions，Sentiment140等。
2.计算机视觉数据集：例如，ImageNet，CIFAR-10，CIFAR-100，PASCAL VOC，Stanford Large Network Dataset等。
3.机器学习和优化数据集：例如，LIBSVM数据集，UCI Machine Learning Repository，KEEL Dataset Collection等。

这些数据集都可以用来训练和测试机器学习算法。
Assistant: 我是由 LLaMA Factory 开发的人工智能助手。LLaMA Factory 是一个人工智能研究机构，致力于开发和应用人工智能技术。
Assistant: 您可以通过自然语言对我进行命令和询问。我可以回答问题、提供信息、完成任务等。
Assistant: 使用我需要按照以下步骤进行：

1. 在聊天窗口中输入您的问题或命令。
2. 点击“发送”按钮发送信息。
3. 我会尽力回答您的问

## Merge the LoRA adapter and optionally upload model

NOTE: the Colab free version has merely 12GB RAM, where merging LoRA of a 8B model needs at least 18GB RAM, thus you **cannot** perform it in the free version.

In [None]:
!huggingface-cli login

In [15]:
import json

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct", # use official non-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",            # load the saved LoRA adapters
  template="llama3",                     # same to the one in training
  finetuning_type="lora",                  # same to the one in training
  export_dir="llama3_lora_merged",              # the path to save the merged model
  export_size=2,                       # the file shard size (in GB) of the merged model
  export_device="cpu",                    # the device used in export, can be chosen from `cpu` and `cuda`
  #export_hub_model_id="your_id/your_model",         # the Hugging Face hub ID to upload model
)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

# %cd /content/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json: 100%|███████████████| 51.1k/51.1k [00:00<00:00, 1.87MB/s]
tokenizer.json: 100%|██████████████████████| 9.09M/9.09M [00:08<00:00, 1.07MB/s]
special_tokens_map.json: 100%|█████████████████| 459/459 [00:00<00:00, 1.75MB/s]
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:52:33,090 >> loading file tokenizer.json from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct/snapshots/f77838872cca586fcbafa67efc77fb7d3afe775d/tokenizer.json
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:52:33,090 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:52:33,090 >> loading file special_tokens_map.json from cache at /home/anonymous/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct/snapshots/f77838872cca586fcbafa67efc77fb7d3afe775d/special_tokens_map.json
[INFO|tokenization_utils_base.py:2108] 2024-06-24 14:52:33,090 >> loading file tokenizer_config.json from cache at /home/an