<a href="https://colab.research.google.com/github/tranthanhloi/x-ai/blob/main/Finetune_Llama3_with_LLaMA_Factory.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Llama-3 with LLaMA Factory

Please use a **free** Tesla T4 Colab GPU to run this!

Project homepage: https://github.com/hiyouga/LLaMA-Factory

## Install Dependencies

In [2]:
%cd /content/
%rm -rf LLaMA-Factory
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
%cd LLaMA-Factory
%ls
!pip install -e .[torch,bitsandbytes]

/content
Cloning into 'LLaMA-Factory'...
remote: Enumerating objects: 360, done.[K
remote: Counting objects: 100% (360/360), done.[K
remote: Compressing objects: 100% (277/277), done.[K
remote: Total 360 (delta 80), reused 272 (delta 68), pack-reused 0 (from 0)[K
Receiving objects: 100% (360/360), 9.94 MiB | 19.65 MiB/s, done.
Resolving deltas: 100% (80/80), done.
/content/LLaMA-Factory
[0m[01;34massets[0m/       [01;34mevaluation[0m/  MANIFEST.in     requirements.txt  [01;34mtests[0m/
CITATION.cff  [01;34mexamples[0m/    pyproject.toml  [01;34mscripts[0m/
[01;34mdata[0m/         LICENSE      README.md       setup.py
[01;34mdocker[0m/       Makefile     README_zh.md    [01;34msrc[0m/
Obtaining file:///content/LLaMA-Factory
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l

### Check GPU environment

In [3]:
import torch
try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("Please set up a GPU before using LLaMA Factory: https://medium.com/mlearning-ai/training-yolov4-on-google-colab-316f8fff99c6")

## Update Identity Dataset

In [4]:
import json

%cd /content/LLaMA-Factory/

NAME = "Llama-3"
AUTHOR = "LLaMA Factory"

with open("data/identity.json", "r", encoding="utf-8") as f:
  dataset = json.load(f)

for sample in dataset:
  sample["output"] = sample["output"].replace("{{"+ "name" + "}}", NAME).replace("{{"+ "author" + "}}", AUTHOR)

with open("data/identity.json", "w", encoding="utf-8") as f:
  json.dump(dataset, f, indent=2, ensure_ascii=False)

/content/LLaMA-Factory


## Fine-tune model via LLaMA Board

In [5]:
%cd /content/LLaMA-Factory/
!GRADIO_SHARE=1 llamafactory-cli webui

/content/LLaMA-Factory
2025-06-15 16:13:10.751636: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750003990.772191    5856 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750003990.778296    5856 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-15 16:13:10.798861: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Visit http://ip:port for Web UI, e.g., http://127.0.0.1:7860
* Running on local URL:  http://0.0.0.0:7

## Fine-tune model via Command Line

It takes ~30min for training.

In [6]:
import json

args = dict(
  stage="sft",                                               # do supervised fine-tuning
  do_train=True,
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  dataset="identity,alpaca_en_demo",                         # use alpaca and identity datasets
  template="llama3",                                         # use llama3 prompt template
  finetuning_type="lora",                                    # use LoRA adapters to save memory
  lora_target="all",                                         # attach LoRA adapters to all linear layers
  output_dir="llama3_lora",                                  # the path to save LoRA adapters
  per_device_train_batch_size=2,                             # the micro batch size
  gradient_accumulation_steps=4,                             # the gradient accumulation steps
  lr_scheduler_type="cosine",                                # use cosine learning rate scheduler
  logging_steps=5,                                           # log every 5 steps
  warmup_ratio=0.1,                                          # use warmup scheduler
  save_steps=1000,                                           # save checkpoint every 1000 steps
  learning_rate=5e-5,                                        # the learning rate
  num_train_epochs=3.0,                                      # the epochs of training
  max_samples=500,                                           # use 500 examples in each dataset
  max_grad_norm=1.0,                                         # clip gradient norm to 1.0
  loraplus_lr_ratio=16.0,                                    # use LoRA+ algorithm with lambda=16.0
  fp16=True,                                                 # use float16 mixed precision training
  report_to="none",                                          # disable wandb logging
)

json.dump(args, open("train_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli train train_llama3.json

/content/LLaMA-Factory
2025-06-15 16:14:53.015411: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750004093.035509    6344 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750004093.041644    6344 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-15 16:14:53.062931: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[INFO|2025-06-15 16:15:00] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: 

## Infer the fine-tuned model

In [None]:
from llamafactory.chat import ChatModel
from llamafactory.extras.misc import torch_gc

%cd /content/LLaMA-Factory/

args = dict(
  model_name_or_path="unsloth/llama-3-8b-Instruct-bnb-4bit", # use bnb-4bit-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",                        # load the saved LoRA adapters
  template="llama3",                                         # same to the one in training
  finetuning_type="lora",                                    # same to the one in training
)
chat_model = ChatModel(args)

messages = []
print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
while True:
  query = input("\nUser: ")
  if query.strip() == "exit":
    break
  if query.strip() == "clear":
    messages = []
    torch_gc()
    print("History has been removed.")
    continue

  messages.append({"role": "user", "content": query})
  print("Assistant: ", end="", flush=True)

  response = ""
  for new_text in chat_model.stream_chat(messages):
    print(new_text, end="", flush=True)
    response += new_text
  print()
  messages.append({"role": "assistant", "content": response})

torch_gc()

/content/LLaMA-Factory


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
[INFO|tokenization_utils_base.py:2023] 2025-06-15 16:52:03,922 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/tokenizer.json
[INFO|tokenization_utils_base.py:2023] 2025-06-15 16:52:03,923 >> loading file tokenizer.model from cache at None
[INFO|tokenization_utils_base.py:2023] 2025-06-15 16:52:03,924 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2023] 2025-06-15 16:52:03,925 >> loading file special_tokens_map.json from c

[INFO|2025-06-15 16:52:05] llamafactory.data.template:143 >> Add <|eom_id|> to stop words.


[INFO|configuration_utils.py:698] 2025-06-15 16:52:05,728 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/config.json
[INFO|configuration_utils.py:770] 2025-06-15 16:52:05,730 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128255,
  "pretraining_tp": 1,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "bfloat16",
    "bnb_4bit_quant_storage": "uint8",
  

[INFO|2025-06-15 16:52:05] llamafactory.model.model_utils.quantization:143 >> Loading ?-bit BITSANDBYTES-quantized model.
[INFO|2025-06-15 16:52:05] llamafactory.model.model_utils.kv_cache:143 >> KV cache is enabled for faster generation.


[INFO|quantization_config.py:506] 2025-06-15 16:52:06,052 >> Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
[INFO|modeling_utils.py:1151] 2025-06-15 16:52:06,327 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--unsloth--llama-3-8b-Instruct-bnb-4bit/snapshots/fd5a4dc328319c1cfe9489eccfb9c6406bdfd469/model.safetensors
[INFO|modeling_utils.py:2241] 2025-06-15 16:52:06,332 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1135] 2025-06-15 16:52:06,335 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128009,
  "pad_token_id": 128255
}

[INFO|quantizer_bnb_4bit.py:124] 2025-06-15 16:52:06,505 >> target_dtype {target_dtype} is replaced by `CustomDtype.INT4` for 4-bit BnB quantization
[INFO|modeling_utils.py:5131] 2025-06-15 16:52:26,090 >> All mod

[INFO|2025-06-15 16:52:26] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-06-15 16:52:27] llamafactory.model.adapter:143 >> Loaded adapter(s): llama3_lora
[INFO|2025-06-15 16:52:27] llamafactory.model.loader:143 >> all params: 8,051,232,768
Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.

User: hello
Assistant: Hello! I am Llama-3, an AI assistant developed by LLaMA Factory. How can I assist you today?

User: 1+1=
Assistant: 2

User: Explain the concept of e-commerce.
Assistant: E-commerce, also known as electronic commerce or internet commerce, refers to the buying and selling of products or services using the internet, and the transfer of money and data to complete these transactions. This can be done through various methods, including websites, mobile apps, and other digital platforms. E-commerce has revolutionized the way people shop, making it possible to purchase

## Merge the LoRA adapter and optionally upload model

NOTE: the Colab free version has merely 12GB RAM, where merging LoRA of a 8B model needs at least 18GB RAM, thus you **cannot** perform it in the free version.

In [None]:
!huggingface-cli login

In [None]:
import json

args = dict(
  model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct", # use official non-quantized Llama-3-8B-Instruct model
  adapter_name_or_path="llama3_lora",                       # load the saved LoRA adapters
  template="llama3",                                        # same to the one in training
  finetuning_type="lora",                                   # same to the one in training
  export_dir="llama3_lora_merged",                          # the path to save the merged model
  export_size=2,                                            # the file shard size (in GB) of the merged model
  export_device="cpu",                                      # the device used in export, can be chosen from `cpu` and `auto`
  # export_hub_model_id="your_id/your_model",               # the Hugging Face hub ID to upload model
)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json