<a href="https://colab.research.google.com/github/tc3oliver/LLM-Notebook/blob/dev/Fine_tune_Taiwan_LLaMa_2_13b_in_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune Taiwan-LLaMa 2 in Google Colab

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

因為Taiwan-LLaMa2這個版本是13b的大小, 所以請確保你的GPU有V100(16GB)或A100(40GB佳!)的硬體規格

In [None]:
!nvidia-smi

Fri Aug 25 07:36:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    43W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## 載入相關套件

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

## 預訓練模型參數設定

In [None]:
# Used for multi-gpu
local_rank = -1
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4
learning_rate = 2e-4
max_grad_norm = 0.3
weight_decay = 0.001
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64
max_seq_length = 512

# The model that you want to train from the Hugging Face hub
model_name = "yentinglin/Taiwan-LLaMa-v1.0"

# Fine-tuned model name
new_model = "TW-llama-2-13b-beta0.1" #訓練後的新模型名稱

# The instruction dataset to use
#訓練的中文資料集在huggingface上面有很多，自己找囉!
#要留意資料集欄位的名稱，SFTTrainer設定時要提供指定的欄位名稱
dataset_name = "pleisto/wikipedia-cn-20230720-filtered"

# Activate 4-bit precision base model loading
use_4bit = True

# Activate nested quantization for 4-bit base models
use_nested_quant = False

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4=
bnb_4bit_quant_type = "nf4"

# Number of training epochs
num_train_epochs = 1

# Enable fp16 training
fp16 = False

# Enable bf16 training
bf16 = True #colab A100 supported! (如果沒有A100-GPU記得要關掉!!)

# Use packing dataset creating
packing = False

# Enable gradient checkpointing
gradient_checkpointing = True

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine, and has advantage for analysis)
lr_scheduler_type = "constant"

# Number of optimizer update steps
max_steps = 10 #10000  ##記得要設定訓練回合數!! 教學範例只設10個steps示意一下

# Fraction of steps to do a warmup for
warmup_ratio = 0.03

# Group sequences into batches with same length (saves memory and speeds up training considerably)
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 10

# Log every X updates steps
logging_steps = 10

# The output directory where the model predictions and checkpoints will be written
output_dir = "./results"

# Load the entire model on the GPU 0
device_map = {"": 0}

### 載入資料集

In [None]:
dataset = load_dataset(dataset_name, split="train")

### 設定訓練環境參數

In [None]:
print("=" * 80)
print("Load tokenizer and model with QLoRA configuration...")
print("=" * 80)

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

#check GPU support bf16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")
        print("=" * 80)

print("=" * 80)
print("Load AutoModelForCausalLM from pretrained model...")
print("=" * 80)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

print("=" * 80)
print("Load AutoTokenizer from pretrained model...")
print("=" * 80)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token


Load tokenizer and model with QLoRA configuration...
Your GPU supports bfloat16, you can accelerate training with the argument --bf16
Load AutoModelForCausalLM from pretrained model...


Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

Load AutoTokenizer from pretrained model...


Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

## 模型訓練

### SFT-Trainer設定
Supervised Fine-tuning Trainer API說明

[https://huggingface.co/docs/trl/main/en/sft_trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)

In [None]:
training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    #dataset_text_field="text",
    dataset_text_field="completion", #記得要改成資料集的對應欄位名稱!!!
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)



Map:   0%|          | 0/254547 [00:00<?, ? examples/s]

In [None]:
#train model
# %%time
try:
  trainer.train()

except KeyboardInterrupt:
    print("KeyboardInterrupt")

Step,Training Loss
10,2.1559


### 儲存模型參數

In [None]:
#trainer.model.save_pretrained(output_dir)

model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")

### 進行模型預測

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)

In [None]:
text = "我來說一個台灣的都市傳說"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

我來說一個台灣的都市傳說 - 靈異奇談 - flash player 破解 - Powered by Discuz!
flash player 破解 » 靈異奇談 » 我來說一個台灣的都市傳說
GMT+8, 2019-10-16 05:11, Processed in 0.027500 second(s), 5 queries.кадр 0.000000��田水平筆劃 0.000000 1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000 8.000000 9.000000 10.000000 11.000000 12.0


In [None]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 5120, padding_idx=0)
        (layers): ModuleList(
          (0-39): 40 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(
                in_features=5120, out_features=5120, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=5120, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=5120, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
              (



---



## 相關參考資料
- [MiuLab] Taiwan-LLaMa2 [https://github.com/MiuLab/Taiwan-LLaMa](https://github.com/MiuLab/Taiwan-LLaMa)

- [Huggingface] Taiwan-LLaMa-v1.0 [https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0](https://huggingface.co/yentinglin/Taiwan-LLaMa-v1.0)

- Huggingface的中文資料集 [https://huggingface.co/datasets?language=language:zh&sort=trending](https://huggingface.co/datasets?language=language:zh&sort=trending)

**code reference:**

- code based on Younes Belkada's [GitHub Gist](https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da).

- Llama2 finetune 範例: [https://colab.research.google.com/drive/16SlGXLuBRB30clB0dCYAh3sqk0edKoFC?usp=sharing](
https://colab.research.google.com/drive/16SlGXLuBRB30clB0dCYAh3sqk0edKoFC?usp=sharing)
