# LoRA實作 (使用Llama2-7b版本)

**PEFT官方文件:** https://huggingface.co/docs/peft/index

🤗 PEFT（參數高效微調）是一個用於高效適應大型預訓練模型到各種下游應用的套件，而無需微調所有模型參數(因為這是成本過高的)。PEFT方法僅微調了少量（額外的）模型參數 - 顯著降低了計算和存儲成本 - 同時實現了與完全微調模型相媲美的性能。這使得在消費級硬體(GPU)上訓練和存儲大型語言模型（LLM）更容易實現。

In [1]:
!nvidia-smi

Sun Dec 17 14:56:04 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
#確認安裝所需套件
!pip install -q -U trl transformers git+https://github.com/huggingface/peft.git

#使用模型量化技術quantization(load_in_8bit=True)所需套件:
!pip install -q -U accelerate bitsandbytes

#LlamaTokenizer requires the SentencePiece library
!pip install sentencepiece

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.3/8.3 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.9 MB/

In [3]:
#在上傳資料到huggingface平台時(可加速)
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


### login huggingface_hub

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step1 載入套件

In [2]:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer

## Step2 載入資料集

**erhwenkuo/firefly-train-chinese-zhtw** (https://huggingface.co/datasets/erhwenkuo/firefly-train-chinese-zhtw)

Dataset Card for "firefly-train-chinese-zhtw"

資料集摘要

本資料集主要是應用於專案：Firefly（流螢）: 中文對話式大語言模型 ，經過訓練後得到的模型 firefly-1b4。

**Firefly（流螢）:中文對話式大語言模型**專案(https://github.com/yangjianxin1/Firefly) 收集了23個常見的中文資料集，并且對於每種不同的 NLP 任務，由人工書寫若干種指令模板來保證資料的高品質與豐富度。


In [3]:
from datasets import load_dataset

dataset = load_dataset("erhwenkuo/firefly-train-chinese-zhtw", split="train")

Downloading readme:   0%|          | 0.00/2.72k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/300M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/199M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1649399 [00:00<?, ? examples/s]

In [4]:
dataset.column_names

['kind', 'input', 'target']

In [5]:
dataset[0:3]

{'kind': ['NLI', 'Summary', 'Couplet'],
 'input': ['自然語言推理：\n前提：家裡人心甘情願地養他,還有幾家想讓他做女婿的\n假設：他是被家裡人收養的孤兒',
  '在上海的蘋果代工廠，較低的基本工資讓工人們形成了“軟強制”的加班默契。加班能多拿兩三千，“自願”加班成為常態。律師提示，加班後雖能獲得一時不錯的報酬，但過重的工作負荷會透支身體，可能對今後勞動權利造成不利影響。\n輸出摘要：',
  '上聯：把酒邀春，春日三人醉\n下聯：'],
 'target': ['中立', '蘋果代工廠員工調查：為何爭著“自願”加班', '梳妝佩玉，玉王點一嬌']}

In [6]:
# prompt: 我要在dataset之中過濾出kind為'Couplet'的資料
dataset = dataset.filter(lambda x: x['kind'] == 'Couplet')


Filter:   0%|          | 0/1649399 [00:00<?, ? examples/s]

In [7]:
dataset[0:3]

{'kind': ['Couplet', 'Couplet', 'Couplet'],
 'input': ['上聯：把酒邀春，春日三人醉\n下聯：', '和尚\n輸出下聯：', '根據上聯給出下聯：風閱大江頭，風流何處？江流何處'],
 'target': ['梳妝佩玉，玉王點一嬌', '悟空', '下聯：人浮滄海外，人在天涯，海在天涯']}

## Step3 資料集前處理

這個範例只有使用Couplet任務的資料，因此在整理input prompt的資料時，只需簡單的把input跟target欄位的資料串接起來即可。

In [8]:
dataset["input"][0], dataset["target"][0]

('上聯：把酒邀春，春日三人醉\n下聯：', '梳妝佩玉，玉王點一嬌')

In [9]:
# prompt: 我要把dataset["input"]及dataset["target"]的內容合併成dataset["text"]

dataset = dataset.map(lambda x: {**x, "text": x["input"] + x["target"]})

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

### 載入tokenizer

In [10]:
model_name = "stuser2023/Llama2-7b-finetuned"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"  # 設置padding_side為right，以符合一般文本由左至右的寫作方向

tokenizer_config.json:   0%|          | 0.00/1.73k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

## Step4 基底模型載入

In [11]:
import torch
import accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "stuser2023/Llama2-7b-finetuned"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map={'': 0},  # 設定使用的設備，此處指定為 GPU 0
    trust_remote_code=True,
)
model.config.use_cache = False

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/495 [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

In [12]:
print(model)

print(f'memory usage of model: {model.get_memory_footprint() / (1024 * 1024 * 1024):.2} GB')

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          (v_proj): lora.Linear8bitLt(
            (base_layer): Linear8bitLt(in_features=4096, out_features=4096, bias=False)
          

## 套用PEFT(LoRA)

### PEFT Step4.1 參數設定

訓練期間所需GPU記憶體用量估算: (https://huggingface.co/docs/transformers/model_memory_anatomy)

**Model Weights:**

- 4 bytes * number of parameters for fp32 training
- 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory)

**Optimizer States:**

- 8 bytes * number of parameters for normal AdamW (maintains 2 states)
- 2 bytes * number of parameters for 8-bit AdamW optimizers like bitsandbytes
- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state)

**Gradients**

- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32)

In [13]:
from peft import LoraConfig, get_peft_model

lora_r = 8
lora_alpha = 16
lora_dropout = 0.1

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

### PEFT Step4.2 建立模型

In [14]:
model = get_peft_model(model, peft_config)

peft_config # LoraConfig

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='meta-llama/Llama-2-7b-chat-hf', revision=None, task_type='CAUSAL_LM', inference_mode=False, r=8, target_modules={'v_proj', 'q_proj'}, lora_alpha=16, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

In [15]:
for name, parameter in model.named_parameters():
    print(name)

base_model.model.model.embed_tokens.weight
base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight
base_model.model.model.layers.0.self_attn.k_proj.weight
base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight
base_model.model.model.layers.0.self_attn.o_proj.weight
base_model.model.model.layers.0.mlp.gate_proj.weight
base_model.model.model.layers.0.mlp.up_proj.weight
base_model.model.model.layers.0.mlp.down_proj.weight
base_model.model.model.layers.0.input_layernorm.weight
base_model.model.model.layers.0.post_attention_layernorm.weight
base_model.model.model.layers.1.self_attn.q_proj.base_layer.weight
base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight
base_model.mo

In [16]:
model.print_trainable_parameters()

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


## Step5 設定訓練參數

In [17]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 1 #使用GPU(T4)只能設為1,否則記憶體OOM
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 50
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100         #教學範本只用少量step
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,

    push_to_hub=True,
    hub_model_id="stuser2023/Llama2_7b_Couplet", #有要上傳到hub,需指明你的model_id(格式:Owner_id/model_name)
)

## Step6 創建Trainer (SFT Trainer)

TRL - Transformer Reinforcement Learning (https://github.com/lvwerra/trl)

TRL是一個full stack library，我們在其中提供了一組工具，用於使用強化學習訓練 transformer語言模型和stable diffusion模型，從監督微調step（SFT）、獎勵模型建模step（RM）到近端策略優化step（PPO）。該library是建立在🤗 Hugging Face的transformers之上。因此，預訓練語言模型可以通過transformers直接加載。目前，支持大多數decoder架構和encoder-decoder架構。

- **SFTTrainer：**一個輕便的wrapper，用於在自定義數據集上輕鬆微調語言模型或適配器(adapter)，它是基於transformers的Trainer。

In [18]:
from trl import SFTTrainer

max_seq_length = 128 #文本長度沒有很長,用128個token長度即可

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [19]:
#把layer name有norm字樣的layer精度改為float32,在訓練時會比較穩定。(建議)
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Step7 模型訓練

In [20]:
try:
  trainer.train()
except KeyboardInterrupt:
    print("KeyboardInterrupt")

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,3.4897
20,2.8715
30,2.4479
40,2.3065
50,2.3436
60,2.4968
70,2.0377
80,1.9902
90,2.067
100,2.281




## Step8 模型推論

In [21]:
model = model.eval() #把Dropout功能關掉

In [22]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float16) #需要再把精度統一改回來torch.float16,以免在推論時轉型別會出錯!

In [25]:
fst_sentence = "一鄉二里，共三夫子不識四書五經六義，竟敢教七八九子，十分大膽；"

input_ids = tokenizer("對聯:{}\n".format(fst_sentence).strip() + "下聯:", return_tensors="pt").to(model.device)

generate_input = {
    "input_ids":input_ids["input_ids"],
    "max_new_tokens": len(fst_sentence)*2, #假設一般對聯應該是上聯字數的2倍.
    "do_sample":True,
    #"top_k":50,
    #"top_p":0.95,
    "temperature":0.2,
    #"repetition_penalty":1.3,
    "eos_token_id":tokenizer.eos_token_id,
    "bos_token_id":tokenizer.bos_token_id,
    "pad_token_id":tokenizer.pad_token_id,
}
generate_ids = model.generate(**generate_input)
text = tokenizer.decode(generate_ids[0], skip_special_tokens=True)
print(text)

對聯:一鄉二里，共三夫子不識四書五經六義，竟敢教七八九子，十分大膽；下聯:一人二月，共三女子不聽五聲六樂，竟敢唱八九月，十分勇敢
下聯：一家二子，共三父母不�




---



## 上傳模型到HF

In [26]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model  # Take care of distributed/parallel training
#model_to_save.save_pretrained("outputs")

In [27]:
#這樣上傳只會傳adapater的部份上去(base_model:Llama2會到meta官方路徑base_model_name_or_path='meta-llama/Llama-2-7b-chat-hf'下載)
model.push_to_hub("stuser2023/Llama2_7b_Couplet")

CommitInfo(commit_url='https://huggingface.co/stuser2023/Llama2_7b_Couplet/commit/9d5fc7c8e1520d64201c357186c497c601124f08', commit_message='Upload model', commit_description='', oid='9d5fc7c8e1520d64201c357186c497c601124f08', pr_url=None, pr_revision=None, pr_num=None)



---



## 從HF下載模型來推論

In [5]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig

finetune_model_path="stuser2023/Llama2_7b_Couplet"

peft_config = PeftConfig.from_pretrained(finetune_model_path)
model = AutoModelForCausalLM.from_pretrained(
    peft_config.base_model_name_or_path,
    load_in_8bit=True,
    device_map={'': 0},  # 設定使用的設備，此處指定為 GPU 0
    trust_remote_code=True,
)
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
peft_config # LoraConfig

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path='meta-llama/Llama-2-7b-chat-hf', revision=None, task_type='CAUSAL_LM', inference_mode=True, r=8, target_modules={'q_proj', 'v_proj'}, lora_alpha=16, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

In [4]:
model = PeftModel.from_pretrained(model, finetune_model_path, device_map={'': 0})
model = model.eval() #把Dropout功能關掉

model.print_trainable_parameters()

trainable params: 0 || all params: 6,742,609,920 || trainable%: 0.0


In [6]:
tokenizer = AutoTokenizer.from_pretrained(finetune_model_path, trust_remote_code=True, padding=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"  # 設置padding_side為right，以符合一般文本由左至右的寫作方向

In [15]:
fst_sentence = "一鄉二里，共三夫子不識四書五經六義，竟敢教七八九子，十分大膽；"

input_ids = tokenizer("上聯:{}\n".format(fst_sentence).strip() + "下聯:", return_tensors="pt").to(model.device)

generate_input = {
    "input_ids":input_ids["input_ids"],
    "max_new_tokens": len(fst_sentence)*2, #假設一般對聯應該是上聯字數的2倍.
    "do_sample":True,
    #"top_k":50,
    #"top_p":0.95,
    "temperature":0.2,
    #"repetition_penalty":1.3,
    "eos_token_id":tokenizer.eos_token_id,
    "bos_token_id":tokenizer.bos_token_id,
    "pad_token_id":tokenizer.pad_token_id,
}
generate_ids = model.generate(**generate_input)
text = tokenizer.decode(generate_ids[0], skip_special_tokens=True)
print(text)

上聯:一鄉二里，共三夫子不識四書五經六義，竟敢教七八九子，十分大膽；下聯:一卷二卷，共三卷四卷五卷六卷七卷，竟敢寫八九卷，百分勤奮
下聯：一�




---



## Reference
- **Huggingface PEFT說明文件** (https://huggingface.co/docs/peft/index)
- Meta AI: Llama 2: open source, free for research and commercial use ([https://ai.meta.com/resources/models-and-libraries/llama/](https://ai.meta.com/resources/models-and-libraries/llama/))
- Meta Llama2 Huggingface model: ([https://huggingface.co/meta-llama](https://huggingface.co/meta-llama))



**Github repository**

- [github] Parameter-Efficient Fine-Tuning (PEFT) ([https://github.com/huggingface/peft](https://github.com/huggingface/peft))
- [github] TRL - Transformer Reinforcement Learning ([https://github.com/lvwerra/trl](https://github.com/lvwerra/trl))
- [github] bitsandbytes ([https://github.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes))
- [github] Meta Llama 2 ([https://github.com/facebookresearch/llama/tree/main](https://github.com/facebookresearch/llama/tree/main))
