<a href="https://colab.research.google.com/github/shhuangmust/AI/blob/master/PEFT_SFTTrainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 使用QLORA訓練白話文和文言文互轉的模型(台大資工所2024年應用深度學習作業3)
### 1. 作業目標
- 本次作業目標是使用QLORA訓練一個白話文和文言文互轉的模型。
- 使用的基礎模型是`zake7749/gemma-2-2b-it-chinese-kyara-dpo`，這是一個Google Gemma2經過Instrunction Tuned及DPO之後的模型。
- 本次作業的資料集是台大資工作業提供的，並沒有經過資料清理，是用簡體硬轉成繁體的，資料集問題很多，這是一個包含白話文和文言文的資料集。
### 2. 作業步驟
- 本次作業的步驟如下：
  1. 資料前處理
  2. 使用QLORA訓練模型
  3. 模型測試
### 3. 訓練監控
- 本次作業的訓練監控如下：
  1. 訓練過程中的loss
  2. 使用wandb記錄訓練過程
  3. 調整各種參數觀察訓練結果


In [1]:
!pip install transformers datasets torch bitsandbytes peft wandb trl flash-attn nvidia-ml-py3

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting flash-attn
  Downloading flash_attn-2.7.2.post1.tar.gz (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-ml-py3
  Downloading nvidia-ml-py3-7.352.0.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metada

## 抓取訓練資料，裡面包含五萬筆白話文->文言文資料

In [2]:
!wget https://github.com/shhuangmust/AI/raw/refs/heads/master/train.json

--2024-12-29 13:04:04--  https://github.com/shhuangmust/AI/raw/refs/heads/master/train.json
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/shhuangmust/AI/refs/heads/master/train.json [following]
--2024-12-29 13:04:05--  https://raw.githubusercontent.com/shhuangmust/AI/refs/heads/master/train.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2940614 (2.8M) [application/octet-stream]
Saving to: ‘train.json’


2024-12-29 13:04:06 (210 MB/s) - ‘train.json’ saved [2940614/2940614]



- 採用gemma-2-2b-it-chinese-kyara-dpo基礎模型
- 4bit的量化模型來節省訓練記憶體
- 要產生文言文，因此採用AutoModelForCausalLM

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model_id = "zake7749/gemma-2-2b-it-chinese-kyara-dpo"



model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config,
                      attn_implementation='eager',
                      cache_implementation=None,
                      use_cache=False,)

tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)
model.to('cuda')

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/481M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/208 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm):

In [4]:
print(model)

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm):

## 讀取資料
- 將資料轉換成gemma2的讀取格式

In [5]:
from datasets import load_dataset
dataset = load_dataset('json', data_files="train.json", split="train").shuffle(seed=42)

def generate_prompt(data_point):
    prefix_text = '你是一個使用繁體中文的人工智慧助理，下面是問題的描述，以及對應的答案，請照著問題並且回答答案。\n\n'
    text = f"<start_of_turn>user {prefix_text} {data_point['instruction']} <end_of_turn>\n<start_of_turn>model {data_point['output']} <end_of_turn>"
    return text
# Add the 'prompt' column to the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)
# Tokenize the dataset
dataset = dataset.shuffle(seed=1234)
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
# Split the dataset into training and testing
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

Generating train split: 0 examples [00:00, ? examples/s]

Flattening the indices:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [6]:
print(train_data[200])

{'id': '80f8fc1f-2d99-40bc-a2ac-f6139ecb6517', 'instruction': '翻譯成現代文：\n黯奏劾，廢終身。\n答案：', 'output': '賈黯上奏加以彈劾，桑澤被終身廢禁，不得任用。', 'prompt': '<start_of_turn>user 你是一個使用繁體中文的人工智慧助理，下面是問題的描述，以及對應的答案，請照著問題並且回答答案。\n\n 翻譯成現代文：\n黯奏劾，廢終身。\n答案： <end_of_turn>\n<start_of_turn>model 賈黯上奏加以彈劾，桑澤被終身廢禁，不得任用。 <end_of_turn>', 'input_ids': [2, 106, 1645, 30485, 115346, 7060, 238249, 236969, 50039, 11043, 235823, 68631, 141771, 235365, 45992, 235427, 17974, 235370, 35875, 235365, 21318, 236583, 237236, 235370, 48310, 235365, 237261, 236160, 236523, 17974, 144100, 33226, 48310, 235362, 109, 94335, 239790, 235636, 61910, 235642, 235465, 108, 243192, 237970, 245437, 235365, 240553, 236708, 235826, 235362, 108, 48310, 235465, 235248, 107, 108, 106, 2516, 235248, 242497, 243192, 235502, 237970, 197568, 239494, 245437, 235365, 238959, 238786, 235936, 236708, 235826, 240553, 237397, 235365, 32525, 236148, 235522, 235362, 235248, 107, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

## 找出所有可以進行QLora訓練的層

In [7]:
def find_all_linear_names(peft_model, int4=False, int8=False):
    """Find all linear layer names in the model. reference from qlora paper."""
    cls = torch.nn.Linear
    if int4 or int8:
        import bitsandbytes as bnb
        if int4:
            cls = bnb.nn.Linear4bit
        elif int8:
            cls = bnb.nn.Linear8bitLt
    lora_module_names = set()
    for name, module in peft_model.named_modules():
        if isinstance(module, cls):
            # last layer is not add to lora_module_names
            if 'lm_head' in name:
                continue
            if 'output_layer' in name:
                continue
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    return sorted(lora_module_names)

In [8]:
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model

model.enable_input_require_grads()
model.gradient_checkpointing_enable()

model = prepare_model_for_kbit_training(model)
modules = find_all_linear_names(model)  # Get modules to apply LoRA to
lora_config = LoraConfig(
    r=64,
    lora_alpha=32,
    target_modules=modules,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
peft_model = get_peft_model(model, lora_config)

In [9]:
print(peft_model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

## 列出可訓練的參數數目跟比例

In [10]:
peft_model.print_trainable_parameters()

trainable params: 83,066,880 || all params: 2,697,408,768 || trainable%: 3.0795


In [11]:
print(modules)

['down_proj', 'gate_proj', 'k_proj', 'o_proj', 'q_proj', 'up_proj', 'v_proj']


## 進行PEFT訓練
- 採用trl套件
- 可修改max_steps以調整訓練次數

In [12]:
from trl import SFTConfig
import transformers
from trl import SFTTrainer

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=train_data,
    eval_dataset=test_data,
    peft_config=lora_config,
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        max_steps=100,
        learning_rate=2e-4,
        output_dir="output1",
        # dataset_text_field="prompt",
        optim="paged_adamw_32bit",
        save_strategy="steps",
        report_to=None,
        #report_to="wandb",
        logging_steps=1,
        packing=False,
        gradient_checkpointing=True,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,5.6173
2,4.1037
3,3.5407
4,3.2631
5,2.99
6,2.3749
7,2.776
8,2.5694
9,2.3886
10,2.6385


TrainOutput(global_step=100, training_loss=2.0449188995361327, metrics={'train_runtime': 652.0631, 'train_samples_per_second': 1.227, 'train_steps_per_second': 0.153, 'total_flos': 1174742458665984.0, 'train_loss': 2.0449188995361327, 'epoch': 0.1})

## 儲存Adapter
- Huggingface的Token，<font color=red>一定要包含『write』權限</font>
- <font color=red>??????????</font>/peft-model-repo，?????????請輸入自己在huggingface的帳號

In [13]:
peft_model.save_pretrained("peft_model")
repo_name = "shhuangmust/peft-model-repo"  # 替換為你的 repository 名稱
save_directory = "./peft_model"             # 模型儲存的本地路徑

# 上傳模型到 Hugging Face
peft_model.push_to_hub(repo_name)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/332M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/shhuangmust/peft-model-repo/commit/98482f9b0ed1747521a852a178d2d805de975e9b', commit_message='Upload model', commit_description='', oid='98482f9b0ed1747521a852a178d2d805de975e9b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/shhuangmust/peft-model-repo', endpoint='https://huggingface.co', repo_type='model', repo_id='shhuangmust/peft-model-repo'), pr_revision=None, pr_num=None)

In [14]:
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
merged_model = PeftModel.from_pretrained(base_model, "output1/checkpoint-100")
merged_model = merged_model.merge_and_unload()


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



## 一般來說只需要上傳Adapter即可，要使用LLM時，Adapter+基礎模型即可。這邊直接把兩者合併為(merged_model)

In [15]:
merged_model.save_pretrained("merged_model", safe_serialization=True, push_to_hub=True)
tokenizer.save_pretrained("merged_model", push_to_hub=True)



Saving checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.json')