# PEFT tutorial using Hugging Face
## 教學目標
利用 Hugging Face 套件快速使用 PEFT 來進行下游任務訓練
- 單一句型分類任務 (single-sentence text classification)

## 適用對象
已經有基本的機器學習知識，且擁有 Python、`numpy`、`pandas`、`scikit-learn` 以及 `PyTorch` 基礎的學生。

若沒有先學過 Python，請參考 [python-入門語法](./python-入門語法.ipynb) 教學。

若沒有先學過 `pandas`，請參考 [pandas-基本功能](./pandas-基本功能.ipynb) 教學。

若沒有先學過 `numpy`，請參考 [numpy-基本功能](./numpy-基本功能.ipynb) 教學。

若沒有先學過 `scikit-learn`，請參考 [scikit-learn-基本功能](./scikit-learn-基本功能.ipynb) 教學。

若沒有先學過  `PyTorch` ，請參考 [PyTorch-基本功能](./PyTorch-基本功能.ipynb) 教學。

若沒有先學過如何使用 `PyTorch` 建立自然語言處理序列模型，請參考 [NN-中文文本分類](./NN-中文文本分類.ipynb) 教學。

## PEFT 簡易介紹
### 對大語言模型進行微調的挑戰
- 大語言模型的通常是以大量的文本資料進行訓練，並且在多個任務上取得了驚人的表現。
- 若我們想要將這些大語言模型應用在自己的任務上，通常需要進行微調。
- 但是對於大語言模型進行微調是一個挑戰，因為這些模型通常有數十億甚至數百億的參數，並且需要大量的計算資源。
- 這就是為什麼我們需要 PEFT 這個套件，它可以幫助我們快速的進行大語言模型的微調。
![](https://i.imgur.com/q6u4GVJ.png)
- 更多細節請參考 ([Peft github](https://github.com/huggingface/peft))

## PEFT 範例: LoRA
![](https://i.imgur.com/GCsNYXF.png)
- 請參考理論層面的詳細教學 ([影片連結](https://www.youtube.com/watch?v=dA-NhCtrrVE))
- 也可以參考原始論文 ([論文連結](https://arxiv.org/abs/2106.09685))

## Hugging Face 介紹
- 🤗 Hugging Face 是專門提供自然語言處理領域的函式庫
- 其函式庫支援 PyTorch 和 TensorFlow
- 🤗 Hugging Face 的主要套件為:
    1. Transformers ([連結](https://huggingface.co/transformers/index.html))
    - 提供了現今最強大的自然語言處理模型，使用上非常彈性且方便
    2. Tokenizers ([連結](https://huggingface.co/docs/tokenizers/python/latest/))
    - 讓你可以快速做好 BERT 系列模型 tokenization
    3. Datasets ([連結](https://huggingface.co/docs/datasets/))
    - 提供多種自然語言處理任務的資料集

In [1]:
# 若沒有安裝 transformers 和 datasets 套件，請取消以下註解並執行
# !pip install transformers==4.38.0
# !pip install datasets
# !pip install torch==2.0.1+cu110
# !pip install peft

In [2]:
# !git clone https://github.com/NVIDIA/apex
# %cd apex
# !pip install -r requirements.txt
# !pip install -v --disable-pip-version-check --no-cache-dir ./

In [2]:
# 確認所需套件的版本
import torch
print("PyTorch 的版本為: {}".format(torch.__version__))

import transformers
print("Hugging Face Transformers 的版本為: {}".format(transformers.__version__))

import datasets
print("Hugging Face Datasets 的版本為: {}".format(datasets.__version__))

import peft
print("PEFT 的版本為: {}".format(peft.__version__))

PyTorch 的版本為: 2.1.2+cu118
Hugging Face Transformers 的版本為: 4.39.3
Hugging Face Datasets 的版本為: 2.18.0
PEFT 的版本為: 0.10.0


In [3]:
# 載入其他所需套件

import os
import json
import numpy as np
from pathlib import Path # (Python3.4+)

## Task 1: 資料載入
本次作業要求選擇GLUE benchmark中的資料集至少兩個\
以上是一種讀取資料集的範例，但具體怎麼讀不做要求，可以直接用huggingface的load_datasets

## Hugging Face AutoTokenizer
- 使用 AutoTokenizer 搭配 Hugging Face models 的名稱可以直接呼叫使用
- 舉例:
    - transformers.AutoTokenizer.from_pretrained('roberta-base')
- [點這裡來查看 Hugging Face models 的名稱](https://huggingface.co/transformers/pretrained_models.html)

In [4]:
# 載入 tokenizer

# 在 Hugging Face 套件中可使用 .from_pretrained() 的方法來導入預訓練模型
tokenizer = transformers.AutoTokenizer.from_pretrained('roberta-base', trust_remote_code=True)

## Hugging Face Datasets
- Hugging Face Datasets 已經幫你收錄了自然語言處理領域常見的資料集
- 直接呼叫 Datasets 並搭配下面幾個 cells 的語法，可省下不少時間
- 但前提是你要進行的任務資料集有被收錄在 Hugging Face Datasets

### GLUE benchmark - CoLA

In [5]:
# 從 Hugging Face Datasets 載入資料並做資料切分

# 載入 CoLA 的訓練資料集
train_cola = datasets.load_dataset("glue", "cola", split="train", cache_dir="./cache/cola/train")

# 載入 CoLA 的驗證資料集
valid_cola = datasets.load_dataset("glue", "cola", split="validation", cache_dir="./cache/cola/valid")

# 載入 CoLA 的測試資料集
test_cola = datasets.load_dataset("glue", "cola", split="test", cache_dir="./cache/cola/test")

In [6]:
print(len(train_cola), len(valid_cola), len(test_cola))

8551 1043 1063


In [7]:
train_cola.features

{'sentence': Value(dtype='string', id=None),
 'label': ClassLabel(names=['unacceptable', 'acceptable'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [8]:
# 查看 max_length

tokenizer.model_max_length

512

In [9]:
# 將 Hugging Face Datasets 轉為 PyTorch Dataset 的封裝

def cola_to_torch_data(hug_dataset):
    """將 Hugging Face Datasets 轉為 PyTorch Dataset
    Args:
        - hug_dataset: 從 Datasets 載入的資料集
    Return:
        - dataset: 已轉為 PyTorch Dataset 的資料集
    """
    dataset = hug_dataset.map(
        lambda batch: tokenizer(
            batch["sentence"],
            truncation=True,
            padding=True
        ),
        batched=True
    )
    dataset.set_format(
        type='torch',
        columns=[
            'input_ids',
            'attention_mask',
            'label'
        ]
    )
    return dataset

train_dataset_cola = cola_to_torch_data(train_cola)
val_dataset_cola = cola_to_torch_data(valid_cola)
test_dataset_cola = cola_to_torch_data(test_cola)

### GLUE Benchmark - MRPC

In [9]:
# 從 Hugging Face Datasets 載入資料並做資料切分

# 載入 MRPC 的訓練資料集
train_mrpc = datasets.load_dataset("glue", "mrpc", split="train", cache_dir="./cache/mrpc/train")

# 載入 MRPC 的驗證資料集
valid_mrpc = datasets.load_dataset("glue", "mrpc", split="validation", cache_dir="./cache/mrpc/valid")

# 載入 MRPC 的測試資料集
test_mrpc = datasets.load_dataset("glue", "mrpc", split="test", cache_dir="./cache/mrpc/test")

In [10]:
print(len(train_mrpc), len(valid_mrpc), len(test_mrpc))

3668 408 1725


In [11]:
train_mrpc.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [12]:
# 將 Hugging Face Datasets 轉為 PyTorch Dataset 的封裝

def mrpc_to_torch_data(hug_dataset):
    """將 Hugging Face Datasets 轉為 PyTorch Dataset
    Args:
        - hug_dataset: 從 Datasets 載入的資料集
    Return:
        - dataset: 已轉為 PyTorch Dataset 的資料集
    """
    dataset = hug_dataset.map(
        lambda batch: tokenizer(
            batch["sentence1"],
            batch["sentence2"],
            truncation=True,
            padding=True
        ),
        batched=True
    )
    dataset.set_format(
        type='torch',
        columns=[
            'input_ids',
            'attention_mask',
            'label'
        ]
    )
    return dataset

train_dataset_mrpc = mrpc_to_torch_data(train_mrpc)
val_dataset_mrpc = mrpc_to_torch_data(valid_mrpc)
test_dataset_mrpc = mrpc_to_torch_data(test_mrpc)

## 檢查 tokenization 後的結果
- 使用 Hugging Face tokenizer 進行 tokenization 後的結果是一個 dict
- 這個 dict 的 keys 包含 'input_ids' 和 'attention_mask'
- input_ids: 原本句子中的每個字詞被斷詞後轉換成字典的 ID
    - 注意!! tokenizer 小小的動作已經幫你完成了斷詞和 word to ID 的轉換
- attention_mask: tokenization 後句子中包含文字的部分為 1，padding 的部分為 0
    - 可以想像成模型需要把注意力放在有文字的位置

In [13]:
train_dataset_cola[1]

{'label': tensor(1),
 'input_ids': tensor([    0,  3762,    55, 38283,   937,  1938,     8,    38,   437,  1311,
            62,     4,     2,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0])}

In [14]:
train_dataset_mrpc[1]

{'label': tensor(0),
 'input_ids': tensor([    0,   975, 26802,  1588,   102,  2164, 13976,  1758,   128,    29,
           137,  2183,     5,  3206,     7, 11881, 10564,    11,  6708,    13,
            68,   132,     4,   245,   325,   479,     2,     2,   975, 26802,
          1588,   102,  2162, 13976,  1758,   128,    29,    11,  7969,    13,
            68,   231,  6478,   153,     8,  1088,    24,     7, 11881, 10564,
            13,    68,   112,     4,   398,   325,    11,  6708,   479,     2,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## 使用 Hugging Face 的模型
- 在這個 API 盛行的世代，總是有人幫你設想周到
- [Hugging Face 的模型頁面連結](https://huggingface.co/models)
- 以 Roberta 為例，只要透過 AutoModel.from_pretrained("roberta-base")，就可以直接使用 RobertaModel
- 需要注意的是接下來你要做怎樣的下游任務訓練
- 同樣以 Roberta 為例，在原始論文中 Roberta 進行過以下的任務:
    - Sentence pair classification: MNLI/QQP/QNLI/MRPC/RTE/WNLI
        - 對應 `RobertaForSequenceClassification`
        - 使用雙句結合，並以分類的方式進行訓練
    - Semantic textual similarity: STS-B
        - `RobertaForSequenceClassification`
        - 使用雙句結合，並以迴歸的方式進行訓練
    - Single sentence classification: SST-2/CoLA
        - 對應 `RobertaForSequenceClassification`
        - 使用單句，並以迴歸的方式進行訓練
    - Question answering: SQuAD v1.1/v2.0
        - 對應 `RobertaForQuestionAnswering`
        - 使用雙句(問題+原文)，並透過答案在原文中的位置進行訓練
    - Named-entity recognition (slot filling): CoNLL-2003
        - 對應 `RobertaForTokenClassification`
        - 使用單句，並以分類的方式進行訓練
- 如果要進行的下游任務訓練不在 Hugging Face 已經建好的模型範圍，那就需要自己寫一個 model class:
    1. 繼承 torch.nn.Module
    2. 利用 super 來繼承所有親屬類別的實體屬性
    3. 定義欲使用的 pre-trained model
    4. 定義會使用到的層如 linear 或 Dropout 等
    5. 設計 forward function 並且設定下游任務的輸出

## 進行模型的訓練
### 使用 Hugging Face Trainer ([Documentation](https://huggingface.co/transformers/main_classes/trainer.html))
- Trainer 是 Hugging Face 中高度封裝的套件之一，負責模型訓練時期的"流程"
- 過去我們自行寫訓練流程的程式碼可以交給 Trainer
- Trainer 需要搭配使用 [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)
    - TrainingArguments 是 Trainer 所需要的引數

## Task2: 模型驗證
這裏要求同學們撰寫computer_metrics函式 \
要求同學們參考GLUE benchmark的官方網頁，使用和資料集對應的evaluation matrics\
回傳一個測試結果的dict

In [20]:
# 建立自定的評估的指標 (定義 function)
# 將作為 transformers.Trainer 的 parameters 之一

# Scikit-learn 的 precision_recall_fscore_support 套件可以一次計算 F1 score, precision, 和 recall
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_metric

def compute_metrics_cola(eval_preds):
    # 請參考GLUE benchmark的官方網頁，使用和資料集對應的evaluation matrics
    metric = datasets.load_metric("glue", "cola")
    logits, labels = eval_preds.predictions, eval_preds.label_ids
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

def compute_metrics_mrpc(eval_preds):
    # 請參考GLUE benchmark的官方網頁，使用和資料集對應的evaluation matrics
    metric = datasets.load_metric("glue", "mrpc")
    logits, labels = eval_preds.predictions, eval_preds.label_ids
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [11]:
data_collator = transformers.DataCollatorWithPadding(
    tokenizer=tokenizer
)

## Task3: PEFT 
在這個模塊中，我們要求同學們把模型改成PEFT的形式（Bitfit、LoRA或者其他）\
同學們可以將模型的可訓練參數量print出來

以下是各個資料集的baseline: 

|dataset|metrics|baseline|
|----|----|----|
|CoLA|Matthew's Corr|0.6|
|SST2|Accuracy|0.88|
|MRPC|Accuracy|0.8|
|STSB|Pearson-Spearman Corr|0.8|
|QQP|F1 / Accuracy|0.8/0.8|
|MNLI_Matched|Accuracy|0.8|
|MNLI_Mismatched|Accuracy|0.8|
|QNLI|Accuracy|0.85|
|RTE|Accuracy|0.7|
|WNLI|Accuracy|0.8|

#### LoRA on CoLA dataset

In [18]:
# LoRA 訓練設定

from peft import LoraConfig, get_peft_model, TaskType
 
# Define LoRA Config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    inference_mode=False,
    task_type=TaskType.SEQ_CLS
)

# add LoRA adaptor
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,181,954 || all params: 125,829,124 || trainable%: 0.9393326142841144


In [20]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="LoRA_CoLA",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-3,              # 學習率
    per_device_train_batch_size=256,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=256,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-4,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_cola,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_cola,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_cola,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.6076,0.68567,0.301826
2,0.4818,0.496303,0.469179
3,0.4233,0.464689,0.480204
4,0.3697,0.456541,0.56516
5,0.336,0.406895,0.585751
6,0.3095,0.414831,0.588414
7,0.2712,0.456625,0.583025
8,0.2594,0.446429,0.58394
9,0.2442,0.494482,0.583025
10,0.2337,0.50546,0.590523


  metric = load_metric("glue", "cola")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric

TrainOutput(global_step=340, training_loss=0.35362693001242246, metrics={'train_runtime': 344.7826, 'train_samples_per_second': 248.011, 'train_steps_per_second': 0.986, 'total_flos': 2093805016160880.0, 'train_loss': 0.35362693001242246, 'epoch': 10.0})

#### LoRA on MRPC dataset

In [18]:
# LoRA 訓練設定

from peft import LoraConfig, get_peft_model, TaskType
 
# Define LoRA Config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    inference_mode=False,
    task_type=TaskType.SEQ_CLS
)

# add LoRA adaptor
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,181,954 || all params: 125,829,124 || trainable%: 0.9393326142841144


In [19]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="LoRA_MRPC",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-3,              # 學習率
    per_device_train_batch_size=128,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=128,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-4,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_mrpc,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_mrpc,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_mrpc,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6404,0.636725,0.713235,0.825112
2,0.5038,0.347224,0.845588,0.886894
3,0.3993,0.333836,0.845588,0.889667
4,0.3175,0.347127,0.882353,0.916376
5,0.2642,0.302864,0.865196,0.901257
6,0.2257,0.396562,0.857843,0.90301
7,0.2015,0.346742,0.879902,0.915078
8,0.1704,0.398173,0.867647,0.906897
9,0.1441,0.396105,0.877451,0.911972
10,0.1301,0.404473,0.877451,0.912281


  metric = load_metric("glue", "mrpc")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric

TrainOutput(global_step=290, training_loss=0.2997131816272078, metrics={'train_runtime': 342.8572, 'train_samples_per_second': 106.983, 'train_steps_per_second': 0.846, 'total_flos': 1987394748218880.0, 'train_loss': 0.2997131816272078, 'epoch': 10.0})

#### BitFit on CoLA dataset

In [25]:
# BitFit 訓練設定
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)

trainable_params = 0
all_param = 0

for name, param in model.named_parameters():
    all_param += param.numel()
    if "bias" not in name:
        param.require_grad = False
    else:
        param.require_grad = True
        trainable_params += param.numel()

print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 102914 || all params: 124647170 || trainable%: 0.08256424915222704


In [26]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="BitFit_CoLA",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-4,              # 學習率
    per_device_train_batch_size=64,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=64,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-5,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_cola,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_cola,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_cola,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5391,0.418387,0.548265
2,0.3729,0.439593,0.526136
3,0.2592,0.574695,0.553641
4,0.1743,0.589701,0.600984
5,0.1285,0.64502,0.59099
6,0.098,0.677215,0.599641
7,0.0658,0.906343,0.562623
8,0.0521,0.816503,0.603317
9,0.0322,0.945155,0.608684
10,0.0223,1.074773,0.595616


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

TrainOutput(global_step=1340, training_loss=0.1744336220755506, metrics={'train_runtime': 582.1451, 'train_samples_per_second': 146.888, 'train_steps_per_second': 2.302, 'total_flos': 2064580034754360.0, 'train_loss': 0.1744336220755506, 'epoch': 10.0})

#### BitFit on MRPC dataset

In [27]:
# BitFit 訓練設定
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)

trainable_params = 0
all_param = 0

for name, param in model.named_parameters():
    all_param += param.numel()
    if "bias" not in name:
        param.require_grad = False
    else:
        param.require_grad = True
        trainable_params += param.numel()

print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 102914 || all params: 124647170 || trainable%: 0.08256424915222704


In [28]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="BitFit_MRPC",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-4,              # 學習率
    per_device_train_batch_size=64,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=64,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-5,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_mrpc,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_mrpc,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_mrpc,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6001,0.451975,0.776961,0.856693
2,0.4347,0.356418,0.843137,0.884477
3,0.3446,0.376615,0.828431,0.866412
4,0.2254,0.309359,0.870098,0.903108
5,0.1404,0.359604,0.867647,0.904594
6,0.0871,0.5092,0.872549,0.907801
7,0.0554,0.478919,0.879902,0.91358
8,0.0342,0.676003,0.865196,0.901961
9,0.0223,0.684587,0.882353,0.915194
10,0.0181,0.71534,0.879902,0.912343


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

TrainOutput(global_step=580, training_loss=0.1962397330793841, metrics={'train_runtime': 500.8034, 'train_samples_per_second': 73.242, 'train_steps_per_second': 1.158, 'total_flos': 1960341806841600.0, 'train_loss': 0.1962397330793841, 'epoch': 10.0})

## 比較不同hyper-parameter

都是LoRa on CoLA

### Batch size = 128
原本: 256

In [29]:
# LoRA 訓練設定

from peft import LoraConfig, get_peft_model, TaskType
 
# Define LoRA Config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    inference_mode=False,
    task_type=TaskType.SEQ_CLS
)

# add LoRA adaptor
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,181,954 || all params: 125,829,124 || trainable%: 0.9393326142841144


In [30]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="LoRA_CoLA_2",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-3,              # 學習率
    per_device_train_batch_size=128,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=128,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-4,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_cola,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_cola,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_cola,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5461,0.508521,0.45359
2,0.4158,0.536687,0.452419
3,0.3585,0.413946,0.575239
4,0.3225,0.406251,0.596038
5,0.273,0.414675,0.603193
6,0.241,0.421161,0.604496
7,0.2227,0.535753,0.585706
8,0.1918,0.576062,0.593695
9,0.1789,0.547534,0.603359
10,0.1531,0.607017,0.590675


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

TrainOutput(global_step=670, training_loss=0.29034543677942076, metrics={'train_runtime': 364.2276, 'train_samples_per_second': 234.771, 'train_steps_per_second': 1.84, 'total_flos': 2093805016160880.0, 'train_loss': 0.29034543677942076, 'epoch': 10.0})

### Learning Rate = 1e-4
原本: 1e-3
weight decay 跟著調整, 從 1e-4 -> 1e-5

In [31]:
# LoRA 訓練設定

from peft import LoraConfig, get_peft_model, TaskType
 
# Define LoRA Config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    inference_mode=False,
    task_type=TaskType.SEQ_CLS
)

# add LoRA adaptor
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,181,954 || all params: 125,829,124 || trainable%: 0.9393326142841144


In [33]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="LoRA_CoLA_2",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-4,              # 學習率
    per_device_train_batch_size=256,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=256,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-5,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_cola,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_cola,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_cola,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.482,0.525661,0.382568
2,0.4406,0.47783,0.458045
3,0.434,0.525648,0.447911
4,0.4219,0.468637,0.507314
5,0.4079,0.460892,0.510438
6,0.3994,0.43774,0.527461
7,0.3915,0.474434,0.520756
8,0.3942,0.462667,0.52684
9,0.3912,0.469191,0.528618
10,0.3847,0.452924,0.53211


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

TrainOutput(global_step=340, training_loss=0.4147378388573142, metrics={'train_runtime': 310.0971, 'train_samples_per_second': 275.752, 'train_steps_per_second': 1.096, 'total_flos': 2093805016160880.0, 'train_loss': 0.4147378388573142, 'epoch': 10.0})

## 調整LoRA rank(r)

### r = 8

In [21]:
# LoRA 訓練設定

from peft import LoraConfig, get_peft_model, TaskType
 
# Define LoRA Config
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    inference_mode=False,
    task_type=TaskType.SEQ_CLS
)

# add LoRA adaptor
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 887,042 || all params: 125,534,212 || trainable%: 0.7066137476531099


In [22]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="LoRA_CoLA",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-3,              # 學習率
    per_device_train_batch_size=256,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=256,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-4,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_cola,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_cola,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_cola,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5614,0.45985,0.529114
2,0.4322,0.498871,0.520951
3,0.3895,0.406159,0.573511
4,0.357,0.416802,0.593084
5,0.3177,0.47034,0.586123
6,0.2964,0.408497,0.618264
7,0.2675,0.501184,0.615864
8,0.2558,0.436887,0.618897
9,0.2466,0.528743,0.613466
10,0.2317,0.510608,0.610736


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

TrainOutput(global_step=340, training_loss=0.33557842338786403, metrics={'train_runtime': 307.1694, 'train_samples_per_second': 278.381, 'train_steps_per_second': 1.107, 'total_flos': 2086693561277040.0, 'train_loss': 0.33557842338786403, 'epoch': 10.0})

### r = 32

In [23]:
# LoRA 訓練設定

from peft import LoraConfig, get_peft_model, TaskType
 
# Define LoRA Config
lora_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.1,
    inference_mode=False,
    task_type=TaskType.SEQ_CLS
)

# add LoRA adaptor
model = transformers.AutoModelForSequenceClassification.from_pretrained("roberta-base", trust_remote_code=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,771,778 || all params: 126,418,948 || trainable%: 1.4015130073697497


In [24]:
# 訓練模型

# 設定 TrainingArguments
training_args = transformers.TrainingArguments(
    output_dir="LoRA_CoLA",        # 輸出的資料夾
    num_train_epochs=10,              # 總共訓練的 epoch 數目
    learning_rate=1e-3,              # 學習率
    per_device_train_batch_size=256,  # 訓練模型時每個裝置的 batch size
    per_device_eval_batch_size=256,   # 驗證模型時每個裝置的 batch size
    gradient_accumulation_steps=1,   # 梯度累積的步數
    warmup_steps=10,                # learning rate scheduler 的參數
    weight_decay=1e-4,               # 最佳化演算法 (optimizer) 中的權重衰退率
    evaluation_strategy='epoch',     # 設定驗證的時機
    save_strategy='epoch',           # 設定儲存的時機
    logging_steps=0.1,
    seed=100
)

trainer = transformers.Trainer(
    model=model,                         # 🤗 的模型
    args=training_args,                  # Trainer 所需要的引數
    train_dataset=train_dataset_cola,         # 訓練集 (注意是 PyTorch Dataset)
    eval_dataset=val_dataset_cola,            # 驗證集 (注意是 PyTorch Dataset)，可使 Trainer 在進行訓練時也進行驗證
    compute_metrics=compute_metrics_cola,     # 自定的評估的指標
    data_collator=data_collator,
    tokenizer=tokenizer
)

# 指定使用 1 個 GPU 進行訓練
trainer.args._n_gpu=1

# 開始進行模型訓練
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5674,0.577912,0.390709
2,0.429,0.494683,0.526048
3,0.3852,0.457663,0.54521
4,0.3385,0.458133,0.580628
5,0.3031,0.429213,0.608191
6,0.2813,0.416304,0.591793
7,0.2509,0.474691,0.593055
8,0.2315,0.482404,0.593382
9,0.2144,0.501416,0.613252
10,0.2078,0.52597,0.605723


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datase

TrainOutput(global_step=340, training_loss=0.32091039910035973, metrics={'train_runtime': 309.7485, 'train_samples_per_second': 276.063, 'train_steps_per_second': 1.098, 'total_flos': 2108027925928560.0, 'train_loss': 0.32091039910035973, 'epoch': 10.0})