<a href="https://colab.research.google.com/github/shhuangmust/AI/blob/112-2/Fine_tuning_a_model_with_the_Trainer_Accuracy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a model with the Trainer API
# 使用Trainer微調模型

 Transformers 提供了一個`Trainer`類別來**微調**資料集上提供的任何預訓練模型。完成資料預處理工作後，只需執行幾個步驟即可定義Trainer, 準備運行環境`Trainer.train()`

安裝套件

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [None]:
tokenizer(tokenized_datasets['train']['sentence1'][0])

{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(tokenized_datasets['train']['sentence2'][0])

{'input_ids': [101, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer(tokenized_datasets['train']['sentence1'][0], tokenized_datasets['train']['sentence2'][0])

{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenized_datasets['train']['input_ids'][0]

[101,
 2572,
 3217,
 5831,
 5496,
 2010,
 2567,
 1010,
 3183,
 2002,
 2170,
 1000,
 1996,
 7409,
 1000,
 1010,
 1997,
 9969,
 4487,
 23809,
 3436,
 2010,
 3350,
 1012,
 102,
 7727,
 2000,
 2032,
 2004,
 2069,
 1000,
 1996,
 7409,
 1000,
 1010,
 2572,
 3217,
 5831,
 5496,
 2010,
 2567,
 1997,
 9969,
 4487,
 23809,
 3436,
 2010,
 3350,
 1012,
 102]

### 定義`Trainer`的參數
我們定義我們之前的第一步`Trainer`是定義一個類別，其中包含將用於訓練和評估的`TrainingArguments`所有超參數。為求簡單，這邊`Trainer`提供的參數是儲存訓練模型的目錄以及訓練的輪數。對於其餘所有內容，保留預設值，這對於基本的微調應該非常有效。

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer", num_train_epochs=10)

使用`AutoModelForSequenceClassification`有兩個標籤的類別

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 說明
在實例化這個預訓練模型後會收到一個警告。這是因為BERT沒有在對句子對進行分類的任務上進行預訓練，因此預訓練模型的頭部被丟棄了(bert-base-uncase)，並且加入了一個適合序列分類的新頭部。表示有些權重未被使用（對應於被丟棄的預訓練頭部的那些）以及一些其他權重是隨機初始化的（新頭部）

### 定義`Trainer`
有了模型，就可以傳入建構的所有對象來定義一個訓練器模型、`training_args`、訓練和驗證資料集、`data_collator`和分詞器

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

### 說明
這樣傳遞分詞器時，訓練器使用的預設`data_collator`將是之前定義的`DataCollatorWithPadding`，因此可以跳過`data_collator=data_collator`這一行

### 開始訓練
呼叫`Trainer`中的`train()`方法

In [None]:
trainer.train()

Step,Training Loss
500,0.5332
1000,0.3468
1500,0.1831
2000,0.1061
2500,0.035
3000,0.0287
3500,0.0053
4000,0.0042
4500,0.0051


TrainOutput(global_step=4590, training_loss=0.1359083686509395, metrics={'train_runtime': 684.016, 'train_samples_per_second': 53.624, 'train_steps_per_second': 6.71, 'total_flos': 1353042435523440.0, 'train_loss': 0.1359083686509395, 'epoch': 10.0})

### 改進的地方
開始微調之後並每500步報告一次訓練損失。但不會知道模型的表現如何，因為有幾個原因
* 沒有設定`evaluation_strategy`為`steps`（每`eval_steps`評估一次）或`epoch`（在每個`epoch`結束時評估）來告訴訓練器在訓練期間進行評估
* 沒有為訓練器提供`compute_metrics()`函數，在上述評估期間計算指標

### 對模型這行評估
使用validation資料集，讓資料集通過訓練好的模型。由於validation資料中有408筆資料，每筆資料經過模型之後，會有兩個logits，所以是408x2個。第一個是label為0的浮點數，第二個是label為1的浮點數，較大的值對應到模型預測出來可能性較高的結果。

另外`labels.ids`就是ground true真值，不是0就是1，所以也有408x1個

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [None]:
print(predictions.predictions[0])
print(predictions.label_ids[0])

[-5.626502  5.618377]
1


也可以用softmax表示

In [None]:
import numpy as np

# Defining the softmax function
def softmax(x):
    e_x = np.exp(x - np.max(x))  # Subtracting the max for numerical stability
    return e_x / e_x.sum()

# Input array
x = np.array(predictions.predictions[0])

# Applying the softmax function
softmax_values = softmax(x)
softmax_values

array([1.3073912e-05, 9.9998689e-01], dtype=float32)

In [None]:
import torch
import torch.nn.functional as F

x_tensor = torch.tensor(predictions.predictions[0])
softmax_values_torch = F.softmax(x_tensor, dim=0)
softmax_values_torch

tensor([1.3074e-05, 9.9999e-01])

predict() 方法的輸出是另一個命名組合，包含三個欄位：predictions、label_ids 和 metrics。metrics 欄位將只包含傳遞的資料集上的損失，以及一些時間度量（預測所需的總時間和平均時間）。一旦完成我們的 compute_metrics() 函式並將其傳遞給 Trainer，該欄位也將包含 compute_metrics() 返回的度量。

predictions 是一個形狀為 408 x 2 的二維陣列（408 是我們使用的資料集中的元素數量）。這些是我們傳遞給 predict() 的資料集中每個元素的 logits, 所有 Transformer 模型都返回 logits）。為了將它們轉換成我們可以與我們的標籤比較的預測，我們需要取第二軸上最大值的索引

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

In [None]:
print("預測：", preds[:20])
print("真值：", predictions.label_ids[:20])

預測： [1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0]
真值： [1 0 0 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0]


### 計算模型表現的度量方法
有很多種度量模型表現的指標，如408個預測中，有幾個是和直值一模一樣的，然後算出不一樣的比例，這種稱為accuracy，也有其它的度量方法，如f1 score等等。針對不同資料集，會有不同的度量方法，我們只要載入那個資料集對對的度量方法即可。

使用`Evaluate` 函式庫的度量。我們可以像載入資料集一樣輕鬆地載入與 MRPC 資料集相關的度量，這次使用 `evaluate.load()` 函式。返回的物件有一個 `compute()` 方法，使用它來進行度量計算：

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8431372549019608, 'f1': 0.8929765886287626}

定義一個`compute_metrics()`方法作為參數傳入`Trainer`的物件，就可以在訓練時列出每一個階段的accuracy和f1 score了。

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch", num_train_epochs=10)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,

)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.552758,0.735294,0.836858
2,0.545500,0.400461,0.845588,0.891938
3,0.358500,0.497515,0.845588,0.890435
4,0.217800,0.878378,0.838235,0.890728
5,0.131400,0.790311,0.85049,0.896082
6,0.065200,1.114082,0.835784,0.890344
7,0.033300,0.99633,0.852941,0.897611
8,0.023100,1.037087,0.845588,0.892308
9,0.006300,1.086482,0.840686,0.889267
10,0.007900,1.107838,0.85049,0.897133


TrainOutput(global_step=4590, training_loss=0.1512976451888518, metrics={'train_runtime': 693.9632, 'train_samples_per_second': 52.856, 'train_steps_per_second': 6.614, 'total_flos': 1353042435523440.0, 'train_loss': 0.1512976451888518, 'epoch': 10.0})