# Huggingface NLP task介紹 - 以文本分類為例(HF官方範例)

確認目前的執行環境(GPU),並安裝所需使用的套件

In [1]:
!nvidia-smi

Sat Dec 16 05:05:51 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install -q -U datasets transformers accelerate #-U:updated_to_new_version

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.3/8.3 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25h

為加速上傳大型檔案資料到huggingface平台時加速傳送速度，可以安裝 Git-LFS (Large File Storage)

In [3]:
#在上傳資料到huggingface平台時(可加速)
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


如果您正在本地(local)打開這個筆記本，請確保您的環境已安裝了上述套件的最新版本。

為了能夠與社群分享您的模型還需要進行一些額外的步驟，您需要儲存來自Hugging Face網站的身份驗證token（如果您還未註冊，請在此處註冊[here](https://huggingface.co/join)！），然後執行`notebook_login()`並輸入您的用戶名和密碼。

請完成以下兩項設定:
- huggingface token設定: https://huggingface.co/settings/tokens
- 在你的帳號下新增一個model的上傳空間: https://huggingface.co/new
(請設定為公開)

[如果有要上傳資料到 huggingface hub]
*需請確認您的huggingface token是有設定為write權限

In [4]:
from huggingface_hub import notebook_login

notebook_login() #依本範例需求,請提供有write權限的huggingface_token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

確認Transformers套作版本在 4.11.0 以上。

In [5]:
import accelerate
import transformers

print(transformers.__version__)

4.36.1


以下程式碼會快速上傳了一些模型訓練數據 - 這告訴我們(Huggingface)使用了哪些示例和軟件版本，因此我們知道應該優先處理哪些維護工作。我們(Huggingface)不收集（或關心）任何可識別個人身份的信息，但如果您不希望被統計，請隨意跳過此步驟或完全刪除此單元格。

In [6]:
from transformers.utils import send_example_telemetry

send_example_telemetry("text_classification_notebook", framework="pytorch")

## 在NLP文本分類任務(text classification task)，微調模型

在這個筆記本中，我們將看到如何對[🤗 Transformers](https://github.com/huggingface/transformers)模型中的文本分類任務進行微調，這是[GLUE基準](https://gluebenchmark.com/)的一部分。

<img src="https://github.com/huggingface/notebooks/blob/main/examples/images/text_classification.png?raw=true" width="640" height="320">

GLUE（通用語言理解評估）基準是九個關於句子或句子對的分類任務組成的集合，這些任務包括：

- [CoLA](https://nyu-mll.github.io/CoLA/)（語言可接受性語料庫）：確定一個句子是否語法正確。
- [MNLI](https://arxiv.org/abs/1704.05426)（多種類型自然語言推論）：確定一個句子是否包含、是否與給定的假設相矛盾，或是否無關。（此數據集有兩個版本，一個使用相同分佈的驗證集和測試集，另一個稱為不匹配，其中驗證集和測試集使用域外數據。）
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398)（Microsoft Research Paraphrase Corpus）：確定兩個句子是否互為Paraphrase。
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/)（問答自然語言推論）：確定問題的答案是否在第二個句子中。
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)（Quora問題對）：確定兩個問題是否在語義上等效。
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment)（認識文本推論）：確定一個句子是否包含給定的假設。
- [SST-2](https://nlp.stanford.edu/sentiment/index.html)（斯坦福情感樹庫）：確定句子是否具有正面或負面情感。
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark)（語義文本相似性基準）：確定兩個句子的相似性，分數從1到5。
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html)（Winograd自然語言推論）：確定一個具有匿名代詞的句子和一個替換了該代詞的句子是否相等。

我們將看到如何輕鬆地加載每個任務的數據集，並使用`Trainer`API對模型進行微調。每個任務都以其縮寫命名，`mnli-mm`代表MNLI的不匹配版本（因此訓練集與`mnli`相同，但驗證集和測試集不同）。

### 模型與GLUE任務

In [7]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

這個筆記本可以運行在上面列出的任何任務中(task)，只要模型model_checkpoint來自Huggingface平台[Model Hub](https://huggingface.co/models)。根據您的模型和正在使用的GPU，您可能需要調整批處理大小以避免內存不足錯誤。設置這三個參數，然後筆記本的其餘部分應該可以順利運行：

In [8]:
task = "cola"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

**Model: DistilBERT**
- Using a student network (half layer size) to learn from pre-trained BERT.
- Adopt Triple Loss, if prediction from student is far from teacher, loss ↑
- Reduce parameter size to 40%, and train speed faster 60% than BERT.

![image](https://drive.google.com/uc?export=view&id=1aChSbPdOzJ6L5jLpA-D7BzMek54JVlcO)


### step1.載入資料集

我們將使用[🤗 Datasets](https://github.com/huggingface/datasets)套件來下載資料集並獲取我們需要用於評估的指標（以將我們的模型與基準模型進行比較）。這可以使用`load_dataset`和`load_metric`函數輕鬆完成。

In [9]:
from datasets import load_dataset, load_metric

除了`mnli-mm`是一個特殊的代碼外，我們可以直接將任務名稱傳遞給這些函數。`load_dataset`將緩存資料集，以避免下次運行此單元時重新下載它。

資料集說明:(https://huggingface.co/datasets/glue/viewer/mnli/train)

In [10]:
actual_task = "mnli" if task == "mnli-mm" else task #本範例可直接指定cola
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

  metric = load_metric('glue', actual_task)


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

`dataset`對象本身是[`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict)，它的key包含了訓練集、驗證集和測試集（在`mnli`任務，還有用於不匹配驗證和測試集的更多keys）。

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

CoLA(Corpus of Linguistic Acceptability)資料集的標記類別 **label class: { 0:unacceptable, 1:acceptable }**

In [12]:
dataset["train"][100], dataset["train"][400]

({'sentence': 'If you eat more, you want correspondingly less.',
  'label': 1,
  'idx': 100},
 {'sentence': 'How many people do you wonder whether I consider intelligent?',
  'label': 0,
  'idx': 400})

為了讓大家更瞭解這個資料集的內容，以下示範隨機從資料集中顯示部份資料記錄

In [13]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [14]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,"Before you make plans, consult the secretary.",acceptable,5203
1,I didn't read a single book the whole time I was in the library.,acceptable,5761
2,The teacher became tired of the students.,acceptable,3939
3,There were twenty students at the lecture and every student who was there said it was inspiring.,acceptable,6455
4,I know a man who John is as tall as.,acceptable,1173
5,Mary taught linguistics to John.,acceptable,2081
6,There tried to be a fountain in the park.,unacceptable,4315
7,The man killed the king with the knife.,acceptable,5698
8,"Is even Clarence, who is wearing mauve socks, a swinger?",acceptable,1878
9,The boat sank to collect the insurance.,unacceptable,460


The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [15]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

您可以直接呼叫`compute`方法，並將您的預測和標籤(label)傳遞給它，它將返回一個包含該指標值的字典(dict)：

In [16]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': -0.0148961999725068}

註記: `load_metric` 在本範例中各任務對應的預設評估指標如下:

- for CoLA: [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient)
- for MNLI (matched or mismatched): Accuracy
- for MRPC: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for QNLI: Accuracy
- for QQP: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for RTE: Accuracy
- for SST-2: Accuracy
- for STS-B: [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman's_Rank_Correlation_Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
- for WNLI: Accuracy

註: Matthews Correlation Coefficient可以寫成混淆矩陣計算的格式:

![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/33f3d62224f97cdef8bc559ee455c3f4815f5788)

### step2.資料預處理

在我們將這些文本餵給模型之前，我們需要對它們進行預處理。這是由🤗 Transformers的`Tokenizer`完成的，它將（如其名稱所示）對輸入進行標記化(tokenize)（包括將標記轉換為預訓練詞彙中的相應ID）並將其轉換為模型期望的格式，同時生成模型所需的其他輸入。

為此，我們使用`AutoTokenizer.from_pretrained`方法實例化我們的分詞器，這將確保：

- 我們獲得與我們想要使用的模型架構相對應的分詞器(tokenizer)，
- 我們下載用於預訓練特定model_checkpoint的**詞彙**(vocab.txt)。

該**詞彙**(vocab.txt)將被緩存(cached)，因此在下一次運行該單元格時不會再次下載。

#### 載入tokenizer

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True) # model_checkpoint="distilbert-base-uncased"

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

在上面的函數呼叫中，我們輸入`use_fast=True`，以使用🤗 Tokenizers套件中支持的**快速分詞器**（由Rust支持）。這些快速分詞器適用於幾乎所有模型，但如果在上一個調用中出現錯誤，請刪除該參數。

您可以直接將此分詞器應用於一個句子或一對句子：

*註:可比較一下OpenAI的tokenizer:(https://platform.openai.com/tokenizer)

In [18]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [19]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [20]:
tokenizer("Hello, this one sentence! And this sentence goes with it.")

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [21]:
tokenizer("[CLS] Hello, this one sentence! [SEP] And this sentence goes with it. [SEP]")

{'input_ids': [101, 101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

根據您選擇的模型，您將在上面程式執行所返回的字典(dict)中看到不同的鍵(keys)。 對於我們在這裡要做的事情來說並不太重要（只需知道它們是我們稍後將實例化的模型所需的），如果您有興趣，可以在[此教程](https://huggingface.co/transformers/preprocessing.html)中了解更多有關它們的信息。

因此，為了預處理我們的數據集，我們需要知道包含句子的列的名稱。 以下字典跟蹤了任務與列名之間的對應關係：

In [22]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

確認一下在目前的資料集(dataset)中的key所對應的欄位名稱是否正確:

In [23]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


接下來，我們可以編寫預處理樣本的函數(preprocess_function)。 我們只需將它們提供給具有參數`truncation=True`的`tokenizer`。 這將確保模型所選的最大長度可以處理的輸入將被**截斷**為模型接受的最大長度。

In [24]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

tokenizer處理後所回傳的所有keys(例如:input_ids, attention_mask)都會依這個規則去截切。

In [25]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

為了將此功能應用(apply)於資料集中的所有sentences（或sentences pair），我們只需使用之前創建的`dataset`對象的`map`方法。 這將在`dataset`的所有拆分的所有元素上應用該函數，因此我們的訓練、驗證和測試數據將在一個單獨的指令中預處理。

In [26]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

更好的是，結果會自動由🤗 Datasets library進行cache，以避免下次運行notebook時花費時間在這一步上。🤗 Datasets library通常足夠智能，能夠檢測到您傳遞給map的函數是否已更改（因此需要不使用cache data）。例如，它將正確檢測到如果您更改了先前執行任務(task)的程式並重新運行notebook。🤗 Datasets在使用cache data時會提醒您，您可以在map的調用中傳遞`load_from_cache_file=False`來不使用cache data，並強制重新應用預處理。

請注意，我們輸入`batched=True`用以一起對文本進行編碼。這是為了充分利用之前加載的快速分詞器的好處，該分詞器將使用多執行續同時處理批次中的文本。

### step3.微調模型 (Fine-tuning model)

現在我們的數據已經準備好了，我們可以下載預訓練模型並進行微調。由於我們所有的任務都涉及到句子分類(sentence classification)，所以我們使用`AutoModelForSequenceClassification`類。與分詞器(tokenizer)一樣，`from_pretrained`方法將為我們下載並緩存模型。我們唯一需要指定的是我們問題的label數（除了STS-B是回歸問題，MNLI有3個label，其他都是2個）

In [27]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels) # model_checkpoint="distilbert-base-uncased"

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


上列執行過程的警告，告訴我們正在捨棄一些權重（`vocab_transform`和`vocab_layer_norm`層），並隨機初始化一些其他權重（`pre_classifier`和`classifier`層）。這個情況在本範例是正常的操作，因為我們正在刪除用於在遮蔽語言建模(masked language modeling,也就是填空格任務的模型)目標上預訓練模型的頭部(model head)，並將其替換為一個新的頭部(model head)，我們沒有預訓練的權重，所以套件警告我們應該在使用它進行推理之前微調此模型，這正是我們接下來要做的事情。

要實例化一個`Trainer`，我們需要定義兩件事情。最重要的是[`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)，這是一個包含自定義訓練屬性的類別。它需要一個資料夾名稱，該名稱將用於保存模型的checkpoint，而所有其他參數都是可選的：


[如果有要上傳資料到 huggingface hub] *需請確認您的huggingface token是有設定為write權限 (https://huggingface.co/settings/tokens)

In [28]:
metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}", #distilbert-base-uncased-finetuned-cola
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,

    push_to_hub=True,
    hub_model_id="stuser2023/distilbert-base-uncased-finetuned-cola", #有要上傳到hub,需指明你的model_id(格式:Owner_id/model_name)
)

在這裡，我們設置了每個epoch結束時執行評估，微調學習速率，使用notebook頂部定義的`batch_size`，自定義訓練的epoch數，以及權重衰減(weight decay)。由於最佳模型可能不是在訓練結束時的模型，我們要求`Trainer`在訓練結束時加載它保存的最佳模型（根據`metric_name`所指定的評估指標）設定`load_best_model_at_end=True`。最後一個參數設置`push_to_hub=True`，用於在訓練期間定期將模型推送到[Hub](https://huggingface.co/models)。如果您沒有按照notebook頂部的安裝步驟進行操作，請刪除它(或設為False)。如果要將模型保存在本地並具有不同於要推送的存儲庫名稱的名稱，或者如果要將模型推送到組織而不是您的名稱空間，請使用`hub_model_id`參數設置存儲庫名稱（它需要是完整的名稱，包括您的名稱空間：例如`"sgugger/bert-finetuned-mrpc"`或`"huggingface/bert-finetuned-mrpc"`）。

設置`Trainer`的最後一個要定義的是如何從預測中計算指標。我們需要為此定義一個函數，它將只使用我們之前加載的`metric`，我們唯一需要進行的預處理是取預測的logits的argmax（在STS-B的任務情況下只需壓縮最後一個軸,更改array shape）：

In [29]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

您可能會想知道，既然我們已經預處理了我們的數據，為什麼還要傳遞`tokenizer`？這是因為我們將最後一次使用它，通過應用填充(applying padding)使我們收集的所有樣本的長度相同，這需要知道模型對填充(padding)的設定（向左還是向右？使用哪個標記符號？）。 `tokenizer` 有一個 `pad` 方法，可以為我們完成所有這些操作，`Trainer` 將使用它。您可以通過定義和傳遞自己的 `data_collator` 來自定義這一部分，它將接收像上面看到的字典樣本(dict)，並需要返回一個張量字典(tensors dict)。

#### 執行模型訓練

We can now finetune our model by just calling the `train` method:

*在訓練的過程中也請瞭解一下GPU記憶體的使用狀況

In [None]:
try:
  trainer.train()
except KeyboardInterrupt:
    print("KeyboardInterrupt")

Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.1317,0.915311,0.515857
2,0.091,1.023814,0.493353
3,0.0741,1.064548,0.522404
4,0.0551,1.149787,0.532073
5,0.0588,1.15757,0.538383


TrainOutput(global_step=2675, training_loss=0.08119281697496075, metrics={'train_runtime': 197.2905, 'train_samples_per_second': 216.711, 'train_steps_per_second': 13.559, 'total_flos': 229437415353012.0, 'train_loss': 0.08119281697496075, 'epoch': 5.0})

#### 評估模型

We can check with the `evaluate` method that our `Trainer` did reload the best model properly (if it was not the last one):

*可以跟目前的排名做比較[GLUE Benchmark leaderboard](https://gluebenchmark.com/leaderboard).

In [None]:
trainer.evaluate()

{'eval_loss': 1.1575696468353271,
 'eval_matthews_correlation': 0.5383825234212567,
 'eval_runtime': 0.7631,
 'eval_samples_per_second': 1366.771,
 'eval_steps_per_second': 86.488,
 'epoch': 5.0}

手動上傳模型權重到huggingface hub

*重要!需請確認您的huggingface token是有設定為write權限 (https://huggingface.co/settings/tokens)

In [None]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

events.out.tfevents.1700187931.faa202bf7ecf.336.1:   0%|          | 0.00/423 [00:00<?, ?B/s]

events.out.tfevents.1700189078.faa202bf7ecf.336.2:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

events.out.tfevents.1700187249.faa202bf7ecf.336.0:   0%|          | 0.00/6.99k [00:00<?, ?B/s]

Upload 6 LFS files:   0%|          | 0/6 [00:00<?, ?it/s]

events.out.tfevents.1700189281.faa202bf7ecf.336.3:   0%|          | 0.00/423 [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.66k [00:00<?, ?B/s]

'https://huggingface.co/stuser2023/distilbert-base-uncased-finetuned-cola/tree/main/'

當你上傳模型權重到hub，並設定為公開分享。其他社群同好也能利用以下語法去下載你的模型權重來使用。

模型存放路徑:`"your-username/the-name-you-picked"`:

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("sgugger/my-awesome-model")
```

### step4.超參數搜尋 (Hyperparameter search)

`Trainer`支援使用 [optuna](https://optuna.org/) 或 [Ray Tune](https://docs.ray.io/en/latest/tune/) 進行超參數搜索。

對於這個最後一部分，您需要安裝其中一個套件，只需取消註釋並執行以下安裝程式碼。

```python
 ! pip install optuna
 ! pip install ray[tune]
```

In [None]:
 ! pip install optuna
 #! pip install ray[tune]

Collecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.6/409.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.12.1-py3-none-any.whl (226 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.8/226.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.0-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, optuna
Successfully installed Mako-1.3.0 alembic-1.12.1 colorlog-6.7.0 optuna-3.4.0
Collecting ray[tune]
  Downloading ray-2.8.0-cp310-cp310-manylinux2014_x86_64.whl (62.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

在超參數搜索期間，Trainer 將運行多次訓練，因此需要通過函數定義模型（以便在每次新運行時可以重新初始化）。我們會重覆使用與之前相同的函數：

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

And we can instantiate our `Trainer` like before:

In [None]:
trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


這次我們調用的方法是`hyperparameter_search`。請注意，對於某些任務，對完整數據集進行超參數搜索可能需要很長時間。您可以嘗試在訓練數據集的一部分上找到一些良好的超參數，方法是將上面的`train_dataset`行替換為：
```python
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
```
這樣您可以在搜索選擇的最佳超參數上進行完整的訓練。

In [None]:
best_run = trainer.hyperparameter_search(n_trials=10, direction="maximize")

[I 2023-11-17 03:09:30,460] A new study created in memory with name: no-name-870f7a3f-3718-4993-b00c-31d684203ed1
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5907,0.589112,0.0
2,0.5487,0.555959,0.123001
3,0.5205,0.552587,0.166652


[I 2023-11-17 03:13:10,014] Trial 0 finished with value: 0.166651669293941 and parameters: {'learning_rate': 1.119198372279693e-06, 'num_train_epochs': 3, 'seed': 19, 'per_device_train_batch_size': 8}. Best is trial 0 with value: 0.166651669293941.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.480188,0.452233
2,0.458300,0.501237,0.481079
3,0.458300,0.500127,0.520212
4,0.253900,0.561569,0.529083
5,0.253900,0.61144,0.512052


[I 2023-11-17 03:16:00,882] Trial 1 finished with value: 0.5120519207329847 and parameters: {'learning_rate': 1.6477448170002332e-05, 'num_train_epochs': 5, 'seed': 34, 'per_device_train_batch_size': 32}. Best is trial 1 with value: 0.5120519207329847.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5053,0.490764,0.428549
2,0.3992,0.50157,0.484094
3,0.3371,0.588019,0.493293
4,0.2797,0.639588,0.492493


[I 2023-11-17 03:21:01,196] Trial 2 finished with value: 0.49249265259737396 and parameters: {'learning_rate': 6.7318244048561916e-06, 'num_train_epochs': 4, 'seed': 14, 'per_device_train_batch_size': 8}. Best is trial 1 with value: 0.5120519207329847.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.51607,0.388457
2,0.507700,0.487953,0.453716
3,0.507700,0.491784,0.475202
4,0.368300,0.504706,0.476243
5,0.368300,0.517015,0.461867


[I 2023-11-17 03:23:59,361] Trial 3 finished with value: 0.4618671668713098 and parameters: {'learning_rate': 7.597085082438434e-06, 'num_train_epochs': 5, 'seed': 25, 'per_device_train_batch_size': 32}. Best is trial 1 with value: 0.5120519207329847.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5161,0.473719,0.46461


[I 2023-11-17 03:27:01,693] Trial 4 finished with value: 0.4646104718707002 and parameters: {'learning_rate': 2.2318866945035745e-05, 'num_train_epochs': 1, 'seed': 3, 'per_device_train_batch_size': 16}. Best is trial 1 with value: 0.5120519207329847.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5457,0.557206,0.348979
2,0.5005,0.643629,0.397332


[I 2023-11-17 03:31:08,284] Trial 5 pruned. 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5388,0.615199,0.427992
2,0.5423,0.683111,0.480953


[I 2023-11-17 03:35:21,827] Trial 6 pruned. 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5493,0.559677,0.139605


[I 2023-11-17 03:37:19,249] Trial 7 pruned. 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.57061,0.0
2,No log,0.525272,0.354093


[I 2023-11-17 03:38:04,143] Trial 8 pruned. 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.5332,0.516364,0.404341
2,0.3795,0.587486,0.409244
3,0.2669,0.890459,0.447995
4,0.1767,1.033401,0.487452


[I 2023-11-17 03:42:38,159] Trial 9 pruned. 


`hyperparameter_search` 方法返回一個 `BestRun` 對象，其中包含了被最大化的目標值（默認為所有指標的總和）以及該運行使用的超參數。

In [None]:
best_run

BestRun(run_id='1', objective=0.5120519207329847, hyperparameters={'learning_rate': 1.6477448170002332e-05, 'num_train_epochs': 5, 'seed': 34, 'per_device_train_batch_size': 32}, run_summary=None)

您可以通過向 `hyperparameter_search` 方法傳遞一個 `compute_objective` 函數來自定義要最大化的目標，並且可以通過向 `hyperparameter_search` 傳遞一個 `hp_space` 參數來自定義搜索空間。有關範例，請參閱此[forum post](https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10)。

要複製最佳訓練，只需在創建 `Trainer` 之前設置您的 `TrainingArgument` 中的超參數即可：

In [None]:
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.480188,0.452233
2,0.458300,0.501237,0.481079
3,0.458300,0.500127,0.520212
4,0.253900,0.561569,0.529083
5,0.253900,0.61144,0.512052


TrainOutput(global_step=1340, training_loss=0.31167233595207555, metrics={'train_runtime': 146.3915, 'train_samples_per_second': 292.059, 'train_steps_per_second': 9.154, 'total_flos': 266352871163628.0, 'train_loss': 0.31167233595207555, 'epoch': 5.0})

In [None]:
#把最佳化參數後的模型再上傳到hub
trainer.push_to_hub()



---



## 課堂練習
請依本範例操作，將訓練好的模型上傳到自己的Huggingface帳號Models空間，並設定為公開。



---



## 參考資料

- Huggingface官方範例(text_classification): https://huggingface.co/tasks/text-classification

- DistillBERT: https://arxiv.org/pdf/1910.01108.pdf