# Transformers 筆記
這是閱讀`https://github.com/zyds/transformers-code`的筆記心得

## Pipeline

In [1]:

from transformers.pipelines import SUPPORTED_TASKS
SUPPORTED_TASKS.keys()


dict_keys(['audio-classification', 'automatic-speech-recognition', 'text-to-audio', 'feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'visual-question-answering', 'document-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'zero-shot-image-classification', 'zero-shot-audio-classification', 'image-classification', 'image-feature-extraction', 'image-segmentation', 'image-to-text', 'object-detection', 'zero-shot-object-detection', 'depth-estimation', 'video-classification', 'mask-generation', 'image-to-image'])

| Pipeline Type                        | Description                                    |
|--------------------------------------|------------------------------------------------|
| audio-classification                 | will return a AudioClassificationPipeline      |
| automatic-speech-recognition         | will return a AutomaticSpeechRecognitionPipeline|
| depth-estimation                     | will return a DepthEstimationPipeline          |
| document-question-answering          | will return a DocumentQuestionAnsweringPipeline|
| feature-extraction                   | will return a FeatureExtractionPipeline        |
| fill-mask                            | will return a FillMaskPipeline                 |
| image-classification                 | will return a ImageClassificationPipeline      |
| image-feature-extraction             | will return an ImageFeatureExtractionPipeline  |
| image-segmentation                   | will return a ImageSegmentationPipeline        |
| image-to-image                       | will return a ImageToImagePipeline             |
| image-to-text                        | will return a ImageToTextPipeline              |
| mask-generation                      | will return a MaskGenerationPipeline           |
| object-detection                     | will return a ObjectDetectionPipeline          |
| question-answering                   | will return a QuestionAnsweringPipeline        |
| summarization                        | will return a SummarizationPipeline            |
| table-question-answering             | will return a TableQuestionAnsweringPipeline   |
| text2text-generation                 | will return a Text2TextGenerationPipeline      |
| text-classification (sentiment-analysis) | will return a TextClassificationPipeline    |
| text-generation                      | will return a TextGenerationPipeline           |
| text-to-audio (text-to-speech)       | will return a TextToAudioPipeline              |
| token-classification (ner)           | will return a TokenClassificationPipeline      |
| translation                          | will return a TranslationPipeline              |
| translation_xx_to_yy                 | will return a TranslationPipeline              |
| video-classification                 | will return a VideoClassificationPipeline      |
| visual-question-answering            | will return a VisualQuestionAnsweringPipeline  |
| zero-shot-classification             | will return a ZeroShotClassificationPipeline   |
| zero-shot-image-classification       | will return a ZeroShotImageClassificationPipeline|
| zero-shot-audio-classification       | will return a ZeroShotAudioClassificationPipeline|
| zero-shot-object-detection           | will return a ZeroShotObjectDetectionPipeline  |

根據task得到不同的pipeline，每種有各自事先定義好的輸入輸出。
可以參閱各自的`__call__()`

In [5]:
from transformers import pipeline
pipe = pipeline("text-classification", 
                model="uer/roberta-base-finetuned-dianping-chinese", #不指定也會有預設值
                 device=0
                #, or deivce_map='auto'
               )
pipe("This restaurant is awesome")

[{'label': 'negative (stars 1, 2 and 3)', 'score': 0.5998563766479492}]

## Tokenizer

In [1]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("uer/roberta-base-finetuned-dianping-chinese")
tokenizer

BertTokenizerFast(name_or_path='uer/roberta-base-finetuned-dianping-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [23]:
tokenizer.vocab

{'訕': 6247,
 '戦': 2778,
 '惦': 2671,
 '抨': 2846,
 '##ti': 9007,
 '##质': 19631,
 '##钜': 20218,
 '##驹': 20784,
 '##僅': 14063,
 'audio': 11698,
 '繞': 5254,
 'rc': 11746,
 '薨': 5957,
 '1v': 12819,
 'delete': 12845,
 '透': 6851,
 'hdr': 10465,
 '##º': 13358,
 '倉': 942,
 '##赋': 19659,
 '##lie': 10158,
 '##つ': 9775,
 '801': 12566,
 '簌': 5078,
 '##苣': 18791,
 '##lton': 10377,
 '##cg': 11478,
 '锲': 7244,
 'pass': 9703,
 'save': 13069,
 '京': 776,
 '花': 5709,
 '##孱': 15173,
 '踮': 6680,
 '##掲': 16034,
 '釀': 7021,
 '##ᅪ': 13476,
 '##580': 10703,
 '##弾': 15546,
 '飄': 7597,
 '鳝': 7850,
 'ai': 8578,
 '##渤': 17000,
 '##们': 13869,
 '##嘖': 14711,
 '收': 3119,
 '##и': 11000,
 '##牟': 17340,
 '##社': 17909,
 '！': 8013,
 '颅': 7565,
 'wikia': 8708,
 'チ': 609,
 '舀': 5641,
 '##晞': 16300,
 '##飛': 20663,
 '到': 1168,
 '##权': 16383,
 '控': 2971,
 '鞑': 7493,
 '##くれる': 11241,
 '##し': 8733,
 '##阈': 20385,
 'vr': 8260,
 'menu': 8888,
 '髯': 7774,
 '辩': 6796,
 '慄': 2704,
 '##猝': 17397,
 '##覬': 19275,
 '##應': 15803,
 '膩': 5611

In [18]:
tokens = tokenizer.tokenize("我有一個夢，I have an incredible dream!")
print(f"tokenize:{tokens}")
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"convert_tokens_to_ids:{ids}")

ids2 = tokenizer.encode("我有一個夢，I have an incredible dream!", add_special_tokens=False)
print(f"encode (tokenize+convert_tokens_to_ids):{ids2}")
tokens2 = tokenizer.convert_ids_to_tokens(ids)
print(f"convert_ids_to_tokens:{tokens2}")

tokenize:['我', '有', '一', '個', '夢', '，', 'i', 'have', 'an', 'inc', '##red', '##ible', 'dream', '!']
convert_tokens_to_ids:[2769, 3300, 671, 943, 1918, 8024, 151, 9531, 9064, 8910, 9749, 12413, 10252, 106]
encode (tokenize+convert_tokens_to_ids):[2769, 3300, 671, 943, 1918, 8024, 151, 9531, 9064, 8910, 9749, 12413, 10252, 106]
convert_ids_to_tokens:['我', '有', '一', '個', '夢', '，', 'i', 'have', 'an', 'inc', '##red', '##ible', 'dream', '!']


### tokenizer.__call__()
`input_ids`
`attention_mask`
`token_type_ids`

In [22]:
inputs = tokenizer("我有一個夢，I have an incredible dream!", add_special_tokens=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  2769,  3300,   671,   943,  1918,  8024,   151,  9531,  9064,
          8910,  9749, 12413, 10252,   106,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## Model

### 不帶Model Head的模型

https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModel.from_pretrained    
這邊列出具體支援那些Model

In [2]:
from transformers import AutoModel
inputs = tokenizer("我有一個夢，I have an incredible dream!", return_tensors="pt") # 記得一定要 return_tensors="pt"
model = AutoModel.from_pretrained("hfl/rbt3")
output = model(**inputs)
output # BaseModelOutputWithPoolingAndCrossAttentions, 回傳每個字被model分析過後的tensor

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.5823,  0.0160,  0.0598,  ...,  0.4973, -0.0854, -0.2723],
         [ 0.7501, -0.2440,  0.2002,  ..., -0.0857, -0.2035, -0.2993],
         [ 0.5779,  0.0498, -0.3363,  ..., -0.4652,  0.0127,  0.2393],
         ...,
         [ 0.0386,  0.5116,  0.2850,  ...,  0.0659, -0.0661, -0.2211],
         [ 0.5911, -0.0726,  0.1620,  ..., -0.6148, -0.2701, -0.2920],
         [ 0.5772,  0.0154,  0.0563,  ...,  0.4985, -0.0868, -0.2694]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-3.2027e-01, -9.9989e-01, -9.9936e-01, -6.1967e-01,  9.9206e-01,
          2.6552e-01, -2.0064e-01,  5.6108e-03,  9.5703e-01,  9.9709e-01,
         -4.2273e-02, -9.9999e-01, -2.1317e-01,  9.9944e-01, -9.9998e-01,
          9.9814e-01,  9.9397e-01,  6.5082e-01, -9.8280e-01, -1.2920e-01,
         -9.1872e-01, -6.6236e-01,  2.2591e-01,  9.4480e-01,  9.8757e-01,
         -9.8333e-01, -9.9996e-01,  8.3699e-02, -9.3286e-01, -9.990

### 帶Model Head的模型

In [3]:
from transformers import AutoModelForSequenceClassification
clz_model = AutoModelForSequenceClassification.from_pretrained("./01-Getting Started/04-model/rbt3", num_labels=10) 
# BertForSequenceClassification, , 被額外加上dropout跟linear layer等做分類
clz_model(**inputs) # SequenceClassifierOutput

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./01-Getting Started/04-model/rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SequenceClassifierOutput(loss=None, logits=tensor([[-0.0606,  0.2734, -0.0070,  0.1932, -0.1230, -0.1013, -0.3791, -0.2660,
          0.6980,  0.0691]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## Dataset

### Torch + Pandas  
1. 使用panda讀入檔案
2. 使用pytorch Dataset包起來/分割
3. 使用pytorch Dataloader (shuffle+batch+collate)

In [4]:
from torch.utils.data import Dataset
import pandas as pd
class MyDataset(Dataset):

    def __init__(self) -> None:
        super().__init__()
        self.data = pd.read_csv("./01-Getting Started/05-datasets/ChnSentiCorp_htl_all.csv")
        self.data = self.data.dropna()

    def __getitem__(self, index):
        return self.data.iloc[index]["review"], self.data.iloc[index]["label"]
    
    def __len__(self):
        return len(self.data)

    def createSplit(self, lengths=[0.9, 0.1]):
        # split
        from torch.utils.data import random_split
        trainset, validset = random_split(dataset, lengths)
        return trainset, validset

dataset = MyDataset()
trainset, validset = dataset.createSplit()

In [30]:
# loader
import torch
from torch.utils.data import DataLoader
tokenizer = AutoTokenizer.from_pretrained("hfl/rbt3")

def collate_func(batch):
    # [
    #  ('总体感觉不错。不足之处：客房分散在各个小楼里，颇有不便。前台速度慢。', 1), 
    #  .........
    #  ('卫生设施太差，毛巾又黑又硬，下水道堵塞，床垫不好，睡上去不是很舒服。觉得服务态度还可以。对面就是长途汽车站，出租车蛮多的，出行还是很方便。', 0)
    # ]
    texts, labels = [], []
    for item in batch:
        texts.append(item[0]) # '很不错的酒店，强烈推荐。为什么要20个字'
        labels.append(item[1]) # 1
    inputs = tokenizer(texts, max_length=128, padding="max_length", truncation=True, return_tensors="pt") # <-重點: 一次處理多筆可以增加效能
    # 'input_ids', 'token_type_ids', 'attention_mask'
    inputs["labels"] = torch.tensor(labels) 
    return inputs


trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=collate_func)
validloader = DataLoader(validset, batch_size=64, shuffle=False, collate_fn=collate_func)
next_input = next(iter(validloader))
next_input.keys()
# 'input_ids', 'token_type_ids', 'attention_mask',  +'labels'


dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

### Dataset (Huggingface Transformer style)
https://huggingface.co/datasets

In [31]:
from datasets import load_dataset, load_from_disk
dataset = load_dataset("csv", data_files="./01-Getting Started/05-datasets/ChnSentiCorp_htl_all.csv"
                       , split='train') 
# or datasets = load_dataset("name")
dataset = dataset.filter(lambda x: x["review"] is not None)
dataset

Dataset({
    features: ['label', 'review'],
    num_rows: 7765
})

In [32]:
datasets = dataset.train_test_split(test_size=0.1)
datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 6988
    })
    test: Dataset({
        features: ['label', 'review'],
        num_rows: 777
    })
})

In [40]:
def process_function(examples):
    # 跟之前的差別:不需要padding跟return_tensors="pt" (都給DataCollatorWithPadding處理)
    tokenized_examples = tokenizer(examples["review"], max_length=128, truncation=True)
    tokenized_examples["labels"] = examples["label"]
    return tokenized_examples

tokenized_datasets = datasets.map(process_function, batch_size=True, remove_columns=datasets["train"].column_names)
tokenized_datasets

Map:   0%|          | 0/6988 [00:00<?, ? examples/s]

Map:   0%|          | 0/777 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 6988
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 777
    })
})

其實他一樣是使用pytorch Dataset，只是又被Transformers多包一層，其中train是預設值

In [67]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

trainset, validset = tokenized_datasets["train"], tokenized_datasets["test"]


trainloader = DataLoader(trainset, batch_size=32, shuffle=True, 
                         collate_fn=DataCollatorWithPadding(tokenizer, 
                                            padding=True, # Pad to the longest sequence in the batch
                                            return_tensors='pt'))
validloader = DataLoader(validset, batch_size=64, shuffle=False, 
                         collate_fn=DataCollatorWithPadding(tokenizer, padding=True, return_tensors='pt'))
next_input = next(iter(validloader))
print(f"batch size {len(next(iter(next_input.values())))}")

batch size 64


## 手動訓練 (參照/01-Getting Started/05-datasets/)

`model.train()`   
`model.eval()`   
 開啟/關閉 Batch Normalization 和 Dropout

 
`torch.inference_mode():`    
https://stackoverflow.com/questions/55627780/evaluating-pytorch-models-with-torch-no-grad-vs-model-eval    
不追蹤梯度，減少資源使用    
相當於    
```python
for param in model.parameters():
    param.requires_grad = False
```


`optimizer.zero_grad():`  
為什麼要手動歸 0，可以累積幾輪才一次update  
https://meetonfriday.com/posts/18392404/  


評估函數請參照：  
/01-Getting Started/06-evaluate/evaluate.ipynb  
        

In [87]:
from torch.optim import Adam
model = AutoModelForSequenceClassification.from_pretrained("hfl/rbt3")

# 如果是pipeline，就可以用device=0處理
if torch.cuda.is_available():
    model = model.cuda()
    
optimizer = Adam(model.parameters(), lr=2e-5)

# 手動算準確度
def evaluate1():
    # https://stackoverflow.com/questions/55627780/evaluating-pytorch-models-with-torch-no-grad-vs-model-eval
    model.eval() # 關閉 Batch Normalization 和 Dropout
    acc_num = 0
    with torch.inference_mode():
        for batch in validloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            output = model(**batch)
            pred = torch.argmax(output.logits, dim=-1)
            acc_num += (pred.long() == batch["labels"].long()).float().sum()
    return acc_num / len(validset)

# 用現成lib算準確度
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1"])
def evaluate2():
    model.eval() # 關閉 Batch Normalization 和 Dropout
    with torch.inference_mode():
        for batch in validloader:
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            output = model(**batch)
            # SequenceClassifierOutput(loss=tensor(0.1632, device='cuda:0'), 
            #                            logits=tensor([[-0.6985,  1.4187],[xxx,xxx],....]]),
            #                            device='cuda:0'), hidden_states=None, attentions=None
            pred = torch.argmax(output.logits, dim=-1)
            clf_metrics.add_batch(predictions=pred.long(), references=batch["labels"].long())
    return clf_metrics.compute()
    
def train(epoch=3, log_step=100):
    global_step = 0
    for ep in range(epoch):
        # model.train()會啟用 batch normalization 和 dropout, 如果模型中有BN层（Batch Normalization）和 Dropout.
        model.train()
        for batch in trainloader:
             # 這邊是把training data放入gpu，所以不要跟前面的model.cuda搞混
            if torch.cuda.is_available():
                batch = {k: v.cuda() for k, v in batch.items()}
            
            optimizer.zero_grad() # 每輪將optimizer梯度歸0，除非你要累積多次
            output = model(**batch) 
            # SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2049, -0.3266, .... -0.3472]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
            output.loss.backward()
            optimizer.step()
            if global_step % log_step == 0:
                print(f"ep: {ep}, global_step: {global_step}, loss: {output.loss.item()}")
            global_step += 1
        clf = evaluate2()
        myacc = evaluate1()
        clf.update()
        print(f"ep: {ep}, {clf}|{myacc}")
train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at hfl/rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ep: 0, global_step: 0, loss: 0.6433025002479553
ep: 0, global_step: 100, loss: 0.2641475796699524
ep: 0, global_step: 200, loss: 0.3857491910457611
ep: 0, {'accuracy': 0.87001287001287, 'f1': 0.9130060292850991}|0.8700128793716431
ep: 1, global_step: 300, loss: 0.19777831435203552
ep: 1, global_step: 400, loss: 0.17106226086616516
ep: 1, {'accuracy': 0.8944658944658944, 'f1': 0.9250457038391224}|0.8944659233093262
ep: 2, global_step: 500, loss: 0.192722886800766
ep: 2, global_step: 600, loss: 0.4233287572860718
ep: 2, {'accuracy': 0.8931788931788932, 'f1': 0.924613987284287}|0.8931788802146912


In [88]:
from transformers import pipeline
model.config.id2label = {0: "差评！", 1: "好评！"}
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)
pipe("店員很漂亮")

[{'label': '好评！', 'score': 0.9802994728088379}]

## 自動訓練

### 1. Data

In [89]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

dataset = load_dataset("csv", data_files="./01-Getting Started/05-datasets/ChnSentiCorp_htl_all.csv", split="train")
dataset = dataset.filter(lambda x: x["review"] is not None)
datasets = dataset.train_test_split(test_size=0.1)

import torch

tokenizer = AutoTokenizer.from_pretrained("hfl/rbt3")

def process_function(examples):
    # 一樣不需要padding跟return_tensors="pt" (都給DataCollatorWithPadding處理)
    tokenized_examples = tokenizer(examples["review"], max_length=128, truncation=True)
    tokenized_examples["labels"] = examples["label"]
    return tokenized_examples

tokenized_datasets = datasets.map(process_function, batched=True, remove_columns=datasets["train"].column_names)



Map:   0%|          | 0/6988 [00:00<?, ? examples/s]

Map:   0%|          | 0/777 [00:00<?, ? examples/s]

### 2. Trainer


#### Model

In [90]:
model = AutoModelForSequenceClassification.from_pretrained("hfl/rbt3")
print(model.config)
print(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at hfl/rbt3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "_name_or_path": "hfl/rbt3",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 3,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.42.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 21128
}

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedd

#### Evaluater

In [91]:
import evaluate

acc_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def eval_metric(eval_predict): # Trainer
    predictions, labels = eval_predict
    predictions = predictions.argmax(axis=-1)
    acc = acc_metric.compute(predictions=predictions, references=labels) # {"accuracy": 0.99},
    f1 = f1_metric.compute(predictions=predictions, references=labels) # {"f1": 0.95},
    acc.update(f1) # {"accuracy": 0.99, "f1": 0.95}
    return acc

#### TrainingArguments

In [92]:
from transformers import Trainer
train_args = TrainingArguments(output_dir="./checkpoints",      # 输出文件夹
                               per_device_train_batch_size=64,  # 训练时的batch_size
                               per_device_eval_batch_size=128,  # 验证时的batch_size
                               logging_steps=10,                # log 打印的频率
                               evaluation_strategy="epoch",     # 评估策略(多久評估一次)
                               save_strategy="epoch",           # 保存策略(多久保存一次)
                               save_total_limit=3,              # 最大保存数
                               learning_rate=2e-5,              # 学习率
                               weight_decay=0.01,               # weight_decay
                               metric_for_best_model="f1",      # 设定评估指标
                               load_best_model_at_end=True)     # 训练完成后加载最优模型
train_args



TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
f

#### Trainer

In [93]:
from transformers import Trainer
from transformers import DataCollatorWithPadding
trainer = Trainer(model=model, 
                  args=train_args, 
                  train_dataset=tokenized_datasets["train"], # 取代Dataloader
                  eval_dataset=tokenized_datasets["test"],  # 取代Dataloader
                  data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
                  compute_metrics=eval_metric)

In [94]:
# for transformers >= 4.42.4
# ValueError: You are trying to save a non contiguous tensor: `bert.encoder.layer.0.attention.self.query.weight` which is not allowed. 
# It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors,
# and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

# for name, param in model.bert.named_parameters():
#     # 修正 contiguous 的 error
#     if not param.is_contiguous():
#         param.data = param.data.contiguous()

In [95]:
trainer.train()

    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3527,0.293507,0.873874,0.907547
2,0.2607,0.256122,0.888031,0.91922
3,0.2522,0.254943,0.886744,0.917603


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=165, training_loss=0.3320430827863289, metrics={'train_runtime': 66.3084, 'train_samples_per_second': 316.159, 'train_steps_per_second': 2.488, 'total_flos': 351909933963264.0, 'train_loss': 0.3320430827863289, 'epoch': 3.0})

In [None]:
NER (named entry recognization)
BERT + dropout + linear (token num -> label)
分類字節

MRC (machine reading comprehension)
1. QA (給予文章跟Q，輸出A)
2. 答案輸出可能是填空、選擇、節錄、生成
訓練資料收集： context + question + answer{answer_start在原文哪裡, text摘錄}
如何輸入：`[CLS]Question[SEP]Context[SEP]`

BERT + linear (hiddensize-> labels(=2, start&end))

## 原文QA 訓練:    

1. 將 question跟context整合成一個pair (BertTokenizerFast(text, text_pair))    
     註：BertTokenizerFast相較一般版，可以提供`offsets_mapping`
3. 找到answer位於該pair的何處，算出index加入example (即'start_positions', 'end_positions')     
4. 使用QAModel訓練    

In [98]:
datasets = load_dataset("cmrc2018")
sample_dataset = datasets["train"].select(range(10))
tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-macbert-base")
tokenizer

BertTokenizerFast(name_or_path='hfl/chinese-macbert-base', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [102]:
# BertTokenizerFast text, text_pair.
tokenized_examples = tokenizer(text=sample_dataset["question"],
                               text_pair=sample_dataset["context"],
                                return_offsets_mapping=True, # <-是否生成offset_mapping (下面會說明)
                               max_length=384, truncation="only_second", # first是question 
                               padding="max_length")
tokenized_examples.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping'])

這邊會把Q跟Context合併在一起:
[CLS] 范 廷 颂 ... 任 为 主 教 的 ？ [SEP] 范 廷 颂 枢 机 ....同 年 8 月 15 日 就 任 ； 其 牧 铭 为 「 我 信 [SEP]

In [103]:
print(tokenizer.decode(tokenized_examples["input_ids"][0]))

# 該token屬於question(type 0) or content(type 1)
print("=== token_type_ids ====\n", tokenized_examples["token_type_ids"][0])

# offset_mapping => 每個token自己內部的index跟offset都是從(0,0)開始
print("=== offset_mapping ====\n", tokenized_examples["offset_mapping"][0], len(tokenized_examples["offset_mapping"][0]))

[CLS] 范 廷 颂 是 什 么 时 候 被 任 为 主 教 的 ？ [SEP] 范 廷 颂 枢 机 （ ， ） ， 圣 名 保 禄 · 若 瑟 （ ） ， 是 越 南 罗 马 天 主 教 枢 机 。 1963 年 被 任 为 主 教 ； 1990 年 被 擢 升 为 天 主 教 河 内 总 教 区 宗 座 署 理 ； 1994 年 被 擢 升 为 总 主 教 ， 同 年 年 底 被 擢 升 为 枢 机 ； 2009 年 2 月 离 世 。 范 廷 颂 于 1919 年 6 月 15 日 在 越 南 宁 平 省 天 主 教 发 艳 教 区 出 生 ； 童 年 时 接 受 良 好 教 育 后 ， 被 一 位 越 南 神 父 带 到 河 内 继 续 其 学 业 。 范 廷 颂 于 1940 年 在 河 内 大 修 道 院 完 成 神 学 学 业 。 范 廷 颂 于 1949 年 6 月 6 日 在 河 内 的 主 教 座 堂 晋 铎 ； 及 后 被 派 到 圣 女 小 德 兰 孤 儿 院 服 务 。 1950 年 代 ， 范 廷 颂 在 河 内 堂 区 创 建 移 民 接 待 中 心 以 收 容 到 河 内 避 战 的 难 民 。 1954 年 ， 法 越 战 争 结 束 ， 越 南 民 主 共 和 国 建 都 河 内 ， 当 时 很 多 天 主 教 神 职 人 员 逃 至 越 南 的 南 方 ， 但 范 廷 颂 仍 然 留 在 河 内 。 翌 年 管 理 圣 若 望 小 修 院 ； 惟 在 1960 年 因 捍 卫 修 院 的 自 由 、 自 治 及 拒 绝 政 府 在 修 院 设 政 治 课 的 要 求 而 被 捕 。 1963 年 4 月 5 日 ， 教 宗 任 命 范 廷 颂 为 天 主 教 北 宁 教 区 主 教 ， 同 年 8 月 15 日 就 任 ； 其 牧 铭 为 「 我 信 [SEP]
=== token_type_ids ====
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 