# 手作課程：BERT 模型的理解與實作（基礎版）
### 以 Fine-grained emotion classification 為例
Fine-grained 指的是 label 集的粒度很細，標籤可能同時包含angry、furious等語意相近的答案，類別的數量也較多；與之相對的概念 Coarse-grained 就像是正負情緒分類，由於只要分兩類，難度相較於 Fine-grained 來說低了不少。

本次的手作課程使用 Empathetic Dialogs 資料集，包含了 Speaker 與 Listener 之間的多輪對話，以及一句描述當下情境的句子。類別共有 32 種，是一個難度頗高的細粒度情緒分類 benchmark dataset。

在**基礎實作課程**中，我們的**目標是將純文字的原始資料轉成 BERT 可以接受的格式**，並將 BERT fine-tune 在我們的資料集上。 

In [None]:
# 安裝必要的套件
from IPython.display import clear_output
!pip install datasets
!pip install transformers
!pip install sacremoses 
!pip install nlpaug
clear_output()

In [None]:
# 匯入必要的套件
import torch
import pandas as pd
from tqdm.notebook import tqdm
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
from torch.nn.utils.rnn import pad_sequence
from sklearn.metrics import accuracy_score, f1_score

### 讀取原始數據並處理
hugging face 已整理為 train, validation 及 test 三個 split，可以直接透過 datasets 套件的 load_dataset 指定要哪一個分割就能取得。

為了讓訓練速度快一點，以便在前半段學員可以較快的知道自己有沒有做錯，在這裡我們只取5,000筆資料來訓練、1000筆驗證、1000筆測試。


In [None]:
# 原始標籤轉換為 id
label2idx = {'sad': 0, 'trusting': 1, 'terrified': 2, 'caring': 3, 'disappointed': 4,
             'faithful': 5, 'joyful': 6, 'jealous': 7, 'disgusted': 8, 'surprised': 9,
             'ashamed': 10, 'afraid': 11, 'impressed': 12, 'sentimental': 13, 
             'devastated': 14, 'excited': 15, 'anticipating': 16, 'annoyed': 17, 'anxious': 18,
             'furious': 19, 'content': 20, 'lonely': 21, 'angry': 22, 'confident': 23,
             'apprehensive': 24, 'guilty': 25, 'embarrassed': 26, 'grateful': 27,
             'hopeful': 28, 'proud': 29, 'prepared': 30, 'nostalgic': 31}
            
idx2label = {v: k for k, v in label2idx.items()}

# 讀取資料集
train = pd.read_csv('https://github.com/dinobby/ed_dataset_clean/raw/main/train.csv')
valid = pd.read_csv('https://github.com/dinobby/ed_dataset_clean/raw/main/valid.csv')
test = pd.read_csv('https://github.com/dinobby/ed_dataset_clean/raw/main/test.csv')
clear_output()

train.head()

Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
0,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,
1,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,
2,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,
3,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,
4,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,


In [None]:
# 處理原始資料集所需要的函數
def process_data(data, label2idx):
    # 這個函數需要接收原始資料(data)及標籤名稱和id的對應(label2idx)，
    # 回傳清理乾淨的樣本(text)及對應的標籤 id (label)
    text, label = [], []
    for n, r in data.groupby('conv_id', sort=False):
            # 在初階的部分取 prompt 一句話來做分類即可，進階時可再取對話內容幫助分類
            sentence = r['prompt'].values[0]

            # [TODO]: 在這裡可以做資料的清洗或前處理，可以根據你的觀察來決定如何處理
            # 同時也需要將標籤從名稱 (例如 angry) 轉為 id (22)
            # ===============================================
            sentence = sentence.replace('_comma_', ',')
            text.append(sentence)
            label.append(label2idx.get(r['context'].values[0]))
            # ===============================================

    return text, label

# 處理原始資料集
train_text, train_label = process_data(train, label2idx)
valid_text, valid_label = process_data(valid, label2idx)
test_text, test_label = process_data(test, label2idx)

In [None]:
# 在 debugging 的過程可以用小一點的訓練集，
# 比較能快速得到結果和錯誤的地方，正式訓練時再將這塊註解掉即可
# train_text = train_text[:1000]
# train_label = train_label[:1000]

In [None]:
print("訓練樣本數：", len(train_text))
print("驗證樣本數：", len(valid_text))
print("測試樣本數：", len(test_text))

訓練樣本數： 19533
驗證樣本數： 2770
測試樣本數： 2547


## 動手寫一個自定義的 Dataset Class
`torch.utils.data` 中的 `Dataset` 是一個非常好用的工具，可以幫助你把 raw data 轉成 pytorch 能一批一批 (batch-wise) 接收的形式

In [None]:
# 指定要使用的 encoder 類型
encoder_type = 'bert-base-uncased'
# 取得此預訓練模型所使用的 tokenizer，和之後所使用的模型要一致
tokenizer = AutoTokenizer.from_pretrained(encoder_type)

class EmotionDataset(Dataset):
    def __init__(self, text, label, tokenizer):
        self.text = text
        self.label = label
        self.tokenizer = tokenizer
    
    # 定義回傳一筆資料時要做的事，
    # 也就是當以 [idx] 來取資料時，要回傳的東西
    def __getitem__(self, idx):         
        # [TODO]: 製作token_tensor及segment_tensor，
        # 將 text tokenize，並加入 [CLS]、[SEP] 兩個特殊符號
        # 也就是將 input sentence 變成 [CLS] + tokenized text + [SEP] 後再轉成 id
        # ======================================
        word_pieces = ["[CLS]"]
        tokens = self.tokenizer.tokenize(self.text[idx])
        word_pieces += tokens + ["[SEP]"]
        word_lens = len(word_pieces)
        
        # 將剛剛做好的 input sentence 轉換成索引序列
        ids = self.tokenizer.convert_tokens_to_ids(word_pieces)
        tokens_tensor = torch.tensor(ids)
        
        # [TODO]: 製作 segment_tensor，由於是單一句子分類任務，
        # 只需製作一條長度與原句子相同的 [0] tensor 即可
        # ======================================
        segments_tensor = torch.tensor([0] * word_lens, dtype=torch.long)
        # ======================================
        
        label_tensor = torch.tensor(self.label[idx], dtype=torch.long)
        return (tokens_tensor, segments_tensor, label_tensor)
    
    def __len__(self):
        return len(self.text)
    
# 初始化剛剛定義的 Dataset
trainset = EmotionDataset(train_text, train_label, tokenizer=tokenizer)
validset = EmotionDataset(valid_text, valid_label, tokenizer=tokenizer)
testset = EmotionDataset(test_text, test_label, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
# 選擇第一個樣本出來，看看原始 input 是怎麼被轉換成 BERT 相容的格式的
sample_idx = 0

# 將原始文本拿出做比較
sample_text, sample_label = train_text[sample_idx], train_label[sample_idx] 
tokens_tensor, segments_tensor, label_tensor = trainset[sample_idx]

# 將 tokens_tensor 還原成文本
tokens = tokenizer.convert_ids_to_tokens(tokens_tensor.tolist())
combined_text = " ".join(tokens)

print(f"""[原始文本]
句子：{sample_text}
分類：{sample_label}

===============

[Dataset 回傳的 tensors]
tokens_tensor  ：{tokens_tensor}

segments_tensor：{segments_tensor}

label_tensor   ：{label_tensor}

===============

[從 Dataset tensor 還原的句子]
{combined_text}
""")

[原始文本]
句子：I remember going to the fireworks with my best friend. There was a lot of people, but it only felt like us in the world.
分類：13


[Dataset 回傳的 tensors]
tokens_tensor  ：tensor([  101,  1045,  3342,  2183,  2000,  1996, 16080,  2007,  2026,  2190,
         2767,  1012,  2045,  2001,  1037,  2843,  1997,  2111,  1010,  2021,
         2009,  2069,  2371,  2066,  2149,  1999,  1996,  2088,  1012,   102])

segments_tensor：tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0])

label_tensor   ：13


[從 Dataset tensor 還原的句子]
[CLS] i remember going to the fireworks with my best friend . there was a lot of people , but it only felt like us in the world . [SEP]



In [None]:
# 將資料處理成一批一批 (batch) 的形式所需要的函數
# 主要為 zero padding 的處理，並產生前面說明過的 masks_tensors
def create_mini_batch(samples):
    tokens_tensors = [s[0] for s in samples]
    segments_tensors = [s[1] for s in samples]
    label_tensors = [s[2] for s in samples]
    
    # [TODO]: 將 token_tensors 及 segments_tensors zero padding 到同樣長度，
    # hint: 可以使用 import的 pad_sequence，記得 batch_first 要設為 True
    #===============================================
    tokens_tensors = pad_sequence(tokens_tensors,batch_first=True)
    segments_tensors = pad_sequence(segments_tensors,batch_first=True)    
    #===============================================
    
    # [TODO] 製作 attention masks，將 tokens_tensors 裡頭「不為 zero padding」
    # 的位置設為 1，讓 BERT 只關注這些位置的 tokens
    # ================================================
    # 先製作一條長度和 token_tensors 一樣的 0 張量
    masks_tensors = torch.zeros(tokens_tensors.shape, dtype=torch.long)
    
    # hint: 可以使用 tensor.masked_fill 來完成「不等於0的位置設為1」
    masks_tensors = masks_tensors.masked_fill(tokens_tensors != 0, 1)
    label_tensors = torch.tensor(label_tensors, dtype=torch.long)
    return tokens_tensors, segments_tensors, masks_tensors, label_tensors


# 初始化每次回傳一批樣本的 DataLoader
demoloader = DataLoader(trainset, batch_size=4, collate_fn=create_mini_batch)
trainloader = DataLoader(trainset, batch_size=64, collate_fn=create_mini_batch)
validloader = DataLoader(validset, batch_size=256, collate_fn=create_mini_batch)
testloader = DataLoader(testset, batch_size=256, collate_fn=create_mini_batch)

In [None]:
demo_data = next(iter(demoloader))

tokens_tensors, segments_tensors, masks_tensors, label_tensors = demo_data

print(f"""
tokens_tensors.shape   = {tokens_tensors.shape} 
{tokens_tensors}
------------------------
segments_tensors.shape = {segments_tensors.shape}
{segments_tensors}
------------------------
masks_tensors.shape    = {masks_tensors.shape}
{masks_tensors}
------------------------
label_tensors.shape        = {label_tensors.shape}
{label_tensors}
""")


tokens_tensors.shape   = torch.Size([4, 30]) 
tensor([[  101,  1045,  3342,  2183,  2000,  1996, 16080,  2007,  2026,  2190,
          2767,  1012,  2045,  2001,  1037,  2843,  1997,  2111,  1010,  2021,
          2009,  2069,  2371,  2066,  2149,  1999,  1996,  2088,  1012,   102],
        [  101,  1045,  2109,  2000, 12665,  2005,  4768,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1045,  3662,  1037,  3124,  2129,  2000,  2448,  1037,  2204,
         26892,  2094,  1999, 23604,  2465,  1998,  2002,  3236,  2006,  4248,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1045,  2031,  2467,  2042,  8884,  2000,  2026,  2564,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]])


## 資料準備完畢，載入預訓練好的語言模型
Hugging face 是一個非常佛心的團隊，整理並實現了許多 BERT 家族及其變種的模型，而且提供訓練好的參數可以直接載入。有了厲害的預訓練語言模型，機器就能更好的理解我們輸入的人類語言，也就是把離散的文字輸入轉化成考慮了上下文而產生的連續向量，具備這樣的武器，後續無論是要應對什麼樣的問題，難度都大大的降低了許多。

想要載入訓練好的模型很容易，只要指定要載入的模型名稱(e.g., `bert-base-uncased`, `roberta-large` 等)，就可以直接從 hugging face 提供的 hub 將模型拉進來。各種變形和模型的名稱可以在 https://huggingface.co/transformers/pretrained_models.html 這個頁面找到。

In [None]:
# 載入一個預訓練好可以做單一句子分類任務的模型，並指定標籤的數量有幾種
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(encoder_type, num_labels=32)
clear_output()

## BertForSequenceClassification 裡面做了什麼呢？
我們可以看看 hugging face 官方文件中的原始碼片段，其實就是過了 BERT 和 dropout 之後，接上了一層 linear 把結果輸出的維度投影到 num_class！所以如果你想要拿 BERT 產生的 representation 再做更多更複雜的事，只要依樣畫葫蘆，繼承 PreTrainedBertModel，並將 Bert 的參數寫進BertModel(config)中，就能夠自己定義 forward pass 時要做的事情了，例如不要只用 `nn.Linear`，而是再過一個 `nn.Conv2d`？

In [None]:
"""
class BertForSequenceClassification(PreTrainedBertModel):
    def __init__(self, config, num_labels=2):
        super(BertForSequenceClassification, self).__init__(config)
        self.num_labels = num_labels
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            return loss
        else:
            return logits
"""
clear_output()

## 開始訓練模型


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def train_model(model, epochs, lr=1e-5):
    model.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model = model.to(device)

    for epoch in range(epochs):
        total_train_loss = 0.0
        train_pred, train_labels = [], []
        for data in tqdm(trainloader):
            
            tokens_tensors, segments_tensors, \
            masks_tensors, labels = [t.to(device) for t in data]

            optimizer.zero_grad()
            
            outputs = model(input_ids=tokens_tensors, 
                            token_type_ids=segments_tensors, 
                            attention_mask=masks_tensors, 
                            labels=labels)
            loss, logits = outputs['loss'], outputs['logits']
            loss.backward()
            optimizer.step()
            total_train_loss += loss.item()

            predictions = torch.argmax(logits, dim=-1)
            train_pred.extend(predictions.tolist())
            train_labels.extend(labels.tolist())

        # 計算訓練集的 loss、準確率及 f1 score
        train_loss = total_train_loss / len(trainloader)
        train_acc = accuracy_score(train_pred, train_labels)
        train_f1 = f1_score(train_pred, train_labels, average='micro')
        print(f'train | acc: {train_acc:.04f}, f1: {train_f1:.04f}')

        # 計算驗證集的 loss、準確率及 f1 score
        total_valid_loss = 0.0
        valid_pred, valid_labels = [], []
        with torch.no_grad():
            for data in tqdm(validloader):
            
                tokens_tensors, segments_tensors, \
                masks_tensors, labels = [t.to(device) for t in data]

                outputs = model(input_ids=tokens_tensors, 
                            token_type_ids=segments_tensors, 
                            attention_mask=masks_tensors, 
                            labels=labels)
                loss, logits = outputs['loss'], outputs['logits']

                total_valid_loss += loss.item()

                predictions = torch.argmax(logits, dim=-1)
                valid_pred.extend(predictions.tolist())
                valid_labels.extend(labels.tolist())

        # valid_pred = valid_pred.detach().cpu().numpy()
        valid_acc = accuracy_score(valid_pred, valid_label)
        valid_f1 = f1_score(valid_pred, valid_label, average='micro')
        print(f'valid | acc: {valid_acc:.04f}, f1: {valid_f1:.04f}')
    return model

In [None]:
model = train_model(model, epochs=3, lr=1e-5)

  0%|          | 0/306 [00:00<?, ?it/s]

train | acc: 0.1782, f1: 0.1782


  0%|          | 0/11 [00:00<?, ?it/s]

valid | acc: 0.3451, f1: 0.3451


  0%|          | 0/306 [00:00<?, ?it/s]

train | acc: 0.4585, f1: 0.4585


  0%|          | 0/11 [00:00<?, ?it/s]

valid | acc: 0.4758, f1: 0.4758


  0%|          | 0/306 [00:00<?, ?it/s]

train | acc: 0.5576, f1: 0.5576


  0%|          | 0/11 [00:00<?, ?it/s]

valid | acc: 0.5069, f1: 0.5069


## 儲存訓練好的模型
一般來說，我們應該在訓練的同時記錄最小的 validation loss 或最高的 validation accuracy，以此來決定儲存模型的時機，不過在這裡我們就直接儲存訓練完 3 個 epochs 後的參數。

In [None]:
torch.save(model.state_dict(), 'bert.pt')

要將訓練好的模型讀取回來，做預測或繼續訓練也都是很方便的：

In [None]:
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(encoder_type, num_labels=32)
model.load_state_dict(torch.load("bert.pt"))
model.to(device)
model.eval()
clear_output()

## 對測試集中的新樣本進行推論

In [None]:
def inference(model, testloader):
    test_pred, test_labels = [], []
    with torch.no_grad():
        for data in tqdm(testloader):
        
            tokens_tensors, segments_tensors, \
            masks_tensors, labels = [t.to(device) for t in data]

            outputs = model(input_ids=tokens_tensors, 
                        token_type_ids=segments_tensors, 
                        attention_mask=masks_tensors)
            logits = outputs['logits']

            predictions = torch.argmax(logits, dim=-1)
            test_pred.extend(predictions.tolist())
            test_labels.extend(labels.tolist())

    test_acc = accuracy_score(test_pred, test_label)
    test_f1 = f1_score(test_pred, test_label, average='micro')
    return test_acc, test_f1, test_pred, test_labels

In [None]:
test_acc, test_f1, bert_pred, test_labels = inference(model, testloader)
print('======= Bert performance =======')
print(f'test | acc: {test_acc:.04f}, f1: {test_f1:.04f}')

  0%|          | 0/10 [00:00<?, ?it/s]

test | acc: 0.5379, f1: 0.5379


In [None]:
d = {'id': range(len(bert_pred)), 'label': bert_pred}
pd.DataFrame(d).to_csv('simple.csv', index=False)

## Reference
- https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html
- https://huggingface.co/transformers/pretrained_models.html