# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at ntu-ml-2023spring-ta@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/15lGUmT8NpLGtoxRllRWCJyQEjhR1Idcei63YHsDckPE/edit#slide=id.g21fff4e9af6_0_13)　Kaggle: [Link](https://www.kaggle.com/competitions/ml2023spring-hw7/host/sandbox-submissions)　Data: [Link](https://drive.google.com/file/d/1YU9KZFhQqW92Lw9nNtuUPg0-8uyxluZ7/view?usp=sharing)

# Prerequisites

## Download Dataset

In [1]:
# download link 1
# !gdown --id '1TjoBdNlGBhP_J9C66MOY7ILIrydm7ZCS' --output hw7_data.zip

# download link 2 (if above link failed)
# !gdown --id '1YU9KZFhQqW92Lw9nNtuUPg0-8uyxluZ7' --output hw7_data.zip

# download link 3 (if above link failed)
# !gdown --id '1k2BfGrvhk8QRnr9Xvb04oPIKDr1uWFpa' --output hw7_data.zip

!unzip -o hw7_data.zip

# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

Archive:  hw7_data.zip
  inflating: hw7_train.json          
  inflating: hw7_test.json           
  inflating: hw7_dev.json            
  inflating: hw7_in-context-learning-examples.json  
Sun May 28 00:08:04 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:03.0 Off |                    0 |
| N/A   33C    P0    24W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+--

## Install packages

Documentation for the toolkit: 
*   https://huggingface.co/transformers/
*   https://huggingface.co/docs/accelerate/index



In [2]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.26.1
!pip install accelerate==0.16.0 -i https://pypi.org/simple

Looking in indexes: https://pypi.internal-mirrors.ucloud.cn/simple


## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2hrs
  

In [1]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AdamW

from tqdm.auto import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"

def same_seeds(seed):
    torch.manual_seed(seed)  # 生成CPU随机数的种子
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)  # 确保GPU的随机数种子一致(CPU和GPU设备不一致)
        torch.cuda.manual_seed_all(seed)  # 确保多个GPU的随机数种子一致
    np.random.seed(seed)  
    random.seed(seed)  # python内置的随机数种子
    torch.backends.cudnn.benchmark = False  # 关闭cuDNN的自动优化，可能导致每次运行结果不一致
    torch.backends.cudnn.deterministic = True  # 算子使用确定性算法
same_seeds(2)

  from .autonotebook import tqdm as notebook_tqdm


## Load Model and Tokenizer


In [16]:
from transformers import (
  AutoTokenizer,
  AutoModelForQuestionAnswering,
)

model = AutoModelForQuestionAnswering.from_pretrained("bert-base-chinese").to(device)
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForQuestionAnswering: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-chinese a

## Read Data

- Training set: 26918 QA pairs
- Dev set: 2863  QA pairs
- Test set: 3524  QA pairs

- {train/dev/test}_questions:	
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [3]:
def read_data(file):
    with open(file, "r", encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data("hw7_train.json")
dev_questions, dev_paragraphs = read_data("hw7_dev.json")
test_questions, test_paragraphs = read_data("hw7_test.json")

## Tokenize Data

### 转换之后的数据示例

```
Q1 = "喀麥隆住了超過多少的種族?"
Q2 = "用稻草編出的人形有什麼作用？"

tokenizer([Q1, Q2])为如下:
{
    'input_ids': [
        [101, 1584, 7930, 7384, 857, 749, 6631, 6882, 1914, 2208, 4638, 4934, 3184, 136, 102],
        [101, 4500, 4940, 5770, 5226, 1139, 4638, 782, 2501, 3300, 784, 7938, 868, 4500, 8043, 102]
    ], 
    'token_type_ids': [
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ],
    'attention_mask': [
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    ]
}
```

In [4]:
# add_special_token如[CLS], [SEP]会在QA_Dataset的__getitem__方法中添加

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False)

Token indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors


## Dataset

In [5]:
class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split  # "train", "dev", "test"指代不同的数据集
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 60
        self.max_paragraph_len = 150

        # 段落和问题的组合过长时要求切分窗口，doc_stride为窗口滑动距离
        # 真实段落300个token，max_paragraph_len为150个token
        # doc_stride为75个token，则真实段落会被切分为3个窗口
        self.doc_stride = 150

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        # 输入到模型的最大长度，由问题和段落组合，CLS包含了用于预测的隐含语意，SEP区分问题和段落的间隔
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)
    
    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        if self.split == "train":
            # answer_start_token和question["answer_start"]都是int类型，但是转换前后数字不一致?
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # paragraph_start是指切分窗口的起始位置
            # 1. min(..., ...)当答案太靠后时确保paragraph_start + self.max_paragraph_len不会超过真实段落的结尾，
            #    - mid - self.max_paragraph_len // 2是在答案中心处往前推1/2个max_paragraph_len确定paragraph_start
            #    - 当答案在真实段落中靠后时可能导致paragraph_end超出真实段落结尾。因此加上len(tokenized_paragraph) - self.
            #    - max_paragraph_len与其比较确定一个更靠前的值作为paragraph_start
            # 2. min(..., ...)都小于0时，max(0, ...)将paragraph_start设为0，那么真实段落中多出的部分会被截掉
            mid = (answer_start_token + answer_end_token) // 2
            paragraph_start = max(0, min(mid - self.max_paragraph_len // 2, len(tokenized_paragraph) - self.max_paragraph_len))
            paragraph_end = paragraph_start + self.max_paragraph_len

            # 101: CLS, 102:SEP
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start:paragraph_end] + [102]

            # anwer_start_token是答案在真实段落中的起始位置
            # anwer_start_token + len(input_ids_question)是答案在输入序列（问题+段落）中的起始位置
            # anwer_start_token + len(input_ids_question) - paragraph_start是答案在输入序列中的起始位置
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start

            # 将问题和段落组合，最大长度为self.max_seq_len
            # input_ids：所有token的id，不足self.max_seq_len填充0
            # token_type_ids：问题部分0，段落部分1，不足self.max_seq_len填充0
            # 不足而导致的填充部分为0，实际内容部分为1，表示注意实际内容部分
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token
        else:
            # 每次取一个问题和一个真实段落：
            # [问题 + 窗口1， 问题 + 窗口2, 问题 + 窗口3, ...]
            # dev/test的数据结构
            # 第一层：data -> [data_0, data_1, data_2, ...]
            # 第二层：data[0] -> [[input_ids_list_0, input_ids_list_1, ...], 
            #                    [token_type_ids_list_0, token_type_ids_list_1, ...],
            #                    [attention_mask_list_0, attention_mask_list_1, ...]] 矩阵
            # data[0]和data[1]中的矩阵维度可能不一样，取决于窗口数目
            # data[0][0]是所有的input_ids
            # data[0][0][0]是第0个input_ids
            # data[0][0][0][answer_start_token:answer_end_token]是答案的位置
            # dev/test取数据和train不同，train仅取1个窗口，而dev/test会取多个窗口构成list:
            # input_ids_list: 
            input_ids_list, token_type_ids_list, attention_mask_list = [], [], []

            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i:i+self.max_paragraph_len] + [102]

                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)

                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)
            
            # 注意返回的是一个[torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)]的列表
            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)
    
    def padding(self, input_ids_question, input_ids_paragraph):
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len

        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

## Function for Evaluation

In [9]:
def evaluate(data, output):
    answer = ""
    max_prob = float("-inf")
    num_of_windows = data[0].shape[1]

    for k in range(num_of_windows):
        # output.start_logits[k]包含第k个窗口中每个token作为起始位置的概率
        # output.end_logits[k]包含第k个窗口中每个token作为结束位置的概率
        start_prob, start_index = torch.max(output.start_logits[k], dim=0)
        end_prob, end_index = torch.max(output.end_logits[k], dim=0)

        prob = start_prob + end_prob

        if prob > max_prob:
            max_prob = prob
            # 解释data[0][0][k][start_index:end_index+1]:
            # QA_Dataset对于dev/test数据集返回的是一个长度为1的list:
            # [tensor(input_ids_list), tensor(token_type_ids_list), tensor(attention_mask_list)]
            # data[0] -> tensor([batch_size, num_windows, self.max_seq_len])
            # data[1] -> tensor([batch_size, num_windows, self.max_seq_len])
            # data[2] -> tensor([batch_size, num_windows, self.max_seq_len])
            answer = tokenizer.decode(data[0][0][k][start_index:end_index+1])
        
    return answer.replace(' ', '')
            

## Training

In [17]:
from accelerate import Accelerator

# hyperparameters
num_epoch = 1
validation = True
logging_step = 100
learning_rate = 1e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)
train_batch_size = 8

gradient_accumulation_steps = 16

from torch.optim import lr_scheduler
total_steps = len(train_set) // train_batch_size
print(type(total_steps), total_steps)
lambda1 = lambda step: 1 - step / total_steps
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)

# pin_memory的作用是让CPU可以在每次从内存中的固定位置将数据加载至GPU，可加快CPU到GPU的数据转移速度
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=False)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

fp16_training = True
if fp16_training:
    accelerator = Accelerator(mixed_precision="fp16", gradient_accumulation_steps=gradient_accumulation_steps)
else:
    accelerator = Accelerator(gradient_accumulation_steps=gradient_accumulation_steps)

model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader)

model.train()

print("Start training ...")

for epoch in range(num_epoch):
    step = 1
    train_loss = train_acc = 0

    # train_loader随机从train_set中抽取8组数据构成data
    # data[0]: [8, self.max_seq_len] -> input_ids
    # data[1]: [8, self.max_seq_len] -> token_type_ids
    # data[2]: [8, self.max_seq_len] -> attention_mask
    # data[3]: [8] -> answer_start_token
    # data[4]: [8] -> answer_end_token
    # output:
    # output.start_logits: [8, self.max_seq_len] -> 答案起始位置
    # output.end_logits: [8, self.max_seq_len] -> 答案结束位置
    for data in tqdm(train_loader):
        with accelerator.accumulate(model):
            data = [i.to(device) for i in data]

            output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

            start_index = torch.argmax(output.start_logits, dim=1)
            end_index = torch.argmax(output.end_logits, dim=1)

            train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()

            train_loss += output.loss

            # 计算梯度
            accelerator.backward(output.loss)

            step += 1
            # 更新梯度
            optimizer.step()
            # 将保存梯度的矩阵清零，避免下次被累加
            optimizer.zero_grad()

            # 更新学习率
            scheduler.step()

            if step % logging_step == 0:
                print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f}")
                train_loss = train_acc = 0
    
    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                # QA_Dataset对于dev/test数据集返回的是一个长度为1的list:
                # [tensor(input_ids_list), tensor(token_type_ids_list), tensor(attention_mask_list)]
                # data[0] -> tensor([batch_size, num_windows, self.max_seq_len])
                # data[1] -> tensor([batch_size, num_windows, self.max_seq_len])
                # data[2] -> tensor([batch_size, num_windows, self.max_seq_len])
                # 注意: 由于单个input的维度是[1, self.max_seq_len]，所以一个batch的维度应为[batch_size, self.max_seq_len]
                # 因此采用squeeze(dim=0)压缩维度
                output = model(input_ids=data[0].squeeze(dim=0).to(device),
                               token_type_ids=data[1].squeeze(dim=0).to(device),
                               attention_mask=data[2].squeeze(dim=0).to(device))
                dev_acc += evaluate(data, output) == dev_questions[i]["answer_text"]
            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
        model.train()

print("Saving Model ...")
model_save_dir = "saved_model"
model.save_pretrained(model_save_dir)


<class 'int'> 3364
Start training ...


  3%|▎         | 99/3365 [00:12<08:05,  6.73it/s]

Epoch 1 | Step 100 | loss = 5.174, acc = 0.001


  6%|▌         | 199/3365 [00:24<08:06,  6.50it/s]

Epoch 1 | Step 200 | loss = 4.776, acc = 0.010


  9%|▉         | 299/3365 [00:36<07:57,  6.42it/s]

Epoch 1 | Step 300 | loss = 4.303, acc = 0.011


 12%|█▏        | 399/3365 [00:49<07:57,  6.21it/s]

Epoch 1 | Step 400 | loss = 3.807, acc = 0.050


 15%|█▍        | 499/3365 [01:01<07:10,  6.66it/s]

Epoch 1 | Step 500 | loss = 3.368, acc = 0.101


 18%|█▊        | 599/3365 [01:14<07:07,  6.47it/s]

Epoch 1 | Step 600 | loss = 2.997, acc = 0.145


 21%|██        | 699/3365 [01:26<06:56,  6.40it/s]

Epoch 1 | Step 700 | loss = 2.672, acc = 0.186


 24%|██▎       | 799/3365 [01:38<06:41,  6.39it/s]

Epoch 1 | Step 800 | loss = 2.442, acc = 0.220


 27%|██▋       | 899/3365 [01:51<06:10,  6.65it/s]

Epoch 1 | Step 900 | loss = 2.233, acc = 0.262


 30%|██▉       | 999/3365 [02:03<06:06,  6.46it/s]

Epoch 1 | Step 1000 | loss = 2.089, acc = 0.276


 33%|███▎      | 1099/3365 [02:16<05:53,  6.40it/s]

Epoch 1 | Step 1100 | loss = 1.980, acc = 0.294


 36%|███▌      | 1199/3365 [02:28<05:38,  6.40it/s]

Epoch 1 | Step 1200 | loss = 1.919, acc = 0.338


 39%|███▊      | 1299/3365 [02:41<05:12,  6.61it/s]

Epoch 1 | Step 1300 | loss = 1.793, acc = 0.335


 42%|████▏     | 1399/3365 [02:53<05:04,  6.46it/s]

Epoch 1 | Step 1400 | loss = 1.745, acc = 0.394


 45%|████▍     | 1499/3365 [03:06<04:51,  6.40it/s]

Epoch 1 | Step 1500 | loss = 1.685, acc = 0.394


 48%|████▊     | 1599/3365 [03:18<04:37,  6.37it/s]

Epoch 1 | Step 1600 | loss = 1.722, acc = 0.386


 50%|█████     | 1699/3365 [03:31<04:11,  6.61it/s]

Epoch 1 | Step 1700 | loss = 1.676, acc = 0.405


 53%|█████▎    | 1799/3365 [03:43<04:04,  6.41it/s]

Epoch 1 | Step 1800 | loss = 1.569, acc = 0.431


 56%|█████▋    | 1899/3365 [03:56<03:50,  6.36it/s]

Epoch 1 | Step 1900 | loss = 1.507, acc = 0.438


 59%|█████▉    | 1999/3365 [04:08<03:34,  6.38it/s]

Epoch 1 | Step 2000 | loss = 1.550, acc = 0.449


 62%|██████▏   | 2099/3365 [04:21<03:12,  6.58it/s]

Epoch 1 | Step 2100 | loss = 1.402, acc = 0.485


 65%|██████▌   | 2199/3365 [04:33<03:00,  6.44it/s]

Epoch 1 | Step 2200 | loss = 1.402, acc = 0.480


 68%|██████▊   | 2299/3365 [04:46<02:47,  6.37it/s]

Epoch 1 | Step 2300 | loss = 1.452, acc = 0.471


 71%|███████▏  | 2399/3365 [04:58<02:32,  6.34it/s]

Epoch 1 | Step 2400 | loss = 1.403, acc = 0.477


 74%|███████▍  | 2499/3365 [05:11<02:12,  6.55it/s]

Epoch 1 | Step 2500 | loss = 1.486, acc = 0.456


 77%|███████▋  | 2599/3365 [05:23<01:59,  6.40it/s]

Epoch 1 | Step 2600 | loss = 1.356, acc = 0.501


 80%|████████  | 2699/3365 [05:36<01:49,  6.07it/s]

Epoch 1 | Step 2700 | loss = 1.301, acc = 0.522


 83%|████████▎ | 2799/3365 [05:48<01:28,  6.36it/s]

Epoch 1 | Step 2800 | loss = 1.338, acc = 0.514


 86%|████████▌ | 2899/3365 [06:01<01:09,  6.72it/s]

Epoch 1 | Step 2900 | loss = 1.335, acc = 0.530


 89%|████████▉ | 2999/3365 [06:13<00:55,  6.59it/s]

Epoch 1 | Step 3000 | loss = 1.291, acc = 0.516


 92%|█████████▏| 3099/3365 [06:26<00:40,  6.53it/s]

Epoch 1 | Step 3100 | loss = 1.217, acc = 0.512


 95%|█████████▌| 3199/3365 [06:38<00:25,  6.49it/s]

Epoch 1 | Step 3200 | loss = 1.277, acc = 0.524


 98%|█████████▊| 3299/3365 [06:51<00:10,  6.59it/s]

Epoch 1 | Step 3300 | loss = 1.333, acc = 0.504


100%|██████████| 3365/3365 [06:59<00:00,  8.02it/s]


Evaluating Dev Set ...


100%|██████████| 2863/2863 [00:57<00:00, 49.74it/s]


Validation | Epoch 1 | acc = 0.468
Saving Model ...


## Testing

In [10]:
print("Evaluating Test Set ...")

result = []

model.eval()
with torch.no_grad():
    for data in tqdm(test_loader):
        output = model(input_ids=data[0].squeeze(dim=0).to(device),
                       token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        result.append(evaluate(data, output))

result_file = "result.csv"
with open(result_file, 'w') as f:	
    f.write("ID,Answer\n")
    for i, test_question in enumerate(test_questions):
    # Replace commas in answers with empty strings (since csv is separated by comma)
    # Answers in kaggle are processed in the same way
        f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")

Evaluating Test Set ...


100%|██████████| 3524/3524 [01:11<00:00, 49.01it/s]

Completed! Result is in result.csv





# GradeScope - Question 2 (In-context learning)

### In-context learning
The example prompt is :
```
請從最後一篇的文章中找出最後一個問題的答案：
文章：<文章1 內容>
問題：<問題1 敘述>
答案：<答案1>
...
文章：<文章n 內容>
問題：<問題n 敘述>
答案：
```

In [1]:
import torch
import random
import numpy as np

# To avoid CUDA_OUT_OF_MEMORY
# 意思是在创建新张量时如果没有制定数据类型，则默认为32位浮点数
# 
torch.set_default_tensor_type(torch.cuda.FloatTensor)

# Fix random seed for reproducibility
def same_seeds(seed):
	torch.manual_seed(seed)
	if torch.cuda.is_available():
			torch.cuda.manual_seed(seed)
			torch.cuda.manual_seed_all(seed)
	np.random.seed(seed)
	random.seed(seed)
	torch.backends.cudnn.benchmark = False
	torch.backends.cudnn.deterministic = True
same_seeds(2)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-1.7B")
model = AutoModelForCausalLM.from_pretrained("facebook/xglm-1.7B")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
def clean_text(text):
    text = text.split("答案:")[-1]
    text = text.split(" ")[0]
    return text

In [4]:
import random
import json

with open("hw7_in-context-learning-examples.json", "r") as f: 
    test = json.load(f)

# K-shot learning 
# Give model K examples to make it achieve better accuracy 
# Note: (1) When K >= 4, CUDA_OUT_OFF_MEMORY may occur.
#       (2) The maximum input length of XGLM is 2048
K = 2

question_ids = [qa["id"] for qa in test["questions"]]

with open("in-context-learning-result.txt", "w") as f:
    print("ID,Ground-Truth,Prediction", file = f)
    with torch.no_grad():
        for idx, qa in enumerate(test["questions"]):
            # You can try different prompts
            prompt = "請從最後一篇的文章中找出最後一個問題的答案\n"
            exist_question_indexs = [question_ids.index(qa["id"])]

            # K-shot learning: give the model K examples with answers
            for i in range(K):
                question_index = question_ids.index(qa["id"])
                while(question_index in exist_question_indexs): 
                    question_index = random.randint(0, len(question_ids) - 1)
                exist_question_indexs.append(question_index)    
                paragraph_id = test["questions"][question_index]["paragraph_id"]
                prompt += f'文章：{test["paragraphs"][paragraph_id]}\n'
                prompt += f'問題：{test["questions"][question_index]["question_text"]}\n'
                prompt += f'答案：{test["questions"][question_index]["answer_text"]}\n'

            # The final one question without answer
            paragraph_id = qa["paragraph_id"]
            prompt += f'文章：{test["paragraphs"][paragraph_id]}\n'
            prompt += f'問題：{qa["question_text"]}\n'
            prompt += f'答案：'
            
            inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="pt") 
            sample = model.generate(**inputs, max_new_tokens = 20)
            text = tokenizer.decode(sample[0], skip_special_tokens=True)

            # Note: You can delete this line to see what will happen
            text = clean_text(text)
            
            print(prompt)
            print(f'正確答案: {qa["answer_text"]}')
            print(f'模型輸出: {text}')
            print()

            print(f"{idx},{qa['answer_text']},{text}", file = f)

請從最後一篇的文章中找出最後一個問題的答案
文章：廣州是京廣鐵路，廣深鐵路，廣茂鐵路和廣梅鐵路的終點站。2009年底，武廣客運專線投入運營，多機組列車長980公里，最高時速350公里。2011年1月7日，廣珠城際鐵路投入運營，平均時速200公里。廣州鐵路，長途巴士和渡輪直達香港。廣九快速列車從廣州火車東站出發，直達香港紅磡火車站。總長約182公里，行程大約需要兩個小時。每年都有繁忙的教練從香港的不同乘客點接載乘客。在市中心的珠江北岸有一條渡輪線路，河流居民可以直接過河而無需乘坐公共汽車或步行穿過大橋。每天都有往返南沙碼頭和蓮花山碼頭的高速雙體船。渡輪也開往香港中國客運碼頭和港澳客運碼頭。
問題：廣珠城際鐵路平均每小時可以走多遠？
答案：200公里
文章：自古以來，廣州一直是華南地區的著名商人，擁有2000多年的開放貿易歷史。20世紀70年代末中國大陸改革開放後，廣州經濟發展迅速。2010年，全市地區生產總值10604.48億元，同比增長13％。它成為僅次於上海和北京的第三個進入“萬億元俱樂部”國內生產總值的城市。這也是第一個超過一萬億的經濟總量。首都。根據國務院2005年發布的報告，廣州成為中國第一個進入“發達”狀態的城市。2012年9月，廣州南沙新區獲批，成為第六個國家級開放開發新區。2015年，廣州GDP達到1810.41億元，人均GDP達到138,377.05元。國內生產總值是中國的第三位，人均國內生產總值與西班牙相當。購買力平價水平與發達國家相當。
問題：進入國內生產總值「萬億元俱樂部」的城市第三個為？
答案：廣州
文章：2010年引入的廣州快速交通運輸系統是世界第二大快速運輸系統。每日載客量可達100萬人次。每小時的客流量峰值高達26,900名乘客，僅次於波哥大的快速交通系統。每10秒有一輛公共汽車，每輛公共汽車在一個方向上行駛350小時。該平台包括橋樑，是世界上最長的國家公共汽車快速運輸系統平台，長度為260米。目前，廣州市的出租車和公交車主要以液化石油氣為燃料，部分公交車採用油電，氣電混合技術。2012年底，一輛LNG燃料公共汽車開始啟動。2014年6月，引入了LNG插電式混合動力公交車取代LPG公交車。2007年1月16日，廣州市政府完全禁止在城市地區駕駛摩托車。違反禁令的機動車將被沒收。廣州市交通局聲稱，禁令的實施導致交通擁堵和車禍大大減少。廣州白