# **Homework 7 - Bert (Question Answering)**

If you have any questions, feel free to email us at mlta-2022-spring@googlegroups.com



Slide:    [Link](https://docs.google.com/presentation/d/1H5ZONrb2LMOCixLY7D5_5-7LkIaXO6AGEaV2mRdTOMY/edit?usp=sharing)　Kaggle: [Link](https://www.kaggle.com/c/ml2022spring-hw7)　Data: [Link](https://drive.google.com/uc?id=1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb)　Ref: [Link](https://github.com/pai4451/ML2021/tree/main/hw7)




## Task description
- Chinese Extractive Question Answering
  - Input: Paragraph + Question
  - Output: Answer

- Objective: Learn how to fine tune a pretrained model on downstream task using transformers

- Todo
    - Fine tune a pretrained chinese BERT model
    - Change hyperparameters (e.g. doc_stride)
    - Apply linear learning rate decay
    - Try other pretrained models
    - Improve preprocessing
    - Improve postprocessing
- Training tips
    - Automatic mixed precision
    - Gradient accumulation
    - Ensemble

- Estimated training time (tesla t4 with automatic mixed precision enabled)
    - Simple: 8mins
    - Medium: 8mins
    - Strong: 25mins
    - Boss: 2.5hrs
  

In [None]:
# For this HW, K80 < P4 < T4 < P100 <= T4(fp16) < V100
!nvidia-smi

## Download Dataset

In [None]:
%%script false --no-raise-error

import os

# Download link 1
!gdown --id '1AVgZvy3VFeg0fX-6WQJMHPVrx3A-M1kb' --output hw7_data.zip

# Download Link 2 (if the above link fails) 
# !gdown --id '1qwjbRjq481lHsnTrrF4OjKQnxzgoLEFR' --output hw7_data.zip

# Download Link 3 (if the above link fails) 
# !gdown --id '1QXuWjNRZH6DscSd6QcRER0cnxmpZvijn' --output hw7_data.zip

!unzip -o hw7_data.zip

dev_path = os.path.join(".", "hw7_dev.json")
test_path = os.path.join(".", "hw7_test.json")
train_path = os.path.join(".", "hw7_train.json")

In [None]:
import os

# kaggle
kaggle_data_folder = "../input/ml2022spring-hw7"

dev_path = os.path.join(kaggle_data_folder, "hw7_dev.json")
test_path = os.path.join(kaggle_data_folder, "hw7_test.json")
train_path = os.path.join(kaggle_data_folder, "hw7_train.json")

## Install transformers

Documentation for the toolkit:　https://huggingface.co/transformers/

In [None]:
# You are allowed to change version of transformers or use other toolkits
!pip install transformers==4.5.0

## Import Packages

In [None]:
import json
import numpy as np
import random
import torch
from torch.utils.data import DataLoader, Dataset 
from transformers import AdamW, BertForQuestionAnswering, BertTokenizerFast

from tqdm.auto import tqdm
import matplotlib.pyplot as plt

device = "cuda" if torch.cuda.is_available() else "cpu"

# Fix random seed for reproducibility
def same_seeds(seed):
      torch.manual_seed(seed)
      if torch.cuda.is_available():
            torch.cuda.manual_seed(seed)
            torch.cuda.manual_seed_all(seed)
      np.random.seed(seed)
      random.seed(seed)
      torch.backends.cudnn.benchmark = False
      torch.backends.cudnn.deterministic = True
same_seeds(0)

device

In [None]:
# Change "fp16_training" to True to support automatic mixed precision training (fp16)    
fp16_training = True

if fp16_training:
    !pip install accelerate==0.2.0
    from accelerate import Accelerator
    accelerator = Accelerator(fp16=True)
    device = accelerator.device

# Documentation for the toolkit:  https://huggingface.co/docs/accelerate/

## Load Model and Tokenizer




 

In [None]:
model_name = "hfl/chinese-macbert-large"
# model_name = "Langboat/mengzi-bert-base"
model = BertForQuestionAnswering.from_pretrained(model_name).to(device)
tokenizer = BertTokenizerFast.from_pretrained(model_name)

# model = BertForQuestionAnswering.from_pretrained("bert-base-chinese").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese")

# model = BertForQuestionAnswering.from_pretrained("hfl/chinese-roberta-wwm-ext").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("hfl/chinese-roberta-wwm-ext")

# model = BertForQuestionAnswering.from_pretrained("hfl/chinese-bert-wwm-ext").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("hfl/chinese-bert-wwm-ext")

# model = BertForQuestionAnswering.from_pretrained("Langboat/mengzi-bert-base").to(device)
# tokenizer = BertTokenizerFast.from_pretrained("Langboat/mengzi-bert-base")

# You can safely ignore the warning message (it pops up because new prediction heads for QA are initialized randomly)

## Read Data

- Training set: 31690 QA pairs
- Dev set: 4131  QA pairs
- Test set: 4957  QA pairs

- {train/dev/test}_questions:    
  - List of dicts with the following keys:
   - id (int)
   - paragraph_id (int)
   - question_text (string)
   - answer_text (string)
   - answer_start (int)
   - answer_end (int)
- {train/dev/test}_paragraphs: 
  - List of strings
  - paragraph_ids in questions correspond to indexs in paragraphs
  - A paragraph may be used by several questions 

In [None]:
!pip install torchinfo
from torchinfo import summary

summary(model)

In [None]:
def read_data(file):
    with open(file, 'r', encoding="utf-8") as reader:
        data = json.load(reader)
    return data["questions"], data["paragraphs"]

train_questions, train_paragraphs = read_data(train_path)
dev_questions, dev_paragraphs = read_data(dev_path)
test_questions, test_paragraphs = read_data(test_path)

## Tokenize Data

In [None]:
%%script false --no-raise-error
# test_paragraphs = ["2022白堊紀滅絕事件", "杭州筧橋機場", "蔡鍔", "丁旿"]
from itertools import chain

unks = set();
for p in chain(train_paragraphs, dev_paragraphs, test_paragraphs):
    t = tokenizer(p, return_offsets_mapping=True)
#     print(p)
#     print(t['input_ids'])
#     print(t['offset_mapping'])
    unks |= {p[pos[0]:pos[1]] for token_id, pos in zip(t['input_ids'], t['offset_mapping']) if token_id == 100}

print(unks)

for unk in unks:
    for c in unk:
        tokenizer.add_tokens(unk)
    
model.resize_token_embeddings(len(tokenizer))

In [None]:
%%script false --no-raise-error
unks_check = set();
for p in chain(train_paragraphs, dev_paragraphs, test_paragraphs):
    t = tokenizer(p, return_offsets_mapping=True)
    unks_check |= {p[pos[0]:pos[1]] for token_id, pos in zip(t['input_ids'], t['offset_mapping']) if token_id == 100}

print(unks_check)
print(unks & unks_check)

In [None]:
# Tokenize questions and paragraphs separately
# 「add_special_tokens」 is set to False since special tokens will be added when tokenized questions and paragraphs are combined in datset __getitem__ 

train_questions_tokenized = tokenizer([train_question["question_text"] for train_question in train_questions], add_special_tokens=False)
dev_questions_tokenized = tokenizer([dev_question["question_text"] for dev_question in dev_questions], add_special_tokens=False)
test_questions_tokenized = tokenizer([test_question["question_text"] for test_question in test_questions], add_special_tokens=False) 

train_paragraphs_tokenized = tokenizer(train_paragraphs, add_special_tokens=False, return_offsets_mapping=True)
dev_paragraphs_tokenized = tokenizer(dev_paragraphs, add_special_tokens=False, return_offsets_mapping=True)
test_paragraphs_tokenized = tokenizer(test_paragraphs, add_special_tokens=False, return_offsets_mapping=True)

# You can safely ignore the warning message as tokenized sequences will be futher processed in datset __getitem__ before passing to model

## Dataset and Dataloader

In [None]:
import random

class QA_Dataset(Dataset):
    def __init__(self, split, questions, tokenized_questions, tokenized_paragraphs):
        self.split = split
        self.questions = questions
        self.tokenized_questions = tokenized_questions
        self.tokenized_paragraphs = tokenized_paragraphs
        self.max_question_len = 81
        self.max_paragraph_len = 320
        
        ##### TODO: Change value of doc_stride #####
        self.doc_stride = 270

        # Input sequence length = [CLS] + question + [SEP] + paragraph + [SEP]
        self.max_seq_len = 1 + self.max_question_len + 1 + self.max_paragraph_len + 1

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        tokenized_question = self.tokenized_questions[idx]
        tokenized_paragraph = self.tokenized_paragraphs[question["paragraph_id"]]

        ##### TODO: Preprocessing #####
        # Hint: How to prevent model from learning something it should not learn
        # like answer is always at the center

        if self.split == "train":
            # Convert answer's start/end positions in paragraph_text to start/end positions in tokenized_paragraph  
            answer_start_token = tokenized_paragraph.char_to_token(question["answer_start"])
            answer_end_token = tokenized_paragraph.char_to_token(question["answer_end"])

            # random at least 1/4 paragraph
            paragraph_start = random.randint(
                answer_start_token + self.max_paragraph_len // 4 - self.max_paragraph_len,
                answer_end_token - self.max_paragraph_len // 4,
            )
            # take more paragraph as many as possible
            paragraph_start = max(0, min(paragraph_start, len(tokenized_paragraph) - self.max_paragraph_len))
            paragraph_end = paragraph_start + self.max_paragraph_len
            
            check = paragraph_start <= answer_start_token < paragraph_end and paragraph_start <= answer_end_token < paragraph_end
            if not check:
                print("question:", tokenizer.decode(tokenized_question.ids[:self.max_question_len]).replace(" ", ""))
                print("answer:", tokenizer.decode(tokenized_paragraph.ids[answer_start_token : answer_end_token+1]).replace(" ", ""))
                print("paraph:", tokenizer.decode(tokenized_paragraph.ids[paragraph_start : paragraph_end]).replace(" ", ""))
                print("")
            assert(check)
            
            # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
            input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102] 
            input_ids_paragraph = tokenized_paragraph.ids[paragraph_start : paragraph_end] + [102]     
            
            # Convert answer's start/end positions in tokenized_paragraph to start/end positions in the window  
            answer_start_token += len(input_ids_question) - paragraph_start
            answer_end_token += len(input_ids_question) - paragraph_start
            
            # Pad sequence and obtain inputs to model 
            input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
            return torch.tensor(input_ids), torch.tensor(token_type_ids), torch.tensor(attention_mask), answer_start_token, answer_end_token

        # Validation/Testing
        else:
            input_ids_list, token_type_ids_list, attention_mask_list, offsets_list = [], [], [], []
            
            # Paragraph is split into several windows, each with start positions separated by step "doc_stride"
            for i in range(0, len(tokenized_paragraph), self.doc_stride):
                
                # Slice question/paragraph and add special tokens (101: CLS, 102: SEP)
                input_ids_question = [101] + tokenized_question.ids[:self.max_question_len] + [102]
                input_ids_paragraph = tokenized_paragraph.ids[i : i + self.max_paragraph_len] + [102]
                
                # Pad sequence and obtain inputs to model
                input_ids, token_type_ids, attention_mask = self.padding(input_ids_question, input_ids_paragraph)
                
                input_ids_list.append(input_ids)
                token_type_ids_list.append(token_type_ids)
                attention_mask_list.append(attention_mask)

            return torch.tensor(input_ids_list), torch.tensor(token_type_ids_list), torch.tensor(attention_mask_list)

    def padding(self, input_ids_question, input_ids_paragraph):
        # Pad zeros if sequence length is shorter than max_seq_len
        padding_len = self.max_seq_len - len(input_ids_question) - len(input_ids_paragraph)
        # Indices of input sequence tokens in the vocabulary
        input_ids = input_ids_question + input_ids_paragraph + [0] * padding_len
        # Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]
        token_type_ids = [0] * len(input_ids_question) + [1] * len(input_ids_paragraph) + [0] * padding_len
        # Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
        attention_mask = [1] * (len(input_ids_question) + len(input_ids_paragraph)) + [0] * padding_len
        
        return input_ids, token_type_ids, attention_mask

train_set = QA_Dataset("train", train_questions, train_questions_tokenized, train_paragraphs_tokenized)
dev_set = QA_Dataset("dev", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
test_set = QA_Dataset("test", test_questions, test_questions_tokenized, test_paragraphs_tokenized)

train_batch_size = 8

# Note: Do NOT change batch size of dev_loader / test_loader !
# Although batch size=1, it is actually a batch consisting of several windows from the same QA pair
train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)
dev_loader = DataLoader(dev_set, batch_size=1, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=1, shuffle=False, pin_memory=True)

In [None]:
# check train data
for i, data in enumerate(train_loader):
    pass

# check question lenth
for i, data in enumerate(train_loader):
    if i in [1152]:
        # wrong data
        continue
    if len(train_questions_tokenized[i].ids) > train_set.max_question_len:
        print(i, train_questions[i]['question_text'])
        print(train_questions_tokenized[i].tokens)
        print(f"{len(train_questions_tokenized[i].ids)} > max_question_len:{train_set.max_question_len}")
        raise
        
for i, data in enumerate(dev_loader):
    if len(dev_questions_tokenized[i].ids) > dev_set.max_question_len:
        print(i, dev_questions[i]['question_text'])
        print(dev_questions_tokenized[i].tokens)
        print(f"{len(dev_questions_tokenized[i].ids)} > max_question_len:{dev_set.max_question_len}")
        raise
        
for i, data in enumerate(test_loader):
    if len(test_questions_tokenized[i].ids) > test_set.max_question_len:
        print(i, test_questions[i]['question_text'])
        print(test_questions_tokenized[i].tokens)
        print(f"{len(test_questions_tokenized[i].ids)} > max_question_len:{test_set.max_question_len}")
        raise

## Function for Evaluation

In [None]:
def get_answer(p, offsets, start, end):
    if start is None or end is None:
        return ""
    
    return p[offsets[start][0]:offsets[end][1]]

In [None]:
def evaluate(data, output, doc_stride, paragraph, paragraph_tokenized, offsets):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    max_prob_bak = float('-inf')
    num_of_windows = data[0].shape[1]
    
    # index in the whole tokens (not just relative to window)
    entire_start_index = None
    entire_end_index = None
    
    for k in range(num_of_windows):            
        # print('window',k)
        # Obtain answer by choosing the most probable start position / end position
        mask = data[1][0][k].bool() & data[2][0][k].bool() # token type & attention mask
        masked_output_start = torch.masked_select(output.start_logits[k].cpu(), mask)[:-1] # -1 is [SEP]
        masked_output_end = torch.masked_select(output.end_logits[k].cpu(), mask)[:-1] # -1 is [SEP]
        
        start_prob, start_index = torch.max(masked_output_start, dim=0)     
        end_prob, end_index = torch.max(masked_output_end[start_index:], dim=0)
        end_index += start_index
        # model output means include end_index
        # so answer is p[start_index, end_index]
        
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
#         masked_data = torch.masked_select(data[0][0][k], mask)[:-1] # -1 is [SEP]

        # Replace answer if calculated probability is larger than previous windows
        if (prob > max_prob) and (end_index - start_index <= 30):
            max_prob = prob
            entire_start_index = start_index.item() + doc_stride * k
            entire_end_index = end_index.item() + doc_stride * k
            # print('entire_start_index',entire_start_index)
            # print('entire_end_index',entire_end_index)
#             # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
#             answer = tokenizer.decode(masked_data[start_index : end_index + 1])

    answer = get_answer(paragraph, offsets, entire_start_index, entire_end_index)
    return answer.strip()

## Training

## Optimizer: Adam + lr scheduling
Inverse square root scheduling is important to the stability when training Transformer. It's later used on RNN as well.
Update the learning rate according to the following equation. Linearly increase the first stage, then decay proportionally to the inverse square root of timestep.
$$lrate = d_{\text{model}}^{-0.5}\cdot\min({step\_num}^{-0.5},{step\_num}\cdot{warmup\_steps}^{-1.5})$$
$$lrate = \frac{1}{\sqrt{d_{\text{model}}}}\cdot\min(\frac{1}{\sqrt{{step\_num}}}, \frac{{step\_num}}{{warmup\_steps}}\cdot\frac{1}{\sqrt{{warmup\_steps}}})$$

In [None]:
def get_rate(d_model, step_num, warmup_step):
    # TODO: Change lr from constant to the equation shown above
    lr = d_model**(-0.5)*min(step_num**(-0.5), step_num*warmup_step**(-1.5))
    return lr

In [None]:
class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
    
    @property
    def param_groups(self):
        return self.optimizer.param_groups
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return 0 if not step else self.factor * get_rate(self.model_size, step, self.warmup)

    def zero_grad(self):
        self.optimizer.zero_grad()

## Hyperparameter

In [None]:
num_epoch = 3
validation = False
logging_step = 100
learning_rate = 1e-5
optimizer = AdamW(model.parameters(), lr=learning_rate)

# warm up
lr_factor = 2.0
lr_warmup = 100

## Scheduling Visualized

In [None]:
optimizer_decay = NoamOpt(
    model_size=1200000, 
    factor=lr_factor, 
    warmup=lr_warmup, 
    optimizer=AdamW(model.parameters(), lr=0)
)

total = 1000
plt.plot(np.arange(1, total), [optimizer_decay.rate(i) for i in range(1, total)])
plt.legend([f"{optimizer_decay.model_size}:{optimizer_decay.warmup}"])

# optimizer = optimizer_decay

In [None]:
from transformers import get_polynomial_decay_schedule_with_warmup

def show_plot():
    optimizer = torch.optim.SGD(torch.nn.Linear(2, 1).parameters(), lr=learning_rate)
    total = num_epoch * len(train_loader)
    scheduler = get_polynomial_decay_schedule_with_warmup(optimizer, total//10, total, lr_end=learning_rate/10, power=2)

    lrs = []

    for i in range(total):
        optimizer.step()
        lrs.append(optimizer.param_groups[0]["lr"])
        scheduler.step()

    plt.plot(range(total), lrs)

    print(lrs[-5:])
    
show_plot()

In [None]:
from transformers import get_linear_schedule_with_warmup

def show_plot():
    optimizer = torch.optim.SGD(torch.nn.Linear(2, 1).parameters(), lr=learning_rate)
    total = num_epoch * len(train_loader)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps= 0, # Default value
                                                num_training_steps=total)

    lrs = []

    for i in range(total):
        optimizer.step()
        lrs.append(optimizer.param_groups[0]["lr"])
        scheduler.step()

    plt.plot(range(total), lrs)

    print(lrs[-5:])
    
show_plot()

In [None]:
total = num_epoch * len(train_loader)
scheduler = get_linear_schedule_with_warmup(optimizer, 0, total)

In [None]:
# if not validation:
#     dev_set = QA_Dataset("train", dev_questions, dev_questions_tokenized, dev_paragraphs_tokenized)
#     train_set = torch.utils.data.ConcatDataset([train_set, dev_set])
#     train_loader = DataLoader(train_set, batch_size=train_batch_size, shuffle=True, pin_memory=True)


if fp16_training:
    model, optimizer, train_loader = accelerator.prepare(model, optimizer, train_loader) 

model.train()

print("Start Training ...")

for epoch in range(num_epoch):
    step = 1
    train_loss = train_acc = 0
    total_step = len(train_loader) * num_epoch
    for data in tqdm(train_loader):    
        # Load all data into GPU
        data = [i.to(device) for i in data]
        
        # Model inputs: input_ids, token_type_ids, attention_mask, start_positions, end_positions (Note: only "input_ids" is mandatory)
        # Model outputs: start_logits, end_logits, loss (return when start_positions/end_positions are provided)  
        output = model(input_ids=data[0], token_type_ids=data[1], attention_mask=data[2], start_positions=data[3], end_positions=data[4])

        # Choose the most probable start position / end position
        start_index = torch.argmax(output.start_logits, dim=1)
        end_index = torch.argmax(output.end_logits, dim=1)
        
        # Prediction is correct only if both start_index and end_index are correct
        train_acc += ((start_index == data[3]) & (end_index == data[4])).float().mean()
        train_loss += output.loss
        
        if fp16_training:
            accelerator.backward(output.loss)
        else:
            output.loss.backward()
        
        optimizer.step()
        optimizer.zero_grad()
        step += 1

        ##### TODO: Apply linear learning rate decay #####
        scheduler.step()
        
        
        # Print training loss and accuracy over past logging step
        if step % logging_step == 0:
            print(f"Epoch {epoch + 1} | Step {step} | loss = {train_loss.item() / logging_step:.3f}, acc = {train_acc / logging_step:.3f} | lr = {optimizer.param_groups[0]['lr']:.6f}")
            train_loss = train_acc = 0

    if validation:
        print("Evaluating Dev Set ...")
        model.eval()
        with torch.no_grad():
            dev_acc = 0
            for i, data in enumerate(tqdm(dev_loader)):
                output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
                # prediction is correct only if answer text exactly matches
                dev_acc += evaluate(data, output, dev_set.doc_stride, 
                                    dev_paragraphs[dev_questions[i]['paragraph_id']],
                                    dev_paragraphs_tokenized[dev_questions[i]['paragraph_id']].tokens,
                                    dev_paragraphs_tokenized[dev_questions[i]['paragraph_id']].offsets,
                                    ) == dev_questions[i]["answer_text"]
                
            print(f"Validation | Epoch {epoch + 1} | acc = {dev_acc / len(dev_loader):.3f}")
        model.train()

# Save a model and its configuration file to the directory 「saved_model」 
# i.e. there are two files under the direcory 「saved_model」: 「pytorch_model.bin」 and 「config.json」
# Saved model can be re-loaded using 「model = BertForQuestionAnswering.from_pretrained("saved_model")」
print("Saving Model ...")
model_save_dir = "saved_model" 
model.save_pretrained(model_save_dir)

## Testing

In [None]:
# model = BertForQuestionAnswering.from_pretrained("../input/hw07tmp/saved_model").to(device)

In [None]:
print("Evaluating Test Set ...")

result = []

model.eval()
with torch.no_grad():
    for i, data in enumerate(tqdm(test_loader)):
        # batch size = 1 so squeeze dim=0
        output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                       attention_mask=data[2].squeeze(dim=0).to(device))
        result.append(evaluate(data, output, test_set.doc_stride, 
                               test_paragraphs[test_questions[i]['paragraph_id']],
                               test_paragraphs_tokenized[test_questions[i]['paragraph_id']].tokens,
                               test_paragraphs_tokenized[test_questions[i]['paragraph_id']].offsets,
                               ))

result_file = "result.csv"
with open(result_file, 'w') as f:    
      f.write("ID,Answer\n")
      for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
            f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")

# doc_stride

In [None]:
%%script false --no-raise-error
x = [dev_set.doc_stride]#list(range(10, dev_set.max_paragraph_len + 10, 5))
y = []
for doc_stride in tqdm(x):
    dev_set.doc_stride = doc_stride
    model.eval()
    with torch.no_grad():
        dev_acc = 0
        stop_len = 10000
        for i, data in enumerate(tqdm(dev_loader)):
            if i >= stop_len:
                break;
            output = model(input_ids=data[0].squeeze(dim=0).to(device), token_type_ids=data[1].squeeze(dim=0).to(device),
                   attention_mask=data[2].squeeze(dim=0).to(device))
            # prediction is correct only if answer text exactly matches
            dev_acc += evaluate(data, output, dev_set.doc_stride, 
                                dev_paragraphs[dev_questions[i]['paragraph_id']],
                                dev_paragraphs_tokenized[dev_questions[i]['paragraph_id']].tokens,
                                dev_paragraphs_tokenized[dev_questions[i]['paragraph_id']].offsets,
                                ) == dev_questions[i]["answer_text"]
            
        acc = dev_acc / min(stop_len, len(dev_loader))
        print(f"{doc_stride} => acc:{acc}")
        y.append(acc)

plt.plot(x,y);
plt.show()

# ensemble

In [None]:
# diff
# self.max_question_len = 81; 40
# self.max_paragraph_len = 320; 350
# self.doc_stride = 270; 300

# https://www.kaggle.com/code/zwindr/hw07-bert version:16
model1 = BertForQuestionAnswering.from_pretrained("../input/hw07tmp/saved_model").to(device)
# https://www.kaggle.com/code/zwindr/pai4451-ml2021-hw7-hw7-macbert4-ipynb version:1
model2 = BertForQuestionAnswering.from_pretrained("../input/hw7-macbert4tmp/saved_model/macbert4").to(device)

models = [model1, model2]

In [None]:
def evaluateEn(data, outputs, doc_stride, paragraph, paragraph_tokenized, offsets):
    ##### TODO: Postprocessing #####
    # There is a bug and room for improvement in postprocessing 
    # Hint: Open your prediction file to see what is wrong 
    
    answer = ''
    max_prob = float('-inf')
    max_prob_bak = float('-inf')
    num_of_windows = data[0].shape[1]
    
    # index in the whole tokens (not just relative to window)
    entire_start_index = None
    entire_end_index = None
    
    output_start_logits = sum(output.start_logits.cpu() for output in outputs)
    output_end_logits = sum(output.end_logits.cpu() for output in outputs)
    
    for k in range(num_of_windows):            
        # print('window',k)
        # Obtain answer by choosing the most probable start position / end position
        mask = data[1][0][k].bool() & data[2][0][k].bool() # token type & attention mask
        masked_output_start = torch.masked_select(output_start_logits[k], mask)[:-1] # -1 is [SEP]
        masked_output_end = torch.masked_select(output_end_logits[k], mask)[:-1] # -1 is [SEP]
        
        start_prob, start_index = torch.max(masked_output_start, dim=0)     
        end_prob, end_index = torch.max(masked_output_end[start_index:], dim=0)
        end_index += start_index
        # model output means include end_index
        # so answer is p[start_index, end_index]
        
        # Probability of answer is calculated as sum of start_prob and end_prob
        prob = start_prob + end_prob
#         masked_data = torch.masked_select(data[0][0][k], mask)[:-1] # -1 is [SEP]

        # Replace answer if calculated probability is larger than previous windows
        if (prob > max_prob) and (end_index - start_index <= 30):
            max_prob = prob
            entire_start_index = start_index.item() + doc_stride * k
            entire_end_index = end_index.item() + doc_stride * k
            # print('entire_start_index',entire_start_index)
            # print('entire_end_index',entire_end_index)
#             # Convert tokens to chars (e.g. [1920, 7032] --> "大 金")
#             answer = tokenizer.decode(masked_data[start_index : end_index + 1])

    answer = get_answer(paragraph, offsets, entire_start_index, entire_end_index)
    return answer.strip()

In [None]:
print("Ensemble Evaluating Test Set ...")
if fp16_training:
    models = [accelerator.prepare(model) for model in models]
    
result = []

for model in models:
    model.eval()
with torch.no_grad():
    for i, data in enumerate(tqdm(test_loader)):
        # batch size = 1 so squeeze dim=0
        outputs = [model(input_ids=data[0].squeeze(dim=0).to(device), 
                         token_type_ids=data[1].squeeze(dim=0).to(device),
                           attention_mask=data[2].squeeze(dim=0).to(device)) for model in models]
        ans = evaluateEn(data, outputs, test_set.doc_stride, 
                               test_paragraphs[test_questions[i]['paragraph_id']],
                               test_paragraphs_tokenized[test_questions[i]['paragraph_id']].tokens,
                               test_paragraphs_tokenized[test_questions[i]['paragraph_id']].offsets,
                               )
#         print(i, ans)
        result.append(ans)

result_file = "result.csv"
with open(result_file, 'w') as f:    
      f.write("ID,Answer\n")
      for i, test_question in enumerate(test_questions):
        # Replace commas in answers with empty strings (since csv is separated by comma)
        # Answers in kaggle are processed in the same way
            f.write(f"{test_question['id']},{result[i].replace(',','')}\n")

print(f"Completed! Result is in {result_file}")