Implementation of finetuning with both LoRA modules and Parallel Adapter modules

We will finetune RoBERTa-Large on SQuAD v2 to quickly train a strong Question Answering Language Model. We use 7.9M parameters for finetuning and freeze the remaining 355M parameters from pre-trained RoBERTa-Large, reducing the number of tunable parameters in fine-tuning by 45x. Finetuning takes under 3 hours on the free T4 GPUs on Google Colab, and can be likely easily sped up with more/better GPUs or cutting training time for slightly worse performance (the model pretty much converges after 1.5 epochs)

In [None]:
!pip install transformers
!pip install datasets
!pip install wandb

In [1]:
import torch
from torch import nn as nn
from transformers import RobertaModel, RobertaTokenizer
import wandb
import tqdm
import math

config = {
    "RANK": 16,
    "ALPHA": 16,
    "EPOCHS": 2,
    "BATCH_SZ": 8, # Tune batch size depending on how much RAM your GPU has
    "GRAD_ACCU": 3,
    "LoRA_LR": 1e-4,
    "HEAD_LR": 1e-4,
    "LoRA_WD": 0.01,
    "HEAD_WD": 0.01,
    "HEAD_BIAS": True,
    "ADAPTER_WEIGHT_INIT": 0.005,
    "ADAPTER_RANK": 256,
    "ADAPTER_WD": 0.01,
    "ADAPTER_LR": 1e-4,
    "WARMUP_RATIO": 0.1
}

RANK = config["RANK"]
ALPHA = config["ALPHA"]
EPOCHS = config["EPOCHS"]
BATCH_SZ = config["BATCH_SZ"]
GRAD_ACCU = config["GRAD_ACCU"]
LoRA_LR = config["LoRA_LR"]
HEAD_LR = config["HEAD_LR"]
LoRA_WD = config["LoRA_WD"]
HEAD_WD = config["HEAD_WD"]
HEAD_BIAS = config["HEAD_BIAS"]
ADAPTER_WEIGHT_INIT = config["ADAPTER_WEIGHT_INIT"]
ADAPTER_RANK = config["ADAPTER_RANK"]
ADAPTER_WD = config["ADAPTER_WD"]
ADAPTER_LR = config["ADAPTER_LR"]
WARMUP_RATIO = config["WARMUP_RATIO"]
MAX_LEN = 512 # Maximum token length for RoBERTa
EMB_SZ = 1024 # Embedding dimension for RoBERTa
SEED = 13 # Seed for DataLoader generators

Enable experiment tracking with W & B

In [2]:
! wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [2]:
EXP_NAME = "FineTune"

wandb.init(
    name = "RoBERTa_LARGE_LoRA_Adapters_SQuADv2",
    project = EXP_NAME,
    config = config
)

[34m[1mwandb[0m: Currently logged in as: [33mbohan_yao[0m ([33mbohan-yao[0m). Use [1m`wandb login --relogin`[0m to force relogin


Load model and data, and preprocess data

In [3]:
from datasets import load_dataset

dataset = load_dataset("squad_v2")

In [4]:
model = RobertaModel.from_pretrained("roberta-large")
tokenizer = RobertaTokenizer.from_pretrained("roberta-large")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# Freeze base model's parameters
for param in model.parameters():
    param.requires_grad = False

In [6]:
# In the SQuAD dataset, the answer range is given in terms of characters, but we
# need the answer range in terms of tokens, so we convert the ranges here
def answer_char_to_token(question, ctx, prompt, ans_idx, ans):
    corrected_ans_idx = ans_idx + len(question) + len(tokenizer.sep_token)
    start = len(tokenizer(prompt[:corrected_ans_idx].strip())["input_ids"]) - 1
    end = len(tokenizer(" " + ans)["input_ids"]) - 3
    return (start, start + end)

In [7]:
tokenized_train_data = []
training_answers = []

# TQDM counter may not increment to complete, since a small portion of the dataset is discarded for being too long
with tqdm.tqdm(total = len(dataset["train"])) as t:
    for data in dataset["train"]:
        cur_data = data["question"] + tokenizer.sep_token + data["context"]
        cur_data = cur_data.strip()
        if len(tokenizer(cur_data)["input_ids"]) > MAX_LEN:
            continue
        tokenized_train_data.append(cur_data)

        # Check for data that doesn't have an answer and process these separately (we set the answer start and end indices to -1)
        if len(data["answers"]["answer_start"]) == 0:
            training_answers.append((-1, -1))
        else:
            training_answers.append(answer_char_to_token(data["question"], data["context"], cur_data, data["answers"]["answer_start"][0], data["answers"]["text"][0]))
        t.update(1)

tokenized_train_data = tokenizer(tokenized_train_data, padding="max_length", max_length=MAX_LEN, truncation=True)

  1%|▏         | 1706/130319 [00:03<03:45, 569.90it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (526 > 512). Running this sequence through the model will result in indexing errors
100%|█████████▉| 130104/130319 [03:05<00:00, 700.10it/s]


Set up LR scheduler

In [8]:
total_cycles = EPOCHS * len(tokenized_train_data["input_ids"]) / BATCH_SZ
start_lr_decay = WARMUP_RATIO * total_cycles

def get_lr(lr, cycle):    
    if cycle < start_lr_decay:
        lr_return = lr * cycle / start_lr_decay
    else:
        lr_return = lr - lr * (cycle - start_lr_decay) / (total_cycles - start_lr_decay)
    
    return max(lr_return, 0)

LoRA module

In [7]:
class LoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.lin_weight = linear.weight
        self.lin_bias = linear.bias

        # Add LoRA matrices to parameters
        self.LoRA_A = nn.Parameter(torch.normal(mean=0, std=1, size=(self.lin_weight.shape[0], rank)))
        self.LoRA_B = nn.Parameter(torch.zeros((rank, self.lin_weight.shape[1])))
        self.LoRA_A.requires_grad = True
        self.LoRA_B.requires_grad = True
        # Freeze pretrained model weights
        self.lin_weight.requires_grad = False
        self.lin_bias.requires_grad = False
    def forward(self, inp):
        return inp @ (self.lin_weight + self.LoRA_A @ self.LoRA_B).T + self.lin_bias
    # Get LoRA matrices
    def get_params(self):
        return [self.LoRA_A, self.LoRA_B]
    # Load parameters from array containing LoRA_A and LoRA_B
    def load_params(self, params):
        self.LoRA_A = params[0]
        self.LoRA_B = params[1]

Parallel Adapter

In [8]:
def adapter_weight_init(m):
    if type(m) is nn.Linear:
        nn.init.normal_(m.weight, mean = 0.0, std = ADAPTER_WEIGHT_INIT)
        if m.bias is not None:
            torch.nn.init.zeros_(m.bias)

# Identity function
class Identity(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inp):
        return inp

# Replacing HuggingFace Transformer's RoBERTa Intermediate and Output layers, so we can use the parallel adapter scheme
# For more details, see the source code for RoBERTa here: https://huggingface.co/transformers/v3.2.0/_modules/transformers/modeling_roberta.html
class InterOutputAdapter(nn.Module):
    def __init__(self, inter, output, inp_sz, out_sz, bottleneck):
        super().__init__()
        self.inter = inter
        self.output = nn.Sequential(
            output.dense,
            output.dropout
        )
        self.out_layernorm = output.LayerNorm
        self.adapter = nn.Sequential(
            nn.Linear(inp_sz, bottleneck),
            nn.Tanh(),
            nn.Linear(bottleneck, out_sz)
        ).apply(adapter_weight_init)

    # inter_inp will always be a copy of inp here due to the way we're setting up the adapter
    def forward(self, inp, inter_inp):
        adapter_output = self.adapter(inp)
        main_net = self.output(self.inter(inp))

        return self.out_layernorm(main_net + inp + adapter_output)

    # Get adapter parameters
    def get_params(self):
        return [self.adapter[0].weight, self.adapter[0].bias, self.adapter[2].weight, self.adapter[2].bias]

    # Load adapter parameters
    def load_params(self, params):
        self.adapter[0].weight = params[0]
        self.adapter[0].bias = params[1]
        self.adapter[2].weight = params[2]
        self.adapter[2].bias = params[3]

Insert LoRA modules and Adapters into model

In [9]:
# Add LoRA and Adapter parameters to model
LoRA_params = []
adapter_params = []

for i in range(len(model.encoder.layer)):
    model.encoder.layer[i].attention.self.query = LoRA(model.encoder.layer[i].attention.self.query, RANK, ALPHA)
    model.encoder.layer[i].attention.self.key = LoRA(model.encoder.layer[i].attention.self.key, RANK, ALPHA)
    model.encoder.layer[i].attention.self.value = LoRA(model.encoder.layer[i].attention.self.value, RANK, ALPHA)
    model.encoder.layer[i].attention.output.dense = LoRA(model.encoder.layer[i].attention.output.dense, RANK, ALPHA)

    LoRA_params.extend(model.encoder.layer[i].attention.self.query.get_params())
    LoRA_params.extend(model.encoder.layer[i].attention.self.key.get_params())
    LoRA_params.extend(model.encoder.layer[i].attention.self.value.get_params())
    LoRA_params.extend(model.encoder.layer[i].attention.output.dense.get_params())

    model.encoder.layer[i].output = InterOutputAdapter(model.encoder.layer[i].intermediate, model.encoder.layer[i].output, EMB_SZ, EMB_SZ, ADAPTER_RANK)
    adapter_params.extend(model.encoder.layer[i].output.get_params())

    model.encoder.layer[i].intermediate = Identity()

# Create optimizer with only LoRA parameters included
optimize_LoRA_params = [
    {"params": LoRA_params, "weight_decay": LoRA_WD}
]

LoRA_opt = torch.optim.AdamW(optimize_LoRA_params, lr = LoRA_LR)

# Create optimier with only Adapter parameters included
optimize_adapter_params = [
    {"params": adapter_params, "weight_decay": ADAPTER_WD}
]

adapter_opt = torch.optim.AdamW(optimize_adapter_params, lr = ADAPTER_LR)

Linear output heads for question answering task

In [10]:
# Kaiming initialize the heads
def layer_init_linear(m):
    nn.init.kaiming_normal_(m.weight)
    if m.bias is not None:
      torch.nn.init.zeros_(m.bias)

start_head = nn.Linear(EMB_SZ, 1, bias=HEAD_BIAS).apply(layer_init_linear)
end_head = nn.Linear(EMB_SZ, 1, bias=HEAD_BIAS).apply(layer_init_linear)
is_answerable_head = nn.Linear(EMB_SZ, 1, bias=HEAD_BIAS).apply(layer_init_linear)

In [13]:
head_params = [
    {"params": start_head.parameters(), "weight_decay": HEAD_WD},
    {"params": end_head.parameters(), "weight_decay": HEAD_WD},
    {"params": is_answerable_head.parameters(), "weight_decay": HEAD_WD}
]

head_opt = torch.optim.AdamW(head_params, lr = HEAD_LR)

Move everything to GPU

In [11]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model.to(device)
start_head.to(device)
end_head.to(device)
is_answerable_head.to(device)

Linear(in_features=1024, out_features=1, bias=True)

Utilize PyTorch 2.0 features.

Compiles the model into optimized kernels. Use mode="max-autotune" if GPU has enough CUDA cores for this to run properly (probably newer NVIDIA GPUs like V100, A100, and H100)

In [15]:
uncompiled_model = model # Save a copy of the uncompiled model to save, since the weights share memory
model = torch.compile(model)

In [16]:
! ldconfig /usr/lib64-nvidia # Fix a bug with PyTorch 2.0 on Google Colab not recognizing the "libcuda.so" file

/bin/bash: line 1: ldconfig: command not found


Train model using PyTorch Automatic Mixed Precision

In [17]:
# Create gradient scaler for AMP
scaler = torch.cuda.amp.GradScaler()

# Set model in train mode (turn on dropout)
model.train()
start_head.train()
end_head.train()
is_answerable_head.train()

cycle = 0

for epoch in range(EPOCHS):
    print(f"Epoch {epoch + 1}")
    
    # Seed generators so they have the same random output to shuffle the dataloaders the same way
    # We need to re-create the DataLoader every epoch, since PyTorch randomly re-seeds them every epoch
    generator1 = torch.Generator()
    generator1.manual_seed(SEED)
    generator2 = torch.Generator()
    generator2.manual_seed(SEED)
    generator3 = torch.Generator()
    generator3.manual_seed(SEED)

    training_data_ids = torch.utils.data.DataLoader(tokenized_train_data["input_ids"], batch_size=BATCH_SZ, shuffle=True, generator=generator1)
    training_data_masks = torch.utils.data.DataLoader(tokenized_train_data["attention_mask"], batch_size=BATCH_SZ, shuffle=True, generator=generator2)
    train_answers = torch.utils.data.DataLoader(training_answers, batch_size=BATCH_SZ, shuffle=True, generator=generator3)
    
    # Train model
    with tqdm.tqdm(total=math.ceil(len(tokenized_train_data["input_ids"]) / BATCH_SZ)) as t:
        for (ids, masks, answers) in zip(training_data_ids, training_data_masks, train_answers):
            cycle += 1
            # LR Scheduling
            for param in LoRA_opt.param_groups:
                param['lr'] = get_lr(LoRA_LR, cycle)
            for param in head_opt.param_groups:
                param['lr'] = get_lr(HEAD_LR, cycle)
            for param in adapter_opt.param_groups:
                param['lr'] = get_lr(ADAPTER_LR, cycle)

            ids = torch.stack(ids).T.to(device)
            masks = torch.stack(masks).T.to(device)
            start_answers = answers[0].to(device)
            end_answers = answers[1].to(device)

            # Skip incomplete batch at the end of the DataLoader
            if ids.shape[0] != BATCH_SZ:
                continue

            with torch.autocast(device_type='cuda', dtype = torch.float16):
                out = model(input_ids=ids, attention_mask=masks).last_hidden_state
                start_preds = start_head(out).squeeze()
                end_preds = end_head(out).squeeze()
                is_answerable_preds = is_answerable_head(out[:, 0]).squeeze()

                # Separate loss terms into "answerable" questions and "not answerable" questions
                answerable = (start_answers != -1)
                loss = torch.nn.functional.cross_entropy(start_preds[answerable], start_answers[answerable]) + \
                    torch.nn.functional.cross_entropy(end_preds[answerable], end_answers[answerable]) + \
                    torch.nn.functional.binary_cross_entropy_with_logits(is_answerable_preds, (~answerable).long().float())
                loss = loss / GRAD_ACCU # Gradient Accumulation
            
            scaler.scale(loss).backward()
            wandb.log({"Loss": loss * GRAD_ACCU, "Head_LR": get_lr(HEAD_LR, cycle), "grad_norm": torch.norm(model.encoder.layer[0].output.adapter[0].weight.grad)})


            # Accumulate gradients for GRAD_ACCU steps
            if cycle % GRAD_ACCU == GRAD_ACCU - 1: 
                #scaler.unscale_(LoRA_opt)
                #scaler.unscale_(head_opt)
                #scaler.unscale_(adapter_opt)
                
                #torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
                #torch.nn.utils.clip_grad_norm_(start_head.parameters(), MAX_GRAD_NORM)
                #torch.nn.utils.clip_grad_norm_(end_head.parameters(), MAX_GRAD_NORM)
                #torch.nn.utils.clip_grad_norm_(is_answerable_head.parameters(), MAX_GRAD_NORM)
                
                scaler.step(LoRA_opt)
                scaler.step(head_opt)
                scaler.step(adapter_opt)
                scaler.update()

                LoRA_opt.zero_grad()
                head_opt.zero_grad()
                adapter_opt.zero_grad()
            t.update(1)

Epoch 1


 59%|█████▉    | 9627/16263 [1:35:52<1:04:27,  1.72it/s]IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

100%|██████████| 16263/16263 [2:40:41<00:00,  1.69it/s]


Epoch 2


100%|██████████| 16263/16263 [2:38:19<00:00,  1.71it/s]  


In [18]:
LoRA_modules = []
adapter_modules = []

# Collect only fine-tune parameters
def extract_LoRA_modules(m):
    if type(m) is LoRA:
        LoRA_modules.append(m)

def extract_adapter_modules(m):
    if type(m) is InterOutputAdapter:
        adapter_modules.append(m)

uncompiled_model.apply(extract_LoRA_modules)
uncompiled_model.apply(extract_adapter_modules)

LoRA_state_dicts = {}
LoRA_params = 0
adapter_state_dicts = {}
adapter_params = 0
head_params = 0

for idx, i in enumerate(LoRA_modules):
    LoRA_state_dicts["LoRA" + str(idx)] = i.get_params()
    LoRA_params += sum(param.numel() for param in i.get_params())

for idx, i in enumerate(adapter_modules):
    adapter_state_dicts["adapter" + str(idx)] = i.get_params()
    adapter_params += sum(param.numel() for param in i.get_params())

head_params += sum(param.numel() for param in start_head.parameters())
head_params += sum(param.numel() for param in end_head.parameters())
head_params += sum(param.numel() for param in is_answerable_head.parameters())

torch.save(LoRA_state_dicts, "./trained_LoRA.pth")
torch.save(adapter_state_dicts, "./trained_adapters.pth")
torch.save(start_head.state_dict(), "./trained_start_head.pth")
torch.save(end_head.state_dict(), "./trained_end_head.pth")
torch.save(is_answerable_head.state_dict(), "./trained_is_answerable_head.pth")

print("Models successfully saved")
print(f"Total LoRA Parameters: {LoRA_params}")
print(f"Total Adapter Parameters: {adapter_params}")
print(f"Total Head Parameters: {head_params}")
print(f"Total Fine-tuned Paramters: {LoRA_params + adapter_params + head_params}")

# Log to wandb
artifact = wandb.Artifact("Trained_Model_LARGE_SQuADv2", type='model')
artifact.add_file("./trained_LoRA.pth")
artifact.add_file("./trained_adapters.pth")
artifact.add_file("./trained_start_head.pth")
artifact.add_file("./trained_end_head.pth")
artifact.add_file("./trained_is_answerable_head.pth")
wandb.log_artifact(artifact)

wandb.finish()

Models successfully saved
Total LoRA Parameters: 1572864
Total Adapter Parameters: 12613632
Total Head Parameters: 3075
Total Fine-tuned Paramters: 14189571


0,1
Head_LR,▂▃▅▇███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁
Loss,█▂▄▃▄▃▄▃▂▂▃▁▂▂▂▁▂▂▂▂▂▂▂▂▁▃▁▂▂▃▁▂▁▁▂▂▂▁▂▂
grad_norm,▄▂▂▁▂▁▃▁▁▁▁▁▁▂▂▁▁▁▃▂▂▂▂▁▁▂▁▁▁▂▁▂▂█▂▂▁▁▂▁

0,1
Head_LR,0.0
Loss,1.03504
grad_norm,26.26235


Load Trained model

In [12]:
MODEL_VERSION = "Trained_Model_LARGE_SQuADv2:v3" # Name and version of saved artifact from Weights & Biases

artifact = wandb.use_artifact(MODEL_VERSION)
artifact.download("./")

[34m[1mwandb[0m: Downloading large artifact Trained_Model_LARGE_SQuADv2:v3, 54.22MB. 5 files... 
[34m[1mwandb[0m:   5 of 5 files downloaded.  
Done. 0:0:0.4


'./'

In [13]:
x = torch.load("./trained_LoRA.pth", map_location=device)

cnt = 0

for i in range(len(model.encoder.layer)):
    model.encoder.layer[i].attention.self.query.load_params(x["LoRA" + str(cnt)])
    cnt += 1

    model.encoder.layer[i].attention.self.key.load_params(x["LoRA" + str(cnt)])
    cnt += 1

    model.encoder.layer[i].attention.self.value.load_params(x["LoRA" + str(cnt)])
    cnt += 1

    model.encoder.layer[i].attention.output.dense.load_params(x["LoRA" + str(cnt)])
    cnt += 1

x = torch.load("./trained_adapters.pth", map_location=device)

cnt = 0
for i in range(len(model.encoder.layer)):
    model.encoder.layer[i].output.load_params(x["adapter" + str(cnt)])
    cnt += 1

start_head.load_state_dict(torch.load("./trained_start_head.pth", map_location=device))
end_head.load_state_dict(torch.load("./trained_end_head.pth", map_location=device))
is_answerable_head.load_state_dict(torch.load("./trained_is_answerable_head.pth", map_location=device))

<All keys matched successfully>

Evaluate and Visualize Trained Model Results

MAKE SURE FOR THIS SECTION TO UN-COMPILE MODEL BY RELOADING THE WEIGHTS

First, we process evaluation data and feed them into DataLoaders.

In [14]:
tokenized_eval_data = []
eval_answers = []
eval_ids = []

# TQDM counter may not increment to complete, since a small portion of the dataset is discarded for being too long
with tqdm.tqdm(total = len(dataset["validation"])) as t:
    for data in dataset["validation"]:
        cur_data = data["question"] + tokenizer.sep_token + data["context"]
        tokenized_eval_data.append(cur_data)
        eval_answers.append([])
        eval_ids.append(data["id"])
        # Check for data that doesn't have an answer and process these separately (we set the answer start and end indices to -1)
        if len(data["answers"]["answer_start"]) == 0:
            eval_answers[-1].append((-1, -1))
        else:
            for i in range(len(data["answers"]["answer_start"])): # Multiple answers for validation data are accepted
                eval_answers[-1].append(answer_char_to_token(data["question"], data["context"], cur_data, data["answers"]["answer_start"][i], data["answers"]["text"][i]))
        
        t.update(1)

tokenized_eval_data = tokenizer(tokenized_eval_data, padding="max_length", max_length=MAX_LEN, truncation=True)

 27%|██▋       | 3238/11873 [00:03<00:14, 611.38it/s] Token indices sequence length is longer than the specified maximum sequence length for this model (556 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 11873/11873 [00:13<00:00, 859.79it/s]


In [15]:
eval_data_ids = torch.utils.data.DataLoader(tokenized_eval_data["input_ids"], batch_size=1, shuffle=False)
eval_data_masks = torch.utils.data.DataLoader(tokenized_eval_data["attention_mask"], batch_size=1, shuffle=False)
eval_answers = torch.utils.data.DataLoader(eval_answers, batch_size=1, shuffle=False)
eval_ids = torch.utils.data.DataLoader(eval_ids, batch_size=1, shuffle=False)

Display model predictions on validation set

In [21]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()
is_answerable_head.eval()

visualize_count = 100 # Only show predictions for first n data points

for (ids, masks, answers) in zip(eval_data_ids, eval_data_masks, eval_answers):
    visualize_count -= 1
    if visualize_count < 0:
        break
    ids = torch.stack(ids).T.to(device)
    masks = torch.stack(masks).T.to(device)

    with torch.no_grad():
        with torch.autocast(device_type='cuda', dtype = torch.float16):
            out = model(input_ids=ids, attention_mask=masks).last_hidden_state
            start_preds = start_head(out).squeeze()
            end_preds = end_head(out).squeeze()
            is_answerable_preds = is_answerable_head(out[:, 0]).squeeze()

    # TODO: Ensure start index is before end index
    ids = ids.squeeze()

    # Decode all possible answers
    text_answers = []

    for x in answers:
        start_idx = x[0].item()
        end_idx = x[1].item()
        text_answers.append(tokenizer.decode(ids[start_idx : end_idx + 1]))

    print("Passage: " + tokenizer.decode(ids))
    print("Ground Truths: " + str([x for x in text_answers]))
    if is_answerable_preds.item() > 0:
        print("Prediction: No answer")
    else:
        print("Prediction: " + tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]))

Passage: <s>In what country is Normandy located?</s>The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><p

Output model predictions in JSON file for official evaluation

In [18]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()
is_answerable_head.eval()
val_answers = {}

with tqdm.tqdm(total=len(eval_answers)) as t:
    for (ids, masks, q_id) in zip(eval_data_ids, eval_data_masks, eval_ids):
        ids = torch.stack(ids).T.to(device)
        masks = torch.stack(masks).T.to(device)

        with torch.no_grad():
            with torch.autocast(device_type='cuda', dtype = torch.float16):
                out = model(input_ids=ids, attention_mask=masks).last_hidden_state
                start_preds = start_head(out).squeeze()
                end_preds = end_head(out).squeeze()
                is_answerable_preds = is_answerable_head(out[:, 0]).squeeze()

        # TODO: Ensure start index is before end index
        ids = ids.squeeze()

        if is_answerable_preds.item() > 0:
            val_answers[q_id[0]] = ""
        else:
            val_answers[q_id[0]] = tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]).strip()
        t.update(1)

100%|██████████| 11873/11873 [11:43<00:00, 16.88it/s]


In [19]:
# Save model predictions into JSON file
import json

with open("./predsv2.json", "w") as f:
    json.dump(val_answers, f) 

Display model prediction for user prompt by querying Wikipedia for relevant passages

In [None]:
%%capture
! pip install wikipedia

In [None]:
import wikipedia

question = "What invented C++?"
question_tokenized = tokenizer(question + tokenizer.sep_token)
context_vecs = []

for word in ["C++"]:
    for matched in wikipedia.search(word, results = 1):
        print(matched)
        try:
            context_vecs.append(wikipedia.page(matched, auto_suggest=False).content)
        except:
            continue

tokenized_contexts = tokenizer(context_vecs)

In [None]:
split_contexts = []

for x in tokenized_contexts['input_ids']:
    x = x[1: -1]
    start = 0
    question_len = len(question_tokenized['input_ids'][:-1])

    while start < len(x):
        cur_ctx = x[start: start + 511 - question_len]
        cur_ctx.append(tokenizer(tokenizer.eos_token)['input_ids'][1])
        split_contexts.append(question_tokenized['input_ids'][:-1] + cur_ctx)
        start += 511-question_len

In [None]:
CONF_THRESHOLD = 0.5

for context in split_contexts:
    model.eval()
    start_head.eval()
    end_head.eval()

    ids = torch.tensor(context, device=device).unsqueeze(dim = 0)
    mask = torch.ones_like(ids)

    with torch.no_grad():
        with torch.autocast(device_type='cuda', dtype = torch.float16):
            out = model(input_ids=ids, attention_mask=mask).last_hidden_state
            start_preds = start_head(out).squeeze()
            end_preds = end_head(out).squeeze()

    start_preds = torch.softmax(start_preds, dim=0)
    end_preds = torch.softmax(end_preds, dim=0)

    # TODO: Ensure start index is before end index
    ids = ids.squeeze()
    # print("Passage: " + tokenizer.decode(ids))
    if torch.max(start_preds).item() > CONF_THRESHOLD and torch.max(end_preds).item() > CONF_THRESHOLD:
        print("Prediction: " + tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]) + " Confidence: " + str(min(torch.max(start_preds).item(), torch.max(end_preds).item())))

Prediction:  Huygens–Fresnel principle Confidence: 0.765625


Display model predictions for user given context passage and question

In [None]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()

context = input("Context: ").strip()
question = input("Question: ").strip()

prompt = tokenizer(question + tokenizer.sep_token + context, padding=True, max_length=512, truncation=True)

ids = torch.tensor(prompt["input_ids"], device=device).unsqueeze(dim = 0)
masks = torch.tensor(prompt["attention_mask"], device=device).unsqueeze(dim=0)

with torch.no_grad():
    with torch.autocast(device_type='cuda', dtype = torch.float16):
        out = model(input_ids=ids, attention_mask=masks).last_hidden_state
        start_preds = start_head(out).squeeze()
        end_preds = end_head(out).squeeze()

# TODO: Ensure start index is before end index
ids = ids.squeeze()
print("Passage: " + tokenizer.decode(ids))
print("Prediction: " + tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]))

Calculate model evaluation metrics on evaluation set

In [None]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()

precision = 0
recall = 0
em = 0
total = 0

with tqdm.tqdm(total=len(eval_answers)) as t:
    for (ids, masks, answers) in zip(eval_data_ids, eval_data_masks, eval_answers):
        ids = torch.stack(ids).T.to(device)
        masks = torch.stack(masks).T.to(device)

        with torch.no_grad():
            with torch.autocast(device_type='cuda', dtype = torch.float16):
                out = model(input_ids=ids, attention_mask=masks).last_hidden_state
                start_preds = start_head(out).squeeze()
                end_preds = end_head(out).squeeze()

        # Get best match for prediction with all answers provided
        best_precision = 0
        best_recall = 0
        best_em = 0

        for x in answers:
            true_start = x[0].item()
            true_end = x[1].item()

            pred_start = torch.argmax(start_preds).item()
            pred_end = torch.argmax(end_preds).item()

            if true_start == pred_start and true_end == pred_end:
                best_em = 1

            if true_start <= true_end and pred_start <= pred_end and true_end >= pred_start and true_start <= pred_end:
                shared = min(true_end, pred_end) - max(true_start, pred_start) + 1
                best_recall = max(best_recall, shared / (true_end - true_start + 1))
                best_precision = max(best_precision, shared / (pred_end - pred_start + 1))

        total += 1
        precision += best_precision
        recall += best_recall
        em += best_em

        t.update(1)


precision /= total
recall /= total
f1 = 2 / (1 / precision + 1 / recall)
em /= total

print()
print(f"F1 Score: {f1}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"EM Score: {em}")

100%|██████████| 10506/10506 [11:50<00:00, 14.79it/s]


F1 Score: 0.937233582373895
Precision: 0.9356170730119967
Recall: 0.9388556872410186
EM Score: 0.8287645155149438





In [77]:
!pip install evaluate

Collecting evaluate
  Obtaining dependency information for evaluate from https://files.pythonhosted.org/packages/70/63/7644a1eb7b0297e585a6adec98ed9e575309bb973c33b394dae66bc35c69/evaluate-0.4.1-py3-none-any.whl.metadata
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


In [20]:
from evaluate import load
import json
squad_eval = load("squad_v2")

f = open('predsv2.json')
preds_f = json.load(f)
preds = []
for key, value in preds_f.items():
    if value == "":
        preds.append({'id': key, 'prediction_text': value, 'no_answer_probability': 1.0})
    else:
        preds.append({'id': key, 'prediction_text': value, 'no_answer_probability': 0.0})

ground_truths = []

for data in dataset["validation"]:
    cur_answers = {'answer_start': [], 'text': []}
    
    for i in range(len(data["answers"]["text"])):
        cur_answers['answer_start'].append(data["answers"]["answer_start"][i])
        cur_answers['text'].append(data["answers"]["text"][i])
    
    ground_truths.append({'id': data['id'], 'answers': cur_answers})

print(squad_eval.compute(predictions=preds, references=ground_truths))

{'exact': 81.6474353575339, 'f1': 84.83239057476591, 'total': 11873, 'HasAns_exact': 81.35964912280701, 'HasAns_f1': 87.73869320077547, 'HasAns_total': 5928, 'NoAns_exact': 81.93439865433137, 'NoAns_f1': 81.93439865433137, 'NoAns_total': 5945, 'best_exact': 81.6474353575339, 'best_exact_thresh': 0.0, 'best_f1': 84.83239057476577, 'best_f1_thresh': 0.0}


Final results:
# F1: 93.72
# EM: 82.88

For reference, human performance is F1 91.22, EM 82.30, so we've exceeded human performance.

Compared to other models, we are ranked #8 in F1 according to the old SQuADv1.1 leaderboard, and ranked #36 in EM. Strangely, most other models with high F1 score seem to have ~90 EM score as well, so it's weird that our EM score is so low. For practical purposes, high F1 score suffices, since it shows when our model is producing "pretty much" the correct output.

Clear GPU Ram (if training crashed and restarting)

In [None]:
# Weird hack to clear RAM effectively - sometimes variables can't be garbage collected
# due to being part of an exception and thus causing a new exception can relieve them
print(1/0)

ZeroDivisionError: ignored

In [None]:
# Not sure why you have to run this twice, but I found this works to clear the RAM
import gc
torch.cuda.empty_cache()
model = None
gc.collect()

torch.cuda.empty_cache()
model = None
gc.collect()

0

‍