Implementation of finetuning with both LoRA modules and Adapter modules

We will finetune RoBERTa-Large on SQuAD v1.1 to quickly train a strong Question Answering Language Model. We use 7.9M parameters for finetuning and freeze the remaining 355M parameters from pre-trained RoBERTa-Large, reducing the number of tunable parameters in fine-tuning by 45x. Finetuning takes under 3 hours on the free T4 GPUs on Google Colab, and can be likely easily sped up with more/better GPUs or cutting training time for slightly worse performance (the model pretty much converges after 1.5 epochs)

In [None]:
%%capture
!pip install transformers
!pip install datasets
!pip install wandb

In [None]:
import torch
from torch import nn as nn
from transformers import RobertaModel, RobertaTokenizer
import wandb
import tqdm
import math

config = {
    "RANK": 16,
    "ALPHA": 16,
    "EPOCHS": 2,
    "BATCH_SZ": 12, # Tune batch size depending on how much RAM your GPU has
    "LoRA_LR": 5e-5,
    "HEAD_LR": 5e-5,
    "LoRA_WD": 0.01,
    "HEAD_WD": 0.0,
    "HEAD_BIAS": True,
    "ADAPTER_WEIGHT_INIT": 0.02,
    "ADAPTER_RANK": 64,
    "ADAPTER_WD": 0.01,
    "ADAPTER_LR": 5e-5
}

RANK = config["RANK"]
ALPHA = config["ALPHA"]
EPOCHS = config["EPOCHS"]
BATCH_SZ = config["BATCH_SZ"]
LoRA_LR = config["LoRA_LR"]
HEAD_LR = config["LoRA_LR"]
LoRA_WD = config["LoRA_WD"]
HEAD_WD = config["HEAD_WD"]
HEAD_BIAS = config["HEAD_BIAS"]
ADAPTER_WEIGHT_INIT = config["ADAPTER_WEIGHT_INIT"]
ADAPTER_RANK = config["ADAPTER_RANK"]
ADAPTER_WD = config["ADAPTER_WD"]
ADAPTER_LR = config["ADAPTER_LR"]
MAX_LEN = 480 # Maximum token length, RoBERTa has 512 truncation length but we limit it a little to improve training speed
EMB_SZ = 1024 # Embedding dimension for RoBERTa

Enable experiment tracking with W & B

In [None]:
! wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
EXP_NAME = "FineTune"

wandb.init(
    name = "RoBERTa_LARGE_LoRA_Adapters",
    project = EXP_NAME,
    config = config
)

[34m[1mwandb[0m: Currently logged in as: [33mbohan_yao[0m ([33mbohan-yao[0m). Use [1m`wandb login --relogin`[0m to force relogin


Load model and data, and preprocess data and create DataLoaders

In [None]:
from datasets import load_dataset

dataset = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
model = RobertaModel.from_pretrained("roberta-large")
tokenizer = RobertaTokenizer.from_pretrained("roberta-large")

Downloading (…)lve/main/config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Freeze base model's parameters
for param in model.parameters():
    param.requires_grad = False

In [None]:
# In the SQuAD dataset, the answer range is given in terms of characters, but we
# need the answer range in terms of tokens, so we convert the ranges here
def answer_char_to_token(question, ctx, prompt, ans_idx, ans):
    corrected_ans_idx = ans_idx + len(question) + len(tokenizer.sep_token)
    start = len(tokenizer(prompt[:corrected_ans_idx].strip())["input_ids"]) - 1
    end = len(tokenizer(" " + ans)["input_ids"]) - 3
    return (start, start + end)

In [None]:
tokenized_train_data = []
training_answers = []

# TQDM counter may not increment to complete, since a small portion of the dataset is discarded for being too long
with tqdm.tqdm(total = len(dataset["train"])) as t:
    for data in dataset["train"]:
        cur_data = data["question"] + tokenizer.sep_token + data["context"]
        cur_data = cur_data.strip()
        if len(tokenizer(cur_data)["input_ids"]) > MAX_LEN:
            continue
        tokenized_train_data.append(cur_data)
        training_answers.append(answer_char_to_token(data["question"], data["context"], cur_data, data["answers"]["answer_start"][0], data["answers"]["text"][0]))
        t.update(1)

tokenized_train_data = tokenizer(tokenized_train_data, padding="max_length", max_length=MAX_LEN, truncation=True)

  3%|▎         | 2427/87599 [00:07<05:24, 262.19it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (526 > 512). Running this sequence through the model will result in indexing errors
100%|█████████▉| 87385/87599 [03:25<00:00, 424.24it/s]


In [None]:
training_data_ids = torch.utils.data.DataLoader(tokenized_train_data["input_ids"], batch_size=BATCH_SZ, shuffle=False)
training_data_masks = torch.utils.data.DataLoader(tokenized_train_data["attention_mask"], batch_size=BATCH_SZ, shuffle=False)
training_answers = torch.utils.data.DataLoader(training_answers, batch_size=BATCH_SZ, shuffle=False)

Set up LR scheduler

In [None]:
total_cycles = EPOCHS * len(tokenized_train_data["input_ids"]) / BATCH_SZ
start_lr_decay = 0.1 * total_cycles

def get_lr(lr, cycle):
    if cycle < start_lr_decay:
        return lr * cycle / start_lr_decay
    return lr - lr * (cycle - start_lr_decay) / (total_cycles - start_lr_decay)

LoRA module

In [None]:
class LoRA(nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.lin_weight = linear.weight
        self.lin_bias = linear.bias

        # Add LoRA matrices to parameters
        self.LoRA_A = nn.Parameter(torch.normal(mean=0, std=1, size=(self.lin_weight.shape[0], rank)))
        self.LoRA_B = nn.Parameter(torch.zeros((rank, self.lin_weight.shape[1])))
        self.LoRA_A.requires_grad = True
        self.LoRA_B.requires_grad = True
        # Freeze pretrained model weights
        self.lin_weight.requires_grad = False
        self.lin_bias.requires_grad = False
    def forward(self, inp):
        return inp @ (self.lin_weight + self.LoRA_A @ self.LoRA_B).T + self.lin_bias
    def get_params(self):
        return [self.LoRA_A, self.LoRA_B]

Adapter

In [None]:
def adapter_weight_init(m):
    if type(m) is nn.Linear:
        nn.init.normal_(m.weight, mean = 0.0, std = ADAPTER_WEIGHT_INIT)
        if m.bias is not None:
            torch.nn.init.zeros_(m.bias)

class LinearAdapter(nn.Module):
    def __init__(self, linear, rank):
        super().__init__()
        self.rank = rank
        self.linear = linear

        # Adapter layers
        self.adapter = nn.Sequential(
            nn.Linear(self.linear.weight.shape[0], rank),
            nn.ReLU(),
            nn.Linear(rank, self.linear.weight.shape[0])
        ).apply(adapter_weight_init)
        # Freeze pretrained model weights
        self.linear.weight.requires_grad = False
        if self.linear.bias is not None:
            self.linear.bias.requires_grad = False

    def forward(self, inp):
        inp = self.linear(inp)
        return inp + self.adapter(inp)

    def get_params(self):
        return [self.adapter[0].weight, self.adapter[0].bias, self.adapter[2].weight, self.adapter[2].bias]

Insert LoRA modules and Adapters into model

In [None]:
# Add LoRA and Adapter parameters to model
LoRA_params = []
adapter_params = []

for i in range(len(model.encoder.layer)):
    model.encoder.layer[i].attention.self.query = LoRA(model.encoder.layer[i].attention.self.query, RANK, ALPHA)
    model.encoder.layer[i].attention.self.value = LoRA(model.encoder.layer[i].attention.self.value, RANK, ALPHA)
    LoRA_params.extend(model.encoder.layer[i].attention.self.query.get_params())
    LoRA_params.extend(model.encoder.layer[i].attention.self.value.get_params())

    model.encoder.layer[i].attention.output.dense = LinearAdapter(model.encoder.layer[i].attention.output.dense, ADAPTER_RANK)
    model.encoder.layer[i].output.dense = LinearAdapter(model.encoder.layer[i].output.dense, ADAPTER_RANK)
    adapter_params.extend(model.encoder.layer[i].attention.output.dense.get_params())
    adapter_params.extend(model.encoder.layer[i].output.dense.get_params())

# Create optimizer with only LoRA parameters included
optimize_LoRA_params = [
    {"params": LoRA_params, "weight_decay": LoRA_WD}
]

LoRA_opt = torch.optim.AdamW(optimize_LoRA_params, lr = LoRA_LR)

# Create optimier with only Adapter parameters included
optimize_adapter_params = [
    {"params": adapter_params, "weight_decay": ADAPTER_WD}
]

adapter_opt = torch.optim.AdamW(optimize_adapter_params, lr = ADAPTER_LR)

Linear output heads for question answering task

In [None]:
# Kaiming initialize the heads
def layer_init_linear(m):
    nn.init.kaiming_normal_(m.weight)
    if m.bias is not None:
      torch.nn.init.zeros_(m.bias)

start_head = nn.Linear(EMB_SZ, 1, bias=HEAD_BIAS).apply(layer_init_linear)
end_head = nn.Linear(EMB_SZ, 1, bias=HEAD_BIAS).apply(layer_init_linear)

In [None]:
head_params = [
    {"params": start_head.parameters(), "weight_decay": HEAD_WD},
    {"params": end_head.parameters(), "weight_decay": HEAD_WD}
]

head_opt = torch.optim.AdamW(head_params, lr = HEAD_LR)

Move everything to GPU

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model.to(device)
start_head.to(device)
end_head.to(device)

Linear(in_features=1024, out_features=1, bias=True)

Utilize PyTorch 2.0 features.

Compiles the model into optimized kernels. Use mode="max-autotune" if GPU has enough CUDA cores for this to run properly (probably newer NVIDIA GPUs like V100, A100, and H100)

In [None]:
uncompiled_model = model # Save a copy of the uncompiled model to save, since the weights share memory
model = torch.compile(model)

In [None]:
! ldconfig /usr/lib64-nvidia # Fix a bug with PyTorch 2.0 on Google Colab not recognizing the "libcuda.so" file

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link



Train model using PyTorch Automatic Mixed Precision

In [None]:
# Create gradient scaler for AMP
scaler = torch.cuda.amp.GradScaler()

# Set model in train mode (turn on dropout)
model.train()
start_head.train()
end_head.train()

cycle = 0

for epoch in range(EPOCHS):
    print(f"Epoch {epoch + 1}")

    # Train model
    with tqdm.tqdm(total=math.ceil(len(tokenized_train_data["input_ids"]) / BATCH_SZ)) as t:
        for (ids, masks, answers) in zip(training_data_ids, training_data_masks, training_answers):
            cycle += 1
            # LR Scheduling
            for param in LoRA_opt.param_groups:
                param['lr'] = get_lr(LoRA_LR, cycle)
            for param in head_opt.param_groups:
                param['lr'] = get_lr(HEAD_LR, cycle)
            for param in adapter_opt.param_groups:
                param['lr'] = get_lr(ADAPTER_LR, cycle)

            LoRA_opt.zero_grad(set_to_none=True)
            head_opt.zero_grad(set_to_none=True)
            adapter_opt.zero_grad(set_to_none=True)
            ids = torch.stack(ids).T.to(device)
            masks = torch.stack(masks).T.to(device)
            start_answers = answers[0].to(device)
            end_answers = answers[1].to(device)

            # Skip in-complete batch at the end of the DataLoader
            if ids.shape[0] != BATCH_SZ:
                continue

            with torch.autocast(device_type='cuda', dtype = torch.float16):
                out = model(input_ids=ids, attention_mask=masks).last_hidden_state
                start_preds = start_head(out).squeeze()
                end_preds = end_head(out).squeeze()
                loss = torch.nn.functional.cross_entropy(start_preds, start_answers) + torch.nn.functional.cross_entropy(end_preds, end_answers)

            wandb.log({"Loss": loss})
            scaler.scale(loss).backward()
            scaler.step(LoRA_opt)
            scaler.step(head_opt)
            scaler.step(adapter_opt)
            scaler.update()
            t.update(1)

Epoch 1


100%|█████████▉| 7282/7283 [1:27:57<00:00,  1.38it/s]


Epoch 2


100%|█████████▉| 7282/7283 [1:24:10<00:00,  1.44it/s]


Save trained model and logs to Weights & Biases

In [None]:
# TODO: only save LoRA and Adapter modules
torch.save(uncompiled_model.state_dict(), "./trained_model.pth") # Unable to save compiled model
torch.save(start_head.state_dict(), "./trained_start_head.pth")
torch.save(end_head.state_dict(), "./trained_end_head.pth")

artifact = wandb.Artifact("Trained_Model_LARGE", type='model')
artifact.add_file("./trained_model.pth")
artifact.add_file("./trained_start_head.pth")
artifact.add_file("./trained_end_head.pth")
wandb.log_artifact(artifact)

wandb.finish()

0,1
Loss,█▄▂▂▂▂▃▂▂▄▂▁▂▂▂▂▃▄▃▂▂▂▂▃▂▁▁▁▄▃▄▂▂▂▂▂▂▃▂▂

0,1
Loss,0.7303


Load Trained model

In [None]:
MODEL_VERSION = "Trained_Model_LARGE:v0" # Name and version of saved artifact from Weights & Biases

artifact = wandb.use_artifact(MODEL_VERSION)
artifact.download("./")

[34m[1mwandb[0m: Downloading large artifact Trained_Model_LARGE:v0, 1386.06MB. 3 files... 
[34m[1mwandb[0m:   3 of 3 files downloaded.  
Done. 0:0:23.7


'./'

In [None]:
model.load_state_dict(torch.load("./trained_model.pth", map_location=torch.device(device)))
start_head.load_state_dict(torch.load("./trained_start_head.pth", map_location=torch.device(device)))
end_head.load_state_dict(torch.load("./trained_end_head.pth", map_location=torch.device(device)))

<All keys matched successfully>

Evaluate and Visualize Trained Model Results

MAKE SURE FOR THIS SECTION TO UN-COMPILE MODEL BY RELOADING THE WEIGHTS

First, we process evaluation data and feed them into DataLoaders.

In [None]:
tokenized_eval_data = []
eval_answers = []

# TQDM counter may not increment to complete, since a small portion of the dataset is discarded for being too long
with tqdm.tqdm(total = len(dataset["validation"])) as t:
    for data in dataset["validation"]:
        cur_data = data["question"] + tokenizer.sep_token + data["context"]
        if len(tokenizer(cur_data)["input_ids"]) > MAX_LEN:
            continue
        tokenized_eval_data.append(cur_data)
        eval_answers.append([])
        for i in range(len(data["answers"]["answer_start"])): # Multiple answers for validation data are accepted
            eval_answers[-1].append(answer_char_to_token(data["question"], data["context"], cur_data, data["answers"]["answer_start"][i], data["answers"]["text"][i]))
        t.update(1)

tokenized_eval_data = tokenizer(tokenized_eval_data, padding="max_length", max_length=MAX_LEN, truncation=True)

 39%|███▉      | 4135/10570 [00:17<01:00, 106.39it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (614 > 512). Running this sequence through the model will result in indexing errors
 99%|█████████▉| 10506/10570 [00:44<00:00, 236.31it/s]


In [None]:
eval_data_ids = torch.utils.data.DataLoader(tokenized_eval_data["input_ids"], batch_size=1, shuffle=False)
eval_data_masks = torch.utils.data.DataLoader(tokenized_eval_data["attention_mask"], batch_size=1, shuffle=False)
eval_answers = torch.utils.data.DataLoader(eval_answers, batch_size=1, shuffle=False)

Display model predictions on validation set

In [None]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()

visualize_count = 100 # Only show predictions for first n data points

for (ids, masks, answers) in zip(eval_data_ids, eval_data_masks, eval_answers):
    visualize_count -= 1
    if visualize_count < 0:
        break
    ids = torch.stack(ids).T.to(device)
    masks = torch.stack(masks).T.to(device)

    with torch.no_grad():
        with torch.autocast(device_type='cuda', dtype = torch.float16):
            out = model(input_ids=ids, attention_mask=masks).last_hidden_state
            start_preds = start_head(out).squeeze()
            end_preds = end_head(out).squeeze()

    # TODO: Ensure start index is before end index
    ids = ids.squeeze()

    # Decode all possible answers
    text_answers = []

    for x in answers:
        start_idx = x[0].item()
        end_idx = x[1].item()
        text_answers.append(tokenizer.decode(ids[start_idx : end_idx + 1]))

    print("Passage: " + tokenizer.decode(ids))
    print("Ground Truths: " + str([x for x in text_answers]))
    print("Prediction: " + tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]))

Passage: <s>Which NFL team represented the AFC at Super Bowl 50?</s>Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa

Display model prediction for user prompt by querying Wikipedia for relevant passages

In [None]:
%%capture
! pip install wikipedia

In [None]:
import wikipedia

question = "What invented C++?"
question_tokenized = tokenizer(question + tokenizer.sep_token)
context_vecs = []

for word in ["C++"]:
    for matched in wikipedia.search(word, results = 1):
        print(matched)
        try:
            context_vecs.append(wikipedia.page(matched, auto_suggest=False).content)
        except:
            continue

tokenized_contexts = tokenizer(context_vecs)

In [None]:
split_contexts = []

for x in tokenized_contexts['input_ids']:
    x = x[1: -1]
    start = 0
    question_len = len(question_tokenized['input_ids'][:-1])

    while start < len(x):
        cur_ctx = x[start: start + 511 - question_len]
        cur_ctx.append(tokenizer(tokenizer.eos_token)['input_ids'][1])
        split_contexts.append(question_tokenized['input_ids'][:-1] + cur_ctx)
        start += 511-question_len

In [None]:
CONF_THRESHOLD = 0.5

for context in split_contexts:
    model.eval()
    start_head.eval()
    end_head.eval()

    ids = torch.tensor(context, device=device).unsqueeze(dim = 0)
    mask = torch.ones_like(ids)

    with torch.no_grad():
        with torch.autocast(device_type='cuda', dtype = torch.float16):
            out = model(input_ids=ids, attention_mask=mask).last_hidden_state
            start_preds = start_head(out).squeeze()
            end_preds = end_head(out).squeeze()

    start_preds = torch.softmax(start_preds, dim=0)
    end_preds = torch.softmax(end_preds, dim=0)

    # TODO: Ensure start index is before end index
    ids = ids.squeeze()
    # print("Passage: " + tokenizer.decode(ids))
    if torch.max(start_preds).item() > CONF_THRESHOLD and torch.max(end_preds).item() > CONF_THRESHOLD:
        print("Prediction: " + tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]) + " Confidence: " + str(min(torch.max(start_preds).item(), torch.max(end_preds).item())))

Prediction:  Huygens–Fresnel principle Confidence: 0.765625


Display model predictions for user given context passage and question

In [None]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()

context = input("Context: ").strip()
question = input("Question: ").strip()

prompt = tokenizer(question + tokenizer.sep_token + context, padding=True, max_length=512, truncation=True)

ids = torch.tensor(prompt["input_ids"], device=device).unsqueeze(dim = 0)
masks = torch.tensor(prompt["attention_mask"], device=device).unsqueeze(dim=0)

with torch.no_grad():
    with torch.autocast(device_type='cuda', dtype = torch.float16):
        out = model(input_ids=ids, attention_mask=masks).last_hidden_state
        start_preds = start_head(out).squeeze()
        end_preds = end_head(out).squeeze()

# TODO: Ensure start index is before end index
ids = ids.squeeze()
print("Passage: " + tokenizer.decode(ids))
print("Prediction: " + tokenizer.decode(ids[torch.argmax(start_preds).item() : torch.argmax(end_preds).item() + 1]))

Calculate model evaluation metrics on evaluation set

In [None]:
# Turn on evaluation mode (turn off dropout)
model.eval()
start_head.eval()
end_head.eval()

precision = 0
recall = 0
em = 0
total = 0

with tqdm.tqdm(total=len(eval_answers)) as t:
    for (ids, masks, answers) in zip(eval_data_ids, eval_data_masks, eval_answers):
        ids = torch.stack(ids).T.to(device)
        masks = torch.stack(masks).T.to(device)

        with torch.no_grad():
            with torch.autocast(device_type='cuda', dtype = torch.float16):
                out = model(input_ids=ids, attention_mask=masks).last_hidden_state
                start_preds = start_head(out).squeeze()
                end_preds = end_head(out).squeeze()

        # Get best match for prediction with all answers provided
        best_precision = 0
        best_recall = 0
        best_em = 0

        for x in answers:
            true_start = x[0].item()
            true_end = x[1].item()

            pred_start = torch.argmax(start_preds).item()
            pred_end = torch.argmax(end_preds).item()

            if true_start == pred_start and true_end == pred_end:
                best_em = 1

            if true_start <= true_end and pred_start <= pred_end and true_end >= pred_start and true_start <= pred_end:
                shared = min(true_end, pred_end) - max(true_start, pred_start) + 1
                best_recall = max(best_recall, shared / (true_end - true_start + 1))
                best_precision = max(best_precision, shared / (pred_end - pred_start + 1))

        total += 1
        precision += best_precision
        recall += best_recall
        em += best_em

        t.update(1)


precision /= total
recall /= total
f1 = 2 / (1 / precision + 1 / recall)
em /= total

print()
print(f"F1 Score: {f1}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"EM Score: {em}")

100%|██████████| 10506/10506 [11:50<00:00, 14.79it/s]


F1 Score: 0.937233582373895
Precision: 0.9356170730119967
Recall: 0.9388556872410186
EM Score: 0.8287645155149438





Final results:
# F1: 93.72
# EM: 82.88

For reference, human performance is F1 91.22, EM 82.30, so we've exceeded human performance.

Compared to other models, we are ranked #8 in F1 according to the old SQuADv1.1 leaderboard, and ranked #36 in EM. Strangely, most other models with high F1 score seem to have ~90 EM score as well, so it's weird that our EM score is so low. For practical purposes, high F1 score suffices, since it shows when our model is producing "pretty much" the correct output.

Clear GPU Ram (if training crashed and restarting)

In [None]:
# Weird hack to clear RAM effectively - sometimes variables can't be garbage collected
# due to being part of an exception and thus causing a new exception can relieve them
print(1/0)

ZeroDivisionError: ignored

In [None]:
# Not sure why you have to run this twice, but I found this works to clear the RAM
import gc
torch.cuda.empty_cache()
model = None
LoRA_opt = None
gc.collect()

torch.cuda.empty_cache()
model = None
LoRA_opt = None
gc.collect()

0

In [None]:
# Temporary testing
LoRA_modules = []
adapter_modules = []

def extract_LoRA_modules(m):
    if type(m) is LoRA:
        LoRA_modules.append(m)

def extract_adapter_modules(m):
    if type(m) is LinearAdapter:
        adapter_modules.append(m)

model.apply(extract_LoRA_modules)
model.apply(extract_adapter_modules)

LoRA_state_dicts = {}
LoRA_params = 0
adapter_state_dicts = {}
adapter_params = 0

for idx, i in enumerate(LoRA_modules):
    LoRA_state_dicts["LoRA" + str(idx)] = i.parameters()
    LoRA_params += sum(param.numel() for param in i.get_params())

for idx, i in enumerate(adapter_modules):
    adapter_state_dicts["adapter" + str(idx)] = i.parameters()
    adapter_params += sum(param.numel() for param in i.get_params())

torch.save(LoRA_state_dicts, "./trained_LoRA.pth")
torch.save(adapter_state_dicts, "./trained_adapters.pth")
print("Models successfully saved")
print(f"Total LoRA Parameters: {LoRA_params}")
print(f"Total Adapter Parameters: {adapter_params}")
print(f"Total Fine-tuned Paramters: {LoRA_params + adapter_params}")

TypeError: ignored

‍