#### Finetuning BERT for `Extractive Question Answering` 

We will finetune a BERT model on the task of extractive QA, which involves taking a `factoid question` and a `context passage` of text and `labeling a span` of text from that passage which contains the `answer`. We can frame this as a `classification task`. First we concatenate the question and context passage pair, seperated by a `[SEP]` token. Then we compute the BERT encoding for this sequence. Then we apply a linear transform to each output token's encoding vector to compute a scalar score. By passing the scores from all tokens through a softmax, we obtain a `probability distribution` over tokens in the sequence, which we can interpret as the probability of a token being the start of the span. We actually will compute two separate linear transforms of all tokens and pass both sets of scores through a softmax to get two probability distributions over tokens, one for `start of span` and one for `end of span`. 

We will train this model on the SQuAD v1 dataset which contains passages with multiple questions and answer span pairs. We will use the cross entropy loss at the softmax output. To make predictions, we can simply just add up the scores of the `ith` token being the start and the `jth` token being the end for all i and j>i, then declare the (i,j) with the highest score as the predicted span.





In [1]:
import torch
from transformers import BertTokenizerFast, BertModel
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import csv
import random
from tqdm import tqdm
import psutil
import json
import wandb
wandb.login()

print(torch.cuda.is_available())

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mtanzids[0m. Use [1m`wandb login --relogin`[0m to force relogin


True


First, let's load the data from file and then set up pytorch datasets

In [2]:
# load the train and dev JSON documents
with open("train.json", "r") as train_file:
    squad_train = json.load(train_file)         
with open("dev.json", "r") as dev_file:
    squad_dev = json.load(dev_file) 

In [188]:
def get_passages(squad, num_titles=None):
    if num_titles is None:
        num_titles = len(squad['data'])
    # for each title, get passages and all corresponding questions from SQuAD train set
    passages = []
    questions = []
    num_questions = 0
    j = 0
    for i in range(num_titles):
        #print(f"Title# {i}: {squad['data'][i]['title']}, Number of passages: {len(squad['data'][i]['paragraphs'])}")
        for p in squad['data'][i]['paragraphs']:
            passages.append(p['context'])
            for q in p['qas']:
                if not q['is_impossible']:
                    questions.append((q,j))    
                    num_questions += 1
            j += 1
    print(f"Number of passages: {len(passages)}")
    print(f"Number of questions: {num_questions}")
    return passages, questions

In [189]:
passages_train, questions_train = get_passages(squad_train, num_titles=5)
passages_val, questions_val = get_passages(squad_dev, num_titles=5)

Number of passages: 312
Number of questions: 2282
Number of passages: 173
Number of questions: 705


In [190]:
passage_lengths = [len(p) for p in passages_train]
print(f"Max passage length: {max(passage_lengths)}, Avg passage length: {sum(passage_lengths)/len(passage_lengths)}")

Max passage length: 2132, Avg passage length: 728.1698717948718


Note that the context passages are very long (over 700 words on average) and won't fit into our BERT model (which can only take upto 512 tokens per sequence). So we will instead take a shorter fixed size context window for each question.  

Since we will use WordPiece tokenization, we also need to be careful about converting the character positions of the start and end of the span to subwork token positions.

In [191]:
q_idx = 10
q = questions_train[q_idx][0]
passage_idx = questions_train[q_idx][1]
answer_start_pos = q['answers'][0]['answer_start']
answer_end_pos = answer_start_pos + len(q['answers'][0]['text'])
context = passages_train[passage_idx]

print(f"Question: {q['question']}")
print(f"Answer span: {context[answer_start_pos:answer_end_pos]}")

Question: What was the first album Beyoncé released as a solo artist?
Answer span: Dangerously in Love


In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [192]:
print("\nSpecial tokens with their integer id:")
special_tokens = tokenizer.all_special_tokens
special_tokens_to_ids = {t:tokenizer.convert_tokens_to_ids(t) for t in special_tokens}
print(special_tokens_to_ids)


Special tokens with their integer id:
{'[UNK]': 100, '[SEP]': 102, '[PAD]': 0, '[CLS]': 101, '[MASK]': 103}


In [193]:
"""
    Method 1: Create window directly on the context string, before tokenization.
"""

window_size_chars = 500
# pick a random context window around the answer (try to keep at least 40% of the characters in window on the left side of the answer)
answer_middle_pos = int((answer_start_pos+answer_end_pos)/2) 
a = max(0,answer_end_pos-window_size_chars)
b = max(0, answer_start_pos - 0.4*window_size_chars)
random_window_start_pos = random.randint(a, b)
#window_start_pos = max(0,answer_middle_pos-window_size_chars)
#window_end_pos = answer_middle_pos+window_size_chars
window_start_pos = random_window_start_pos
window_end_pos = window_start_pos+window_size_chars

context_window = context[window_start_pos:window_end_pos]
print("Context window: ", context_window)

answer_start_pos_window = answer_start_pos - window_start_pos
answer_end_pos_window = answer_start_pos_window + len(q['answers'][0]['text'])
answer_window = context_window[answer_start_pos_window:answer_end_pos_window]
print(f"Answer window: {answer_window}")

# Trim off stray partial words at the beginning and end
context_window_words = context_window.split()
# only trim if the first stray word does not overlap with the answer span
if answer_start_pos_window > len(context_window_words[0]):
    context_window = ' '.join(context_window_words[1:-1])
    left_trim_length = len(context_window_words[0]) + 1 # add 1 for the white space between stary partial first word and next word
    answer_start_pos_window = answer_start_pos_window - left_trim_length
    answer_end_pos_window = answer_end_pos_window - left_trim_length

answer_window = context_window[answer_start_pos_window:answer_end_pos_window]
print(f"Answer window trimmed: {answer_window}")


# encode the passage
context_encoded = tokenizer.encode_plus((context_window, q['question']), add_special_tokens=True, return_offsets_mapping=True)
print(context_encoded.keys())
# convert character positions from original sentence to subword token positions
start_pos_enc = context_encoded.char_to_token(answer_start_pos_window)
end_pos_enc = context_encoded.char_to_token(answer_end_pos_window-1)
# get the corresponding subword token span
answer_span_encoded = context_encoded['input_ids'][start_pos_enc:end_pos_enc+1]
# decode the span to check if it matches original answer span
print(f"Decoded subword token span: {tokenizer.decode(answer_span_encoded)}")
print(f"Decoded sentence pair: {tokenizer.decode(context_encoded['input_ids'])}")

Context window:  songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awar
Answer window: Dangerously in Love
Answer window trimmed: Dangerously in Love
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping'])
Decoded subword token span: dangerously in love
Decoded sentence pair: [CLS] record producer and actress. born and raised in houston, texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of r & b girl - group destiny's child. managed by her father, ma

In [194]:
print(f"Question: {q['question']}")
print(f"Context: {context}")
print(f"Answer span: {context[answer_start_pos:answer_end_pos]}")

Question: What was the first album Beyoncé released as a solo artist?
Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Answer span: Dangerously in Love


In [195]:
"""  
    Method 2: Tokenize first, then create context window around answer span. (Cleaner than method 1)
"""

# tokenize the context passage, get offset mapping
encoding = tokenizer.encode_plus(context, add_special_tokens=False, return_offsets_mapping=False, return_attention_mask=False, return_token_type_ids=False)
input_ids = encoding['input_ids']
#offset_mapping = encoding['offset_mapping']
print(f"Subword tokens: {tokenizer.convert_ids_to_tokens(input_ids)}")
#print(f"Offset mapping: {offset_mapping}")

# answer span start and end character positions
answer_start_char = q['answers'][0]['answer_start']
answer_end_char = answer_start_char + len(q['answers'][0]['text'])
print(f"Answer start char: {answer_start_char}, Answer end char: {answer_end_char}")    

# convert char positions to token positions
answer_start_token = encoding.char_to_token(answer_start_char)
answer_end_token = encoding.char_to_token(answer_end_char-1)
print(f"Answer start token: {answer_start_token}, Answer end token: {answer_end_token}")    

# now create a window around the answer span, pick the window start position randomly
window_size_tokens = 10

# range of legal starting positions
start_min = max(0, answer_end_token - window_size_tokens + 1)
start_max = answer_start_token
print(f"start min: {start_min}, start max: {start_max}")

window_start = random.randint(start_min, start_max)
window_end = window_start + window_size_tokens
print(f"Random window start: {window_start}, window end: {window_end}")

# select window of tokens
window_tokens = input_ids[window_start:window_end]
print(f"window tokens: {tokenizer.convert_ids_to_tokens(window_tokens)}")

# offset the answer span token positions by window start position
answer_start_token_window = answer_start_token - window_start
answer_end_token_window = answer_end_token - window_start
print(f"window answer start token: {answer_start_token_window}, window answer end token: {answer_end_token_window}")

Subword tokens: ['beyonce', 'gi', '##selle', 'knowles', '-', 'carter', '(', '/', 'bi', '##ː', '##ˈ', '##j', '##ɒ', '##nse', '##ɪ', '/', 'bee', '-', 'yo', '##n', '-', 'say', ')', '(', 'born', 'september', '4', ',', '1981', ')', 'is', 'an', 'american', 'singer', ',', 'songwriter', ',', 'record', 'producer', 'and', 'actress', '.', 'born', 'and', 'raised', 'in', 'houston', ',', 'texas', ',', 'she', 'performed', 'in', 'various', 'singing', 'and', 'dancing', 'competitions', 'as', 'a', 'child', ',', 'and', 'rose', 'to', 'fame', 'in', 'the', 'late', '1990s', 'as', 'lead', 'singer', 'of', 'r', '&', 'b', 'girl', '-', 'group', 'destiny', "'", 's', 'child', '.', 'managed', 'by', 'her', 'father', ',', 'mathew', 'knowles', ',', 'the', 'group', 'became', 'one', 'of', 'the', 'world', "'", 's', 'best', '-', 'selling', 'girl', 'groups', 'of', 'all', 'time', '.', 'their', 'hiatus', 'saw', 'the', 'release', 'of', 'beyonce', "'", 's', 'debut', 'album', ',', 'dangerously', 'in', 'love', '(', '2003', ')', ',

In [196]:
question_encoding = tokenizer.encode_plus(q['question'], add_special_tokens=False, return_offsets_mapping=False, return_attention_mask=False, return_token_type_ids=False)
question_idx = question_encoding['input_ids']        
input_idx = [special_tokens_to_ids["[CLS]"]] + question_idx + [special_tokens_to_ids["[SEP]"]] + window_tokens + [special_tokens_to_ids["[SEP]"]]   
print(tokenizer.convert_ids_to_tokens(input_idx))

['[CLS]', 'what', 'was', 'the', 'first', 'album', 'beyonce', 'released', 'as', 'a', 'solo', 'artist', '?', '[SEP]', ',', 'dangerously', 'in', 'love', '(', '2003', ')', ',', 'which', 'established', '[SEP]']


Let's creating a pytorch dataset that handles the tokenization and context creation.

In [197]:
class SquadDataset(Dataset):
    def __init__(self, passages, questions, max_length=256, window_size_tokens=20):
        self.passages = passages
        self.questions = questions
        self.tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
        self.max_length = max_length
        self.window_size_tokens = window_size_tokens
        
    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        # get the question and context passage
        q = self.questions[idx][0]
        passage_idx = self.questions[idx][1]
        question = q['question']
        context = self.passages[passage_idx]
        # tokenize the context passage
        context_encoding = tokenizer.encode_plus(context, add_special_tokens=False, return_offsets_mapping=False, return_attention_mask=False, return_token_type_ids=False)
        context_idx = context_encoding['input_ids']
        # tokenize the question
        question_encoding = tokenizer.encode_plus(q['question'], add_special_tokens=False, return_offsets_mapping=False, return_attention_mask=False, return_token_type_ids=False)
        question_idx = question_encoding['input_ids']

        # get answer span start and end character positions, for multiple answers, we will only use the first answer
        first_answer_idx = 0
        answer_start_char = q['answers'][first_answer_idx]['answer_start']
        answer_end_char = answer_start_char + len(q['answers'][first_answer_idx]['text'])
        # convert char positions to token positions
        answer_start_token = context_encoding.char_to_token(answer_start_char)
        answer_end_token = context_encoding.char_to_token(answer_end_char-1)
        # now create a window around the answer span, pick the window start position randomly
        window_start = random.randint(max(0, answer_end_token - self.window_size_tokens + 1), answer_start_token)
        window_end = window_start + self.window_size_tokens
        
        # select window of tokens
        window_tokens = context_idx[window_start:window_end]
        # offset the answer span token positions by window start position
        answer_start_token_window = answer_start_token - window_start
        answer_end_token_window = answer_end_token - window_start
        # concatenate the question and context, add special tokens and padding
        input_idx = [self.tokenizer.cls_token_id] + question_idx + [self.tokenizer.sep_token_id] + window_tokens + [self.tokenizer.sep_token_id]
        # make sure the input sequence is not longer than max_length
        if len(input_idx) > self.max_length:
            raise Exception(f"Input sequence length {len(input_idx)} is longer than max_length {self.max_length}!")

        input_idx = input_idx + [self.tokenizer.pad_token_id]*(self.max_length-len(input_idx))
        # offset the answer span token positions again by the length of the question and the two special tokens ([CLS] and [SEP])
        answer_start_token_window = answer_start_token_window + len(question_idx) + 2
        answer_end_token_window = answer_end_token_window + len(question_idx) + 2
        # create attention mask
        attn_mask = [1 if idx != self.tokenizer.pad_token_id else 0 for idx in input_idx]

        # convert to tensors
        input_idx = torch.tensor(input_idx)
        attn_mask = torch.tensor(attn_mask)
        start_pos_enc = torch.tensor(answer_start_token_window)
        end_pos_enc = torch.tensor(answer_end_token_window)
        return input_idx, start_pos_enc, end_pos_enc, attn_mask

In [198]:
train_dataset = SquadDataset(passages_train, questions_train, window_size_tokens=100, max_length=128)
val_dataset = SquadDataset(passages_val, questions_val, window_size_tokens=100, max_length=128)

2282


In [199]:
idx = 2220
input_idx, start_pos_enc, end_pos_enc, attn_mask = train_dataset[idx]
input_tokens = tokenizer.convert_ids_to_tokens(input_idx)
print(f"Input idx: {input_tokens}")
print(f"Answer span: {input_tokens[start_pos_enc:end_pos_enc+1]}")

Input idx: ['[CLS]', 'when', 'were', 'track', 'versions', 'of', 'the', 'game', "'", 's', 'so', '##unt', '##rac', '##k', 'released', '?', '[SEP]', 'his', 'preference', 'for', 'live', 'instruments', '.', 'he', 'originally', 'envisioned', 'a', 'full', '50', '-', 'person', 'orchestra', 'for', 'action', 'sequences', 'and', 'a', 'string', 'quartet', 'for', 'more', '"', 'lyrical', 'moments', '"', ',', 'though', 'the', 'final', 'product', 'used', 'sequence', '##d', 'music', 'instead', '.', 'ko', '##ndo', 'later', 'cited', 'the', 'lack', 'of', 'interact', '##ivity', 'that', 'comes', 'with', 'orchestral', 'music', 'as', 'one', 'of', 'the', 'main', 'reasons', 'for', 'the', 'decision', '.', 'both', 'six', '-', 'and', 'seven', '-', 'track', 'versions', 'of', 'the', 'game', "'", 's', 'soundtrack', 'were', 'released', 'on', 'november', '19', ',', '2006', ',', 'as', 'part', 'of', 'a', 'nintendo', 'power', 'promotion', 'and', 'bundled', 'with', 'replica', '##s', 'of', 'the', 'master', '[SEP]', '[PAD]',

#### Now define the classification model.

In [61]:
class BERTExtractiveQA(torch.nn.Module):
    def __init__(self, hidden_size=768, dropout_rate=0.1, finetune=False):
        super().__init__()
        # load pretrained BERT model
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(dropout_rate)      
        # define two classifier heads, one for predicting start of span and another for end of span 
        self.classifier_head_start_span = torch.nn.Linear(hidden_size, 1)
        self.classifier_head_end_span = torch.nn.Linear(hidden_size, 1)

        for param in self.bert.parameters():
            if finetune:
                # make all parameters of BERT model trainable if we're finetuning
                param.requires_grad = True
            else:
                # freeze all parameters of BERT model if we're not finetuning
                param.requires_grad = False

    def forward(self, input_idx, labels_start, labels_end, attn_mask):
        # compute BERT encodings
        bert_output = self.bert(input_idx, attention_mask=attn_mask)
        bert_output = bert_output.last_hidden_state # shape: (batch_size, sequence_length, hidden_size)
        # compute logits/scores over tokens for each of the classifier heads
        logits_start = self.classifier_head_start_span(bert_output).squeeze(-1)  # shape: (batch_size, sequence_length)
        logits_end = self.classifier_head_end_span(bert_output).squeeze(-1)  # shape: (batch_size, sequence_length)
        # compute loss
        loss = F.cross_entropy(logits_start, labels_start) + F.cross_entropy(logits_end, labels_end) 

        return logits_start, logits_end, loss
    

# training loop
def train(model, optimizer, train_dataloader, val_dataloader, scheduler=None, device="cpu", num_epochs=10, val_every=1, save_every=None, log_metrics=None):
    avg_loss = 0
    train_acc = 0
    val_loss = 0
    val_acc = 0
    model.train()
    for epoch in range(num_epochs):
        num_correct = 0
        num_total = 0
        pbar = tqdm(train_dataloader, desc="Epochs")
        for batch in pbar:
            inputs, targets_start, targets_end, attn_mask = batch
            # move batch to device
            inputs, targets_start, targets_end, attn_mask = inputs.to(device), targets_start.to(device), targets_end.to(device), attn_mask.to(device)
            # forward pass
            logits_start, logits_end, loss = model(inputs, targets_start, targets_end, attn_mask)
            # reset gradients
            optimizer.zero_grad()
            # backward pass
            loss.backward()
            # optimizer step
            optimizer.step()
            avg_loss = 0.9* avg_loss + 0.1*loss.item()
            B, _ = inputs.shape
            y_pred_start = logits_start.argmax(dim=-1).view(-1) # shape (B,)
            y_pred_end = logits_end.argmax(dim=-1).view(-1) # shape (B,)
            num_correct += ((y_pred_start.eq(targets_start.view(-1)) + y_pred_end.eq(targets_end.view(-1))) == 2).sum().item()            
            num_total += B
            train_acc = num_correct / num_total        
            
            pbar.set_description(f"Epoch {epoch + 1}, EMA Train Loss: {avg_loss:.3f}, Train Accuracy: {train_acc: .3f}, Val Loss: {val_loss: .3f}, Val Accuracy: {val_acc: .3f}")  

            if log_metrics:
                metrics = {"Batch loss" : loss.item(), "Moving Avg Loss" : avg_loss, "Val Loss": val_loss}
                log_metrics(metrics)

        if scheduler is not None:
            scheduler.step()
        
        if val_every is not None:
            if epoch%val_every == 0:
                # compute validation loss
                val_loss, val_acc = validation(model, val_dataloader, device=device)
                pbar.set_description(f"Epoch {epoch + 1}, EMA Train Loss: {avg_loss:.3f}, Train Accuracy: {train_acc: .3f}, Val Loss: {val_loss: .3f}, Val Accuracy: {val_acc: .3f}") 

        if save_every is not None:
            if (epoch+1) % save_every == 0:
                save_model_checkpoint(model, optimizer, epoch, avg_loss)


def validation(model, val_dataloader, device="cpu"):
    model.eval()
    val_losses = torch.zeros(len(val_dataloader))
    with torch.no_grad():
        num_correct = 0
        num_total = 0
        for i,batch in enumerate(val_dataloader):
            inputs, targets_start, targets_end, attn_mask = batch
            inputs, targets_start, targets_end, attn_mask = inputs.to(device), targets_start.to(device), targets_end.to(device), attn_mask.to(device)
            logits_start, logits_end, loss = model(inputs, targets_start, targets_end, attn_mask)
            B, _ = inputs.shape
            y_pred_start = logits_start.argmax(dim=-1).view(-1) # shape (B,)
            y_pred_end = logits_end.argmax(dim=-1).view(-1) # shape (B,)
            num_correct += ((y_pred_start.eq(targets_start.view(-1)) + y_pred_end.eq(targets_end.view(-1))) == 2).sum().item()            
            num_total += B
            val_losses[i] = loss.item()
    model.train()
    val_loss = val_losses.mean().item()
    val_accuracy = num_correct / num_total
    return val_loss, val_accuracy


def save_model_checkpoint(model, optimizer, epoch=None, loss=None, filename=None):
    # Save the model and optimizer state_dict
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }

    # Save the checkpoint to a file
    if filename:
        torch.save(checkpoint, filename)
    else:
        torch.save(checkpoint, 'qa_checkpoint.pth')
    print(f"Saved model checkpoint!")


def load_model_checkpoint(model, optimizer, filename=None):
    if filename:
        checkpoint = torch.load(filename)
    else:
        checkpoint = torch.load('qa_checkpoint.pth')
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    model.train()
    print("Loaded model from checkpoint!")
    return model, optimizer             

Now let's train the model with and without finetuning

In [62]:
block_size = 256
B = 64
DEVICE = "cuda"
learning_rate = 5e-3


train_dataloader = DataLoader(train_dataset, batch_size=B, shuffle=True) #, pin_memory=True, num_workers=2)
val_dataloader = DataLoader(val_dataset, batch_size=B, shuffle=True) #, pin_memory=True, num_workers=2)

# model with finetuning disabled
model = BERTExtractiveQA(finetune=False).to(DEVICE)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
scheduler =  torch.optim.lr_scheduler.StepLR(optimizer, step_size=100, gamma=0.95)
#model, optimizer = load_model_checkpoint(model, optimizer)

num_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in transformer network: {num_params/1e6} M")
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")


Total number of parameters in transformer network: 109.483778 M
RAM used: 6661.90 MB


In [41]:
inputs, targets_start, targets_end, attn_mask = next(iter(train_dataloader))
print(inputs.shape, targets_start.shape, targets_end.shape, attn_mask.shape)

torch.Size([64, 256]) torch.Size([64]) torch.Size([64]) torch.Size([64, 256])


In [63]:
train(model, optimizer, train_dataloader, val_dataloader, device=DEVICE, num_epochs=1, save_every=50, val_every=1) 

Epoch 1, EMA Train Loss: 6.158, Train Accuracy:  0.000, Val Loss:  0.000, Val Accuracy:  0.000: 100%|██████████| 1357/1357 [11:35<00:00,  1.95it/s]

start pos char: 351, end pos char: 403
decoded sentence pair: [CLS] earlier they surrendered to the mongols, the higher they were placed, the more the held out, the lower they were ranked. the northern chinese were ranked higher and southern chinese were ranked lower because southern china withstood and fought to the last before caving in. major commerce during this era gave rise to favorable conditions for private southern chinese manufacturers and merchants. [SEP] who did the yuan's increase in commerce help? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA




Exception: Start or end position is None! start_pos: 66, end_pos: None

In [None]:
train(model, optimizer, train_dataloader, val_dataloader, device=DEVICE, num_epochs=1, save_every=50, val_every=1) 