In [1]:
!nvidia-smi -L

GPU 0: Tesla K80 (UUID: GPU-180cacfa-104a-ea8c-e65a-0642bc7e6946)


## Fetching data

In [2]:
## Downloading data. For more info about data check this link - https://github.com/dmis-lab/biobert
!gdown --id 19ft5q44W4SuptJgTwR84xZjsHg1jvjSZ

Downloading...
From: https://drive.google.com/uc?id=19ft5q44W4SuptJgTwR84xZjsHg1jvjSZ
To: /content/QA.zip
  0% 0.00/5.48M [00:00<?, ?B/s] 86% 4.72M/5.48M [00:00<00:00, 26.2MB/s]100% 5.48M/5.48M [00:00<00:00, 25.5MB/s]


In [3]:
!unzip -q QA.zip -d QA_dataset

**In the folder QA_dataset/BioASQ we have many JSON files. We would be using the train-factoid files for fine-tuning our model.**

## Importing necessary libraries :

In [4]:
import os
import json
import pandas as pd
import numpy as np
import random

## Extracting questions & answers from JSON files :

In [5]:
def extract_QA(filepath):
  with open(filepath, 'r') as f:
    data = json.load(f)

  questions = data['data'][0]['paragraphs']
  data_rows = []

  for question in questions:
    context = question['context']
    for qa in question['qas']:
      question = qa["question"]
      answers = qa['answers']

      for answer in answers:
        answer_text = answer['text']
        answer_start = answer['answer_start']
        answer_end = answer_start + len(answer_text) - 1 

        data_rows.append({
            "question": question,
            "context": context,
            "answer_text": answer_text,
            "answer_start": answer_start,
            "answer_end": answer_end
        })
  return pd.DataFrame(data_rows)

In [6]:
dfs = []
filepaths = ['/content/QA_dataset/BioASQ/BioASQ-train-factoid-4b.json',
             '/content/QA_dataset/BioASQ/BioASQ-train-factoid-5b.json',
             '/content/QA_dataset/BioASQ/BioASQ-train-factoid-6b.json']

for filepath in filepaths:
  df = extract_QA(filepath)
  print(df.shape)
  dfs.append(df)             

(3266, 5)
(4950, 5)
(4772, 5)


In [7]:
df_conc = pd.concat(dfs)
print(df_conc.shape)
df_conc.head()

(12988, 5)


Unnamed: 0,question,context,answer_text,answer_start,answer_end
0,What is the inheritance pattern of Li–Fraumeni...,Balanced t(11;15)(q23;q15) in a TP53+/+ breast...,autosomal dominant,213,230
1,What is the inheritance pattern of Li–Fraumeni...,Genetic modeling of Li-Fraumeni syndrome in ze...,autosomal dominant,105,122
2,Which type of lung cancer is afatinib used for?,Clinical perspective of afatinib in non-small ...,EGFR-mutant NSCLC,1203,1219
3,Which hormone abnormalities are characteristic...,"DOCA sensitive pendrin expression in kidney, h...",thyroid,419,425
4,Which hormone abnormalities are characteristic...,Clinical and molecular characteristics of Pend...,thyroid,705,711


**Total no.of of context, question pairs is 12988. But not all of them are unique.**

In [None]:
## No.of unique contexts
len(df_conc['context'].unique())

2582

In [None]:
## Nol of unique questions
len(df_conc['question'].unique())

443

**We have 2582 unique contexts & 443 unique questions. We will be using the unique contexts only for fine-tuning our model.**

In [8]:
## Dropping duplicate contexts
df_deduped = df_conc.drop_duplicates(subset=["context"], keep='first')
df_deduped.head()

Unnamed: 0,question,context,answer_text,answer_start,answer_end
0,What is the inheritance pattern of Li–Fraumeni...,Balanced t(11;15)(q23;q15) in a TP53+/+ breast...,autosomal dominant,213,230
1,What is the inheritance pattern of Li–Fraumeni...,Genetic modeling of Li-Fraumeni syndrome in ze...,autosomal dominant,105,122
2,Which type of lung cancer is afatinib used for?,Clinical perspective of afatinib in non-small ...,EGFR-mutant NSCLC,1203,1219
3,Which hormone abnormalities are characteristic...,"DOCA sensitive pendrin expression in kidney, h...",thyroid,419,425
4,Which hormone abnormalities are characteristic...,Clinical and molecular characteristics of Pend...,thyroid,705,711


In [9]:
df_deduped.shape

(2582, 5)

In [None]:
# df_deduped.to_csv("Biological_QA.csv", index=False)
# print("File saved !!!")

## Splitting data into 2 parts :

In [10]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_deduped, test_size=0.15, random_state = 2021)
print(df_train.shape)
print(df_test.shape)

(2194, 5)
(388, 5)


In [11]:
## Filtering questions
df_train = df_train[(df_train["answer_start"] > 0) & (df_train["answer_start"] > 0)]
df_test = df_test[(df_test["answer_start"] > 0) & (df_test["answer_start"] > 0)]
print(df_train.shape)
print(df_test.shape)

(2136, 5)
(378, 5)


In [12]:
train_contexts = list(df_train['context'].values)
train_questions = list(df_train['question'].values)
train_answers = list(df_train['answer_text'].values)
train_answers_start = list(df_train['answer_start'].values)
train_answers_end = list(df_train['answer_end'].values)

## The lines below are for safety check purposes :)
assert len(train_contexts) == len(train_questions)
assert len(train_contexts) == len(train_answers)
assert len(train_contexts) == len(train_answers_start)
assert len(train_contexts) == len(train_answers_end)

## Fine-tuning model :

**I would be using a pretrained DistilBert model & would retrain the model using the new dataset.**

In [13]:
!pip install transformers --q

[K     |████████████████████████████████| 3.1 MB 4.0 MB/s 
[K     |████████████████████████████████| 895 kB 49.8 MB/s 
[K     |████████████████████████████████| 3.3 MB 34.4 MB/s 
[K     |████████████████████████████████| 596 kB 42.3 MB/s 
[K     |████████████████████████████████| 59 kB 6.7 MB/s 
[?25h

In [14]:
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering

In [15]:
MODEL_NAME = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
# train_encodings.char_to_token(4,705)

In [None]:
# train_encodings.token_to_chars(171)

CharSpan(start=723, end=725)

In [16]:
def add_token_positions(tokenizer, encodings, answers_start, answers_end):
    start_positions = []
    end_positions = []
    for i, (st,en) in enumerate(zip(answers_start, answers_end)):
        # print(i,st,en)
        start_positions.append(encodings.char_to_token(i, st))
        end_positions.append(encodings.char_to_token(i, en))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length

    print(start_positions)
    print(end_positions)
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(tokenizer, train_encodings, train_answers_start, train_answers_end)

[150, 249, 69, 83, 38, 478, 208, 185, 59, 80, 278, 104, 35, 46, 122, 69, 178, 35, 23, 71, 170, 107, 17, 95, 34, 39, 50, 73, 36, 33, 55, 172, 13, 75, 152, 57, 172, 37, 86, 6, 39, 41, 11, 26, 42, 7, 22, 84, 96, 114, 36, 21, 42, 311, 17, 209, 92, 37, 27, 278, 7, 111, 34, 11, 227, 83, 54, 259, 26, 137, 152, 77, 48, 60, 21, 65, 23, 429, 58, 36, 37, 55, 31, 17, 109, 46, 8, 49, 145, 136, 289, 512, 23, 121, 59, 233, 91, 82, 64, 56, 215, 64, 31, 6, 56, 105, 185, 115, 87, 44, 229, 26, 27, 37, 84, 82, 29, 69, 95, 180, 109, 85, 131, 40, 30, 8, 37, 122, 67, 181, 87, 48, 115, 137, 142, 275, 7, 18, 124, 162, 40, 27, 26, 17, 106, 42, 144, 111, 45, 60, 57, 7, 297, 248, 214, 26, 97, 91, 39, 31, 127, 55, 113, 15, 94, 20, 15, 62, 95, 24, 295, 83, 25, 24, 21, 121, 30, 57, 48, 8, 102, 124, 57, 9, 38, 31, 388, 72, 27, 142, 29, 53, 51, 24, 432, 8, 37, 39, 45, 333, 219, 42, 177, 195, 24, 69, 67, 34, 33, 4, 109, 215, 54, 107, 10, 52, 180, 61, 147, 31, 46, 129, 236, 103, 50, 443, 266, 339, 85, 9, 28, 96, 51, 79,

In [17]:
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [18]:
class QADataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings.input_ids)

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

In [19]:
train_dataset = QADataset(train_encodings)
train_dataloader = DataLoader(train_dataset, 
                              sampler = RandomSampler(train_dataset), 
                              batch_size=16,
                              num_workers=2,
                              pin_memory=True)

# # For validation the order doesn't matter, so we'll just read them sequentially.
# validation_dataloader = DataLoader(
#             val_dataset, # The validation samples.
#             sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
#             batch_size = batch_size # Evaluate with this batch size.
#         )

In [20]:
def seed_everything(seed=2021):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything()

In [22]:
import time
from tqdm import tqdm
from transformers import AdamW, get_linear_schedule_with_warmup

In [23]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForQuestionAnswering.from_pretrained(MODEL_NAME)
model.to(device)
model.train()

num_epochs = 10
optim = AdamW(model.parameters(), 
              lr=2e-5,
              correct_bias=False)

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * num_epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optim, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

for epoch in range(num_epochs):
    for idx, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):

        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, 
                        attention_mask=attention_mask, 
                        start_positions=start_positions, 
                        end_positions=end_positions)
        loss = outputs.loss
        
        ## Backpropagating
        loss.backward()

        ## Gradient clipping - avoids exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optim.step()
        # Update the learning rate.
        scheduler.step()

print("\nDONE !!!")

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode


DONE !!!





In [None]:
model.eval()

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            

## Evaluating the model :

Now we evaluate model predictions when it sees new data

In [None]:
test_contexts = list(df_test['context'].values)
test_questions = list(df_test['question'].values)
test_answers = list(df_test['answer_text'].values)

In [None]:
idx = 89
test_context, test_question, test_answer = test_contexts[idx], test_questions[idx], test_answers[idx]
print(test_question)
print(test_context)

test_encoding = tokenizer.encode_plus(text=test_question,text_pair=test_context, add_special_tokens=True)

What organism causes tularemia?
Molecular Detection of Persistent Francisella tularensis Subspecies holarctica in Natural Waters. Tularemia, caused by the bacterium Francisella tularensis, where F. tularensis subspecies holarctica has long been the cause of endemic disease in parts of northern Sweden. Despite this, our understanding of the natural life-cycle of the organism is still limited. During three years, we collected surface water samples (n = 341) and sediment samples (n = 245) in two areas in Sweden with endemic tularemia. Real-time PCR screening demonstrated the presence of F. tularenis lpnA sequences in 108 (32%) and 48 (20%) of the samples, respectively. The 16S rRNA sequences from those samples all grouped to the species F. tularensis. Analysis of the FtM19InDel region of lpnA-positive samples from selected sampling points confirmed the presence of F. tularensis subspecies holarctica-specific sequences. These sequences were detected in water sampled during both outbreak an

In [None]:
inputs = test_encoding['input_ids']  #Token embeddings
attn_mask = test_encoding['attention_mask']
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens
print(inputs)
print(attn_mask)
print(tokens)
print(len(attn_mask),len(tokens))

# Telling the model not to compute or store gradients, saving memory and speeding up prediction
with torch.no_grad():
  output = model(input_ids=torch.tensor([inputs]).to(device=device), 
                 attention_mask = torch.tensor([attn_mask]).to(device=device))                                  

[101, 2054, 15923, 5320, 10722, 8017, 17577, 1029, 102, 8382, 10788, 1997, 14516, 4557, 8411, 10722, 8017, 9911, 11056, 7570, 8017, 13306, 2050, 1999, 3019, 5380, 1012, 10722, 8017, 17577, 1010, 3303, 2011, 1996, 24024, 4557, 8411, 10722, 8017, 9911, 1010, 2073, 1042, 1012, 10722, 8017, 9911, 11056, 7570, 8017, 13306, 2050, 2038, 2146, 2042, 1996, 3426, 1997, 7320, 4295, 1999, 3033, 1997, 2642, 4701, 1012, 2750, 2023, 1010, 2256, 4824, 1997, 1996, 3019, 2166, 1011, 5402, 1997, 1996, 15923, 2003, 2145, 3132, 1012, 2076, 2093, 2086, 1010, 2057, 5067, 3302, 2300, 8168, 1006, 1050, 1027, 28358, 1007, 1998, 19671, 8168, 1006, 1050, 1027, 21005, 1007, 1999, 2048, 2752, 1999, 4701, 2007, 7320, 10722, 8017, 17577, 1012, 2613, 1011, 2051, 7473, 2099, 11326, 7645, 1996, 3739, 1997, 1042, 1012, 10722, 8017, 18595, 2015, 6948, 2532, 10071, 1999, 10715, 1006, 3590, 1003, 1007, 1998, 4466, 1006, 2322, 1003, 1007, 1997, 1996, 8168, 1010, 4414, 1012, 1996, 2385, 2015, 25269, 2532, 10071, 2013, 2216, 8

## Note :

There might be some trouble when you try to convert BERT tokens back to original sentence. This is bcoz BERT uses word-piece tokenization that is unfortunately not loss-less, i.e., you are never guaranteed to get the same sentence after detokenization. This is a big difference from RoBERTa that uses SentencePiece that is fully revertable.

You can get the so-called pre-tokenized text where merging tokens starting with ##.

For more check this link - https://stackoverflow.com/questions/66232938/how-to-untokenize-bert-tokens

In [None]:
start_index = torch.argmax(output.start_logits)
end_index = torch.argmax(output.end_logits)

pred_tokens = tokens[start_index:end_index+1]
pred_answer = tokenizer.convert_tokens_to_string(pred_tokens)

print(test_answer)
print(pred_answer)
print(pred_tokens)

Francisella tularensis
francisella tularensis
['francis', '##ella', 'tu', '##lar', '##ensis']


**That's spot on !!!**