<a href="https://www.kaggle.com/code/trungcnguyn/diss-biomedqa-context?scriptVersionId=192758013" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Explanation:** 
* (1) Due to Kaggle resources limitation so the code for this project cannot be run in 1 single notebook (the trained models are each at around +1.3Gb) so it was splitted into 2 notebook - This notebook should be examine AFTER the no-context notebook  

* (2) In this project we will explore various language models, include the PubMedBERT - but the current pretrained model hosted on Huggingface is now renamed to BiomedNLP/BiomedBERT - which can potentially cause confusion

* (3) Section 1 and 3 are copied from the first notebook to reproduce the data for training.


# 1. Importing 

## 1.1. Importing libraries

In [1]:
import shutil
import os

# Remove the directory if already exist 
dir_name = 'neural_medical_qa'
if os.path.exists(dir_name):
    shutil.rmtree(dir_name)

#clone from the github repo to load the code into kaggle 
!git clone https://github.com/trduc97/neural_medical_qa.git
%cd neural_medical_qa
# install the requirement
!pip install -r requirements.txt

Cloning into 'neural_medical_qa'...

remote: Enumerating objects: 162, done.[K

remote: Counting objects: 100% (22/22), done.[K

remote: Compressing objects: 100% (22/22), done.[K

remote: Total 162 (delta 11), reused 0 (delta 0), pack-reused 140 (from 1)[K

Receiving objects: 100% (162/162), 1.82 MiB | 21.00 MiB/s, done.

Resolving deltas: 100% (81/81), done.

/kaggle/working/neural_medical_qa






































In [2]:
from classifiers import QAModel, BiLSTMmodel
from train_and_test import Trainandtest
from processing import load_bioasq_pubmedqa, pubmed_train_test_split,result_convert 
from datasets import Dataset, DatasetDict
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## 1.2. Importing data 

### 1.2.1. Importing PubMedQA and BioASQ

In [3]:
bioasq, pubmedqa = load_bioasq_pubmedqa()

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### 1.2.2. Importing PubMedQA artificial

In [4]:
_, pubmedqa_artificial = load_bioasq_pubmedqa(pubmed_kaggle_path='/kaggle/input/pubmed-qa/pubmed_qa_pga_artificial.parquet')

Map:   0%|          | 0/211269 [00:00<?, ? examples/s]

Map:   0%|          | 0/211269 [00:00<?, ? examples/s]

## 1.3. Preprocessing

### 1.3.1 Concatenating the full context for PubMedQA

In [5]:
# Extracting the contexts
pubmed_text = pd.DataFrame(pubmedqa['train']['context'])
pubmed_text['full_context'] = pubmed_text['contexts'].apply(lambda x: ' '.join(x))
# Convert to a DataFrame
pubmedqa_train_df = pd.DataFrame(pubmedqa['train'])
pubmedqa_train_df['full_context']= pubmed_text['full_context']

# Convert the DataFrame back to a Dataset
pubmedqa_context = Dataset.from_pandas(pubmedqa_train_df)

# Create a DatasetDict
pubmedqa = DatasetDict({
    'train': pubmedqa_context
})

### 1.3.2. Concatenating the full context for PubMedQA artificial data

In [6]:
# Extracting the contexts
pubmed_text = pd.DataFrame(pubmedqa_artificial['train']['context'])
pubmed_text['full_context'] = pubmed_text['contexts'].apply(lambda x: ' '.join(x))
# Convert to a DataFrame
pubmedqa_train_df = pd.DataFrame(pubmedqa_artificial['train'])
pubmedqa_train_df['full_context']= pubmed_text['full_context']

# Convert the DataFrame back to a Dataset
pubmedqa_arti_context = Dataset.from_pandas(pubmedqa_train_df)

# Create a DatasetDict
pubmedqa_artificial = DatasetDict({
    'train': pubmedqa_arti_context
})

# 3. Splitting and mix data for training

## 3.1. Splitting PubMedQA and BioASQ

In [7]:
# Splitting  with a ratio of 70-30
pubmedqa_train, pubmedqa_test = pubmed_train_test_split(pubmedqa)
bioasq_train, bioasq_test = pubmed_train_test_split(bioasq)

## 3.2. Mixing Artificial data with PubMedQA labeled training data

In [8]:
# Convert the pubmedqa_artificial dataset to a pandas df
df_artificial = pd.DataFrame(pubmedqa_artificial['train'])
# Separate the df by class
df_class_0 = df_artificial[df_artificial['decision_encoded'] ==0]
df_class_2 = df_artificial[df_artificial['decision_encoded'] ==2]
# Calculate the number of samples needed from each class
samples_per_class=700 // 2
# Sample equally from each class
sampled_class_0 = df_class_0.sample(n=samples_per_class, random_state=42)
sampled_class_2 = df_class_2.sample(n=samples_per_class, random_state=42)


# Combine the samples into one
sampled_artificial = pd.concat([sampled_class_0, sampled_class_2])
# Shuffle
shuffled_sampled_artificial = sampled_artificial.sample(frac=1, random_state=42).reset_index(drop=True)
# Convert the shuffled df to a Dataset
sampled_pubmedqa_artificial = Dataset.from_pandas(shuffled_sampled_artificial)
df_train = pd.DataFrame(pubmedqa_train) # Convert the datasets to pandas dfs
# Concatenate
combined_df = pd.concat([shuffled_sampled_artificial, df_train], ignore_index=True)


# Shuffle the combined df 
shuffled_combined_df = combined_df.sample(frac=1, random_state=42).reset_index(drop=True)
# Convert the final shuffled df back to a Dataset
pubmedqa_arti_context_mixed = Dataset.from_pandas(shuffled_combined_df)

# 5. Training with context (reasoning-required setting)

In this process, our tokenizer are set to process max 512 token, which is the default from the BERT paper, there will be instances with long context that went over this limit, which will cause warning because the section over limit will be remove, leadning to lost of information, but increasing limit will potentially waste computing resources - so the limit is set at 512.

## 5.1. Training with just labeled data

After section 4, it is determined that BiomedNLP (New name of PubMedBERT) and BioLinkBERT provide significant improvement commparing to other language models so we remove ColBERT and LinkBERT, BERT is kept as a base model.

In [9]:
# Predefineing the parameters for training 
batch_size = 8 # due to increase in size of the context so the max batch we can process is 8 
epochs=10
context = '_context'
opt = ''
data = ''
version = context+opt+data

 
models = [
    
    {'model_name': 'BERT'+version,
    'source': 'bert-base-uncased'},
    {'model_name': 'BiomedNLP'+version,
    'source': 'microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract'},
    {'model_name': 'BioLinkBERT'+version,
    'source': 'michiyasunaga/BioLinkBERT-base'}
]

In [10]:
trainer_linear_context = Trainandtest(pubmedqa_train, pubmedqa_test, context=True)

for model in models:
    model_name=model['model_name'],
    source=model['source'],
    trainer_linear_context.model_compile(QAModel, model_name,source,
                                      optimizer='adam', 
                                      batch_size=batch_size)
    # Train the model
    trainer_linear_context.training(model_name, epochs=epochs)
    
    # test the model
    test_result = trainer_linear_context.val()
    trainer_linear_context.results[model['model_name']] = test_result

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Epoch 1, Loss: 1.0035318352959373, F1 Score: 0.4258091595690478, Time: 169.21 seconds

Epoch 2, Loss: 0.956186484884132, F1 Score: 0.45407878630468307, Time: 168.72 seconds

Epoch 3, Loss: 0.9463618634776636, F1 Score: 0.4429278319272303, Time: 168.53 seconds

Epoch 4, Loss: 0.9292710396376523, F1 Score: 0.45727865834962395, Time: 168.74 seconds

Epoch 5, Loss: 0.8620377474210479, F1 Score: 0.5314034191673229, Time: 168.95 seconds

Epoch 6, Loss: 0.7657119543714956, F1 Score: 0.6279825830659492, Time: 168.73 seconds

Epoch 7, Loss: 0.5684577751566063, F1 Score: 0.7402319089150972, Time: 169.36 seconds

Epoch 8, Loss: 0.39344744849950075, F1 Score: 0.825271407870827, Time: 171.20 seconds

Epoch 9, Loss: 0.2134440200911327, F1 Score: 0.9154971140514397, Time: 170.99 seconds

Epoch 10, Loss: 0.11449313076974993, F1 Score: 0.9867715301661939, Time: 171.07 seconds

Model saved to /kaggle/working/models/BERT_context_model.pth

Test - Accuracy: 0.5548172757475083, Precision: 0.570514950166113

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Epoch 1, Loss: 1.07882898775014, F1 Score: 0.4266646260225058, Time: 171.27 seconds

Epoch 2, Loss: 0.9536730061200532, F1 Score: 0.495111989951073, Time: 170.98 seconds

Epoch 3, Loss: 0.8852055167609995, F1 Score: 0.5764454733597845, Time: 171.15 seconds

Epoch 4, Loss: 0.7566650469194759, F1 Score: 0.6666518753000096, Time: 171.45 seconds

Epoch 5, Loss: 0.6289773160083727, F1 Score: 0.719118948164006, Time: 171.63 seconds

Epoch 6, Loss: 0.5029958694834601, F1 Score: 0.7906655122205205, Time: 171.45 seconds

Epoch 7, Loss: 0.37539987909522926, F1 Score: 0.8361187012063224, Time: 171.74 seconds

Epoch 8, Loss: 0.30207037375393236, F1 Score: 0.8713180614021783, Time: 171.45 seconds

Epoch 9, Loss: 0.23481985499066385, F1 Score: 0.9112288680833465, Time: 172.70 seconds

Epoch 10, Loss: 0.15888179468244992, F1 Score: 0.9524310003203712, Time: 171.86 seconds

Model saved to /kaggle/working/models/BiomedNLP_context_model.pth

Test - Accuracy: 0.6611295681063123, Precision: 0.632973786209

tokenizer_config.json:   0%|          | 0.00/379 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/225k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/447k [00:00<?, ?B/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence

config.json:   0%|          | 0.00/559 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Epoch 1, Loss: 0.973631892691959, F1 Score: 0.45082601538667405, Time: 172.26 seconds

Epoch 2, Loss: 0.9408344700932503, F1 Score: 0.43982555605351015, Time: 172.02 seconds

Epoch 3, Loss: 0.9423940174958922, F1 Score: 0.44230613521825507, Time: 171.93 seconds

Epoch 4, Loss: 0.9210914715447209, F1 Score: 0.4719052852539676, Time: 171.73 seconds

Epoch 5, Loss: 0.8594460859894753, F1 Score: 0.5437623508204493, Time: 171.58 seconds

Epoch 6, Loss: 0.7684573534537446, F1 Score: 0.6435523395676456, Time: 172.04 seconds

Epoch 7, Loss: 0.6371374746615236, F1 Score: 0.721943002708733, Time: 172.65 seconds

Epoch 8, Loss: 0.45533397218043153, F1 Score: 0.8202212264670455, Time: 170.75 seconds

Epoch 9, Loss: 0.29034512756731023, F1 Score: 0.8948962422427097, Time: 170.53 seconds

Epoch 10, Loss: 0.19909433444792574, F1 Score: 0.9513276168859007, Time: 170.73 seconds

Model saved to /kaggle/working/models/BioLinkBERT_context_model.pth

Test - Accuracy: 0.6677740863787376, Precision: 0.666249

## 5.2. Training with mixed of labeled and artificial data

In [11]:
# revising the versioning of the models
context = '_context'
opt = ''
data = '_mixed'
version = context+opt+data

models = [
    
    {'model_name': 'BERT'+version,
    'source': 'bert-base-uncased'},
    {'model_name': 'BiomedNLP'+version,
    'source': 'microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract'},
    {'model_name': 'BioLinkBERT'+version,
    'source': 'michiyasunaga/BioLinkBERT-base'}
]

In [12]:
trainer_mixed = Trainandtest(pubmedqa_arti_context_mixed, pubmedqa_test, context=True)

for model in models:
    model_name=model['model_name'],
    source=model['source'],
    trainer_mixed.model_compile(QAModel, model_name,source,
                                      batch_size=batch_size)
    # Train the model
    trainer_mixed.training(model_name, epochs=epochs)
    
    # test the model
    test_result = trainer_mixed.val()
    trainer_mixed.results[model['model_name']] = test_result

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence

Epoch 1, Loss: 0.8986478524548667, F1 Score: 0.48637003935219425, Time: 339.83 seconds

Epoch 2, Loss: 0.803130886384419, F1 Score: 0.579843816055689, Time: 340.02 seconds

Epoch 3, Loss: 0.715532306092126, F1 Score: 0.6681516807898874, Time: 340.00 seconds

Epoch 4, Loss: 0.5978773845945086, F1 Score: 0.7540310742900714, Time: 339.87 seconds

Epoch 5, Loss: 0.4610365628344672, F1 Score: 0.8058068535981151, Time: 339.91 seconds

Epoch 6, Loss: 0.3178102566514696, F1 Score: 0.868207176071281, Time: 339.06 seconds

Epoch 7, Loss: 0.2326211639387267, F1 Score: 0.9046567907195692, Time: 339.51 seconds

Epoch 8, Loss: 0.16743274979825531, F1 Score: 0.9321493280894271, Time: 339.00 seconds

Epoch 9, Loss: 0.11709948597209795, F1 Score: 0.9632817826892575, Time: 337.44 seconds

Epoch 10, Loss: 0.08623563926666974, F1 Score: 0.9743939642898238, Time: 337.06 seconds

Model saved to /kaggle/working/models/BERT_context_mixed_model.pth

Test - Accuracy: 0.5714285714285714, Precision: 0.54431350824

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence

Epoch 1, Loss: 0.9049298996584756, F1 Score: 0.530313759938743, Time: 337.03 seconds

Epoch 2, Loss: 0.7017579024178641, F1 Score: 0.6846565216440482, Time: 335.70 seconds

Epoch 3, Loss: 0.5802638104132244, F1 Score: 0.7721562790164472, Time: 334.23 seconds

Epoch 4, Loss: 0.44585470352854045, F1 Score: 0.8259648987899192, Time: 335.28 seconds

Epoch 5, Loss: 0.33813858489905085, F1 Score: 0.867361590415157, Time: 334.96 seconds

Epoch 6, Loss: 0.2613428049640996, F1 Score: 0.8975571917602269, Time: 334.47 seconds

Epoch 7, Loss: 0.18531239184417894, F1 Score: 0.9265792856672992, Time: 334.43 seconds

Epoch 8, Loss: 0.14649703740275333, F1 Score: 0.9504684058761081, Time: 334.35 seconds

Epoch 9, Loss: 0.1558898522997541, F1 Score: 0.9563559267525574, Time: 333.23 seconds

Epoch 10, Loss: 0.11290957718555417, F1 Score: 0.967101428377449, Time: 333.99 seconds

Model saved to /kaggle/working/models/BiomedNLP_context_mixed_model.pth

Test - Accuracy: 0.6013289036544851, Precision: 0.5934

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence

Epoch 1, Loss: 0.9053829898153033, F1 Score: 0.47079018030235487, Time: 334.44 seconds

Epoch 2, Loss: 0.7738681610992977, F1 Score: 0.6171396092259691, Time: 334.31 seconds

Epoch 3, Loss: 0.6178830394574574, F1 Score: 0.7448943483238617, Time: 335.32 seconds

Epoch 4, Loss: 0.4767942288517952, F1 Score: 0.8188375349738348, Time: 333.58 seconds

Epoch 5, Loss: 0.345628376517977, F1 Score: 0.8667719298933526, Time: 335.31 seconds

Epoch 6, Loss: 0.2813580674145903, F1 Score: 0.8935513907574281, Time: 335.01 seconds

Epoch 7, Loss: 0.20346770026854105, F1 Score: 0.9293628174657663, Time: 335.14 seconds

Epoch 8, Loss: 0.15210273075316633, F1 Score: 0.9584721868793843, Time: 334.37 seconds

Epoch 9, Loss: 0.12396818215293544, F1 Score: 0.9653990419018964, Time: 336.30 seconds

Epoch 10, Loss: 0.08140072951891593, F1 Score: 0.9803841517787403, Time: 337.28 seconds

Model saved to /kaggle/working/models/BioLinkBERT_context_mixed_model.pth

Test - Accuracy: 0.6843853820598007, Precision: 0.

## 5.3. Results

In a context rich condition, BioLinkBERT consistently outperform PubMedBERT/BiomedNLP. Quite surprisingly even in the condition of mixing with artificial data, BioLinkBERT still got an improved results even though both BERT and PubMedBERT suffered. 

In [13]:
result_linear_context= result_convert(trainer_linear_context.results)
print('Adam optimiser+Linear layer\n',result_linear_context[['Model','Accuracy','F1 Score']])
result_linear_context_mixed= result_convert(trainer_mixed.results)
print('Adam optimiser+Linear layer\n',result_linear_context_mixed)

Adam optimiser+Linear layer

                  Model  Accuracy  F1 Score

0         BERT_context  0.554817  0.559380

1    BiomedNLP_context  0.661130  0.643008

2  BioLinkBERT_context  0.667774  0.661588

Adam optimiser+Linear layer

                        Model  Accuracy  Precision    Recall  F1 Score

0         BERT_context_mixed  0.571429   0.544314  0.571429  0.552980

1    BiomedNLP_context_mixed  0.601329   0.593460  0.601329  0.585372

2  BioLinkBERT_context_mixed  0.684385   0.687367  0.684385  0.685798
