## Text Data Augmentation for Medical Question Answering

## Introduction

In the biomedical domain, obtaining large, high-quality labeled datasets is often challenging due to data privacy concerns, the sensitive nature of medical data, and the specialized knowledge required for annotation. **Synthetic data augmentation** using **Generative AI** addresses these limitations by creating new, diverse samples without needing extensive manual labeling.

This notebook demonstrates two text augmentation techniques on a subset of the **MedQUAD** dataset:
1. **NLPAug (Synonym-based Augmentation)**: A rule-based method that generates three variations of question-answer pairs by replacing words with their synonyms.
2. **DistilGPT-2 (Generative Model)**: A pre-trained model that generates one new synthetic question-answer pair for each original pair.

These techniques enhance the original dataset, which is used to fine-tune the **BioBERT** model, improving its performance in answering medical questions. **BioBERT** is particularly suited for biomedical tasks, as it has been pre-trained on large biomedical text corpora.



**Dataset:**

- MedQUAD is a dataset containing medical question-answer pairs covering various topics. Our goal is to generate augmented data for fine-tuning a BioBERT model for the medical question-answering task. MedQUAD: [MedQUAD Kaggle](https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset)
- This dataset is useful for developing and fine-tuning models in the biomedical domain, especially for tasks like medical question answering and knowledge retrieval.


### Step 1: Authenticate & Install necessary libraries
We authenticate the Colab environment with Google Cloud and set the project.
This is necessary for accessing data stored in Google Cloud Storage (GCS).


In [None]:
from google.colab import auth
auth.authenticate_user()

# Set your Google Cloud project
project_id = "project_id" #Replace it with your project id
!gcloud config set project {project_id}

Updated property [core/project].


### Install Packages like google-cloud-automl, nlpaug and transformers

In [None]:
!pip install google-cloud-automl nlpaug transformers datasets nltk rouge safetensors

Collecting google-cloud-automl
  Downloading google_cloud_automl-2.13.5-py2.py3-none-any.whl.metadata (6.1 kB)
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading google_cloud_automl-2.13.5-py2.py3-none-any.whl (334 kB)
[2K

Load the MedQUAD data

In [None]:
import pandas as pd
from google.cloud import automl_v1beta1 as automl
from google.cloud import storage

# Define the AutoML Vision client
client = automl.AutoMlClient()

# Set your GCS bucket and file path
bucket_name = "bucket_name" #Replace it with your bucket name
file_name = "medquad_sample.csv"

# Full GCS file path
gcs_file_path = f"gs://{bucket_name}/{file_name}"

# Load the CSV into a Pandas DataFrame
medquad_df = pd.read_csv(gcs_file_path)

medquad_df.head()

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Breast Cancer? ?,Risk factors are conditions or agents that inc...
1,frequency,How many people are affected by myotonia conge...,Myotonia congenita is estimated to affect 1 in...
2,symptoms,What are the symptoms of Hallermann-Streiff sy...,What are the signs and symptoms of Hallermann-...
3,causes,What causes Williams syndrome ?,What causes Williams syndrome? Williams syndro...
4,symptoms,What are the symptoms of Keutel syndrome ?,What are the signs and symptoms of Keutel synd...


In [None]:
#Display the shape of the dataframe
print("Shape of the dataframe:")
print(medquad_df.shape)
print('\n')
#Display datatype information of the dataframe
print("DataType Info:")
print(medquad_df.info())
print("\n")
print("Columns in Dataframe:")
print(medquad_df.columns)

Shape of the dataframe:
(800, 3)


DataType Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   qtype     800 non-null    object
 1   Question  800 non-null    object
 2   Answer    800 non-null    object
dtypes: object(3)
memory usage: 18.9+ KB
None


Columns in Dataframe:
Index(['qtype', 'Question', 'Answer'], dtype='object')


Split the data to Train set and Test Set

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data
train_df, test_df = train_test_split(medquad_df, test_size=0.4, random_state=42)
# Taking only 10 samples from the train_df
sample_data = train_df.sample(n=10, random_state=42)

# Display the shape of the data
print(f"Train Shape: {train_df.shape}, Test Shape: {test_df.shape}")
print(f"Sample Data Shape: {sample_data.shape}" )

Train Shape: (480, 3), Test Shape: (320, 3)
Sample Data Shape: (10, 3)


Make copies of training data to apply Augmentation Techniques

In [None]:
#Make a copy of train_df for NLPAug based augmentation
ip2nlpaug_df = train_df.copy()
print("Shape of input dataframe for traditional augmentation:", ip2nlpaug_df.shape)

#Make a copy of train_df for GPT based augmentation
ip2gpt2_df = train_df.copy()
print("Shape of input dataframe for traditional augmentation:", ip2gpt2_df.shape)

#Make a copy of sample data for NLPAug based augmentation

sample_data_nlp_df = sample_data.copy()
print("Shape of sample input dataframe for traditional augmentation:", sample_data_nlp_df.shape)

Shape of input dataframe for traditional augmentation: (480, 3)
Shape of input dataframe for traditional augmentation: (480, 3)
Shape of sample input dataframe for traditional augmentation: (10, 3)


### Step 2: Data Augmentation using NLPAug

We use the **NLPAug** (Natural Language Processing Augmentation) library to augment our training data. NLPAug is a powerful Python library that provides a range of augmentation techniques for text, audio, and image data. For this example, we focus on **text augmentation** using synonym replacement.

**NLPAug features:**
- **Word Augmentation:** Synonym replacement, random insertion, and deletion.
- **Character Augmentation:** Substitution, random keyboard typo simulation.
- **Sentence Augmentation:** Contextual word embedding techniques (BERT, GPT).
- **Speed and Flexibility:** Built-in support for common NLP libraries like WordNet and Transformers, making it versatile for different use cases.

In this step, we use synonym augmentation with **WordNet** to generate three new question-answer pairs for each original pair in our dataset.


## Synonym Replacement - Sample Data

In [None]:
import pandas as pd
import nlpaug.augmenter.word as naw

# Initialize the Synonym Augmenter from nlpaug
synonym_aug = naw.SynonymAug(aug_src='wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
# Step 1: Select one sample from sample_data
sample_input = sample_data.iloc[[0]]  # Selecting the first sample

# Step 2: Extract the original Question
original_question = sample_input['Question'].values[0]

# Step 3: Augment the Question
augmented_questions = [synonym_aug.augment(original_question)]

# Step 4: Display the original question and the augmented questions
print(f"Original Question:\n{original_question}\n")

for i, aug_question in enumerate(augmented_questions, 1):
    print(f"Augmented Question {i}:\n{aug_question}\n")

Original Question:
What causes Tetralogy of Fallot ?

Augmented Question 1:
['What causes Tetralogy of Etienne louis arthur fallot?']



Function to apply synonym-based augmentation to training set

In [None]:
import nlpaug.augmenter.word as naw
import pandas as pd

# Initialize the Synonym Augmenter from nlpaug
synonym_aug = naw.SynonymAug(aug_src='wordnet')

# Function to augment the data (Question and Answer only, qtype stays the same)
def augment_text(df, augmenter, num_augments=1):
    augmented_data = []

    for idx, row in df.iterrows():
        qtype = row['qtype']  # Keep the qtype the same
        question = row['Question']
        answer = row['Answer']

        # Augment the question and answer
        for _ in range(num_augments):
            aug_question = augmenter.augment(question)
            aug_answer = augmenter.augment(answer)
            augmented_data.append([qtype, aug_question, aug_answer])

    # Convert to a DataFrame
    return pd.DataFrame(augmented_data, columns=['qtype', 'Augmented_Question', 'Augmented_Answer'])

This takes around -  45 secs

In [None]:
# Augmenting the training data
num_augments_per_sample = 3  # Number of augmentations per sample
augmented_train_df = augment_text(ip2nlpaug_df, synonym_aug, num_augments=num_augments_per_sample)

# Display a few augmented samples
augmented_train_df.head()

Unnamed: 0,qtype,Augmented_Question,Augmented_Answer
0,information,[What is (are) Neurogenic diabetes insipidus?],[Neurogenic diabetes insipidus is a disease th...
1,information,[What is (are) Neurogenic diabetes insipidus?],[Neurogenic diabetes insipidus is a disease th...
2,information,[What is (are) Neurogenic diabetes insipidus?],[Neurogenic diabetes insipidus is a disease th...
3,information,[What is (be) medium - mountain range acyl - C...,[Medium - chain acyl - CoA dehydrogenase (MCAD...
4,information,[What is (be) medium - chain of mountains acyl...,[Medium - chain acyl - CoA dehydrogenase (MCAD...


In [None]:
print("NLPAug based Augmented Data:", augmented_train_df.shape)

NLPAug based Augmented Data: (1440, 3)


In [None]:
# Prepare NLP-augmented data
augmented_nlp = augmented_train_df[['qtype', 'Augmented_Question', 'Augmented_Answer']].copy()
augmented_nlp.rename(columns={'Augmented_Question': 'Question', 'Augmented_Answer': 'Answer'}, inplace=True)

# Mark as NLP-Augmented
augmented_nlp['source'] = 'NLAug-Augmented'

augmented_nlp.head()

Unnamed: 0,qtype,Question,Answer,source
0,information,[What is (are) Neurogenic diabetes insipidus?],[Neurogenic diabetes insipidus is a disease th...,NLAug-Augmented
1,information,[What is (are) Neurogenic diabetes insipidus?],[Neurogenic diabetes insipidus is a disease th...,NLAug-Augmented
2,information,[What is (are) Neurogenic diabetes insipidus?],[Neurogenic diabetes insipidus is a disease th...,NLAug-Augmented
3,information,[What is (be) medium - mountain range acyl - C...,[Medium - chain acyl - CoA dehydrogenase (MCAD...,NLAug-Augmented
4,information,[What is (be) medium - chain of mountains acyl...,[Medium - chain acyl - CoA dehydrogenase (MCAD...,NLAug-Augmented


In [None]:
#Save the Augmented data to a csv format
augmented_nlp.to_csv('augmented_medquad.csv', index=False)

Saving Augmented Dataset:
The augmented dataset is saved to Google Cloud Storage for further use in downstream tasks.

In [None]:
#Save the augmented data to the storage bucket

# Define your bucket name
bucket_name = 'bucket name'  # Replace with your actual bucket name

# Get the bucket object
bucket = client.bucket(bucket_name)

# Define the destination blob name (file name in the bucket)
destination_blob_name = 'augmented_medquad.csv'

# Create a new blob and upload the CSV file to GCS
blob = bucket.blob(destination_blob_name)

# Upload the CSV file to the bucket
blob.upload_from_filename('augmented_medquad.csv')

print(f"File uploaded to {bucket_name}/{destination_blob_name}.")


File uploaded to data-augmentation-text/augmented_medquad.csv.


### Step 3: Data Augmentation using DistilGPT-2
In this step, we use **DistilGPT-2**, a distilled version of GPT-2, to generate new question-answer pairs for data augmentation.

#### What is DistilGPT-2?
**DistilGPT-2** is a smaller, faster version of **GPT-2** (Generative Pretrained Transformer-2), which is one of the most popular generative language models developed by OpenAI. DistilGPT-2 retains 95% of the performance of GPT-2 while being **60% faster** and **smaller** in size. This makes it ideal for tasks like data augmentation where you need high-quality, diverse text generation at a lower computational cost.

You can find more details and access the model on [Hugging Face's DistilGPT-2 model page](https://huggingface.co/distilbert/distilgpt2).

#### Why use DistilGPT-2 for Augmentation?
- **Generates Coherent Text:** Unlike rule-based techniques like synonym replacement, DistilGPT-2 can generate more **contextually rich and diverse** text samples.
- **Better for Creativity:** Generating new samples that don't just vary in word choice but also in sentence structure and phrasing can introduce more variety into the dataset, improving generalization.
- **Efficient for Real-time Applications:** Because it’s smaller and faster than GPT-2, it can be used in **real-time** applications or in scenarios where computational resources are limited.

#### How DistilGPT-2 works:
In this step, we use **DistilGPT-2** to generate **one new question-answer pair** for each original sample. This is done by feeding the original question and answer into the model and generating a new variant.


Import necesssary libraries

In [None]:
import pandas as pd
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

### Pipeline for generating new question-answer pairs using DistilGPT-2

In [None]:
# Check if a GPU is available
#device = 0 if torch.cuda.is_available() else -1  # Use -1 for CPU, 0 for GPU
device = -1  # Use CPU


# Load the DistilGPT2 model and tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set the padding token to be the same as the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token  # Use eos_token as pad_token

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name)

# Create a text generation pipeline, specifying the device
generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Generated Question on sample data

In [None]:
sample_input = sample_data.iloc[[4]]  # Change the index to select a different sample
original_question = sample_input['Question'].values[0]  # Assuming the column name is 'Question'

augmented_question = generation_pipeline(
    original_question,
    num_return_sequences=1,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    max_new_tokens=50,
    truncation=True,
    pad_token_id=tokenizer.eos_token_id
)[0]['generated_text']


print(f"Original Question:\n{original_question}\n")
print(f"Generated Question:\n{augmented_question}\n")


Original Question:
Do you have information about Alcohol

Generated Question:
Do you have information about Alcohol, Tobacco, Firearms and Explosives, please call the ATF (ATF) at 800-342-7233 (trucks). If an agent is not on the scene, please call ATF at 800-342-7233 (tru



Filtering the training data to cater towards the maximum sequence length of the model 1024 tokens.

In [None]:
def filter_long_texts(df, max_length=1024):
    # Keep only those rows where both 'Question' and 'Answer' are within the max_length
    # Apply the tokenization and length checking in a way that prevents errors
    def is_within_length(text):
        return len(tokenizer.encode(text)) <= max_length if isinstance(text, str) else True

    return df[df['Question'].apply(is_within_length) & df['Answer'].apply(is_within_length)]

# Apply the filter to your DataFrame
filtered_df = filter_long_texts(ip2gpt2_df)
print(filtered_df.shape)
#print(filtered_df.head())

Token indices sequence length is longer than the specified maximum sequence length for this model (3531 > 1024). Running this sequence through the model will result in indexing errors


(451, 3)


In [None]:
# Function to generate new text for both questions and answers
def generate_batch_text(input_texts, num_return_sequences=1, max_new_tokens=50, batch_size=16):
    augmented_texts = []

    # Process the input data in batches
    for i in range(0, len(input_texts), batch_size):
        batch_input = input_texts[i:i + batch_size]

        # Debugging: Print the batch input
        #print(f"Batch Input {i//batch_size}: {batch_input}")

        # Generate augmented text
        try:
            generated = generation_pipeline(
                batch_input,
                num_return_sequences=num_return_sequences,
                do_sample=True,
                top_k=50,
                top_p=0.95,
                max_new_tokens=max_new_tokens,
                truncation=True,  # Explicitly truncate input sequences
                pad_token_id=tokenizer.eos_token_id  # Explicit padding
            )

            # Extract and store generated text
            for gen in generated:
                #print(f"Generated Output: {gen}")  # Debugging the generated output
                augmented_texts.append(gen[0]['generated_text'])  # Accessing the generated text correctly
        except Exception as e:
            print(f"Error occurred during generation: {e}")

    return augmented_texts


In [None]:
# Batch size and maximum length for generated samples
batch_size = 8
max_new_tokens = 50

# Initialize empty list to store augmented data
augmented_data = []

Time taken to generate new data samples -  4 mins

In [None]:
# Process the training data in batches
for i in range(0, len(filtered_df), batch_size):
    # Get the current batch
    batch_df = filtered_df[i:i + batch_size]

    # Debugging: Print the batch DataFrame
    # print(f"Processing Batch {i//batch_size}:")
    # print(batch_df)

    # Check for empty questions or answers
    if any(batch_df['Question'].isnull()) or any(batch_df['Answer'].isnull()):
        print("Warning: Null values detected in questions or answers.")

    # Generate augmented questions and answers
    augmented_questions = generate_batch_text(batch_df['Question'].tolist(), max_new_tokens=max_new_tokens)
    augmented_answers = generate_batch_text(batch_df['Answer'].tolist(), max_new_tokens=max_new_tokens)

    # Ensure that generated outputs match the input batch size
    if len(augmented_questions) != len(batch_df) or len(augmented_answers) != len(batch_df):
        print(f"Warning: Mismatch in generated outputs for batch {i//batch_size}. "
              f"Questions: {len(augmented_questions)}, Answers: {len(augmented_answers)}, "
              f"Batch Size: {len(batch_df)}")
        continue  # Skip this batch if there's a mismatch

    # Store the augmented data (qtype, augmented question, augmented answer)
    for qtype, question, answer in zip(batch_df['qtype'], augmented_questions, augmented_answers):
        augmented_data.append([qtype, question, answer])


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


In [None]:
# Convert augmented data to a DataFrame
generated_df = pd.DataFrame(augmented_data, columns=['qtype', 'Augmented_Question', 'Augmented_Answer'])

In [None]:
# Prepare GPT-augmented data
generated_gpt = generated_df[['qtype', 'Augmented_Question', 'Augmented_Answer']].copy()
generated_gpt.rename(columns={'Augmented_Question': 'Question', 'Augmented_Answer': 'Answer'}, inplace=True)

# Mark as GPT-Augmented
generated_gpt['source'] = 'GPT-Augmented'

generated_gpt.head()

Unnamed: 0,qtype,Question,Answer,source
0,information,What is (are) Neurogenic diabetes insipidus ?\...,Neurogenic diabetes insipidus is a disease tha...,GPT-Augmented
1,information,What is (are) medium-chain acyl-CoA dehydrogen...,Medium-chain acyl-CoA dehydrogenase (MCAD) def...,GPT-Augmented
2,information,What is (are) abetalipoproteinemia ?,Abetalipoproteinemia is an inherited disorder ...,GPT-Augmented
3,treatment,What are the treatments for maternally inherit...,These resources address the diagnosis or manag...,GPT-Augmented
4,causes,What causes Williams syndrome ? (C) (2007),What causes Williams syndrome? Williams syndro...,GPT-Augmented


In [None]:
print("DistillGPT based Augmented Data:", generated_gpt.shape)

DistillGPT based Augmented Data: (451, 4)


In [None]:
# Save to CSV
generated_gpt.to_csv('generated_medquad.csv', index=False)

Saving Augmented Dataset:
The augmented dataset is saved to Google Cloud Storage for further use in downstream tasks.

In [None]:
#Save the augmented data to the storage bucket
# Define your bucket name
bucket_name = 'bucket name'  # Replace with your actual bucket name

# Get the bucket object
bucket = client.bucket(bucket_name)

# Define the destination blob name (file name in the bucket)
destination_blob_name = 'generated_medquad.csv'

# Create a new blob and upload the CSV file to GCS
blob = bucket.blob(destination_blob_name)

# Upload the CSV file to the bucket
blob.upload_from_filename('generated_medquad.csv')

print(f"File uploaded to {bucket_name}/{destination_blob_name}.")


File uploaded to data-augmentation-text/generated_medquad.csv.


### Step 4: Fine-Tuning BioBERT for Medical Question Answering

**BioBERT** (Bidirectional Encoder Representations from Transformers for Biomedical Text) is a variant of BERT that has been pre-trained on large biomedical corpora such as **PubMed** abstracts and **PMC full-text articles**. This makes it ideal for tasks like **medical question answering**, where understanding domain-specific language is crucial.

In this notebook, we fine-tune BioBERT using the augmented data generated by:
1. **NLPAug**: Which provides synonym-based variations of medical question-answer pairs.
2. **DistilGPT-2**: A generative model that produces contextually diverse and fluent medical question-answer pairs.

By fine-tuning BioBERT on these augmented datasets, we aim to:
- Improve its ability to answer medical queries.
- Handle diverse question formats.
- Generalize better on unseen data by leveraging the variety introduced through synthetic data augmentation.

Fine-tuning a domain-specific model like **BioBERT** on **synthetic medical data** allows us to improve the model's accuracy without requiring an expensive or time-consuming manual annotation process.

#### Steps Involved:

- Dataset Preparation: We prepare diverse medical question-answer pairs with relevant contextual information.
- Model Configuration: Key parameters include:
     - Learning Rate: Set between 2e-5 and 5e-5.
     - Batch Size: Typically 16 or 32.
     - Epochs: Usually 3-5, monitored for overfitting.
- Training Process:
   - Tokenize and encode questions into input IDs and attention masks.
   - Train the model with model configuration on the datasets.
- Evaluation Metrics:
  - **ROUGE Score**: Measures the overlap of n-grams between the generated and reference answers, focusing on recall to capture relevant information.
  - **BLEU Score**: Assesses how many words and phrases in the generated output match those in the reference answers, emphasizing precision and fluency in generated responses.




Step 1: Combine the Augmented and Generated data with the Training Set.

In [None]:
# Mark original data
train_df['source'] = 'Original'
train_df.columns

Index(['qtype', 'Question', 'Answer', 'source'], dtype='object')

In [None]:
# Concatenating the dataframes vertically
combined_df = pd.concat([train_df, augmented_nlp, generated_gpt], ignore_index=True)

combined_df.head()

Unnamed: 0,qtype,Question,Answer,source
0,information,What is (are) Neurogenic diabetes insipidus ?,Neurogenic diabetes insipidus is a disease tha...,Original
1,information,What is (are) medium-chain acyl-CoA dehydrogen...,Medium-chain acyl-CoA dehydrogenase (MCAD) def...,Original
2,treatment,What are the treatments for Salivary Gland Can...,Key Points\n - There are di...,Original
3,information,What is (are) abetalipoproteinemia ?,Abetalipoproteinemia is an inherited disorder ...,Original
4,treatment,What are the treatments for maternally inherit...,These resources address the diagnosis or manag...,Original


In [None]:
print("Shape of the Combined Training Set:", combined_df.shape)
print("\n")
print("Combined Training Set dataframe info:", combined_df.info())

Shape of the Combined Training Set: (2371, 4)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2371 entries, 0 to 2370
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   qtype     2371 non-null   object
 1   Question  2371 non-null   object
 2   Answer    2371 non-null   object
 3   source    2371 non-null   object
dtypes: object(4)
memory usage: 74.2+ KB
Combined Training Set dataframe info: None


Step 2: Preprocess the data -
- Convert lists to strings
- Prepare dataset for finetuning BioBERT

In [None]:
# Convert lists to strings
combined_df['Question'] = combined_df['Question'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x)
combined_df['Answer'] = combined_df['Answer'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x)

# Verify the changes
combined_df[['Question', 'Answer']].head()

Unnamed: 0,Question,Answer
0,What is (are) Neurogenic diabetes insipidus ?,Neurogenic diabetes insipidus is a disease tha...
1,What is (are) medium-chain acyl-CoA dehydrogen...,Medium-chain acyl-CoA dehydrogenase (MCAD) def...
2,What are the treatments for Salivary Gland Can...,Key Points\n - There are di...
3,What is (are) abetalipoproteinemia ?,Abetalipoproteinemia is an inherited disorder ...
4,What are the treatments for maternally inherit...,These resources address the diagnosis or manag...


In [None]:
from datasets import Dataset

# Prepare dataset for fine-tuning
augmeted_train_set = Dataset.from_pandas(combined_df[['Question', 'Answer']])

Step 3: Load the BioBERT model

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the QA-specific BioBERT model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad")
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad")


config.json:   0%|          | 0.00/477 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dmis-lab/biobert-base-cased-v1.1-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



Step4: Tokenize the inputs for question-answering

In [None]:
def tokenize_function(examples):
    # Tokenize the inputs for question-answering
    tokenized_examples = tokenizer(
        examples['Question'],  # The question
        examples['Answer'],    # The answer as the context
        truncation="only_second",  # Truncate the context if necessary
        max_length=512,
        return_offsets_mapping=True,  # Keep track of the offsets
        padding="max_length"
    )

    # Prepare the start and end positions for the answers
    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized_examples['offset_mapping']):
        # The full answer text
        answer = examples['Answer'][i]

        # Finding the start position of the answer in the context
        answer_start = examples['Answer'][i].find(answer)  # Since Answer is the context

        if answer_start != -1:  # If the answer is found
            # Convert character indices to token indices
            # The start index is found through offsets
            start_token = next((j for j, (start, end) in enumerate(offsets) if start == answer_start), -1)
            answer_end = answer_start + len(answer)
            end_token = next((j for j, (start, end) in enumerate(offsets) if end == answer_end), -1)

            start_positions.append(start_token)
            end_positions.append(end_token)
        else:
            # If answer not found, append default values (e.g., -1)
            start_positions.append(-1)
            end_positions.append(-1)

    tokenized_examples['start_positions'] = start_positions
    tokenized_examples['end_positions'] = end_positions

    return tokenized_examples


# Apply tokenization to the datasets
augmeted_train_Set_tokenized = augmeted_train_set.map(tokenize_function, batched=True)

# Set the format for PyTorch
augmeted_train_Set_tokenized.set_format(type='torch', columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])


Map:   0%|          | 0/2371 [00:00<?, ? examples/s]

Step5: Define the training arguments and training function

In [None]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results_biobert',        # Output directory for model predictions and checkpoints
    evaluation_strategy="epoch",            # Evaluate at the end of each epoch
    learning_rate=2e-5,                     # Learning rate
    per_device_train_batch_size=8,          # Training batch size per device
    per_device_eval_batch_size=8,           # Evaluation batch size per device
    num_train_epochs=3,                      # Number of training epochs
    weight_decay=0.01,                      # Weight decay for optimization
    logging_dir='./logs',                   # Directory for storing logs
)

# Initialize the Trainer for the NLP-augmented dataset
trainer_medquad = Trainer(
    model=model,                             # The model to train
    args=training_args,                      # Training arguments
    train_dataset=augmeted_train_Set_tokenized,  # Training dataset
    eval_dataset=augmeted_train_Set_tokenized     # Evaluation dataset
)




### Step6: Finetune the biobert model on augmented datasets

In [None]:
# Ensure all tensors are contiguous
for param in model.parameters():
    param.data = param.data.contiguous()

# Fine-tune BioBERT on Combined Train Set
trainer_medquad.train()

Epoch,Training Loss,Validation Loss
1,No log,0.086592
2,0.148000,0.042972
3,0.148000,0.035296


TrainOutput(global_step=891, training_loss=0.10505397766661297, metrics={'train_runtime': 201.2581, 'train_samples_per_second': 35.343, 'train_steps_per_second': 4.427, 'total_flos': 1858603830663168.0, 'train_loss': 0.10505397766661297, 'epoch': 3.0})

Step7: Generate answers for the questions.

In [None]:
import os

# Check the contents of the checkpoint directories
checkpoint_dir_891 = './results_biobert/checkpoint-891' #Final
print("Contents of checkpoint-891:", os.listdir(checkpoint_dir_891))


Contents of checkpoint-891: ['rng_state.pth', 'model.safetensors', 'training_args.bin', 'scheduler.pt', 'trainer_state.json', 'config.json', 'optimizer.pt']


In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

# Choose the checkpoint to load
checkpoint_to_load = './results_biobert/checkpoint-891'  # or checkpoint-500

# Load the model from the checkpoint
model = AutoModelForQuestionAnswering.from_pretrained(checkpoint_to_load,
                                                      use_safetensors=True)

# Load the tokenizer from the original BioBERT model
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1-squad")




### Generate Answers

In [None]:
import torch

def answer_question(question, context):
    # Tokenize the inputs
    inputs = tokenizer(
        question,
        context,
        return_tensors='pt',  # Return PyTorch tensors
        truncation=True,
        padding=True,
        max_length=512,
        return_offsets_mapping=True  # Keep this for later use if needed
    )

    model.eval()
    with torch.no_grad():
        # Pass only the necessary arguments to the model
        outputs = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])

    # Get the start and end positions of the answer
    start_position = torch.argmax(outputs.start_logits)
    end_position = torch.argmax(outputs.end_logits)

    # Decode the answer from the context based on the token indices
    if start_position <= end_position:
        # Decode the answer using token indices
        answer_ids = inputs['input_ids'][0][start_position:end_position + 1]
        answer = tokenizer.decode(answer_ids, skip_special_tokens=True)
    else:
        answer = ""  # No valid answer found

    # Clean the answer to remove unnecessary text
    answer = answer.strip()  # Remove leading and trailing whitespace

    # Remove the question from the answer, ignoring case
    answer = answer.lower().replace(question.lower(), "").strip()

    # If answer is empty after removing the question, set a default message
    if not answer:
        answer = "The answer is not available."

    return answer

# Accept user input for question and context
user_question = input("Enter your question: ")
user_context = input("Enter the context: ")

# Generate the answer using user input
answer = answer_question(user_question, user_context)
print(f"Question: {user_question}")
print(f"Answer: {answer}")


Enter your question: What is Asthma?
Enter the context: Asthma is a condition in which your airways narrow and swell and may produce extra mucus. This can make breathing difficult and trigger coughing, wheezing, and shortness of breath. Asthma is often linked to allergies, environmental factors, or respiratory infections.
Question: What is Asthma?
Answer: asthma is a condition in which your airways narrow and swell and may produce extra mucus. this can make breathing difficult and trigger coughing, wheezing, and shortness of breath. asthma is often linked to allergies, environmental factors, or respiratory infections.


### Step 5: Evaluation of Fine-Tuned Models
We evaluate both the NLP-Augmented and GPT-Augmented models using BLEU and ROUGE scores. These metrics help us understand how well the models generate correct answers.


Load the test dataset from the google cloud stoage bucket.

In [None]:
import pandas as pd
from google.cloud import automl_v1beta1 as automl
from google.cloud import storage

# Define the AutoML Vision client
client = automl.AutoMlClient()

# Set your GCS bucket and file path
bucket_name = "bucket name " #Replace it with your bucket name
file_name = "test_dataset.csv"

# Full GCS file path
gcs_file_path = f"gs://{bucket_name}/{file_name}"

# Load the CSV into a Pandas DataFrame
test_df = pd.read_csv(gcs_file_path)

# Remove the 'Unnamed: 0' column
test_df =test_df.drop(columns=['Unnamed: 0'])

# Verify the column is removed
print(test_df.columns)
test_df.head()

# Convert lists to strings for Dataset 1
test_df['Question'] = test_df['Question'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x)
test_df['Answer'] = test_df['Answer'].apply(lambda x: x[0] if isinstance(x, list) and len(x) > 0 else x)

#print(test_df.head())
print(test_df.info())

Index(['qtype', 'Question', 'Answer', 'source'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 657 entries, 0 to 656
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   qtype     657 non-null    object
 1   Question  657 non-null    object
 2   Answer    657 non-null    object
 3   source    657 non-null    object
dtypes: object(4)
memory usage: 20.7+ KB
None


In [None]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

def evaluate_model_performance(model, test_df):
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    # Initialize ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    smooth_fn = SmoothingFunction().method1  # Apply smoothing method for BLEU

    for _, row in test_df.iterrows():
        question = row['Question']
        expected_answer = row['Answer']

        # Generate the answer using the model
        generated_answer = answer_question(question, expected_answer)

        # BLEU score calculation with smoothing
        reference = [expected_answer.split()]  # BLEU expects a list of references (list of lists)
        candidate = generated_answer.split()  # Candidate answer is a single list of tokens
        bleu_score = sentence_bleu(reference, candidate, smoothing_function=smooth_fn)
        bleu_scores.append(bleu_score)

        # ROUGE score calculation
        rouge_score = scorer.score(expected_answer, generated_answer)
        rouge_scores['rouge1'].append(rouge_score['rouge1'].fmeasure)
        rouge_scores['rouge2'].append(rouge_score['rouge2'].fmeasure)
        rouge_scores['rougeL'].append(rouge_score['rougeL'].fmeasure)

    # Calculate average BLEU and ROUGE scores
    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0.0
    avg_rouge1 = sum(rouge_scores['rouge1']) / len(rouge_scores['rouge1']) if rouge_scores['rouge1'] else 0.0
    avg_rouge2 = sum(rouge_scores['rouge2']) / len(rouge_scores['rouge2']) if rouge_scores['rouge2'] else 0.0
    avg_rougeL = sum(rouge_scores['rougeL']) / len(rouge_scores['rougeL']) if rouge_scores['rougeL'] else 0.0

    return avg_bleu, avg_rouge1, avg_rouge2, avg_rougeL


In [None]:
# Evaluate both models on the test set
model_results = evaluate_model_performance(trainer_medquad.model, test_df)

# Prepare results in a tabular format
results_df = pd.DataFrame({
    'Metric': ['BLEU Score', 'ROUGE-1', 'ROUGE-2', 'ROUGE-L'],
    'Augmented Model': model_results,
})

# Print the results DataFrame
print(results_df)

       Metric  Augmented Model
0  BLEU Score         0.533345
1     ROUGE-1         0.833203
2     ROUGE-2         0.830837
3     ROUGE-L         0.833139


## Conclusion and Practical Impact

By augmenting our dataset with **synthetic data** generated from **NLPAug** and **DistilGPT-2**, we fine-tuned **BioBERT**, resulting in improved performance for the medical question-answering task. This approach showcases how **Generative AI** can fill the gaps in datasets that are often small, unbalanced, or difficult to access due to data privacy concerns.

### Practical Impact in Healthcare:
- **Automating Medical Assistance**: Fine-tuning **BioBERT** on synthetic medical data can enhance models used in **automated diagnostic tools** or **virtual health assistants**, providing patients and doctors with quick, accurate answers to medical queries.
- **Data Privacy and Scalability**: Synthetic data allows us to create diverse training sets without directly using sensitive patient information, thus improving scalability while maintaining **data privacy**.
- **Cost-Effective Model Improvement**: Generating synthetic data reduces the need for expensive, manual data labeling, enabling faster and more cost-effective improvements in model performance.

This method demonstrates how **Generative AI** can revolutionize healthcare AI models, expanding their capabilities with minimal manual intervention.
