<a href="https://colab.research.google.com/github/reshma-03/IISc-Projects/blob/main/M5_NB_MiniProject_1_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Part-A: Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

Please refer to ***M6 Assignment-1 Fine-tune GPT2*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

### Import required packages

In [None]:
!pip -q install -U accelerate
!pip -q install -U transformers
!pip -q install torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [None]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [None]:
df = pd.read_csv("MedQuAD.csv")
df.shape

(16412, 6)

In [None]:
df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [None]:
# YOUR CODE HERE
pd.concat([df.isna().sum(), df.isna().sum() / df.shape[0] * 100], axis=1).rename({0: 'Count of missing values', 1: 'Percentage of missing values'}, axis=1)

Unnamed: 0,Count of missing values,Percentage of missing values
Focus,14,0.085303
CUI,565,3.442603
SemanticType,597,3.637582
SemanticGroup,565,3.442603
Question,0,0.0
Answer,5,0.030466


In [None]:
# Drop missing values
# YOUR CODE HERE
df = df.dropna().reset_index(drop=True)
df.shape

(15810, 6)

- **Remove duplicates from data considering `Question` and `Answer` columns**

In [None]:
# Check duplicates
# YOUR CODE HERE
df.duplicated(subset=['Question', 'Answer']).sum()

48

In [None]:
# Drop duplicates
# YOUR CODE HERE
df = df.drop_duplicates(['Question', 'Answer'], keep='first').reset_index(drop=True)
df.shape

(15762, 6)

In [None]:
# Check duplicates
# YOUR CODE HERE
df.duplicated(subset=['Question', 'Answer']).sum()

0

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [None]:
# YOUR CODE HERE
df['Focus'].value_counts(sort=True, ascending=False).head(100)

Unnamed: 0_level_0,count
Focus,Unnamed: 1_level_1
Breast Cancer,53
Prostate Cancer,43
Stroke,35
Skin Cancer,34
Alzheimer's Disease,30
...,...
Alzheimer's Caregiving,11
Polycythemia Vera,11
"Diabetes, Heart Disease, and Stroke",11
Pelizaeus-Merzbacher disease,10


In [None]:
# Top 100 Focus categories names
top_100_categories = df['Focus'].value_counts(sort=True, ascending=False).head(100)
for category, count in top_100_categories.items():
    print(f"{category}: {count}")

Breast Cancer: 53
Prostate Cancer: 43
Stroke: 35
Skin Cancer: 34
Alzheimer's Disease: 30
Colorectal Cancer: 29
Lung Cancer: 29
Heart Failure: 28
Heart Attack: 28
High Blood Cholesterol: 28
High Blood Pressure: 27
Parkinson's Disease: 25
Leukemia: 22
Osteoporosis: 21
Shingles: 21
Hemochromatosis: 20
Age-related Macular Degeneration: 20
Diabetes: 20
Gum (Periodontal) Disease: 19
Diabetic Retinopathy: 19
Psoriasis: 19
Kidney Disease: 17
Dry Mouth: 16
COPD: 16
Cataract: 16
Balance Problems: 16
Gout: 15
Wilson Disease: 15
Medicare and Continuing Care: 15
Prescription and Illicit Drug Abuse: 15
Glaucoma: 15
Rheumatoid Arthritis: 14
Neuroblastoma: 14
Short Bowel Syndrome: 14
Problems with Taste: 14
Narcolepsy: 14
Endometrial Cancer: 14
Osteoarthritis: 14
Kidney Dysplasia: 13
Problems with Smell: 13
Dry Eye: 13
Pituitary Tumors: 13
Anxiety Disorders: 13
Urinary Tract Infections in Children: 13
Peripheral Arterial Disease (P.A.D.): 13
Surviving Cancer: 13
Amyloidosis and Kidney Disease: 12
Abdo

### Create Training and Validation set

**Exercise 4: Create training and validation set [2 Marks]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [None]:
training = pd.DataFrame(columns=df.columns)
validation = pd.DataFrame(columns=df.columns)
for category in top_100_categories.index:
    train, test = train_test_split(df[df['Focus'] == category].sample(5, random_state=42), test_size=0.2, random_state=42)
    training = pd.concat([training, train])
    validation = pd.concat([validation, test])
training.reset_index(drop=True, inplace=True)
validation.reset_index(drop=True, inplace=True)
print(training.shape, validation.shape)

(400, 6) (100, 6)


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks: [1.5 Marks]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [None]:
# YOUR CODE HERE
training['sequence'] = '<question>' + training['Question'] + '<answer>' + training['Answer']
validation['sequence'] = '<question>' + validation['Question'] + '<answer>' + validation['Answer']

In [None]:
training.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer,sequence
0,Breast Cancer,C0006142,T191,Disorders,What are the treatments for Breast Cancer ?,You can seek conventional treatment from a spe...,<question>What are the treatments for Breast C...
1,Breast Cancer,C0006142,T191,Disorders,What is (are) Breast Cancer ?,There are two types of breast-conserving surge...,<question>What is (are) Breast Cancer ?<answer...
2,Breast Cancer,C0006142,T191,Disorders,Who is at risk for Breast Cancer? ?,Key Points - Avoiding risk factors and increas...,<question>Who is at risk for Breast Cancer? ?<...
3,Breast Cancer,C0006142,T191,Disorders,What are the symptoms of Breast Cancer ?,Signs of breast cancer include a lump or chang...,<question>What are the symptoms of Breast Canc...
4,Prostate Cancer,C0376358,T191,Disorders,What are the treatments for Prostate Cancer ?,There are a number of ways to treat prostate c...,<question>What are the treatments for Prostate...


- **Join the combined text using '\n' into a single string for training and validation separately**

In [None]:
# YOUR CODE HERE
training_text = '\n'.join(training['sequence'])
validation_text = '\n'.join(validation['sequence'])

- **Save the training and validation strings as text files**

In [None]:
# YOUR CODE HERE
with open('training.txt', 'w') as f:
    f.write(training_text)

with open('validation.txt', 'w') as f:
    f.write(validation_text)

**Exercise 6: Load pre-trained GPT2Tokenizer [0.5 Mark]**

- Use checkpoint = "gpt2"

In [None]:
# YOUR CODE HERE
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

**Exercise 7: Tokenize train and validation data and form TextDataset objects [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [None]:
# YOUR CODE HERE
train_dataset = TextDataset(tokenizer=tokenizer, file_path="training.txt", block_size=128)
val_dataset = TextDataset(tokenizer=tokenizer, file_path="validation.txt", block_size=128)

In [None]:
len(train_dataset), len(val_dataset)

(959, 218)

**Exercise 8: Create a DataCollator object [0.5 Mark]**

In [None]:
# YOUR CODE HERE
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

**Exercise 9: Load pre-trained GPT2LMHeadModel [0.5 Mark]**

In [None]:
# YOUR CODE HERE
model = GPT2LMHeadModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [None]:
# Set up the training arguments

# YOUR CODE HERE
model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir = model_output_path,
    overwrite_output_dir = True,
    per_device_train_batch_size = 4, # try with 2
    per_device_eval_batch_size = 4,  #  try with 2
    num_train_epochs = 30,
    save_steps = 1_000,
    save_total_limit = 2,
    logging_dir = './logs',
    )

In [None]:
# Train the model
# YOUR CODE HERE
trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
)

trainer.train()

# Save the model
# YOUR CODE HERE
trainer.save_model(model_output_path)

# Save the tokenizer
# YOUR CODE HERE
tokenizer.save_pretrained(model_output_path)

Step,Training Loss
500,2.5082
1000,1.9364
1500,1.5839
2000,1.2881
2500,1.0567
3000,0.8651
3500,0.7223
4000,0.6067
4500,0.5181
5000,0.451


('/content/gpt_model/tokenizer_config.json',
 '/content/gpt_model/special_tokens_map.json',
 '/content/gpt_model/vocab.json',
 '/content/gpt_model/merges.txt',
 '/content/gpt_model/added_tokens.json')

**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:
# Load the fine-tuned model and tokenizer

# YOUR CODE HERE
my_model = GPT2LMHeadModel.from_pretrained(model_output_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

In [None]:
# Response from model

# YOUR CODE HERE
def generate_response(model, tokenizer, prompt, max_length=100):

    input_ids = tokenizer.encode(prompt, return_tensors="pt")      # 'pt' for returning pytorch tensor

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
# Testing with given prompt 1

# YOUR CODE HERE
prompt = "What precautions to take for a healthy life?"
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What precautions to take for a healthy life? - Keep your blood glucose level below 70 mg/dL. - Keep your blood glucose level below 70 mg/dL for up to 1 year after symptoms occur. - Call your health care provider right away if your blood glucose level is higher than 70 mg/dL. - If your systolic (upper number) blood glucose reading is higher than 70 mg/dL, you have a high risk of heart disease. - Keep your blood glucose level below 70


In [None]:
# Testing with given prompt 2

# YOUR CODE HERE
prompt = "What to do when feeling sick?"
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do when feeling sick? - Have you ever felt you should be able to do something about your sickle cell aneurysms? - Have you ever felt like you should be able to do something about your sickle cell aneurysms? - Have you ever felt like you should be able to protect your home or work from the sun? Have you ever felt like you should be able to protect your home or work from the sun? If you answered no to any of these questions


**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE
untuned_model = GPT2LMHeadModel.from_pretrained(checkpoint)
untuned_tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

In [None]:
# Testing with finetuned model: prompt 1

# YOUR CODE HERE
prompt = "What precautions to take for a healthy life?"
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What precautions to take for a healthy life? - Keep your blood glucose level below 70 mg/dL. - Keep your blood glucose level below 70 mg/dL for up to 1 year after symptoms occur. - Call your health care provider right away if your blood glucose level is higher than 70 mg/dL. - If your systolic (upper number) blood glucose reading is higher than 70 mg/dL, you have a high risk of heart disease. - Keep your blood glucose level below 70


In [None]:
# Testing with untuned model: prompt 1

# YOUR CODE HERE
prompt = "What precautions to take for a healthy life?"
response = generate_response(untuned_model, untuned_tokenizer, prompt)
print("Generated response:", response)

Generated response: What precautions to take for a healthy life?

The following are some of the most common questions you'll hear from your doctor or nurse about your health.

What are the risks of taking a drug that can cause cancer?

The risks of taking a drug that can cause cancer are very high.

What are the risks of taking a drug that can cause cancer?

The risks of taking a drug that can cause cancer are very high.

What are the risks


In [None]:
# Testing with finetuned model: prompt 2

# YOUR CODE HERE
prompt = "What to do after being diagnosed with cancer?"
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do after being diagnosed with cancer?
<question>What is (are) Cataract ?<answer>Yes. Certain tumors can cause cataract surgery. These include melanoma, cystic fibrosis, and squamous cell carcinoma. Surgery is done to remove the tumor. The surgeon may remove the entire tumor using a single operation. The surgeon may also remove only part of it. The surgeon does not have to remove the entire tumor. The surgeon may remove only part


In [None]:
# Testing with untuned model: prompt 2

# YOUR CODE HERE
prompt = "What to do after being diagnosed with cancer?"
response = generate_response(untuned_model, untuned_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do after being diagnosed with cancer?

The first step is to get your doctor's approval for a treatment.

If you have a cancer diagnosis, you may need to get a second opinion.

If you have a cancer diagnosis, you may need to get a second opinion. If you have a cancer diagnosis, you may need to get a third opinion.

If you have a cancer diagnosis, you may need to get a third opinion. If you have a cancer


In [None]:
# Testing with finetuned model: prompt 3

# YOUR CODE HERE
prompt = "What to do when feeling sick?"
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do when feeling sick? - Have you ever felt you should be able to do something about your sickle cell aneurysms? - Have you ever felt like you should be able to do something about your sickle cell aneurysms? - Have you ever felt like you should be able to protect your home or work from the sun? Have you ever felt like you should be able to protect your home or work from the sun? If you answered no to any of these questions


In [None]:
# Testing with untuned model: prompt 3

# YOUR CODE HERE
prompt = "What to do when feeling sick?"
response = generate_response(untuned_model, untuned_tokenizer, prompt)
print("Generated response:", response)

Generated response: What to do when feeling sick?

The first thing you should do is to get your body to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you should take a few minutes to relax.

If you're feeling sick, you


In [None]:
trainer.save_model("/content/Model_Medical")
tokenizer.save_pretrained("/content/Model_Medical")

('/content/Model_Medical/tokenizer_config.json',
 '/content/Model_Medical/special_tokens_map.json',
 '/content/Model_Medical/vocab.json',
 '/content/Model_Medical/merges.txt',
 '/content/Model_Medical/added_tokens.json')

In [None]:
import shutil
shutil.make_archive('/content/Model_Medical', 'zip', '/content/Model_Medical')

'/content/Model_Medical.zip'