# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2 | Deployment on Hugging Face Spaces

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering
* upload your fine-tuned model to Hugging Face Model Hub
* deploy application with uploaded model on HuggingFace Spaces using Gradio

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries. Later, deploy the fine-tuned model on Hugging Face Spaces.

Please refer to ***M4 Assignment-1 Fine-tune GPT2*** and ***M4 AdditionalNB Fine-tune GPT2 for TextClassification*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

Please refer to ***The demo session held on 26 Jan - Hugging Face Spaces Deployment*** to get familiar with how to do deployment using Hugging Face Spaces.

### Installing Dependencies

In [None]:
%%capture
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [None]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [None]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [None]:
# YOUR CODE HERE
medqa_data_raw = pd.read_csv('/content/MedQuAD.csv')
medqa_data_raw.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [None]:
# YOUR CODE HERE
#Checking for missing values
print(f'Before \n --- \n{medqa_data_raw.isnull().sum()} \n')

medqa_data_1 = medqa_data_raw.dropna()

print(f'After \n --- \n{medqa_data_1.isnull().sum()} \n')

Before 
 --- 
Focus             14
CUI              565
SemanticType     597
SemanticGroup    565
Question           0
Answer             5
dtype: int64 

After 
 --- 
Focus            0
CUI              0
SemanticType     0
SemanticGroup    0
Question         0
Answer           0
dtype: int64 



- **Remove duplicates from data considering `Question` and `Answer` columns**

In [None]:
# YOUR CODE HERE

print(f"Before \n ---- \n {medqa_data_1.duplicated(subset=['Question', 'Answer']).sum()} \n")

medqa_data_2 = medqa_data_1.drop_duplicates(subset=['Question', 'Answer'], keep='first')

print(f"After \n ---- \n {medqa_data_2.duplicated(subset=['Question', 'Answer']).sum()} \n")



Before 
 ---- 
 48 

After 
 ---- 
 0 



**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [0.5 Mark]**

In [None]:
# Total categories in Focus column
# YOUR CODE HERE
total_categories = medqa_data_2['Focus'].value_counts()
print(total_categories)

Focus
Breast Cancer                                   53
Prostate Cancer                                 43
Stroke                                          35
Skin Cancer                                     34
Alzheimer's Disease                             30
                                                ..
Malignant hyperthermia susceptibility type 3     1
Malignant peripheral nerve sheath tumor          1
Malonyl-CoA decarboxylase deficiency             1
Mandibuloacral dysplasia                         1
Tricho-dento-osseous syndrome                    1
Name: count, Length: 4770, dtype: int64


In [None]:
# Displaying the distinct categories of Focus column and the number of records belonging to each category
# (Top 100 only)

# YOUR CODE HERE

unique_categories = medqa_data_2['Focus'].unique()

for i in unique_categories:
    print(f' Categories --> {i}')

# print(total_categories.unique())

# print(total_categories.head(100))



 Categories --> Adult Acute Lymphoblastic Leukemia
 Categories --> Adult Acute Myeloid Leukemia
 Categories --> Chronic Lymphocytic Leukemia
 Categories --> Chronic Myelogenous Leukemia
 Categories --> Hairy Cell Leukemia
 Categories --> Childhood Acute Lymphoblastic Leukemia
 Categories --> Childhood Acute Myeloid Leukemia and Other Myeloid Malignancies
 Categories --> Adult Soft Tissue Sarcoma
 Categories --> Gastrointestinal Stromal Tumors
 Categories --> Kaposi Sarcoma
 Categories --> Childhood Rhabdomyosarcoma
 Categories --> Childhood Soft Tissue Sarcoma
 Categories --> Childhood Vascular Tumors
 Categories --> Adult Hodgkin Lymphoma
 Categories --> Adult Non-Hodgkin Lymphoma
 Categories --> AIDS-Related Lymphoma
 Categories --> Mycosis Fungoides and the Szary Syndrome
 Categories --> Primary CNS Lymphoma
 Categories --> Childhood Hodgkin Lymphoma
 Categories --> Childhood Non-Hodgkin Lymphoma
 Categories --> Anal Cancer
 Categories --> Adult Central Nervous System Tumors
 Catego

In [None]:
medqa_data_2.describe()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
count,15762,15762,15762,15762,15762,15762
unique,4770,3325,14,1,14393,15222
top,Breast Cancer,C0039082,T047,Disorders,What is (are) High Blood Cholesterol ?,This condition is inherited in an autosomal re...
freq,53,351,9599,15762,19,347


In [None]:
# Top 100 Focus categories names

# YOUR CODE HERE

top_100_categories = medqa_data_2['Focus'].value_counts().head(100)
print(f'Top 100 Categories \n------------- \n{top_100_categories}')

#.apply() -

#note - .str.strip().str.lower() reduces the shape to 145 rows only"

Top 100 Categories 
------------- 
Focus
Breast Cancer                                                       53
Prostate Cancer                                                     43
Stroke                                                              35
Skin Cancer                                                         34
Alzheimer's Disease                                                 30
                                                                    ..
Sarcoidosis                                                         11
Polycythemia Vera                                                   11
Celiac Disease                                                      11
Down syndrome                                                       10
Microscopic Colitis: Collagenous Colitis and Lymphocytic Colitis    10
Name: count, Length: 100, dtype: int64


### Create Training and Validation set

**Exercise 4: Create training and validation set [1 Mark]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [None]:
# YOUR CODE HERE

top_categories_names = top_100_categories.index

medqa_data_top_100 = medqa_data_2[medqa_data_2['Focus'].isin(top_categories_names)]

# medqa_data_top_100.head(10)

# print(medqa_data_top_100.describe)

print("Filtered data shape:", medqa_data_top_100.shape)
print("Sample data:\n", medqa_data_top_100.head())

if medqa_data_top_100.empty:
    raise ValueError("Error: No matching data. Check 'Focus' column values or top category names.")

print('-----------------------------------------------------------------------')

# Creating Training Set
training_set = medqa_data_top_100.groupby('Focus').head(4)


# Creating Validation Set
validation_set = medqa_data_top_100.groupby('Focus').tail(1)

print(f"Training Data Shape: {training_set.shape}")
print(f"Validation Data Shape: {validation_set.shape}")

print('-----------------------------------------------------------------------')



print(f"Training Data Shape: {len(training_set)}")
print(f"Validation Data Shape: {len(validation_set)}")




Filtered data shape: (1532, 6)
Sample data:
                   Focus       CUI SemanticType SemanticGroup  \
281   Polycythemia Vera  C0032463         T191     Disorders   
282   Polycythemia Vera  C0032463         T191     Disorders   
283   Polycythemia Vera  C0032463         T191     Disorders   
284   Polycythemia Vera  C0032463         T191     Disorders   
320  Endometrial Cancer  C1883486         T191     Disorders   

                                            Question  \
281                What is (are) Polycythemia Vera ?   
282     What are the symptoms of Polycythemia Vera ?   
283              How to diagnose Polycythemia Vera ?   
284  What are the treatments for Polycythemia Vera ?   
320               What is (are) Endometrial Cancer ?   

                                                Answer  
281  Key Points - Polycythemia vera is a disease in...  
282  Symptoms of polycythemia vera include headache...  
283  Special blood tests are used to diagnose polyc...  
284  

### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks:  [1 Mark]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text + '\<end\>'*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [None]:
# Combine Questions and Answers for train and val data
## sequence = '<question>' + question + '<answer>' + answer

# YOUR CODE HERE

def combine_question_answer(dataframe, question_col, answer_col):
    return dataframe.apply(lambda row: f"<question>{row[question_col]}<answer>{row[answer_col]}", axis=1)


- **Join the combined text using '\n' into a single string for training and validation separately**

In [None]:
# Train and Validation text for all Q&As

# YOUR CODE HERE

train_data = combine_question_answer(training_set, 'Question', 'Answer')

val_data = combine_question_answer(validation_set, 'Question', 'Answer')



In [None]:
# Save the training and validation data as text files

# YOUR CODE HERE
train_file = "train_seq.txt"
val_file = "val_seq.txt"

with open(train_file, 'w', encoding='utf-8') as train_out:
        train_out.write('\n'.join(train_data))

with open(val_file, 'w', encoding='utf-8') as val_out:
        val_out.write('\n'.join(val_data))


$\color{red}{\text{Alternative- All the steps in single function}}$

In [None]:
# Alternate approach putting everything togather

def prepare_and_save_sequences(train_df, val_df, question_col, answer_col, train_file, val_file):
    """
    Combines question and answer into a single sequence for train and validation datasets
    and saves the result as separate text files.

    Parameters:
        train_df (pd.DataFrame): The training DataFrame.
        val_df (pd.DataFrame): The validation DataFrame.
        question_col (str): The column containing questions.
        answer_col (str): The column containing answers.
        train_file (str): File path to save the training sequences.
        val_file (str): File path to save the validation sequences.
    """
    def create_sequences(dataframe):
        return dataframe.apply(
            lambda row: f"<question>{row[question_col]}<answer>{row[answer_col]}<end>", axis=1
        )

    # Create combined sequences for training and validation
    train_sequences = create_sequences(train_df)
    val_sequences = create_sequences(val_df)

    # Join sequences with '\n' and save to text files
    with open(train_file, 'w', encoding='utf-8') as train_out:
        train_out.write('\n'.join(train_sequences))

    with open(val_file, 'w', encoding='utf-8') as val_out:
        val_out.write('\n'.join(val_sequences))




In [None]:
# Save training and validation sequences to files
prepare_and_save_sequences(
    train_df=training_set,
    val_df=validation_set,
    question_col="Question",
    answer_col="Answer",
    train_file="train_sequences.txt",
    val_file="val_sequences.txt"
)

- **Save the training and validation strings as text files**

**Exercise 6: Load pre-trained GPT2Tokenizer**

- Use checkpoint = "gpt2"

In [None]:
# Set up the tokenizer
# YOUR CODE HERE

checkpoint = "gpt2"

tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

tokenizer.pad_token = tokenizer.unk_token


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

**Exercise 7: Tokenize train and validation data [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [None]:
# YOUR CODE HERE

# def read_text_file(file_path):
#     with open(file_path, "r", encoding="utf-8") as file:
#         return file.readlines()

# train_file_path = '/content/train_seq.txt'
# validation_file_path = '/content/val_seq.txt'

# train_texts = read_text_file(train_file_path)
# validation_texts = read_text_file(validation_file_path)


# train_encodings = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt")
# valid_encodings = tokenizer(validation_texts, padding=True, truncation=True, return_tensors="pt")


# print("Train Encodings:", train_encodings)
# print("Validation Encodings:", valid_encodings)



Train Encodings: {'input_ids': tensor([[   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        ...,
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}
Validation Encodings: {'input_ids': tensor([[   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        ...,
        [   27, 25652,    29,  ..., 50256, 50256, 50256],
        [   27, 25652,    29,  ..., 13820,   284,   262],
        [   27, 25652,    29, 

In [None]:
from datasets import load_dataset

train_file_path = '/content/train_seq.txt'
validation_file_path = '/content/val_seq.txt'

dataset = load_dataset("text", data_files={"train": train_file_path,
                                           "validation": validation_file_path})


block_size = 256

# Define the tokenization function to apply to each example in the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"],
                     padding='max_length',
                     truncation=True,
                     max_length=block_size,
                     return_tensors='pt')

# Apply the tokenization function to the entire dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 400
    })
    validation: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 100
    })
})

**Exercise 8: Create a DataCollator object**

In [None]:
# Create a Data collator object
# YOUR CODE HERE
from transformers import DataCollatorForLanguageModeling

# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")


**Exercise 9: Load pre-trained GPT2LMHeadModel**

In [None]:
# Set up the model
# YOUR CODE HERE

model = GPT2LMHeadModel.from_pretrained(checkpoint)

**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [None]:
# Set up the training arguments

# YOUR CODE HERE

model_output_path = "/content/gpt2_model"

training_args = TrainingArguments(
    output_dir=model_output_path,
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=30,
    save_steps=1_000,
    save_total_limit=2,
    logging_dir='./logs',
    report_to= None
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Train the model
# YOUR CODE HERE
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"]
)


# Save the model
# YOUR CODE HERE
trainer.train()

# Save the tokenizer
# YOUR CODE HERE

# Save the model
saved_model_path = "/content/finetuned_gpt2_model"
trainer.save_model(saved_model_path)

# Save the tokenizer
tokenizer.save_pretrained(saved_model_path)

Step,Training Loss
500,1.8459
1000,1.2181
1500,0.8241
2000,0.5862
2500,0.452
3000,0.3897


('/content/finetuned_gpt2_model/tokenizer_config.json',
 '/content/finetuned_gpt2_model/special_tokens_map.json',
 '/content/finetuned_gpt2_model/vocab.json',
 '/content/finetuned_gpt2_model/merges.txt',
 '/content/finetuned_gpt2_model/added_tokens.json')

**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:
def generate_response(model, tokenizer, prompt, max_length=200):

    # YOUR CODE HERE
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    device = next(model.parameters()).device

    input_ids = input_ids.to(device)

    attention_mask = torch.ones_like(input_ids)

    pad_token_id = tokenizer.eos_token_id


    output = model.generate(
        input_ids,                    # Input tokens
        max_length=max_length,        # Maximum length of the generated response
        num_return_sequences=1,       # Generate one sequence
        attention_mask=attention_mask, # Attention mask
        pad_token_id=pad_token_id     # Pad token ID
    )


    return tokenizer.decode(output[0], skip_special_tokens=True)


In [None]:
# Load the fine-tuned model and tokenizer

# YOUR CODE HERE
# YOUR CODE HERE

my_model = GPT2LMHeadModel.from_pretrained(saved_model_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(saved_model_path)

In [None]:
# Testing with a sample prompt 1

prompt = "What is (are) Adult Acute Lymphoblastic Leukemia ?"
response = generate_response(model, tokenizer, prompt)
response

"What is (are) Adult Acute Lymphoblastic Leukemia ? Acute lymphoblastic leukemia is a form of childhood leukemia that causes bone marrow to die. The bone marrow produces many types of cells, including white blood cells and mature blood cells. White blood cells carry oxygen throughout the body, and the cells carry oxygen throughout the body. In people with leukemia, the blood becomes thicker and more dense, producing lymphocytes. These cells help the body fight infection and disease. White blood cells are also present in some types of acute lymphocytic leukemia, which is caused by the leukemia cells to die. Children with leukemia are more likely to develop the disease than children without leukemia. However, children with leukemia do not develop leukemia completely. In fact, about half of the children diagnosed with leukemia are never diagnosed with leukemia. The disease may be inherited, or a combination of factors, such as leukemia and family members' histories, may lead to the diseas

In [None]:
# Testing with a sample prompt 2

prompt = "What are the treatments for Kidney Stones in Adults ?"
response = generate_response(model, tokenizer, prompt)
response

'What are the treatments for Kidney Stones in Adults ? There are a number of treatments for kidney stones. Some people are helped by special diet changes, such as vitamin D supplements or medicines. Others may need surgical correction to remove a kidney stone. These procedures are usually painless and can last a lifetime. Check with your kidney care provider if you need treatment. Treatment for Kidney Stones in Adults Kidney stones may be necessary in certain situations, such as during an emergency, when blood clots can be dangerous or when a kidney stone is too swollen or irritated. To help you care for your loved one during an emergency, your family member or friend can contact your local chapter of the American Academy of Kidney Diseases to discuss how you can best care for your loved one. Kidney stones are treated in a variety of ways. Some people are helped by special diet changes, such as vitamin D supplements or medicines. Other people may not need treatment, but treatment may b

**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [0.5 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE

model_2 = GPT2LMHeadModel.from_pretrained(checkpoint)

In [None]:
# Testing with finetuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(model, tokenizer, prompt)
response

'What precautions to take for a healthy life? There are many different things you can do to help you maintain a healthy immune system. These things include - Taking care of yourself - Getting to know what you need to do - Managing your time - Preparing your own meals - Managing your own - Managing your own water - Diet - Sleep - Managing your own foodstuffs - Managing your own stress. Managing your own stress. Get over your fear of food. Anxiety. Fear. Anxiety. Fear. The more important it is to have a healthy immune system, the more important it is to have healthy immune systems. Learn how to maintain a healthy immune system. Here are some tips you can take. - Avoid foods with additives or preservatives that increase the risk of infection. - Avoid foods with artificial preservatives, such as MSG or Calcium. - Eat right after you eat. - Choose foods with healthy fats and cholesterol. - Avoid processed foods with preservatives. - Avoid foods with cholesterol. -'

In [None]:
# Testing with untuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(model_2, tokenizer, prompt)
response

"What precautions to take for a healthy life?\n\nThe following are some of the most common questions you'll hear from your doctor or nurse about your health.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause"

In [None]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(model, tokenizer, prompt)
response

'What to do after being diagnosed with cancer? Talk with your doctor. He or she may recommend that you seek medical help if you have cancer. (Watch the video to learn more about the stages of cancer. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) Treatment for Breast Cancer Treatment for breast cancer depends on : The level of the cancer cells. The type of treatment used to treat the cancer. The part of the body that is affected by the cancer. The most common type of cancer is breast cancer. Breast cancer treatment is used to treat cancer cells that are not cancerignant. It may be necessary to other cancer treatments as well. It is also used to treat other cancer cells that are not cancer. The use of hormone therapy for breast cancer treatment is also called therapy cancer treatment. Talk with your doctor about what you can do. Treatment options include hormone therapy, chemotherapy, radiati

In [None]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(model_2, tokenizer, prompt)
response

"What to do after being diagnosed with cancer?\n\nThe first step is to get your doctor's approval for a treatment.\n\nIf you have a cancer diagnosis, you may need to get a second opinion.\n\nIf you have a cancer diagnosis, you may need to get a second opinion. If you have a cancer diagnosis, you may need to get a third opinion.\n\nIf you have a cancer diagnosis, you may need to get a third opinion. If you have a cancer diagnosis, you may need to get a fourth opinion.\n\nIf you have a cancer diagnosis, you may need to get a fourth opinion. If you have a cancer diagnosis, you may need to get a fifth opinion.\n\nIf you have a cancer diagnosis, you may need to get a fifth opinion. If you have a cancer diagnosis, you may need to get a sixth opinion.\n\nIf you have a cancer diagnosis, you may need to get a sixth opinion. If you have"

In [None]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(model, tokenizer, prompt)
response

'What to do when feeling sick? - Sleep studies, weight loss and physical activity - Talk to your doctor if you are feeling particularly tired - If you have any of these symptoms, ask your doctor if you can take me to see a doctor. If so, ask if you could substitute for me in an effortless death. If you are not sure if you should substitute, ask your doctor if you could ask your doctor not to do so, and ask if you could be more specific. In addition, ask your doctor about any other medical conditions that may be causing your sickness. These questions and preferences may be used in conjunction with your general medical history to help determine what to do when. (Watch the video to learn more about what to do when feeling sick. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) To reduce the video, press Escape (Esc) while pressing the Escape ('

In [None]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(model_2, tokenizer, prompt)
response

"What to do when feeling sick?\n\nThe first thing you should do is to get your body to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick"

## Push your model to Hugging Face Model Hub

**Exercise 13: Follow below steps to push your fine-tuned model to HuggingFace Model Hub**

1. [Sign up](https://huggingface.co/join) for a Hugging Face account
2. Create an access token for your account and save it
3. Store your access token in the Hugging Face cache folder within colab
4. Push your fine-tuned model and tokenizer to Model Hub
5. Load the model back from Hub and test it with user input prompts

* **Create an access token for your account**

    Once you have an account, to create an access token:
    
    - Go to your `Settings`, then click on the `Access Tokens` tab. Click on the `New token` button to create a new User Access Token.
    - Select a Token type as `Write` and give a name for your token
    - Click on Create token
    - Once a token is created save it somewhere
    - When required later, use the old saved token or create a new token again

    To know more about Access Tokens, refer [here](https://huggingface.co/docs/hub/security-tokens).

* **Store your access token in the Hugging Face cache folder within colab**

    Once you have your User Access Token, run the following command to authenticate your identity to the Hub.
    - `!huggingface-cli login`
    - Paste your Access token when prompted
    - Type **n** when prompted to Add token as git credential? (Y/n)

    For more details on login, refer [here](https://huggingface.co/docs/huggingface_hub/quick-start#login).

In [None]:
# YOUR CODE HERE
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `never` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `never`


* **Push your fine-tuned model and tokenizer to Model Hub [0.5 Mark]**

    - Use `push_to_hub()` method of your model and tokenizer both, to push them on hub
    - Specify name for your repository where the model and tokenizer will be pushed using `repo_id` parameter
    - Push model and tokenizer to the same repository

    - **Hint:**

        - Use `push_to_hub()` method of your model. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub).
        - Use `push_to_hub()` method of your tokenizer. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.push_to_hub).
        - Access your pushed model at `https://huggingface.co/[YOUR-USER-NAME]/[YOUR-MODEL-REPO-NAME]/tree/main`

In [None]:
# Push model
# YOUR CODE HERE

repo_id = "Kaushiktd/gpt2_finetuned_medqa"

model.push_to_hub(repo_id)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Kaushiktd/gpt2_finetuned_medqa/commit/83b9f9fd4eec9b5ad9a5363f9394ceac76444885', commit_message='Upload model', commit_description='', oid='83b9f9fd4eec9b5ad9a5363f9394ceac76444885', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Kaushiktd/gpt2_finetuned_medqa', endpoint='https://huggingface.co', repo_type='model', repo_id='Kaushiktd/gpt2_finetuned_medqa'), pr_revision=None, pr_num=None)

In [None]:
# Push tokenizer
# YOUR CODE HERE
tokenizer.push_to_hub(repo_id)

CommitInfo(commit_url='https://huggingface.co/Kaushiktd/gpt2_finetuned_medqa/commit/4bcaab330f8c595d7a80245c0b5b166219668d8f', commit_message='Upload tokenizer', commit_description='', oid='4bcaab330f8c595d7a80245c0b5b166219668d8f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Kaushiktd/gpt2_finetuned_medqa', endpoint='https://huggingface.co', repo_type='model', repo_id='Kaushiktd/gpt2_finetuned_medqa'), pr_revision=None, pr_num=None)

* **Load the model and tokenizer back from Hub and test it with user input prompts [0.5 Mark]**

    - In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the `from_pretrained()` method. **AutoClasses** can be used to automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.

    - Instantiating one of `AutoConfig`, `AutoModel`, and `AutoTokenizer` will directly create a class of the relevant architecture.

    - When the GPT2 Model transformer has a language modeling head on top, you can use an auto class with language modeling head on top as well - `AutoModelWithLMHead`.

    - Specify full path of your model repo i.e. ***''YOUR-USER-NAME/YOUR-REPO-NAME''*** while calling `from_pretrained()` method.

In [None]:
from transformers import AutoModelWithLMHead, AutoTokenizer

In [None]:
# Load your model from hub
repo_id = "Kaushiktd/gpt2_finetuned_medqa"

loaded_model = AutoModelWithLMHead.from_pretrained(repo_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [None]:
# Load your tokenizer from hub

loaded_tokenizer = AutoTokenizer.from_pretrained(repo_id)

tokenizer_config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

In [None]:
# Response from loaded model

prompt = "What is the outlook for Skin Cancer ?"
response = generate_response(loaded_model, loaded_tokenizer, prompt)
response

"What is the outlook for Skin Cancer ?<answer>Certain factors affect prognosis (chance of recovery) and treatment options. The prognosis (chance of recovery) and treatment options depend on the following: - The stage of the cancer (level of PSA, Gleason score, grade of the tumor, how much of the skin is affected by the cancer, and whether the cancer has spread deeper into the skin). - The patients age. - Whether the cancer has just been diagnosed or has recurred (come back). Treatment options also may depend on the following: - Whether the patient has other health problems. - Past treatment for cancer. - The wishes of the patient. - The wishes of the patient's family. - The wishes of the patient's friends and coworkers. - The wishes of the patient's biographer. - The wishes of the patient's doctors. Most men and women diagnosed with skin cancer do not die of it. However, some do recover and others continue treatment. The"

## Gradio Implementation

Gradio is an open-source python library that allows us to quickly create easy-to-use, customizable UI components for our ML model, any API, or any arbitrary function in just a few lines of code. We can integrate the GUI directly into the Python notebook, or we can share the link with anyone.

**Exercise 14: Create a Gradio app for your fine-tuned model pushed on Hugging Face Model Hub [1 Marks]**

- Install and import `gradio` library
- Create a function to use your fine-tuned model for response generation
    - Use the model and tokenizer directly within the function, do not pass them as parameters
    - Function should take input prompt text, and max response length as its input parameters
    - Function should output the generated response text
- Create input and output gradio elements
- Create a gradio interface object
- Launch the interface to generate UI

In [None]:
!pip -q install gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.9/321.9 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.4/12.4 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_mach

In [None]:
import gradio as gr

In [None]:
# Function for response generation

def generate_query_response(prompt, max_length=200):

    model = loaded_model
    tokenizer = loaded_tokenizer

    # YOUR CODE HERE ...

    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    device = next(model.parameters()).device

    input_ids = input_ids.to(device)

    attention_mask = torch.ones_like(input_ids)

    pad_token_id = tokenizer.eos_token_id


    output = model.generate(
        input_ids,                    # Input tokens
        max_length=max_length,        # Maximum length of the generated response
        num_return_sequences=1,       # Generate one sequence
        attention_mask=attention_mask, # Attention mask
        pad_token_id=pad_token_id     # Pad token ID
    )


    return tokenizer.decode(output[0], skip_special_tokens=True)



In [None]:
# Gradio elements

# Input from user
in_prompt = gr.Textbox(label="Enter your question here", placeholder="Type question...", lines=2)
in_max_length = in_max_length = gr.Slider(minimum=10, maximum=500, step=10, value=100, label="Max Length")

# Output response
out_response = gr.Textbox(label="Response", interactive=False)

In [None]:
# Gradio interface to generate UI link
iface = gr.Interface(
    fn=generate_query_response,
    inputs=[in_prompt, in_max_length],
    outputs=out_response,
    title="Medical_Q&A_GPT2",
    description="Ask a medical question and select max length to see the response.")

# YOUR CODE HERE to launch the interface
iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://2249ab6290d2c060ee.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Upload your Gradio application on Hugging Face Spaces

**Exercise 15: Upload your Gradio application on Hugging Face Spaces [2 Marks]**

1. Start a new Hugging Face Space by going to your profile and [clicking "New Space"](https://huggingface.co/new-space)

2. Provide details for your space:
    - Space name
    - License (eg. [MIT](https://opensource.org/licenses/MIT))
    - Space SDK (software development kit) (eg. `Gradio`)
    - Space hardware (CPU basic)
    - Choose whether your Space is public or private
    - Click "Create Space"

3. Go to ***Add files -> Create a new file*** option to add below files:
    - `requirements.txt`: should contain the dependencies to run your app such as `transformers`, `torch`, and `gradio`
    - `app.py`: should contain steps to
        - import required packages
        - load your fine-tuned model and tokenizer from the Model Hub
        - function to use your fine-tuned model for response generation
        - create input and output gradio elements
        - create a gradio inference object
        - launch the interface to generate UI

4. Access the `App` tab of your repository to see the build progress (debug if error persists)

5. Once the app has built successfully, test the application running on your Space with a user input prompt

