# Generative AI and Prompt Engineering
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

### **GPT-2**

In recent years, the OpenAI GPT-2 exhibited an impressive ability to write coherent and passionate essays that exceeded what current language models can produce. The GPT-2 wasn't a particularly novel architecture - its architecture is very similar to the **decoder-only transformer**. The GPT2 was, however, a very large, transformer-based language model trained on a massive dataset.

Here, you are going to fine-tune the GPT2 model with the Medical data. Expected result should be that the model will be able to reply to the prompt related medical queries after fine-tuning.

To know more about GPT-2, refer [here](http://jalammar.github.io/illustrated-gpt2/).

### Installing Dependencies

In [None]:
%%capture
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [None]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments


import warnings
warnings.filterwarnings('ignore')

In [None]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv
MedQuAD.csv.1
MedQuAD.csv.2


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** `pd.read_csv()`

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# YOUR CODE HERE
df = pd.read_csv("/content/MedQuAD.csv")

In [None]:
df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [1 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

In [None]:
def understanding_data(df):
    print("Basic details of the Data-Frame\n")
    print("Dimensions of data frame: ")
    print(df.shape)
    print('\n---------------------------------------------------------------------------------------------------------\n')
    print("Data-types of the columns and count of non-Null values: ")
    print(df.info())
    print('\n---------------------------------------------------------------------------------------------------------\n')
    print("Count of Missing values: ")
    print(df.isnull().sum())
    print('\n---------------------------------------------------------------------------------------------------------\n')
    print("Percentage of Missing values: ")
    print(df.isnull().sum() / df.shape[0] * 100.00)
    print('\n---------------------------------------------------------------------------------------------------------\n')
    duplicate = df[df.duplicated()]
    print("Duplicate Rows :")
    print(duplicate)
    print('\n---------------------------------------------------------------------------------------------------------\n')


In [None]:
understanding_data(df)

Basic details of the Data-Frame

Dimensions of data frame: 
(16412, 6)

---------------------------------------------------------------------------------------------------------

Data-types of the columns and count of non-Null values: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16412 entries, 0 to 16411
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Focus          16398 non-null  object
 1   CUI            15847 non-null  object
 2   SemanticType   15815 non-null  object
 3   SemanticGroup  15847 non-null  object
 4   Question       16412 non-null  object
 5   Answer         16407 non-null  object
dtypes: object(6)
memory usage: 769.4+ KB
None

---------------------------------------------------------------------------------------------------------

Count of Missing values: 
Focus             14
CUI              565
SemanticType     597
SemanticGroup    565
Question           0
Answer             5
dtype

In [None]:
df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


- **Remove duplicates from data considering `Question` and `Answer` columns**

In [None]:
# YOUR CODE HERE
dfM = df.drop_duplicates()
print(f"Shape after deduplication: {dfM.shape}")

Shape after deduplication: (16364, 6)


- **Handle missing values**

In [None]:
# YOUR CODE HERE

def impute_focus(df):
    mode_imputer = SimpleImputer(strategy='most_frequent')
    focus_imputed = mode_imputer.fit_transform(df[['Focus']])
    df['Focus'] = focus_imputed.ravel()
    return df

def impute_cui(df):
    focus_cui_map = df.groupby('Focus')['CUI'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['CUI'] = df.apply(lambda row: focus_cui_map[row['Focus']] if pd.isnull(row['CUI']) else row['CUI'], axis=1)
    df['CUI'] = df['CUI'].fillna('Unknown')
    return df

def impute_semantic_type(df):
    cui_type_map = df.groupby('CUI')['SemanticType'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['SemanticType'] = df.apply(lambda row: cui_type_map[row['CUI']] if pd.isnull(row['SemanticType']) else row['SemanticType'], axis=1)
    df['SemanticType'] = df['SemanticType'].fillna('Unknown')
    return df

def impute_semantic_group(df):
    type_group_map = df.groupby('SemanticType')['SemanticGroup'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    df['SemanticGroup'] = df.apply(lambda row: type_group_map[row['SemanticType']] if pd.isnull(row['SemanticGroup']) else row['SemanticGroup'], axis=1)
    df['SemanticGroup'] = df['SemanticGroup'].fillna('Unknown')
    return df

def impute_answer(df):
    def find_similar_question(question, questions, n=1):
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform(questions)
        question_vector = vectorizer.transform([question])
        similarities = cosine_similarity(question_vector, tfidf_matrix)
        return similarities.argsort()[0][-n:][::-1]

    questions_with_answers = df[df['Answer'].notna()]['Question'].tolist()
    answers_with_questions = df[df['Answer'].notna()]['Answer'].tolist()

    for idx, row in df[df['Answer'].isna()].iterrows():
        similar_indices = find_similar_question(row['Question'], questions_with_answers)
        df.at[idx, 'Answer'] = answers_with_questions[similar_indices[0]]

    return df

dfM = impute_focus(dfM)
dfM = impute_cui(dfM)
dfM = impute_semantic_type(dfM)
dfM = impute_semantic_group(dfM)
dfM = impute_answer(dfM)

print(dfM.isnull().sum())

Focus            0
CUI              0
SemanticType     0
SemanticGroup    0
Question         0
Answer           0
dtype: int64


In [None]:
dfM.shape

(16364, 6)

**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [None]:
# Total categories in Focus column
# YOUR CODE HERE
categories = df['Focus'].value_counts()
categories

Unnamed: 0_level_0,count
Focus,Unnamed: 1_level_1
Breast Cancer,53
Prostate Cancer,43
Stroke,35
Skin Cancer,34
Alzheimer's Disease,30
...,...
X-linked Charcot-Marie-Tooth disease type 3,1
X-linked Charcot-Marie-Tooth disease type 4,1
Woodhouse Sakati syndrome,1
Woolly hair hypotrichosis everted lower lip and outstanding ears,1


In [None]:
# Displaying the distinct categories of Focus column and the number of records belonging to each category
# (Top 100 only)

# YOUR CODE HERE
top_categories = dfM['Focus'].value_counts().nlargest(100)
print(top_categories)
focus_top_categories = pd.DataFrame({'category':top_categories.index, 'count':top_categories.values})

Focus
Breast Cancer          67
Prostate Cancer        43
Stroke                 35
Skin Cancer            34
Alzheimer's Disease    30
                       ..
Laron syndrome         11
Cushing's Syndrome     11
Sarcoidosis            11
Hearing Loss           10
Tourette syndrome      10
Name: count, Length: 100, dtype: int64


In [None]:
# Top 100 Focus categories names

# YOUR CODE HERE
print(focus_top_categories['category'])

0           Breast Cancer
1         Prostate Cancer
2                  Stroke
3             Skin Cancer
4     Alzheimer's Disease
             ...         
95         Laron syndrome
96     Cushing's Syndrome
97            Sarcoidosis
98           Hearing Loss
99      Tourette syndrome
Name: category, Length: 100, dtype: object


### Create Training and Validation set

**Exercise 4: Create training and validation set [1 Mark]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [None]:
# YOUR CODE HERE
train_data = []
val_data = []

for category in top_categories.index:
    category_data = dfM[dfM['Focus'] == category]
    train_samples = category_data.sample(n=4, random_state=42)
    val_samples = category_data[~category_data.index.isin(train_samples.index)].sample(n=1, random_state=42)

    train_data.append(train_samples)
    val_data.append(val_samples)

train_df = pd.concat(train_data)
val_df = pd.concat(val_data)

In [None]:
train_df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
14920,Breast Cancer,C0006142,T191,Disorders,What are the symptoms of Breast Cancer ?,"When breast cancer first develops, there may b..."
565,Breast Cancer,C0006142,T191,Disorders,What are the treatments for Breast Cancer ?,Key Points - Treatment options for pregnant wo...
545,Breast Cancer,C0006142,T191,Disorders,What are the symptoms of Breast Cancer ?,Signs of breast cancer include a lump or chang...
550,Breast Cancer,C0006142,T191,Disorders,what research (or clinical trials) is being do...,New types of treatment are being tested in cli...
15450,Prostate Cancer,C0376358,T191,Disorders,What is (are) Prostate Cancer ?,Surgery is a common treatment for early stage ...


In [None]:
val_df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
16394,Breast Cancer,C0006142,T191,Disorders,what is the treatment for vancomycin-resistant...,On this Page General Information What is vanco...
15448,Prostate Cancer,C0376358,T191,Disorders,What are the treatments for Prostate Cancer ?,"Surgery, radiation therapy, and hormonal thera..."
16079,Stroke,C0038454,T047,Disorders,What are the symptoms of Stroke ?,The signs and symptoms of a stroke often devel...
15525,Skin Cancer,C0007114,T191,Disorders,What are the treatments for Skin Cancer ?,Different types of treatment are available for...
14859,Alzheimer's Disease,C0002395,T046,Disorders,How to prevent Alzheimer's Disease ?,"Currently, no medicines or other treatments ar..."


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks:  [1 Mark]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text + '\<end\>'*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [None]:
# Combine Questions and Answers for train and val data
## sequence = '<question> ' + question + ' <answer> ' + answer + ' <end>'

# YOUR CODE HERE
def combine_qa(row):
    return f"<question>{row['Question']}<answer>{row['Answer']}<end>"

train_df['combined'] = train_df.apply(combine_qa, axis=1)
val_df['combined'] = val_df.apply(combine_qa, axis=1)

# Step 9: Join the combined text and save to files
train_text = '\n'.join(train_df['combined'])
val_text = '\n'.join(val_df['combined'])

with open('train_data.txt', 'w', encoding='utf-8') as f:
    f.write(train_text)

with open('val_data.txt', 'w', encoding='utf-8') as f:
    f.write(val_text)

- **Join the combined text using '\n' into a single string for training and validation separately**

In [None]:
# Train and Validation text for all Q&As

# YOUR CODE HERE
train_text = '\n'.join(train_df['combined'])
val_text = '\n'.join(val_df['combined'])



- **Save the training and validation strings as text files**

In [None]:
# Save the training and validation data as text files

# YOUR CODE HERE
with open('train_data.txt', 'w', encoding='utf-8') as f:
    f.write(train_text)

with open('val_data.txt', 'w', encoding='utf-8') as f:
    f.write(val_text)

**Exercise 6: Load pre-trained GPT2Tokenizer**

- Use checkpoint = "gpt2"

**Hint:** `GPT2Tokenizer.from_pretrained(...)`

In [None]:
# Set up the tokenizer
# YOUR CODE HERE
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

**Exercise 7: Tokenize train and validation data [1 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

**Hint:**

`from datasets import load_dataset`

`dataset = load_dataset("text", data_files={...})`

In [None]:
# YOUR CODE HERE
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512, return_attention_mask=True)

dataset = load_dataset("text", data_files={
    "train": "train_data.txt",
    "validation": "val_data.txt"
})

tokenized_train = dataset['train'].map(tokenize_function, batched=True)
tokenized_val = dataset['validation'].map(tokenize_function, batched=True)


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

**Exercise 8: Create a DataCollator object**

**Hint:** `DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")`

Data collators are objects that:

- will form a batch by using a list of dataset elements as input
- may apply some processing (like padding)

One of the data collators, `DataCollatorForLanguageModeling`, can also apply some random data augmentation (like random masking) on the formed batch.

<br>

`DataCollatorForLanguageModeling` is a data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

Parameters:

- ***tokenizer:*** The tokenizer used for encoding the data.
- ***mlm*** (bool, optional, default=True): Whether or not to use masked language modeling.
    - If set to False, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100).
    - Otherwise, the labels are -100 for non-masked tokens and the value to predict for the masked token.
- ***return_tensors*** (str): The type of Tensor to return. Allowable values are “np”, “pt” and “tf” for numpy array, pytorch tensor, and tensorflow tensor respectively.

To know more about `DataCollatorForLanguageModeling` parameters, refer [here](https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling).

In [None]:
# Create a Data collator object
# YOUR CODE HERE
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, pad_to_multiple_of=8, return_tensors="pt")


**Exercise 9: Load pre-trained GPT2LMHeadModel**

**Hint:** `GPT2LMHeadModel.from_pretrained(...)`

In [None]:
# Set up the model
# YOUR CODE HERE
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [None]:
model.config.pad_token_id = model.config.eos_token_id

**Exercise 10: Fine-tune GPT2 Model [2 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [None]:
# Set up the training arguments

model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=30,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    fp16=True,
    gradient_accumulation_steps=8,
    save_steps=500,
    save_total_limit=3,
    learning_rate=5e-5,
    evaluation_strategy="steps",  # Ensure evaluation happens during training
    eval_steps=500,  # Evaluate every 500 steps
    load_best_model_at_end=True,  # Required for EarlyStoppingCallback
    metric_for_best_model='eval_loss'
)



In [None]:
# Train the model
from transformers import EarlyStoppingCallback

early_stopping = EarlyStoppingCallback(early_stopping_patience=3)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    callbacks=[early_stopping],

)

trainer.train()


Step,Training Loss,Validation Loss


TrainOutput(global_step=360, training_loss=2.2447063869900172, metrics={'train_runtime': 647.0009, 'train_samples_per_second': 18.547, 'train_steps_per_second': 0.556, 'total_flos': 3010084208640000.0, 'train_loss': 2.2447063869900172, 'epoch': 28.8})

In [None]:
# Save the model
# YOUR CODE HERE
trainer.save_model(model_output_path)
# Save the tokenizer
# YOUR CODE HERE
tokenizer.save_pretrained(model_output_path)

('/content/gpt_model/tokenizer_config.json',
 '/content/gpt_model/special_tokens_map.json',
 '/content/gpt_model/vocab.json',
 '/content/gpt_model/merges.txt',
 '/content/gpt_model/added_tokens.json')

**Exercise 11: Test Model with user input prompts [2 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:

# YOUR CODE HERE
def generate_response(model, tokenizer, input_text, max_length=256):
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        output = model.generate(
        input_ids=inputs['input_ids'],          # Input token IDs
        max_length=max_length,                 # Adjust max length as per your dataset needs
        num_return_sequences=1,                # Return one output sequence
        no_repeat_ngram_size=2,                # Avoid repetition of 2-grams
        attention_mask=inputs['attention_mask'], # Ensure attention mask is passed
        top_k=50,                              # Top-k sampling for limiting vocab size
        top_p=0.9,                             # Nucleus sampling (top-p)
        temperature=0.7,                       # Adjust temperature for controlled randomness
        repetition_penalty=1.2,                # Penalize repetitions for better diversity
        do_sample=True                         # Enable sampling (instead of greedy decoding)
        )
        response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

In [None]:
# Load the fine-tuned model and tokenizer

# YOUR CODE HERE
fine_tuned_model = GPT2LMHeadModel.from_pretrained('/content/gpt_model')
fine_tuned_tokenizer = GPT2Tokenizer.from_pretrained('/content/gpt_model')

In [None]:
# Testing with a sample prompt 1

prompt = "What is the treatment for stroke?"
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What is the treatment for stroke?\n. In addition, many older adults with cerebrovascular disease (CVD) have had strokes themselves or their family members treated by a neurologist and/or an orthopedic surgeon who specializes in treating CWDs; however this type of surgery may not be available to most younger people affected based on age alone.<end>This article provides information about how neuroimaging can help reduce long-term cognitive decline associatedwith aging: Treatment Options that Increase Cognitive Functioning during Aging There are several different typesof interventions aimed at increasing healthy brain connections between nerve cells involved mainly neurons located near areas where motor skills fail when exposed directly to high levelsa magnetic resonance imaging technique called MRI<END>. Neurointerventional studies show promising results after 10 years following standard therapiesfor those over 65 but cannot confirm whether these treatments increase functioningin some c

In [None]:
# Testing with a sample prompt 2

prompt = "What are symptoms of flu?"
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What are symptoms of flu? These include - sudden, severe illness. This includes episodes such as seizures and death or a rash that appears on the body\'s surface when you have an episode with fever (flu). fevers; sweating feeling dizzy-like during sleep suddenly becomes more common after getting sick due to fatigue/sleep problems Feels like being in pain from too much cold air falling down your face A person can feel itchy joints moving around them Feeling tired while trying not get hot easily Sometimes people experience chest tightness every time they take another step up stairs In some cases these feelings may be triggered by certain medications medicines You should talk about how long this is going to last before taking any other steps Upgrading medicine gradually over months seems important for preventing future illnesses Symptoms usually appear one week later than usual If nausea gets worse soon afterwards vomiting attacks quickly if taken lightly The same goesfor high blood pres

**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE
untuned_model = GPT2LMHeadModel.from_pretrained('gpt2')
untuned_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
untuned_tokenizer.pad_token = untuned_tokenizer.eos_token


In [None]:
# Testing with finetuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What precautions to take for a healthy life? Read more at HealthyLifeWatch.\n - Talk with your doctor before taking any medications, vitamins, supplements or dietary changes -- especially if you have diabetes and heart disease problems that can lead into serious complications such as stroke (stroke is the most common type of car accident), high blood pressure/high cholesterol levels in older adults who are not diabetic, liver damage due both from alcohol use when drinking too much water during pregnancy, severe kidney failure caused by exposure to ultraviolet radiation which damages bone density). Ask about limiting how many calories should be consumed every day so people get enough nutrients along each meal plan; eat plenty protein-rich foods like fish oil insteadof saturated fat because it increases absorption back quickly but may also reduce insulin sensitivity." Also read: 5 Things You Can Do To Lose Weight Faster In The Short Term! Nutritionists recommend lowering sodium intake 3

In [None]:
# Testing with untuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(untuned_model, untuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What precautions to take for a healthy life?\nThe following advice is based on my personal experience. If you are an older person and do not know what the risks of living with someone younger than 18 years old or under should be, there may also be things that can help people avoid them:'

In [None]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What to do after being diagnosed with cancer?\n. You should talk about what you are taking and how much is taken into account when deciding whether or not it\'s right for your health care provider, as well in making a decision on the use of certain medications (including chemotherapy) during treatment. If all goes according that plan, there will be no need either to get tested again if needed; however, some people may find this process more difficult than others due their tolerance toward high doses of radiation therapy.<end> When considering an opt-out from radioactive medicines such Asparagus extract because many patients have been found unable complete remission by then owing only part time access to dialysis facilities despite repeated treatments -- especially those undergoing surgery at Johns Hopkins Medical Center - doctors must look closely over each person who receives any type Ofcharium bromide supplements while he/she continues his healthy lifestyle through regular outpatien

In [None]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(untuned_model, untuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What to do after being diagnosed with cancer?\nA: You\'re in a different place. Cancer is very difficult, so you have no control over it or the symptoms that come along with chemotherapy — and then there\'s other factors as well; for example, if your doctor says they don\'t want any of this medication given because I\'m not doing anything about my condition."\n\n (Photo: Courtesy) The best way to help people stay on their feet long enough to get back into bed at night when needed most quickly was by making sure everyone knows how much time has passed before coming down from surgery but also knowing what treatments are available during those times instead—the kind doctors often prescribe themselves regularly once an hour each day only occasionally due out-of–office hours…You should be able pick up some medications gradually throughout treatment even though one pill may feel like too little medicine until later than usual!'

In [None]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(fine_tuned_model, fine_tuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




In [None]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(untuned_model, untuned_tokenizer, prompt)
response

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"What to do when feeling sick?\nIt's not a bad idea. I'll admit that it can be painful but in the end, my body doesn't care about what happens with me anymore and just wants some good news for myself: this is going away!"