<a href="https://colab.research.google.com/github/quazirab/fine-tuning-llama-3.1-on-medical-questionnaires/blob/main/notebooks/data_prep_for_medical_question_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup
Install the following python packages

In [41]:
!pip install -q datasets

# Data prep

In [42]:
from datasets import load_dataset
dataset = load_dataset("lavita/medical-qa-datasets", name="medmcqa", split="train")

In [43]:
dataset

Dataset({
    features: ['id', 'question', 'opa', 'opb', 'opc', 'opd', 'cop', 'choice_type', 'exp', 'subject_name', 'topic_name'],
    num_rows: 182822
})

In [44]:
df = dataset.to_pandas()
df.head(5)

Unnamed: 0,id,question,opa,opb,opc,opd,cop,choice_type,exp,subject_name,topic_name
0,e9ad821a-c438-4965-9f77-760819dfa155,Chronic urethral obstruction due to benign pri...,Hyperplasia,Hyperophy,Atrophy,Dyplasia,2,single,Chronic urethral obstruction because of urinar...,Anatomy,Urinary tract
1,e3d3c4e1-4fb2-45e7-9f88-247cc8f373b3,Which vitamin is supplied from only animal sou...,Vitamin C,Vitamin B7,Vitamin B12,Vitamin D,2,single,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...,Biochemistry,Vitamins and Minerals
2,5c38bea6-787a-44a9-b2df-88f4218ab914,All of the following are surgical options for ...,Adjustable gastric banding,Biliopancreatic diversion,Duodenal Switch,Roux en Y Duodenal By pass,3,multi,"Ans. is 'd' i.e., Roux en Y Duodenal Bypass Ba...",Surgery,Surgical Treatment Obesity
3,cdeedb04-fbe9-432c-937c-d53ac24475de,Following endaerectomy on the right common car...,Central aery of the retina,Infraorbital aery,Lacrimal aery,Nasociliary aretry,0,multi,The central aery of the retina is a branch of ...,Ophthalmology,
4,dc6794a3-b108-47c5-8b1b-3b4931577249,Growth hormone has its effect on growth through?,Directly,IG1-1,Thyroxine,Intranuclear receptors,1,single,"Ans. is 'b' i.e., IGI-1GH has two major functi...",Physiology,


For this traning set, lets only use the single choice_type and create a new column with answer from the choice_type and option A, B, C and D

In [45]:
def answer(row):
  match row["cop"]:
    case 0:
      return row["opa"]
    case 1:
      return row["opb"]
    case 2:
      return row["opc"]
    case 3:
      return row["opd"]

df["answer"] = df.apply(answer, axis=1)

# pick only the required columns
df = df[["question", "answer", "exp"]]

df.head(5)

Unnamed: 0,question,answer,exp
0,Chronic urethral obstruction due to benign pri...,Atrophy,Chronic urethral obstruction because of urinar...
1,Which vitamin is supplied from only animal sou...,Vitamin B12,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...
2,All of the following are surgical options for ...,Roux en Y Duodenal By pass,"Ans. is 'd' i.e., Roux en Y Duodenal Bypass Ba..."
3,Following endaerectomy on the right common car...,Central aery of the retina,The central aery of the retina is a branch of ...
4,Growth hormone has its effect on growth through?,IG1-1,"Ans. is 'b' i.e., IGI-1GH has two major functi..."


In [None]:
if False: df.to_csv("medical-question-w-answer-and-explanation.csv")

## Clean up
Its observered that the explanations have some repeations of answer choice. Lets try to clean that up!

In [46]:
df

Unnamed: 0,question,answer,exp
0,Chronic urethral obstruction due to benign pri...,Atrophy,Chronic urethral obstruction because of urinar...
1,Which vitamin is supplied from only animal sou...,Vitamin B12,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...
2,All of the following are surgical options for ...,Roux en Y Duodenal By pass,"Ans. is 'd' i.e., Roux en Y Duodenal Bypass Ba..."
3,Following endaerectomy on the right common car...,Central aery of the retina,The central aery of the retina is a branch of ...
4,Growth hormone has its effect on growth through?,IG1-1,"Ans. is 'b' i.e., IGI-1GH has two major functi..."
...,...,...,...
182817,Most common site for extra mammary Paget&;s di...,Vulva,.It is superficial manifestation of an intradu...
182818,Inferior Rib notching is seen in all except?,Neurofibromatosis,Answer is D (Neurofibromatosis) Neurofibromato...
182819,Which is false regarding cryptococcus neoformans?,Urease negative,"Ans. is 'c' i e., Urease negative Cryptococcus..."
182820,Histopathological finding of gluten hypersensi...,Crypt hyperplasia,"Ans. is 'a' i.e., Crypt hyperplasia Histopatho..."


Observing the dataset, the majority of repetation is based like -
regex pattern - `Ans.* [^\w\s][A-Za-z][^\w\s]`
* Ans. (c)
* Ans. is 'd'
* Ans. is 'b'
* Ans. C
* Ans) C
* Ans : B
* Answer is D
* Ans is 'c'
* Answer- B.
* Ans: (b)
* Ans A
* Ans, is 'a'
* Ans is 'c' i.e.
* Ans: a
* Ans is (a)
* Ans, is 'a'
* Ans is 'c'


In [47]:
import re

re_subs = [
    "^\s", # clean up line that starts with whitespaces
    "Ans.* [^\w\s][A-Za-z][^\w\s]", # clean up the Answers
    "ANS.* [^\w\s][A-Za-z][^\w\s]", # clean up the ANSWERS
    "^[a-dA-D]*", # this are for exp that started with MCQ answers
    "^\S*\s", # clean up lines with whitespaces
    "^i\.e\.", # clean up lines starting with .i.e
    "^[^\w\s]", # clean up any symbols
    "^\s", # clean up any further whitespaces
    ]

def clean_ans(example: str | None):
  if example:
    for re_sub in re_subs:
      example = re.sub(re_sub, "", example)
  return example

print(clean_ans("Ans. (C). Adequate liquor amniiThe photograph shows maceration, a sign of intrauterine death.Conditions favoring maceration: Intact membranes; adequate liquor amnii & NO air. "))
print(clean_ans("C i.e. Mite"))
print(clean_ans("C. Deficit of exp"))
print(clean_ans("	d i.e., Monocyte"))

Adequate liquor amniiThe photograph shows maceration, a sign of intrauterine death.Conditions favoring maceration: Intact membranes; adequate liquor amnii & NO air. 
Mite
Deficit of exp
Monocyte


In [48]:
df["exp"] = df["exp"].apply(clean_ans)

df

Unnamed: 0,question,answer,exp
0,Chronic urethral obstruction due to benign pri...,Atrophy,urethral obstruction because of urinary calcul...
1,Which vitamin is supplied from only animal sou...,Vitamin B12,Vitamin B12 Ref: Harrison's 19th ed. P 640* Vi...
2,All of the following are surgical options for ...,Roux en Y Duodenal By pass,Roux en Y Duodenal Bypass Bariatric surgical p...
3,Following endaerectomy on the right common car...,Central aery of the retina,central aery of the retina is a branch of the ...
4,Growth hormone has its effect on growth through?,IG1-1,IGI-1GH has two major functions :-i) Growth of...
...,...,...,...
182817,Most common site for extra mammary Paget&;s di...,Vulva,is superficial manifestation of an intraductal...
182818,Inferior Rib notching is seen in all except?,Neurofibromatosis,is D (Neurofibromatosis) Neurofibromatosis is ...
182819,Which is false regarding cryptococcus neoformans?,Urease negative,"i e., Urease negative Cryptococcus neoformans ..."
182820,Histopathological finding of gluten hypersensi...,Crypt hyperplasia,Crypt hyperplasia Histopathological findings o...


In [49]:
i = 0

for x in df.itertuples():
  exp = x[3]
  if exp and "ans" in exp.lower():
    i += 1
    print(exp)

  if i == 20:
    break

Vitamin B12 Ref: Harrison's 19th ed. P 640* Vitamin B12 (Cobalamin) is synthesized solely by microorganisms.* In humans, the only source for humans is food of animal origin, e.g., meat, fish, and dairy products.* Vegetables, fruits, and other foods of nonanimal origin doesn't contain Vitamin B12 .* Daily requirements of vitamin Bp is about 1-3 pg. Body stores are of the order of 2-3 mg, sufficient for 3-4 years if supplies are completely cut off.
of the most impoant pharmacokinetic changes associated with aging is decreased renal elimination of drugs. After age 40, creatinine clearance decreases an average of 8 mL/min/1.73 m2/decade; however, the age-related decrease varies substantially from person to person. Serum creatinine levels often remain within normal limits despite a decrease in GFR because older adults generally have less muscle mass and are generally less physically active than younger adults and thus produce less creatinine. Maintenance of normal serum creatinine levels ca

## Reduce number of training rows

There are too many rows, lets reduce the rows to 10,000 for our training purposes. For this experiment, lets try to drop
* the rows whose answer and explanation is the same
* the rows that has no explanations
* then sort the the dataframe based on the len of the explanation and take the first 10,000 rows  

In [50]:
print(f"before dropping: {df.shape}")
df = df[df["answer"] != df['exp']]
print(f"after dropping: {df.shape}")

before dropping: (182822, 3)
after dropping: (173505, 3)


In [51]:
print(f"before dropping: {df.shape}")
df.dropna(subset=["exp"], inplace=True)
print(f"after dropping: {df.shape}")

before dropping: (173505, 3)
after dropping: (151552, 3)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(subset=["exp"], inplace=True)


Lets check how many words are typical in the explanation

In [54]:
print(f"before dropping: {df.shape}")
df["num_of_words_in_exp"] = df.apply(lambda row: len(row["exp"].split()), axis=1)
df = df[df["num_of_words_in_exp"] != 0]
print(f"after dropping: {df.shape}")
df[df["num_of_words_in_exp"] == 0].head(5)


before dropping: (151552, 4)
after dropping: (150850, 4)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["num_of_words_in_exp"] = df.apply(lambda row: len(row["exp"].split()), axis=1)


Unnamed: 0,question,answer,exp,num_of_words_in_exp


In [60]:
num_count = df["num_of_words_in_exp"].value_counts().sort_index()
num_count

num_of_words_in_exp
1       2181
2       1504
3       1336
4       1149
5        961
        ... 
1647       1
1666       1
1683       1
2100       1
3154       1
Name: count, Length: 855, dtype: int64

In [61]:
df = df[df["num_of_words_in_exp"] > 5]
df = df.sort_values('num_of_words_in_exp').head(10_000)
print(f"after dropping: {df.shape}")

after dropping: (10000, 4)


In [62]:
df = df[["question", "answer", "exp"]]
df

Unnamed: 0,question,answer,exp
132587,Pathognomic lesion of scabies is?,Burrow,"BurrowRef : Rook's 8/e, p 38.36.39"
72018,WHO STEPS is used for:,Non-communicable diseases,diseases [Ref. http://wwwwho.int/mediacentre/f...
158061,Acute bilirubin encephalopathy is characterize...,Hypertonia,bilirubin encephalopathy is characterized by h...
137653,In some kidney transplants Hyperacute graft re...,Preformed antibodies,Bailey and love 25 e p1410
176938,Clasp arms serves the function of,Both 2 and 3,and Position of Clasp Assembly Parts
...,...,...,...
1121,A 35 yr old pregnant female at 40 weeks gestat...,Epidural block,complete relief of pain is needed throughout l...
108449,The two strands of DNA are held together by,Hydrogen bond,strands of DNA are held together by hydrogen b...
91014,Which of the following conditions is associate...,Juvenile CML,is A (Juvenile CML): Fetal Haemoglobin Levels ...
35235,Langerhan's cells are -,Antigen presenting cells,Antigen presenting cells Langerhans cells are ...


# Save the data

Looking at the data, it seems like most of the dataset has been cleaned up with Answers mentioned at the starting. Now its time to save the dataset for Llama training. To save the dataset, set the `if False` to `if True`. It will save it the google drive from which it will be retrived by another notebook

In [63]:
if True: df.to_parquet("/content/drive/MyDrive/colab_drive/medical-question-w-answer-and-explanation-training-dataset.parquet")