In [None]:
import pandas as pd
from tqdm import tqdm
pd.set_option('max_colwidth', 1000)
pd.options.mode.chained_assignment = None
df_dev = pd.read_csv("../data/MedQA_dev.csv", index_col=0)
df_train = pd.read_csv("../data/MedQA_train.csv", index_col=0)


Let's focus on question 2, and call it the girl study case:

In [9]:
girl_study_case = df_dev.iloc[1]
pd.DataFrame(girl_study_case)

Unnamed: 0,1
question,"A 5-year-old girl is brought to the emergency department by her mother because of multiple episodes of nausea and vomiting that last about 2 hours. During this period, she has had 6–8 episodes of bilious vomiting and abdominal pain. The vomiting was preceded by fatigue. The girl feels well between these episodes. She has missed several days of school and has been hospitalized 2 times during the past 6 months for dehydration due to similar episodes of vomiting and nausea. The patient has lived with her mother since her parents divorced 8 months ago. Her immunizations are up-to-date. She is at the 60th percentile for height and 30th percentile for weight. She appears emaciated. Her temperature is 36.8°C (98.8°F), pulse is 99/min, and blood pressure is 82/52 mm Hg. Examination shows dry mucous membranes. The lungs are clear to auscultation. Abdominal examination shows a soft abdomen with mild diffuse tenderness with no guarding or rebound. The remainder of the physical examination sho..."
answer,Cyclic vomiting syndrome
answer_idx,A
options,"{'A': 'Cyclic vomiting syndrome', 'B': 'Gastroenteritis', 'C': 'Hypertrophic pyloric stenosis', 'D': 'Gastroesophageal reflux disease'}"
meta_info,step2&3


Let's try using BM25 search to find the most similar question to this in the test split.
First let's create BM25 search DB:

In [10]:
#creating BM25 search DB:
from rank_bm25 import BM25Okapi
import numpy as np

questions_corpus=df_train['question'].tolist()
tokenized_question_corpus = [doc.split(" ") for doc in questions_corpus]
bm25_questions = BM25Okapi(tokenized_question_corpus)

def search_similar(query, bm25_index):
    query=query.lower()
    tokenized_query = query.split(" ")
    doc_scores = bm25_index.get_scores(tokenized_query)
    index=np.argsort(-doc_scores)[:5][0]
    return df_train.iloc[index]


Done! let's search for closest question to our example:

In [12]:
most_similar = search_similar(girl_study_case["question"], bm25_questions)
pd.DataFrame(most_similar)

Unnamed: 0,1274
question,"A previously healthy 14-year-old girl is brought to the emergency department by her mother because of abdominal pain, nausea, and vomiting for 6 hours. Over the past 6 weeks, she has also had increased frequency of urination, and she has been drinking more water than usual. She has lost 6 kg (13 lb) over the same time period despite having a good appetite. Her temperature is 37.1°C (98.8°F), pulse is 125/min, respirations are 32/min, and blood pressure is 94/58 mm Hg. She appears lethargic. Physical examination shows deep and labored breathing and dry mucous membranes. The abdomen is soft, and there is diffuse tenderness to palpation with no guarding or rebound. Urine dipstick is positive for ketones and glucose. Further evaluation is most likely to show which of the following findings?"
answer,Decreased total body potassium
answer_idx,D
options,"{'A': 'Increased arterial pCO2', 'B': 'Increased arterial blood pH', 'C': 'Serum glucose concentration > 800 mg/dL', 'D': 'Decreased total body potassium'}"
meta_info,step1


What if we used the answer options only for the search db?

In [13]:
options_corpus=df_train['options'].astype(str).tolist()
tokenized_options_corpus = [doc.split(" ") for doc in options_corpus]
bm25_options = BM25Okapi(tokenized_options_corpus)

In [15]:
most_similar = search_similar(str(girl_study_case["options"]), bm25_options)
pd.DataFrame(most_similar)

Unnamed: 0,7055
question,"A 5-year-old male is brought to the pediatrician by his mother, who relates a primary complaint of a recent history of five independent episodes of vomiting over the last 10 months, most recently 3 weeks ago. Each time, he has awoken early in the morning appearing pale, feverish, lethargic, and complaining of severe nausea. This is followed by 8-12 episodes of non-bilious vomiting over the next 24 hours. Between these episodes he returns to normal activity. He has no significant past medical history and takes no other medications. Review of systems is negative for changes in vision, gait disturbance, or blood in his stool. His family history is significant only for migraine headaches. Vital signs and physical examination are within normal limits. Initial complete blood count, comprehensive metabolic panel, and abdominal radiograph were unremarkable. What is the most likely diagnosis?"
answer,Cyclic vomiting syndrome
answer_idx,B
options,"{'A': 'Intracranial mass', 'B': 'Cyclic vomiting syndrome', 'C': 'Gastroesophageal reflux', 'D': 'Intussusception'}"
meta_info,step2&3
