# Unimportant settings

In [2]:
# This setting allows the notebook to show all 
# outputs instead of only the last one. It's just a QoL thing.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Import packages

This is where we will import our **Python tools** that will help us tackle this problem

In [3]:
import pandas as pd # to read and analyse data

# Import the data

In [4]:
clinical_trials_data_path = 'https://raw.githubusercontent.com/vohcolab/TREC-Clinical-Workshop/main/data/sample_collection.csv'
patients_data_path = 'https://raw.githubusercontent.com/vohcolab/TREC-Clinical-Workshop/main/data/patients_sample.csv'


clinical_trials_data = pd.read_csv(clinical_trials_data_path,index_col=0)
patients_data = pd.read_csv(patients_data_path,index_col=0)

## Continuing from where we left off

Out of all requirements, probably matching the gender and age are the easiest ones to complete. Let's solve the gender matching then!

<img src="https://i.imgflip.com/1yuocg.jpg">

In [5]:
# let's save the patient description under the variable 'text'
text = patients_data.iloc[19].description
text

'A 72-year-old man complains of increasing calf pain when walking uphill. The symptoms have gradually increased over the past 3 months. The patient had an uncomplicated myocardial infarction 2 years earlier and a transient ischemic attack 6 months ago. Over the past month, his blood pressure has worsened despite previous control with diltiazem, hydrochlorothiazide, and propranolol. His is currently taking isosorbide dinitrate, hydrochlorothiazide, and aspirin. On physical examination, his blood pressure is 151/91 mm Hg, and his pulse is 67/min. There is a right carotid bruit. His lower extremities are slightly cool to the touch and have diminished pulses at the dorsalis pedis.\n        '

In [6]:
def naive_gender_detector(text):
    """
    This function (tries to) detect the gender of the patient
    given their medical description.
    """
    
    possible_male_references =  ['male','man','boy']
    
    for reference in possible_male_references:
        position_found = text.find(reference)
        
        if position_found == -1:
            continue
        else:
            return 'Male'
    return 'Female'

In [7]:
naive_gender_detector(text)

'Male'

Awesome! It works for this example. But does it work for all examples?

In [8]:
sneaky_female_text = patients_data.iloc[25].description
sneaky_female_text

'A 43-year-old woman visits her dermatologist for lesions on her neck. On examination, multiple lesions are seen. Each lesion is small soft, and pedunculated. The largest lesion is about 4 mm in diameter. The color of different lesions varies from flesh colored to slightly hyperpigmented.\n        '

Now we just need to call our gender detection function!

<img src="https://i.imgflip.com/5ds971.jpg" width="600">

In [9]:
naive_gender_detector(sneaky_female_text)

'Male'

<img src="https://memegenerator.net/img/instances/62594221.jpg" width="300">

Wel... this is awkward

> Can you figure out why this happened?

In [10]:
def naive_gender_detector_slightly_improved(text):
    """
    This function (tries to) detect the gender of the patient
    given their medical description.
    """
    
    possible_male_references =  [' male ', ' man ','boy']
    
    for reference in possible_male_references:
        position_found = text.find(reference)
        
        if position_found == -1:
            continue
        else:
            return 'Male'
    return 'Female'

In [11]:
naive_gender_detector_slightly_improved(sneaky_female_text)

'Female'

<img src="https://i.imgflip.com/5dsbfs.jpg" width="500">

In [12]:
hidden_male_text = patients_data.iloc[-1].description
hidden_male_text

'A 10 year old child is brought to the emergency room complaining of myalgia, cough, and shortness of breath.  Two weeks ago the patient was seen by his pediatrician for low-grade fever, abdominal pain, and diarrhea, diagnosed with a viral illness, and prescribed OTC medications. Three weeks ago the family returned home after a stay with relatives on a farm that raises domestic pigs for consumption. Vital signs: T: 39.5 C, BP: 90/60 HR: 120/min RR: 40/min. Physical exam findings include cyanosis,  slight stiffness of the neck,  and  marked periorbital edema. Lab results include WBC 25,000, with 25% Eosinophils, and an unremarkable urinalysis.\n        '

In [13]:
naive_gender_detector_slightly_improved(hidden_male_text)

'Female'

<img src="https://wompampsupport.azureedge.net/fetchimage?siteId=7575&v=2&jpgQuality=100&width=700&url=https%3A%2F%2Fi.kym-cdn.com%2Fentries%2Ficons%2Foriginal%2F000%2F006%2F725%2FDesk_Flip_banner.jpg">

Trying to improve our system further than this becomes exponentially more complicated because we are trying to make the computer understand natural language, which is impossible... right? More on that later!


For now let's finish our gender matching.

In [14]:
patients_data['infered_gender'] = patients_data.description.apply(naive_gender_detector_slightly_improved)
patients_data

Unnamed: 0_level_0,description,infered_gender
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1
20141,A 58-year-old African-American woman presents ...,Female
201410,A physician is called to see a 67-year-old wom...,Female
201411,A 40-year-old woman with no past medical histo...,Female
201412,A 25-year-old woman presents to the clinic com...,Female
201413,A 30-year-old generally healthy woman presents...,Female
201414,An 85-year-old man is brought to the ER becaus...,Male
201415,A 36-year-old woman presents to the emergency ...,Female
201416,A 28-year-old female with neck and shoulder pa...,Female
201417,A 48-year-old white male with history of commo...,Male
201418,A 6-month-old male infant has a urine output o...,Male


In [15]:
matching_data = {}

# iterate each clinical trial
for _,ct in clinical_trials_data.iterrows():
    
    # If the gender requirement is for both
    if ct['gender'] == 'Both':
        # all patients are eligible
        # match this trial with all of the patients
        matching_patients = patients_data.index.tolist()
    
    # But if it requires males
    elif ct['gender'] == 'Male':
        # match only with male patients
        matching_patients = patients_data[patients_data['infered_gender'] == 'Male'].index.tolist()
    else:
        # likewise for female patients
        matching_patients = patients_data[patients_data['infered_gender'] == 'Female'].index.tolist()
    # save this matching on our variable <matching_data>
    matching_data[ct.name] = matching_patients

In [16]:
matching_data

{'NCT00000408': [20141,
  201410,
  201411,
  201412,
  201413,
  201414,
  201415,
  201416,
  201417,
  201418,
  20142,
  201421,
  201422,
  201423,
  201424,
  201426,
  201427,
  201429,
  20143,
  201430,
  20144,
  20145,
  20146,
  20147,
  20148,
  20149,
  20151,
  201510,
  201511,
  201512,
  201513,
  201516,
  201517,
  201518,
  20152,
  201520,
  201521,
  201523,
  201524,
  201525,
  201527,
  201528,
  201529,
  20153,
  201530,
  20154,
  20155,
  20156,
  20157,
  20158,
  20159],
 'NCT00000492': [20141,
  201410,
  201411,
  201412,
  201413,
  201414,
  201415,
  201416,
  201417,
  201418,
  20142,
  201421,
  201422,
  201423,
  201424,
  201426,
  201427,
  201429,
  20143,
  201430,
  20144,
  20145,
  20146,
  20147,
  20148,
  20149,
  20151,
  201510,
  201511,
  201512,
  201513,
  201516,
  201517,
  201518,
  20152,
  201520,
  201521,
  201523,
  201524,
  201525,
  201527,
  201528,
  201529,
  20153,
  201530,
  20154,
  20155,
  20156,
  20157,
  2

Awesome!

Although we are only matching based on gender, this is a pretty good first step!


...

You might be wondering, how can we improve from here?

<img src="https://memegenerator.net/img/instances/75581336.jpg" width="500">

> Q2: How can we improve from here? What is lacking in our solution?