# MAPPING STUDIES TO THERAPEUTIC AREAS

In this Notebook, we map studies to their therapuetic areas using word embeddings. We mostly follow this [article](https://towardsdatascience.com/use-embeddings-to-predict-therapeutic-area-of-clinical-studies-654af661b949) by 
Thierry Herrmann.

### IMPORTS

In [1]:
# Import notebook containing the studies required
%run data_extraction1.ipynb

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns

# use the whole width
from IPython.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# ensure plots don't have a transparent background
plt.rcParams['axes.facecolor']='white' 
plt.rcParams['figure.facecolor']='white'

pd.set_option("display.max_rows",100)

### SOME CONSTANTS

In [3]:
# Constants
mesh_dir = '/Users/sanya/Downloads/'
emb_dir = '/Users/sanya/Downloads/embeddings-master'


## LOAD PRETRAINED EMBEDDINGS

- Loading pretrained embeddings of the UMLS (Unified Medical Language System) CUI's (Concept Unique Identifiers)
- Loading using the gensim library

In [4]:
from gensim.models.keyedvectors import KeyedVectors
w2v_path = emb_dir+'/DeVine_etal_200.txt'     # len(w2v.vocab) = 52102
w2v = KeyedVectors.load_word2vec_format(w2v_path, binary=False) # Syntax 

Notes:
     
     “KeyedVectors” is essentially a mapping between keys and vectors. Each vector is identified by its lookup key, most often a short string token, so this is usually a mapping between {str => 1D numpy array}.



In [5]:
if False:
    w2v_path = emb_dir + '/stanford_cuis_svd_300.txt' # len(w2v.vocab) = 22705
    w2v_stanford = KeyedVectors.load_word2vec_format(w2v_path, binary=False)

In [6]:
if False:
    w2v_path = emb_dir + '/claims_cuis_hs_300.txt' # len(w2v.vocab) = 14852
    w2v_claims = KeyedVectors.load_word2vec_format(w2v_path, binary=False)

In [7]:
vocab_keys = w2v.key_to_index
len(vocab_keys)

52102

In [8]:
# TEST CASE

w2v.most_similar('C0019158')

# Using the most_similar function from the gensim library, we can look up the CUI's that are most similar to the one we provided.

# Returns:

# [('C0012833', 0.782675564289093),     dizziness
#  ('C0220870', 0.7098520994186401),    Lightheadedness
#  ('C0917801', 0.6731902360916138),    Sleeplessness
#  ('C0043352', 0.669024646282196),     Xerostomia
#  ('C0027497', 0.6656966805458069),    Nausea
#  ('C0221512', 0.6654322147369385)]    Stomach ache

[('C0037140', 0.6207729578018188),
 ('C0019196', 0.5936840772628784),
 ('C0019163', 0.5609232187271118),
 ('C0019169', 0.5462726950645447),
 ('C0019159', 0.5288532972335815),
 ('C0042721', 0.5198029279708862),
 ('C0524909', 0.5124529004096985),
 ('C1443861', 0.4951326549053192),
 ('C0011226', 0.49500754475593567),
 ('C0019189', 0.4948180317878723)]

In [9]:
# TEST CASE

# Find concepts similar to C0019158 hepatitis

w2v.most_similar('C0019158')

# Returns:

# [('C0037140', 0.6207730770111084),  B Virus Infection
#  ('C0019196', 0.5936840772628784),  Hepatitis C
#  ('C0019163', 0.5609232187271118),  Hepatitis B
#  ('C0019169', 0.5462726950645447),  Hepatitis B Virus
#  ('C0019159', 0.5288532376289368),  Hepatitis A
#  ('C0042721', 0.5198029279708862),  Viral hepatitis
#  ('C0524909', 0.5124529004096985),  Hepatitis B, Chronic
#  ('C1443861', 0.4951326549053192),  Post-Exposure Prophylaxis
#  ('C0011226', 0.4950075149536133),  Hepatitis D Infection
#  ('C0019189', 0.4948180317878723)]  Hepatitis, Chronic

[('C0037140', 0.6207729578018188),
 ('C0019196', 0.5936840772628784),
 ('C0019163', 0.5609232187271118),
 ('C0019169', 0.5462726950645447),
 ('C0019159', 0.5288532972335815),
 ('C0042721', 0.5198029279708862),
 ('C0524909', 0.5124529004096985),
 ('C1443861', 0.4951326549053192),
 ('C0011226', 0.49500754475593567),
 ('C0019189', 0.4948180317878723)]

## UTILITY FUNCTIONS

In [10]:
def df_mem(df):
    return '%.1f Mb' % (df.memory_usage(index=True, deep=True).values.sum()/1024/1024) 

def load_df(file_name, nrows=1000, header='infer', names=None):
    df = pd.read_csv(file_name, sep='|', nrows=nrows, low_memory=False, header=header, names=names)
    #print("loaded '%s', %d rows (%s)" % (file_name, len(df), df_mem(df)))
    return df

## LOAD/ INSPECT STUDIES

In [11]:
# Dataframe with the required studies
# Resetting index
# data_final.reset_index()
data_final = data_final.get(['NCTId','ConditionMeshTerm','OverallStatus',
                'Condition','InterventionName','LocationCity','LocationState','LocationCountry','BriefSummary'])
data_final

Unnamed: 0,NCTId,ConditionMeshTerm,OverallStatus,Condition,InterventionName,LocationCity,LocationState,LocationCountry,BriefSummary
0,NCT02335671,breast neoplasms,Recruiting,"[Early Stage Breast Cancer, Breast Cancer Stag...",[Intra-operative Magnetic Resonance Imaging (M...,"[Boston, Boston]","[Massachusetts, Massachusetts]","[United States, United States]",[The purpose of this study is to investigate t...
1,NCT02348749,neuroendocrine tumors,Recruiting,[Neuroendocrine Tumors],"[18F-MFBG (meta-fluoro benzylguanidine), Posit...",[New York],[New York],[United States],[The purpose of this study is to see how a new...
2,NCT02347995,stroke,Recruiting,[Stroke],"[Protein, Placebo]",[Baltimore],[Maryland],[United States],[Stroke survivors experience severe muscle was...
3,NCT02346435,kidney neoplasms,Recruiting,[Kidney Neoplasm],[],[Baltimore],[Maryland],[United States],[Retrospective studies indicate that active su...
4,NCT02344485,parkinson disease,Recruiting,"[Parkinson's Disease, Constipation]",[OMM treatment],[Old Westbury],[New York],[United States],[The aim of this pilot study is to investigate...
...,...,...,...,...,...,...,...,...,...
32094,NCT05419076,lung neoplasms,Recruiting,"[Lung Cancer, Lung Cancer Metastatic, Brain Me...","[Stereotactic Radiosurgery, Cerebrospinal flui...","[Basking Ridge, Middletown, Montvale, Commack,...","[New Jersey, New Jersey, New Jersey, New York,...","[United States, United States, United States, ...",[The purpose of the study is to see if stereot...
32095,NCT05424822,lymphoma,Recruiting,"[Lymphoma, Non-Hodgkin, Leukemia, Lymphocytic,...",[JNJ-80948543],"[Duarte, New York, New York, Nashville, Housto...","[California, New York, New York, Tennessee, Te...","[United States, United States, United States, ...",[The purpose of this study is to characterize ...
32096,NCT05425719,kidney diseases,Recruiting,[Chronic Kidney Disease],"[MB-102, MediBeacon Transdermal Glomerular Fil...","[Edgewater, Chicago, Saint Paul, Raleigh, San ...","[Florida, Illinois, Minnesota, North Carolina,...","[United States, United States, United States, ...","[This is a multi-center, open-label, pivotal s..."
32098,NCT05444465,rotator cuff injuries,Recruiting,[Rotator Cuff Injuries],"[Isolated Bioinductive Repair, Completion and ...","[Rome, Rockford, Bedford, Richmond, Calgary, W...","[Georgia, Illinois, Texas, Victoria, Alberta, ...","[United States, United States, United States, ...",[The purpose of the study is to assess whether...


## 1. MAP STUDIES TO MESH TERMS

- Create a dictionary to map NCT ID's to their mesh_terms

In [12]:
# Creating a dataframe with nct_id and downcase_mesh_term column from the main data.

data_final = data_final.assign(nct_id = data_final.get('NCTId'),downcase_mesh_term = data_final.get('ConditionMeshTerm'))
df_mesh_ct = data_final.get(['nct_id','downcase_mesh_term'])
df_mesh_ct

Unnamed: 0,nct_id,downcase_mesh_term
0,NCT02335671,breast neoplasms
1,NCT02348749,neuroendocrine tumors
2,NCT02347995,stroke
3,NCT02346435,kidney neoplasms
4,NCT02344485,parkinson disease
...,...,...
32094,NCT05419076,lung neoplasms
32095,NCT05424822,lymphoma
32096,NCT05425719,kidney diseases
32098,NCT05444465,rotator cuff injuries


### 1.1 CREATING NCT ID TO MESH TERM DICTIONARY

In [13]:
# Creating dicitionary

from collections import defaultdict
nct_to_mesh_term = defaultdict(set)
for row in df_mesh_ct[['nct_id', 'downcase_mesh_term']].itertuples():
    nct_to_mesh_term[row[1]].add(row[2])

In [14]:
# Test Case
nct_to_mesh_term['NCT02344485']

{'parkinson disease'}

Notes:
    
    Omitted loading of studies.txt since we used the studies alreay in the dataframe "data_final". Also, omitted use of 
    browse_conditions.txt by creating a separate dataframe designed just like browse_conditions.txt.

## 2. CONVERT MESH TERMS TO THEIR UNIQUE IDENTIFIERS

### 2.1 FIND MESH TERMS AND CODES

- Finding Unique Identifiers for MESH Terms. 

In [15]:
# Function to parse MESH FILE (d2022.bin)

def parse_mesh_file_as_df(filepath):
    import re
    tups = []
    import re
    
    with open(filepath) as fp:
        heading,mesh_nb,ui = None,None,None
        for cnt, line in enumerate(fp):
            if line.startswith('MN ='):
                record_type = re.search(r'MN = (.+)', line).group(1)
                #print('MN: ' + MN)
            elif line.startswith('MH ='):
                name = re.search(r'MH = (.+)', line).group(1).lower()
                #print('name: ' + name)
            elif line.startswith('UI ='):
                ui = re.search(r'UI = (.+)', line).group(1)
                #print('ui: ' + ui)
                tups.append((ui, name, record_type))
    return pd.DataFrame(tups, columns=['ui','name','mesh_number'])

In [16]:
# Creating dataframe with Mesh Terms and their CUI's from the file d2022.bin

df_mesh=parse_mesh_file_as_df("/Users/sanya/Downloads/d2022.bin")
#print(df_mesh[['ui','name']])
df_mesh

Unnamed: 0,ui,name,mesh_number
0,D000001,calcimycin,D03.633.100.221.173
1,D000002,temefos,D02.886.300.692.800
2,D000003,abattoirs,J03.540.020
3,D000004,abbreviations as topic,L01.559.598.400.556.131
4,D000005,abdomen,A01.923.047
...,...,...,...
30189,D066310,digital divide,L01.143.230.500
30190,D066328,ventral striatum,A08.186.211.200.885.287.249.487.775
30191,D066329,protein aggregates,D05.875
30192,D066330,"printing, three-dimensional",L01.296.110.150.500


### 2.2 BUILD MESH TERM TO ID DICITIONARY 

- Dictionary to map MESH Terms to their Unique Identifiers

In [17]:
from collections import defaultdict
mesh_term_to_id = {}

for row in df_mesh[['name','ui']].itertuples():
    mesh_term_to_id[row[1]] = row[2]
    
mesh_term_to_id['ventral striatum']

'D066328'

In [18]:
# MeSH terms in CT.gov but not in df_mesh


mesh_missing = set(df_mesh_ct.downcase_mesh_term.values) - set(df_mesh.name.values)
print('mesh terms in CT.gov but not in MeSH official list: %d' % len(mesh_missing))
if len(mesh_missing) <= 1000:
    print(mesh_missing)

mesh terms in CT.gov but not in MeSH official list: 1219


In [19]:
# Dicitionaries created up until now:
# - nct_to_mesh_term : Look up mesh terms with NCT ID's
# - mesh_term_to_id : Look up mesh Unique Identifiers (ui) with the mesh term

## 3. LOAD UMLS CUIS
- Loading the Concepts and Sources file MRCONSO.RRF from the NLM

In [20]:
# since the file is big, we need a special function to read in streaming mode and eliminate on the fly
# CUIs that are not in the embeddings


def load_conso(file_name, vocab_keys):
    rows=[]
    cnt=0
    with open(file_name) as fp:  
        for cnt, line in enumerate(fp):
            line=line.strip()
            cols = line.split('|')
            cols[3] = cols[3].lower()
            if cols[0] in vocab_keys:
                rows.append(cols)
            cnt += 1
    
    #print("loaded '%s', %d rows (%s)" % (file_name, len(df), df_mem(df)))
    df = pd.DataFrame(rows, columns=['CUI','SAB','CODE','STR'])
    print("loaded '%s', %d rows (%s)" % (file_name, len(df), df_mem(df)))
    print('processed rows: %d' % cnt)
    return df # Returns dataframe with the 4 Columns mentioned above.

In [21]:
df_c = load_conso("/Users/sanya/Downloads/MRCONSO_reduced.RRF", vocab_keys)

# df_c : Dataframe with CUI, SAB, CODE AND STR INFO FROM MRCONSO.RRF

loaded '/Users/sanya/Downloads/MRCONSO_reduced.RRF', 2159141 rows (574.7 Mb)
processed rows: 16857345


In [22]:
# Test Case
df_c[df_c.CODE=='D014947']

Unnamed: 0,CUI,SAB,CODE,STR
967572,C0043250,MSHCZE,D014947,rány
967583,C0043250,MSH,D014947,wounds
967586,C0043250,MSH,D014947,wound
967613,C0043250,MSHFRE,D014947,plaies
967623,C0043250,MSHITA,D014947,ferite
967636,C0043250,MSHNOR,D014947,sår
967637,C0043250,MSHPOR,D014947,ferimentos
967638,C0043250,MSHPOR,D014947,feridas
967639,C0043250,MSHPOR,D014947,ferimento
967646,C0043250,MSHPOR,D014947,ferida


In [23]:
# Test Case
df_c[df_c.STR=='trauma']

Unnamed: 0,CUI,SAB,CODE,STR
967662,C0043251,MSHCZE,D014947,trauma
967663,C0043251,MSHDUT,D014947,trauma
967687,C0043251,MSH,D014947,trauma
967701,C0043251,MSHFRE,D014947,trauma
967703,C0043251,MSHGER,D014947,trauma
967705,C0043251,MSHITA,D014947,trauma
967714,C0043251,MSHPOR,D014947,trauma
967730,C0043251,MSHSPA,D014947,trauma


### 3.1 MAP MESH CODES TO CUIS
- Create dictionary that maps mesh codes to CUI's using the dataframe "dt_c".

In [24]:
# Dictionary mapping mesh codes to CUI's


from collections import defaultdict
mesh_code_to_cui = defaultdict(set) # used to link a study to CUIs (through mesh codes)

                                   # used later to associate therapeutic area (from their strings) to CUIs
for row in df_c[df_c.SAB=='MSH'][['CODE','CUI']].itertuples():
    code, cui = row[1], row[2]
    if cui in vocab_keys:
        mesh_code_to_cui[code].add(cui) 

In [25]:
#TEST CASE
mesh_code_to_cui['D014947']

{'C0043250', 'C0043251'}

### 3.2 MAP STUDIES TO CUIS
- Create a dictionary of study identifiers to concept CUIs.

In [26]:
std_to_cuis = defaultdict(set)
imperfect_studies=set()
for idx,(std,terms) in enumerate(nct_to_mesh_term.items()):
    for term in terms:
        mesh_id = mesh_term_to_id.get(term)
        if mesh_id is None:
            #print('mesh term "' + term + \
                            #'" is in CT.gov but not in the official MeSH terms. Ignore term for study %s' % std)
            imperfect_studies.add(std)
            break
        for cui in mesh_code_to_cui[mesh_id]:
            std_to_cuis[std].add(cui)
                
print('removing (imperfect) studies containing at least 1 mesh term not in the official mesh list: %d' % len(imperfect_studies))
for imperfect in imperfect_studies:
    if std_to_cuis.get(imperfect) is not None:
        del std_to_cuis[imperfect]

removing (imperfect) studies containing at least 1 mesh term not in the official mesh list: 2048


Notes: 
    
    After removing imperfect studies, number of studies remaining are 20,672.

In [27]:
# Test Case
std_to_cuis['NCT02335671'] # breast neoplasms

{'C0006142', 'C0678222', 'C1458155'}

## 4. MAP CONCEPT CUIS TO STRINGS
- Create dictionary mapping CUI's to the STR column that contains a set of descriptions for each CUI.
- This will be used later to associate therapeutic area (from their strings) to CUIs

In [28]:
# Creating a dictionary mapping the CUI's to the string column in "df_c"

from collections import defaultdict

cui_to_strings = defaultdict(set)  # a set of descriptions (lowercased) for each CUI (obtained from the STR column)
                                   # used later to associate therapeutic area (from their strings) to CUIs
for row in df_c[['CUI','STR']].itertuples():
    cui, term = row[1], row[2]
    cui_to_strings[cui].add(term)

In [29]:
# Test Case
#cui_to_strings['C1458155']
# breast neoplasms

### 4.1 GET STUDY TERMS FROM THE STUDY ID

In [30]:
# Creating function to retrieve the CUI terms in a study from the dictionary "std_to_cuis"

def get_study_terms(std):
    #  mesh terms -> CUIs (UMLS) -> 
    # -> CUI terms from MRCONSO.RRF (all strings for the CUI)
    cuis = std_to_cuis.get(std)
    terms = set()
    if cuis is None:
        return terms
    for cui in cuis:
        terms.update(cui_to_strings[cui])
    return terms


In [31]:
# test case
get_study_terms('NCT02347995')

{'acc cerebrovascular/apoplejia',
 'accident - cerebrovascular',
 'accident cerebrovasculair',
 'accident cerebrovascular',
 'accident cérébro-vasculaire',
 'accident cérébrovasculaire',
 'accident cérébrovasculaire sai',
 'accident ischémique cérébral',
 'accident vasculaire cerebral',
 'accident vasculaire cérébral',
 'accident, cerebrovasculair',
 'accident, cérébrovasculaire',
 'accident; cerebraal',
 'accident; cerebral',
 'accident; cerebrovasculair',
 'accident; cerebrovascular',
 'accidente cerebral vascular',
 'accidente cerebrovascolare',
 'accidente cerebrovascolare nas',
 'accidente cerebrovascular',
 'accidente cerebrovascular (concepto no activo)',
 'accidente cerebrovascular (trastorno)',
 'accidente cerebrovascular neom',
 'accidente cerebrovascular, no especificado',
 'accidente cerebrovascular, no especificado (trastorno)',
 'accidente cerebrovascular, sai',
 'accidente cerebrovascular, sai (trastorno)',
 'accidente vascular cerebral',
 'accidente vascular del cerebro

## 5. Manually Associate Therapeutic Areas to UMLS Concepts (CUIs)

### 5.1 CREATING FUNCTIONS
- Create a function (any_term_in_strings) to return a boolean value if our input contains a string.
- Create a function (find_cuis_for_terms) to return the set of CUI's whose terms match a list of terms that we input.

In [32]:
def any_term_in_strings(term_list, strings):
    """
    Returns True if any term of term_list is in strings.
    Args:
    - term_list: list of strings
    - strings: other list of strings
    """
    for term in term_list:
        term_ok=False
        for string in strings:
            if term in string:
                #print('  term "%s" ok! in string: %s' % (term,string))
                return True
    return False

if False: # test
    any_term_in_strings(['virus','viral'], {'aa viral', 'bb'})

def find_cuis_for_terms(term_lists, exclude_terms=None):
    """
    Finds the set of CUIs whose terms match a list of terms.
    Args:
    - term_lists: list of list of terms: for each of the term_list in term_lists, 
      at least one term must be in at least one of the strings. Thus it's an OR 
      inside the term_list, but an AND between term_list.
    - exclude_terms: the CUI is rejected if contains one of these terms.
    Returns:
    - a set of CUIs as a set of strings.
    """
    cuis=set()
    cnt=0
    for cui,strings in cui_to_strings.items():
        # if any excluded term is in any strings, reject this cui
        if exclude_terms is not None:
            excluded = False
            for ex_term in exclude_terms:
                for string in strings:
                    if ex_term in string:
                        excluded = True
                        break
                if excluded:
                    break
            if excluded:
                #print('cui %s excluded. Strings: %s' % (cui, str(strings)))
                continue
                
        # process the term_lists
        cui_ok = True
        for term_list in term_lists:
            #print('searching term_list "%s"' % (term_list))
            if not any_term_in_strings(term_list, strings):
                #print('term list "%s" does not match strings "%s"' % (term_list, strings))
                cui_ok = False
                break
            #else:
            #    #print('term list "%s" matches strings "%s"' % (term_list, strings))
            #    cnt+=1
            #    if cnt == 100:
            #        return
                
        if cui_ok:
            # THIS IS THE LINE TO UNCOMMENT TO DEBUG CUIs
            #print('accepted cui "%s" that corresponds to strings "%s"\n' % (cui, str(strings)))
            cuis.add(cui)
    return cuis     

### 5.2 MANUALLY ASSOCIATE CUI'S TO THERAPEUTIC AREA

- Manually implement the function "find_cuis_for_terms" to retrieve cui's for various terms.
- Manually associate the output of the function "find_cuis_for_terms" to Therapeutic Areas.

In [33]:



cardiology_cuis = find_cuis_for_terms([['cardiolog', 'cardiovascul']])
#dental_cuis = find_cuis_for_terms([['dental', ]], exclude_terms=['accidental', 'incidental', 'occidental', 'osteodental'])
#dental_cuis = find_cuis_for_terms([['caries', 'cavity', 'cavities', 'orthodon', 'endodon',]]) # worse!
dental_cuis = find_cuis_for_terms([['tooth cavit', 'caries', 'cavities']], 
                                  exclude_terms=['peccaries', 'cotyloid cavities', 'nasal cavities', 'cavities, glenoid', 'cavities paranasal',
                                                 'pleural cavities', 'cavities, pleural', 'pericardial cavities', 'cavities pelvic',
                                                 'cavities uterine', 'abdominal cavities', 'cavities, tympanic', 'body cavities'])
dermatology_cuis = find_cuis_for_terms([['dermatol', ]]) # 'skin' adds too many matches in other contexts
device_cuis = find_cuis_for_terms([['device', ]])
environ_cuis = find_cuis_for_terms([['environmental', 'environments', 'pollut']])
endocrinology_cuis = find_cuis_for_terms([['endocrinol', ]])
family_med_cuis = find_cuis_for_terms([['family medicine', ]])
gastro_cuis = find_cuis_for_terms([['gastroentero', ]])
genetic_cuis = find_cuis_for_terms([['geneti', ], ['diseas']])
volunteer_cuis = find_cuis_for_terms([['volunteer', ]]) # difficult. Other contexts than health
hematology_cuis = find_cuis_for_terms([['hematol', ]], exclude_terms=['non-hemato'])
#hepatology_cuis = find_cuis_for_terms([['hepatol', 'hepatic']])
#hepatology_cuis = find_cuis_for_terms([['liver']])
hepatology_cuis = find_cuis_for_terms([['hepatitis']]) # needs to be refined. Might too restrictive but liver/hepatol match too much
immunology_cuis = find_cuis_for_terms([['immunolog']])
infect_cuis=find_cuis_for_terms([['infectious', 'infected', 'infection'], ['disease']])
intern_cuis=find_cuis_for_terms([['intern'],['medicin']])
muskuloskel_cuis=find_cuis_for_terms([['musculoskelet']])
nephrology_cuis=find_cuis_for_terms([['nephrolog']])
neurology_cuis=find_cuis_for_terms([['neurolog']])
nutrition_cuis=find_cuis_for_terms([['nutrition', 'body weight', 'weight reduc', 'weight gain', 'overweight']])
obstetrics_cuis=find_cuis_for_terms([['obstetri', 'gynecol']])
oncology_cuis = find_cuis_for_terms([['oncolog', 'cancer']])
occupdisease_cuis = find_cuis_for_terms([['occupational disease']])
ophtalmo_cuis = find_cuis_for_terms([['ophthalmol', 'eye']])
orthopedics_cuis = find_cuis_for_terms([['orthopedi']])
otorino_cuis = find_cuis_for_terms([['otolaryngol']])
pediatrics_cuis = find_cuis_for_terms([['pediatr', 'neonat']])
parasitic_cuis = find_cuis_for_terms([['parasit'],['disease']])
pharmacol_toxicol_cuis = find_cuis_for_terms([['pharmacol', 'toxicol']])
#podiatrics_cuis = find_cuis_for_terms([['podiat', 'foot diseases', 'foot injur']]) # podiatry not found in embeddings, although in UMLS
podiatrics_cuis = find_cuis_for_terms([['podiat']]) # including foot diseases/injur make it match dental studies
psy_cuis = find_cuis_for_terms([['psychiatr', 'psycholog']])
pulmon_cuis = find_cuis_for_terms([['pulmonar', 'respirat'], ['diseas']])
rare_cuis = find_cuis_for_terms([['orphan drug', ]])# C0178786 orphan disease/drug
                                                    # C0178604 drug design/synthesis/production
                                                    # C0013232 Drugs, Orphan        --> THE ONLY ONE PRESENT IN EMBEDDINGS :(
                                                    # C0920627 Orphan Diseases
                                                    # C0029308 Orphan Drug Production
                                                    # C0599036 unprofitable drug development
                                                    # C0678236 Rare Diseases
rheumatology_cuis = find_cuis_for_terms([['rheumat', ]]) # C0035452 (Rheumatology specialty) badly missing + should we limit to 'rheumatolog' ?
sleep_cuis = find_cuis_for_terms([['sleep', ]])
symptoms_cuis = find_cuis_for_terms([['general manifestation of disorders', ]]) # matches only C1457887: symptoms
traume_cuis = find_cuis_for_terms([['traumas', ]]) # seems more focused than 'trauma' and avoids 'non-trauma' and 'nontrauma' strings
urology_cuis = find_cuis_for_terms([['urology', ]])
vaccine_cuis = find_cuis_for_terms([['vaccine', ]])

# Creating dictionary associating the cui terms retrieved above to Therapeutic Areas

areas = {
    'Cardiology/Vascular Diseases' :       cardiology_cuis,
    'Dental and Oral Health' :             dental_cuis,
    'Dermatology' :                        dermatology_cuis,
    'Devices' :                            device_cuis,
    'Disorders of Environmental Origin' :  environ_cuis,
    'Endocrinology' :                      endocrinology_cuis,
    'Family Medicine' :                    family_med_cuis,
    'Gastroenterology' :                   gastro_cuis,
    'Genetic Disease' :                    genetic_cuis,
    'Healthy Volunteers' :                 volunteer_cuis,
    'Hematology' :                         hematology_cuis,
    'Hepatology' :                         hepatology_cuis,
    'Immunology' :                         immunology_cuis,
    'Infections and Infectious Diseases' : infect_cuis,
    'Internal Medicine' :                  intern_cuis,
    'Musculoskeletal' :                    muskuloskel_cuis,
    'Nephrology' :                         nephrology_cuis,
    'Neurology' :                          neurology_cuis,
    'Nutrition and Weight Loss' :          nutrition_cuis,
    'Obstetrics/Gynecology' :              obstetrics_cuis,
    'Oncology' :                           oncology_cuis,
    'Occupational Diseases' :              occupdisease_cuis,
    'Ophthalmology' :                      ophtalmo_cuis,
    'Orthopedics/Orthopedic Surgery' :     orthopedics_cuis,
    'Otolaryngology' :                     otorino_cuis,
    'Pediatrics/Neonatology' :             pediatrics_cuis,
    'Parasitic Diseases' :                 parasitic_cuis,
    'Pharmacology/Toxicology' :            pharmacol_toxicol_cuis,
    'Podiatry' :                           podiatrics_cuis,
    'Psychiatry/Psychology' :              psy_cuis,
    'Pulmonary/Respiratory Diseases' :     pulmon_cuis,
    'Rare Diseases and Disorders' :        rare_cuis,
    'Rheumatology' :                       rheumatology_cuis,
    'Sleep' :                              sleep_cuis,
    'Symptoms and General Pathology' :     symptoms_cuis,
    'Trauma' :                             traume_cuis,
    'Urology' :                            urology_cuis,
    'Vaccines' :                           vaccine_cuis,
}

# 6. Match Studies with Therapeutic Areas using Concept Embeddings
- Create function to find the therapeutic areas having the closest concepts to those of a given study.
- We will be using the dictionaries created earlier for our functions.

In [36]:
# Creating function to find the therapeutic areas having the closest concepts to those of a given study

def find_best_areas(std):
    """
    Finds the therapeutic areas having the closest concepts to those of a given study
    Args:
    - std: study identifier
    Returns:
    - the study terms for debugging purposes
    - a list of 5 tuple2 containing the area and the similarity score with the study
      in similarity decreasing order
    """
    std_cuis = std_to_cuis[std]
    sims, area_list = [], []
    for area, cuis in areas.items():
        if len(std_cuis) == 0:
            raise Exception('no cuis for std %s' % std)
        sims.append(w2v.n_similarity(std_cuis, cuis)) # finds the similarity numeric for our study cuis with cuis from all areas
        area_list.append(area)
    indices = list(reversed(np.argsort(sims))) # gives the sorted index order where the most similar's index will be placed first and the least similar's index will be last
    best_areas = np.array(area_list)[indices][:5]
    best_sims = np.array(sims)[indices][:5]
    return get_study_terms(std), list(zip(best_areas,best_sims))

In [42]:
get_study_terms('NCT02347995')

{'acc cerebrovascular/apoplejia',
 'accident - cerebrovascular',
 'accident cerebrovasculair',
 'accident cerebrovascular',
 'accident cérébro-vasculaire',
 'accident cérébrovasculaire',
 'accident cérébrovasculaire sai',
 'accident ischémique cérébral',
 'accident vasculaire cerebral',
 'accident vasculaire cérébral',
 'accident, cerebrovasculair',
 'accident, cérébrovasculaire',
 'accident; cerebraal',
 'accident; cerebral',
 'accident; cerebrovasculair',
 'accident; cerebrovascular',
 'accidente cerebral vascular',
 'accidente cerebrovascolare',
 'accidente cerebrovascolare nas',
 'accidente cerebrovascular',
 'accidente cerebrovascular (concepto no activo)',
 'accidente cerebrovascular (trastorno)',
 'accidente cerebrovascular neom',
 'accidente cerebrovascular, no especificado',
 'accidente cerebrovascular, no especificado (trastorno)',
 'accidente cerebrovascular, sai',
 'accidente cerebrovascular, sai (trastorno)',
 'accidente vascular cerebral',
 'accidente vascular del cerebro

In [43]:
find_best_areas('NCT02347995')

({'acc cerebrovascular/apoplejia',
  'accident - cerebrovascular',
  'accident cerebrovasculair',
  'accident cerebrovascular',
  'accident cérébro-vasculaire',
  'accident cérébrovasculaire',
  'accident cérébrovasculaire sai',
  'accident ischémique cérébral',
  'accident vasculaire cerebral',
  'accident vasculaire cérébral',
  'accident, cerebrovasculair',
  'accident, cérébrovasculaire',
  'accident; cerebraal',
  'accident; cerebral',
  'accident; cerebrovasculair',
  'accident; cerebrovascular',
  'accidente cerebral vascular',
  'accidente cerebrovascolare',
  'accidente cerebrovascolare nas',
  'accidente cerebrovascular',
  'accidente cerebrovascular (concepto no activo)',
  'accidente cerebrovascular (trastorno)',
  'accidente cerebrovascular neom',
  'accidente cerebrovascular, no especificado',
  'accidente cerebrovascular, no especificado (trastorno)',
  'accidente cerebrovascular, sai',
  'accidente cerebrovascular, sai (trastorno)',
  'accidente vascular cerebral',
  'a

In [51]:
summ, al = [],[]
cuis1 = std_to_cuis['NCT02347995']
for area, cuis in areas.items():
    #print( w2v.n_similarity(cuis1,cuis))
    summ.append(w2v.n_similarity(cuis1, cuis))
    al.append(area)
    #print(al)
    indices1 = list(reversed(np.argsort(summ)))
    print(indices1)

[0]
[0, 1]
[0, 1, 2]
[0, 1, 3, 2]
[0, 4, 1, 3, 2]
[0, 4, 1, 5, 3, 2]
[0, 4, 1, 5, 3, 6, 2]
[0, 4, 1, 5, 7, 3, 6, 2]
[0, 4, 1, 8, 5, 7, 3, 6, 2]
[0, 4, 1, 8, 5, 7, 3, 6, 2, 9]
[0, 4, 1, 8, 10, 5, 7, 3, 6, 2, 9]
[0, 4, 1, 8, 10, 5, 7, 3, 6, 2, 9, 11]
[0, 4, 1, 8, 10, 5, 7, 3, 6, 2, 9, 11, 12]
[0, 4, 1, 8, 10, 13, 5, 7, 3, 6, 2, 9, 11, 12]
[0, 4, 1, 8, 14, 10, 13, 5, 7, 3, 6, 2, 9, 11, 12]
[0, 15, 4, 1, 8, 14, 10, 13, 5, 7, 3, 6, 2, 9, 11, 12]
[0, 16, 15, 4, 1, 8, 14, 10, 13, 5, 7, 3, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 1, 8, 14, 10, 13, 5, 7, 3, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 1, 8, 14, 10, 13, 5, 7, 18, 3, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 1, 8, 19, 14, 10, 13, 5, 7, 18, 3, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 1, 8, 19, 14, 10, 13, 5, 7, 18, 3, 20, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 21, 1, 8, 19, 14, 10, 13, 5, 7, 18, 3, 20, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 21, 1, 8, 22, 19, 14, 10, 13, 5, 7, 18, 3, 20, 6, 2, 9, 11, 12]
[17, 0, 16, 15, 4, 21, 1, 8, 22, 19, 14, 10, 13, 5, 7, 18,

In [43]:
# Test Case
#find_best_areas('NCT02347995')

In [58]:
def find_therapeutic_area(std):
    """
    Finds the therapeutic areas having the closest concepts to those of a given study
    Args:
    - std: study identifier
    Returns:
    - the study terms for debugging purposes
    - a list of 5 tuple2 containing the area and the similarity score with the study
      in similarity decreasing order
    """
    if std_to_cuis[std] == set():
        return 'N/A'
    else:
        std_cuis = std_to_cuis[std]
        sims, area_list = [], []
        for area, cuis in areas.items():
            if len(std_cuis) == 0:
                raise Exception('no cuis for std %s' % std)
            sims.append(w2v.n_similarity(std_cuis, cuis))
            area_list.append(area)
        indices = list(reversed(np.argsort(sims)))
        best_areas = np.array(area_list)[indices][:5]
        best_sims = np.array(sims)[indices][:5]
        #return get_study_terms(std), list(zip(best_areas,best_sims))
        #return list(zip(best_areas,best_sims))
        return best_areas[0]

Notes: 
    
    This function returns 'N/A' for the studies that do not have a CUI.

In [100]:
# Test Case
#find_therapuetic_area('NCT02347995')

In [45]:
def classify_studies(nb_studies):
    """
    Finds therapeutic areas for a given number of studies. Prints a basic summary 
    of the results
    Args:
    - nb_studies: number of studies to consider.
    Returns:
    - a dictionary indexed by area whose value is a list of 3-tuples containing:
      - study identifier
      - study strings (see get_study_terms()) for evaluation
      - the list of closest areas order with their similarity score, the first one being
        the corresponding dictionary key.
    """
    from collections import defaultdict
    # stds_by_area: key: area, value: list( (study_id, list(study_term), list( (area,similarity) )) )
    stds_by_area = defaultdict(list) 
    
    for cnt,std in enumerate(list(std_to_cuis.keys())[:nb_studies]):
        std_terms, areas_sim = find_best_areas(std)
        area = areas_sim[0][0]
        if area not in stds_by_area.keys():
            print("study found for area '%s' after analyzing %d studies" % (area, cnt+1))
        stds_by_area[area].append((std, std_terms, areas_sim))

    # sort results by number of studies in each area, just to print a summary of results
    res_areas, res_lens = [], []
    for area,ranks_list in stds_by_area.items():
        res_areas.append(area)
        res_lens.append(len(ranks_list))
    res_areas, res_lens = np.array(res_areas), np.array(res_lens)
    indices = list(reversed(np.argsort(res_lens)))
    sorted_areas = np.array(res_areas)[indices]
    sorted_lens = np.array(res_lens)[indices]
    areas_results = list(zip(sorted_areas,sorted_lens))
    print('%d studies in %d areas' % (nb_studies, len(areas_results)))
    for res in areas_results:
        print('%s: %d' %(res[0], res[1]))
        
    return stds_by_area

## 7. RESULTS

In [46]:
# Classifying all studies available
# Uncomment to see results

# studies_by_area = classify_studies(len(std_to_cuis))


In [47]:

# Uncomment to print results for 5 studies for each found therapeutic area
#for area,results_list in studies_by_area.items():
#    print('======= %s ==========' % area)
#    for results in results_list[:1]:
#        std, terms, ranks = results
        
#        print('study: %s' % std)
#        print('terms: %s' % str(terms))
#        print('ranks: %s' % str(ranks))
#        print()

### 6.1 SAVING RESULTS IN A DATAFRAME

In [60]:
result_df = data_final

### 6.2 ASSOCIATING THERAPUETIC AREAS TO STUDIES

In [61]:
result_df = result_df.assign(Therapeutic_Area = result_df.get('NCTId').apply(find_therapeutic_area))

In [62]:
result_df = result_df[result_df.get('Therapeutic_Area')!= 'N/A']
result_df = result_df.reset_index()
result_df = result_df.get(['NCTId','Therapeutic_Area','ConditionMeshTerm','InterventionName','LocationCity',
                          'LocationState','LocationCountry'])
result_df

Unnamed: 0,NCTId,Therapeutic_Area,ConditionMeshTerm,InterventionName,LocationCity,LocationState,LocationCountry
0,NCT02335671,Oncology,breast neoplasms,[Intra-operative Magnetic Resonance Imaging (M...,"[Boston, Boston]","[Massachusetts, Massachusetts]","[United States, United States]"
1,NCT02348749,Oncology,neuroendocrine tumors,"[18F-MFBG (meta-fluoro benzylguanidine), Posit...",[New York],[New York],[United States]
2,NCT02347995,Neurology,stroke,"[Protein, Placebo]",[Baltimore],[Maryland],[United States]
3,NCT02346435,Oncology,kidney neoplasms,[],[Baltimore],[Maryland],[United States]
4,NCT02344485,Occupational Diseases,parkinson disease,[OMM treatment],[Old Westbury],[New York],[United States]
...,...,...,...,...,...,...,...
18884,NCT05419375,Oncology,neoplasms,[Screening platform],"[Tucson, Longmont, Austin, San Antonio, Tyler,...","[Arizona, Colorado, Texas, Texas, Texas, Virgi...","[United States, United States, United States, ..."
18885,NCT05419076,Oncology,lung neoplasms,"[Stereotactic Radiosurgery, Cerebrospinal flui...","[Basking Ridge, Middletown, Montvale, Commack,...","[New Jersey, New Jersey, New Jersey, New York,...","[United States, United States, United States, ..."
18886,NCT05424822,Oncology,lymphoma,[JNJ-80948543],"[Duarte, New York, New York, Nashville, Housto...","[California, New York, New York, Tennessee, Te...","[United States, United States, United States, ..."
18887,NCT05425719,Nephrology,kidney diseases,"[MB-102, MediBeacon Transdermal Glomerular Fil...","[Edgewater, Chicago, Saint Paul, Raleigh, San ...","[Florida, Illinois, Minnesota, North Carolina,...","[United States, United States, United States, ..."


In [64]:
result_df[result_df.get('Therapeutic_Area')!='Oncology'][:5]

Unnamed: 0,NCTId,Therapeutic_Area,ConditionMeshTerm,InterventionName,LocationCity,LocationState,LocationCountry
2,NCT02347995,Neurology,stroke,"[Protein, Placebo]",[Baltimore],[Maryland],[United States]
4,NCT02344485,Occupational Diseases,parkinson disease,[OMM treatment],[Old Westbury],[New York],[United States]
6,NCT02342444,Cardiology/Vascular Diseases,thrombosis,"[Enoxaparin Sodium Injection 30 mg BID, Enoxap...",[Portland],[Oregon],[United States]
8,NCT02337634,Psychiatry/Psychology,gambling,"[Placebo, Milk Thistle]",[Chicago],[Illinois],[United States]
16,NCT02332369,Ophthalmology,cataract,[Capsular Tension Ring],"[Venice, Washington, Lake Jackson]","[Florida, Missouri, Texas]","[United States, United States, United States]"
