## Featurizing Conditions of the Studies:

How to determine if 2 studies are studing the same condition?
- number of tagged conditions (num_matched, num_unmatched)
- mesh terms (jaccard distance) - min, max, mean
- mesh tree location (tree distance) - min, max, mean


- matching conditions (based on model):
    - remove stop words (maybe)
    - condition names (lev distance using fuzzy wuzzy full ratio)
    - nouns only lev dist using fuzzy wuzzy full ratio
    - condition bing results (bag of words) - norm wass dist on top X words - min, max, mean(each cond pair)
    - adjective verb descriptors distance (such as "chronic") - min, max, mean (on each cond pair)
    - type, grade, stage, AJCC (type1, type2) etc. - min, max, mean (on every cond pair)

In [15]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook
from tqdm import tqdm
import datetime as dt
import pickle
from collections import Counter
from importlib import reload

import nltk

import pdaactconn as pc
from trialexplorer.mesh_terms import MeSHCatalog
from trialexplorer import AACTStudySet
from trialexplorer import studysimilarity as ssim

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
tqdm.pandas()

In [5]:
# selecting all interventional studies
conn = pc.AACTConnection(source=pc.AACTConnection.LOCAL)
ss = AACTStudySet.AACTStudySet(conn= conn, tqdm_handler=tqdm_notebook)
ss.add_constraint("study_type = 'Interventional'")
ss.load_studies()

250890 studies loaded!


In [6]:
# loading all dimensional data
ss.add_dimensions('browse_conditions')
ss.add_dimensions('conditions')
ss.refresh_dim_data()

Successfuly added these 1 dimensions: ['browse_conditions']
Failed to add these 0 dimensions: []
Successfuly added these 1 dimensions: ['conditions']
Failed to add these 0 dimensions: []


HBox(children=(IntProgress(value=0, max=502), HTML(value='')))

Syncing the temp table temp_cur_studies in 502 chunks x 500 records each

Creating index on the temp table
 - Loading dimension browse_conditions
 -- Loading raw data
 -- Sorting index
 - Loading dimension conditions
 -- Loading raw data
 -- Sorting index


# 1. Mesh Terms

In [7]:
# intializing MeSH object
mc = MeSHCatalog()  

Parsing MeSH xml: xml/desc2020.xml ...
Parse Complete! (parsed ElementTree root can be found in the .root attribute)


In [8]:
bc = ss.dimensions['browse_conditions']
c = ss.dimensions['conditions']

In [9]:
bc.data.head()

Unnamed: 0_level_0,id,mesh_term,downcase_mesh_term
nct_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NCT00000102,9144384,"Adrenal Hyperplasia, Congenital","adrenal hyperplasia, congenital"
NCT00000102,9144385,Adrenogenital Syndrome,adrenogenital syndrome
NCT00000102,9144386,Adrenocortical Hyperfunction,adrenocortical hyperfunction
NCT00000102,9144387,Hyperplasia,hyperplasia
NCT00000106,9143121,Rheumatic Diseases,rheumatic diseases


In [10]:
len(bc.data['mesh_term'].unique())

3738

In [11]:
s_mesh = bc.data.groupby('mesh_term').size().sort_values(ascending=False)

In [17]:
ssim.mesh_jaccard_sim('NCT00000102', 'NCT03323658', bc.data)

0.09090909090909091

### To compute the jaccard distance, for 1 study vs all the others, takes approx 3.5min

In [None]:
jaccard_dist = {}
for cur_nct in tqdm(list(bc.data.index.unique())):
    jaccard_dist[cur_nct] = mesh_jaccard_dist('NCT00000102', cur_nct, bc.data)

In [None]:
dfjac = pd.DataFrame(jaccard_dist, index=['jdist']).T

In [None]:
dfjac[dfjac['jdist'] > 0].sort_values('jdist', ascending=False)

## Computing the min, max, mean distance between the tagged mesh terms

In [None]:
def mesh_tree_dist(nctid1, nctid2, data, mc):
    """ compute the set of all tree distances and returns tuple of min, max, mean """
    s1terms, s2terms = get_mesh_terms(nctid1, nctid2, data)
    
    all_dist = []
    for t1 in s1terms:
        for t2 in s2terms:
            cur_dist = mc.shortest_mesh_dist(t1, t2)
            all_dist.append(cur_dist)
    
    return min(all_dist), max(all_dist), np.mean(all_dist)

In [None]:
mesh_tree_dist('NCT00000102', 'NCT03323658', bc.data, mc)