##### COVID-19 DREAM challenge - Mapping OMOP to disease and phenotype terms

The goal of this notebook is to map OMOP terms to disease and phenotype terms from the Mondo disease ontology and human phenotype ontology.  

###### Mapping to Mondo and HPO
- Mondo stores OMOP codes as COHD iris  
- HPO stores SnomedCT mappings, which is often the underlying vocab in OMOP


Composed disease phenotype terms:  
OMOP includes terms that capture both phenotypes and disorders; 

"Ascites due to alcoholic cirrhosis"  
Should be mapped to  
- Ascites: HP:0001541  
- Alcoholic liver cirrhosis: MONDO:0006644  

Multiple Phenotypes, eg:  
"Macular edema and retinopathy due to type 2 diabetes mellitus"  
- Macular edema: HP:0040049   
- Retinopathy: HP:0000488   
- Type 2 Diabetes Mellitus: MONDO:0005148   

Disease with phenotype,  
"Gastro-esophageal reflux disease with esophagitis"  


Sometimes the phenotype is generic, like "complication", such as "Complication due to Crohn's disease"

Other modifiers on diseases/phenotypes:  
- Uncomplicated
- moderate
- persistent
- Acute exacerbation of
- Acute/Chronic

Although in many cases these are also part of HPO/Mondo terms:
- Moderate proteinuria
- Moderate hypermetropia
- Persistant truncus arteriosus 
- chronic kidney disease

But sometimes these are missing in our ontologies, eg "Uncomplicated asthma"


###### Types of condition concepts
condition_type_concept_id:  
```cut -f8 -d ',' condition_occurrence.csv | sort | uniq -c```

EHR billing diagnosis (32019): 1920741  
EHR encounter diagnosis (32020): 802911  
EHR Chief Complaint (42894222): 9767  
Observation recorded from EHR (43542353): 623689  


Dependencies:
Running make in the resources dir to get ontology mappings

Useful resources: 
COHD API: http://cohd.smart-api.info/#/OMOP/concepts  
Monarch API: https://api.monarchinitiative.org/  


##### Phenotype(s) + Disease - lexical matching?

Since solr does not do p-values or [percent scores](https://cwiki.apache.org/confluence/display/lucene/ScoresAsPercentages), we should be fairly strict in the search params and potentially hand curate the results.


#https://solr.monarchinitiative.org/solr/search/select/?q=%22crohn%27s%20disease%22&defType=edismax&wt=json&qf=label_kw^4&qf=synonym_kw^2




In [1]:
import pandas as pd
from pathlib import Path


# Files
data_dir = Path('/media/kshefchek/data/synthetic_data')

test_results_fp = data_dir / 'goldstandard.csv'
condition_era_fp = data_dir / 'condition_era.csv'
data_dictionary_fp = data_dir / 'data_dictionary.csv'
person_fp = data_dir / 'person.csv'

condition_era = pd.read_csv(condition_era_fp)
test_results = pd.read_csv(test_results_fp)
data_dictionary = pd.read_csv(data_dictionary_fp)
person = pd.read_csv(person_fp)


with_labels = pd.merge(
    condition_era, data_dictionary, left_on="condition_concept_id", right_on='concept_id', how='left'
)

with_conditions = pd.merge(with_labels, test_results, on='person_id')

conditions = pd.merge(with_conditions, person, on='person_id')

conditions[conditions['concept_name'] == "Nausea"].to_csv('foo.tsv', sep='\t')

# How synthetic is the data?
conditions[(conditions['person_id'] == '500') & (conditions['gender_source_value'] == 'F')]

Unnamed: 0,condition_era_id,person_id,condition_concept_id,condition_era_start_date,condition_era_end_date,condition_occurrence_count,concept_id,concept_name,table,status,...,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
135955,135956,500,24134,2015-12-31,2016-01-10,1,24134,Neck pain,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
135956,135957,500,24818,2017-03-29,2017-04-02,1,24818,Injury of neck,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
135957,135958,500,29735,2011-08-31,2012-01-29,2,29735,Candidiasis of mouth,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
135958,135959,500,30437,2020-02-25,2020-03-02,1,30437,Gastro-esophageal reflux disease with esophagitis,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
135959,135960,500,30753,2020-02-27,2020-02-29,1,30753,Esophagitis,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136235,136236,500,43530691,2011-12-30,2012-01-04,1,43530691,Sequela of intracranial injury,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
136236,136237,500,43531003,2016-01-26,2016-01-31,1,43531003,Essential tremor,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
136237,136238,500,45768812,2014-10-12,2014-10-18,1,45768812,Anemia in chronic kidney disease,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
136238,136239,500,46269757,2011-11-08,2011-11-10,1,46269757,Chronic ulcer of skin of lower leg,condition_occurrence,0.0,...,2296,,,,F,0,,10,,29
