# FEMR Ontology support

FEMR provides support for querying ontologies using the OMOP Vocabulary. 

This enables easier definition of labeling functions as well as better feature generation.

# Downloading the OMOP Vocabulary

The OMOP Vocabulary can be downloaded for free from the [OHDSI ATHENA website.](https://athena.ohdsi.org/)

# Processing the OMOP Vocabulary

femr.ontology.Ontology allows you to process, and then use the OMOP Vocabulary, optionally combining it with [code metadata from MEDS](https://github.com/Medical-Event-Data-Standard/meds/blob/e93f63a2f9642123c49a31ecffcdb84d877dc54a/src/meds/__init__.py#L94).

```python 
ontology = femr.ontology.Ontology(path_to_athena, code_metadata)
```

# Working with an Ontology object

The following code samples illustrate the main ways to use a vocabulary object

In [3]:
import femr.models.tasks
import femr.ontology
import femr.models.tokenizer
import datasets
import pickle
import json
import os

#data_path = 'mimic-iv-demo-meds'
data_path = '/share/pi/nigam/projects/zphuo/data/PE/inspect/timelines_smallfiles_meds'
 
dataset = datasets.Dataset.from_parquet(os.path.join(data_path, 'data/*'))

# with open(os.path.join(data_path, 'metadata.json')) as f:
#     metadata = json.load(f)
with open(os.path.join('/omop_extract_PHI/som-nero-phi-nigam-starr.shahlab_omop_cdm5_subset_2023_03_05_meds', 'metadata.json')) as f:
    metadata = json.load(f)


if True:
    ontology = femr.ontology.Ontology('athena', metadata['code_metadata'])
    with open('ontology.pkl', 'wb') as f:
        pickle.dump(ontology, f)
else:
    with open('ontology.pkl', 'rb') as f:
        ontology = pickle.load(f)

print("Starting to process ontology")
ontology.prune_to_dataset(
    dataset, 
    num_proc=10, 
    prune_all_descriptions=True,
    remove_ontologies={'SPL'}
)
print("Done processing ontology")

with open('inspect_ontology.pkl', 'wb') as f:
    pickle.dump(ontology, f)

Resolving data files:   0%|          | 0/100 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Preparing /home/zphuo/.cache/huggingface/datasets/parquet/default-95f427130c6f9f2c/0.0.0/ca31c69184d9832faed373922c2acccec0b13a0bb5bbbe19371385c3ff26f1d1.incomplete/parquet-train-JJJJJ-SSSSS-of-NNNNN.arrow - arrow - 500000000 - 0


DatasetGenerationError: An error occurred while generating the dataset

In [1]:
code_metadata_entry = {
    "type": "object",
    "properties": {
        "description": {"type": "string"},
        "parent_codes": {"type": "array", "items": {"type": "string"}},
    },
}

code_metadata = {
    "type": "object",
    "additionalProperties": code_metadata_entry,
}

In [2]:
from femr import ontology
athena_download='/share/pi/nigam/projects/zphuo/data/omop_extract_deid/athena_download'
ontology = ontology.Ontology(athena_path=athena_download, code_metadata=code_metadata)

40218360 2213420
44782129 2314221
44782128 2314220
937613 2102139
915613 46257666
3037393 42742298
40664606 42742468
2104915 40310989
2105168 40310989
2106729 40295754
2109474 40478911
2105406 40310989
2105162 40310989
2106280 40295754
2211466 40491843
40757009 40481492
2110481 40493259
2105992 40485989
40756881 40481492
2110615 40481492
2103095 40481492
2213222 40492491
2108801 40493228
2105650 40492489
2211467 40491843
42742528 40479859
2211504 40491843
2105651 40492489
2109619 40489295
2211499 40491843
2106141 40491839
2213254 40492306
2109474 42537047
42628340 42628453
42628003 42628460
42628027 42628568
2107134 42628515
42627941 42628485
42628560 42628712
42628719 42628522
42628466 42628522
2109649 42628092
42628229 42628092
42742323 42628536
42628632 42628460
759582 42628712
42628618 42628150
2107121 42628515
42628352 42628522
42628497 42628522
2109647 42628092
42627930 42628071
42628028 42628344
759640 42628509
2110866 42709803
42627915 42628081
2107132 42628515
2110753 42709801

KeyboardInterrupt: 

In [2]:
# Ontology datasets downloaded by Athena tend to be very large as they contain many codes, including several that are no longer used.
# We therefore provide a function to prune ontologies to a particular dataset of interest.
# This makes it much cheaper to store and use an ontology object, both in terms of disk space and RAM

import os
os.environ["HF_DATASETS_CACHE"] = '/share/pi/nigam/projects/zphuo/.cache'

import datasets
# dataset = datasets.Dataset.from_parquet("input/meds/data/*")

parquet_folder = '/share/pi/nigam/projects/zphuo/data/PE/inspect/timelines_smallfiles_meds/data/*'
dataset = datasets.Dataset.from_parquet(parquet_folder)
ontology.prune_to_dataset(dataset)

Resolving data files:   0%|          | 0/100 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Preparing /home/zphuo/.cache/huggingface/datasets/parquet/default-95f427130c6f9f2c/0.0.0/ca31c69184d9832faed373922c2acccec0b13a0bb5bbbe19371385c3ff26f1d1.incomplete/parquet-train-JJJJJ-SSSSS-of-NNNNN.arrow - arrow - 500000000 - 0


In [3]:
!echo $HF_HOME




In [9]:
os.system('export HF_HOME="/share/pi/nigam/projects/zphuo/.cache"')
os.system('echo $HF_HOME')
os.system('echo $TRANSFORMERS_CACHE')





0

In [None]:
# First, we can query the description for a particular code
print("Description", ontology.get_description("ATC/A02B"))

# Second, we can search for the parents of a particular code
print("Parents", ontology.get_parents("ATC/A02B"))

# Finally, we can search for the children of a particular code
print("Children", ontology.get_children("ATC/A02B"))

# For the sake of convience, we also support the recursive versions of querying parents and children
print("All children", ontology.get_all_children("ATC/A02B"))
print("All parents", ontology.get_all_parents("ATC/A02B"))

Description DRUGS FOR PEPTIC ULCER AND GASTRO-OESOPHAGEAL REFLUX DISEASE (GORD)
Parents ['ATC/A02']
Children ['ATC/A02BX']
All children {'ATC/A02BX77', 'RxNorm/4501', 'RxNorm/2403', 'ATC/A02B', 'RxNorm/38574', 'ATC/A02BX', 'RxNorm/6852', 'RxNorm/8730', 'RxNorm/8704', 'RxNorm/7815', 'ATC/A02BX71', 'RxNorm/2018', 'RxNorm/2620', 'RxNorm/7019', 'RxNorm/2017', 'RxNorm/8705', 'RxNorm/2353', 'RxNorm/2344'}
All parents {'ATC/A02B', 'ATC/A02', 'ATC/A'}
