# OMOP Common Data Model

Electronic health data is acquired and stored in many different manners. To facilitate the analysis of such data, and enable the joint analysis of multiple datasets, a common, widely applicable data model can help.

This is what the OMOP Common Data Model is for. To put it in their words, from [The Book of OHDSI, Chapter 4: The Common Data Model](https://ohdsi.github.io/TheBookOfOhdsi/CommonDataModel.html)

> Why do we need a Common Data Model for observational healthcare data?

> Depending on their primary needs none of the observational databases capture all clinical events equally well. Therefore, research results must be drawn from many disparate data sources and compared and contrasted to understand the effect of potential capture bias. In addition, in order to draw conclusions with statistical power we need large numbers of observed patients. That explains the need for assessing and analyzing multiple data sources concurrently. In order to do that, data need to be harmonized into a common data standard. In addition, patient data require a high level of protection. To extract data for analysis purposes as it is done traditionally requires strict data use agreements and complex access control. A common data standard can alleviate this need by omitting the extraction step and allowing a standardized analytic to be executed on the data in it’s native environment - the analytic comes to the data instead of the data to the analytic.

The CDM contains 16 Clinical Event tables, 10 Vocabulary tables, 2 metadata tables, 4 health system data tables, 2 health economics data tables, 3 standardized derived elements, and 2 Results schema tables.

Here, we walk through the key aspects of the data model with some examples.

In particular, we will use a [publicly available demo subsample of the MIMIC-IV dataset](https://physionet.org/content/mimic-iv-demo-omop/0.9/) containing 100 patients, converted to the OMOP CDM.

See Athena how to map things.

Maybe use this to explain what is supported and what is not, and how. such as for measurements, mention "you can convert different units to a consistent one using `ed.omop.extract_measurements(adata, measurement_units="SI")` or `extract_measurements(adata, measurement_units={concept_id_for_blood_pressure:"mmHg"})` etc. In case of unclear cases, e.g. counts per nL or mL or mm^3 it will raise an issue and ask you to do the specification."

"Payment details are not supported but feel free to open a PR doing that if you're an expert on this"

In [2]:
import pandas as pd

data_path = "/Users/eljas.roellin/Documents/ehrapy_workspace/mimic-iv-demo-data-in-the-omop-common-data-model-0.9/1_omop_data_csv"

## Introduction to Vocabularies


We explain these along the first three tables of the OMOP CDM.

### 0.1 Concept

Purpose: Clinical events in OMOP are expressed as concepts, the fundamental building block of data records. For this, OMOP gathers concepts from many existing vocabularies, such as WHO's [ICD10](https://www.icd-code.de/) and [SNOMED](https://www.snomed.org/). There are many concepts in the OMOP CDM; the concepts that are actually used for a specific dataset are listed in this table of the database.

- Example:

In [10]:
pd.read_csv(data_path + "/2b_concept.csv").head(2)

Unnamed: 0,concept_id,concept_name,domain_id,vocabulary_id,concept_class_id,standard_concept,concept_code,valid_start_DATE,valid_end_DATE,invalid_reason
0,2000011360,Caffeine Citrate 1 Syringe,Drug,mimiciv_drug_ndc,Prescription Drug,,Caffeine Citrate 1 Syringe,1970-01-01,2099-12-31,
1,2000010536,Acetaminophen,Drug,mimiciv_drug_ndc,Prescription Drug,,ACETAMINOPHEN (RECTAL),1970-01-01,2099-12-31,



In more detail, consider the example entry

| **CONCEPT_ID**       | 313217           |
|----------------------|------------------|
| **CONCEPT_NAME**     | Atrial fibrillation |
| **DOMAIN_ID**        | Condition         |
| **VOCABULARY_ID**    | SNOMED            |
| **CONCEPT_CLASS_ID** | Clinical Finding  |
| **STANDARD_CONCEPT** | S                |
| **CONCEPT_CODE**     | 49436004          |
| **VALID_START_DATE** | 01-Jan-1970       |
| **VALID_END_DATE**   | 31-Dec-2099       |
| **INVALID_REASON**   |                  |

- `Concept_ID`: the unique ID of a concept within the OMOP CDM
- `Concept_name`: a descriptive name (from the source vocabulary)
- `Domain_ID`: each concept is annotated to a "Domain" in OMOP CDM, examples of which are “Condition,” “Drug,” “Procedure,” “Visit,” “Device,” “Specimen”. There are ca 44 domains.
- `Vocabulary_ID`: an identifier of which vocabulary this concept stems from.
- `Concept_Class_ID`: The class within the vocabulary
- `Standard_Concept`: "S" if this is a standard concept, empty if it is not a standard concept. Within OMOP, each concept has one single standard concept, selected from one vocabulary. There can be multiple corresponding concepts from vocabularies, but one is selected the standard by OMOP CDM. For example, SNOMED code 49436004 and ICD9CM code 427.31 define "Atrial fibrillation", but only the SNOMED concept is chosen the standard. Only standard concepts should be used in CDM fields ending with "_concept_id". Data from another source than the CDM might use a vocabulary_id. the process of converting this to a standard concept ID is called mapping.
- `Concept_code`: Code in source vocabulary of this concept.
- `Valid_start_date`: Date when this concept was taken into its source vocabulary (if that is not known set to 1970-1-1)
- `Valid_end_date`: (Optional) Date when this concept was deemed deprecated from its source vocabulary. This can happen as vocabularies are reviewed on a regular basis for e.g. discovered ambiguity or duplicates.
- `Invalid_reason`: (Optional) Reason why this concept was deemed invalid.

OHDSI has an online tool for finding the concept you're looking for, called [ATHENA](https://athena.ohdsi.org/search-terms/start). There, you can also download the vocabularies as files.

For more information, see the [Vocabulary Chapter in The Book of OHDSI](https://ohdsi.github.io/TheBookOfOhdsi/StandardizedVocabularies.html).

### 0.2 Concept Relationship
Any two concepts can have a relationship between each other. The most common two relationships are "Maps to" and "Maps from", where a non-standard concept from the source database is mapped to a standard concept in the CDM.

In [11]:
pd.read_csv(data_path + "/2b_concept_relationship.csv").head(2)

Unnamed: 0,concept_id_1,concept_id_2,relationship_id,valid_start_DATE,valid_end_DATE,invalid_reason
0,2000003069,4022792,Maps to,1970-01-01,2099-12-31,
1,2000010663,40164921,Maps to,1970-01-01,2099-12-31,


Maybe include this example here: the explanation for mapping might go too deep but this is hilarious

"“Equivalent concepts” means it carries the same meaning, and, importantly, the hierarchical descendants cover the same semantic space. If an equivalent concept is not available and the concept is not Standard, it is still mapped, but to a slightly broader concept (so-called “up-hill mappings”). For example, ICD10CM W61.51 “Bitten by goose” has no equivalent in the SNOMED vocabulary, which is generally used for standard condition concepts. Instead, it is mapped to SNOMED 217716004 “Peck by bird,” losing the context of the bird being a goose. Up-hill mappings are only used if the loss of information is considered irrelevant to standard research use cases."

### 0.3 Concept Ancestry
(is built automatically from the concept relationship table if there are is a relationships. Not sure if should include..?)

### Internal Reference Tables
There are tables DOMAIN, VOCABULARY, CONCEPT_CLASS, RELATIONSHIP; these tables duplicate the fields already in CONCEPT and CONCEPT_RELATIONSHIP, and can provide more information with an additional *_NAME field.

We here omit them, as they can at any stage be created from the latter two tables.

### 1. Person

- Purpose: Contains demographic information about each patient.
- Key Fields: person_id, gender_concept_id, year_of_birth, race_concept_id, ethnicity_concept_id
- Example Row:

In [5]:
pd.read_csv(data_path + "/person.csv").head(2)

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,3589912774911670296,8507,2095,,,,0,38003563,,,,10009628,M,0,,0,HISPANIC/LATINO,2000001408


### 2. Observation Period
Purpose: Defines periods of time during which the patient’s data is considered reliable and available.

OMOP CDM: "This table contains records which define spans of time during which two conditions are expected to hold: (i) Clinical Events that happened to the Person are recorded in the Event tables, and (ii) absence of records indicate such Events did not occur during this span of time."

"Each Person needs to have at least one OBSERVATION_PERIOD record, which should represent time intervals with a high capture rate of Clinical Events. Some source data have very similar concepts, such as enrollment periods in insurance claims data. In other source data such as most EHR systems these time spans need to be inferred under a set of assumptions. It is the discretion of the ETL developer to define these assumptions."

- Key Fields: observation_period_id, person_id, observation_period_start_date, observation_period_end_date
- Example Row:

In [6]:
pd.read_csv(data_path + "/observation_period.csv").head(2)

Unnamed: 0,observation_period_id,person_id,observation_period_start_date,observation_period_end_date,period_type_concept_id
0,-422211212329812262,-7391666713304457659,2110-11-30,2110-12-10,32828


### 3. Visit Occurrence

"This table contains Events where Persons engage with the healthcare system for a duration of time. They are often also called “Encounters”. Visits are defined by a configuration of circumstances under which they occur, such as (i) whether the patient comes to a healthcare institution, the other way around, or the interaction is remote, (ii) whether and what kind of trained medical staff is delivering the service during the Visit, and (iii) whether the Visit is transient or for a longer period involving a stay in bed."

"The Visit duration, or ‘length of stay’, is defined as VISIT_END_DATE - VISIT_START_DATE. For all Visits this is <1 day, except Inpatient Visits and Non-hospital institution Visits."

- Purpose: Captures information about healthcare encounters or visits.
- Key Fields: visit_occurrence_id, person_id, visit_concept_id, visit_start_date, visit_end_date
- Example Row:

In [7]:
pd.read_csv(data_path + "/visit_occurrence.csv").head(2)

Unnamed: 0,visit_occurrence_id,person_id,visit_concept_id,visit_start_date,visit_start_datetime,visit_end_date,visit_end_datetime,visit_type_concept_id,provider_id,care_site_id,visit_source_value,visit_source_concept_id,admitting_source_concept_id,admitting_source_value,discharge_to_concept_id,discharge_to_source_value,preceding_visit_occurrence_id
0,-4406053801395356975,4783904755296699562,38004207,2112-11-06,2112-11-06 11:05:00,2112-11-06,2112-11-06 11:05:00,32817,,,10035631|2112-11-06,2000001801,,,,,-3.100296e+18
1,2636026522589494723,-6225647829918357531,38004207,2153-10-17,2153-10-17 14:23:00,2153-10-17,2153-10-17 14:23:00,32817,,,10019003|2153-10-17,2000001801,,,,,-2.238366e+18


### 4. Visit Detail (OPTIONAL)
- Purpose: More details on visit, such as movement between units in an inpatient stay. There can be 0 or more entries in visit_detail per entry in visit_occurrence.
- Key Fields: visit_detail_id, person_id, visit_detail_concept_id, visit_detail_start_date, visit_detail_end_date


In [8]:
pd.read_csv(data_path + "/visit_detail.csv").head(2)

Unnamed: 0,visit_detail_id,person_id,visit_detail_concept_id,visit_detail_start_date,visit_detail_start_datetime,visit_detail_end_date,visit_detail_end_datetime,visit_detail_type_concept_id,provider_id,care_site_id,admitting_source_concept_id,discharge_to_concept_id,preceding_visit_detail_id,visit_detail_source_value,visit_detail_source_concept_id,admitting_source_value,discharge_to_source_value,visit_detail_parent_id,visit_occurrence_id
0,-1757828362327778468,3129727379702505063,8870,2197-04-17,2197-04-17 09:48:00,2197-04-17,2197-04-17 11:44:19,32817,,-3.63344e+18,8870.0,,,10002930|25282382|38481760,2000001903,EMERGENCY ROOM,,,-9127810274408915712
1,-4357165027259445573,3129727379702505063,8870,2197-04-16,2197-04-16 22:57:00,2197-04-17,2197-04-17 09:48:00,32817,,6.888076e+18,8870.0,,,10002930|25282382|35169671,2000001901,EMERGENCY ROOM,,,-9127810274408915712


4. Condition Occurrence

"This table contains records of Events of a Person suggesting the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom, which is either observed by a Provider or reported by the patient."

"Conditions are defined by Concepts from the Condition domain, which form a complex hierarchy. As a result, the same Person with the same disease may have multiple Condition records, which belong to the same hierarchical family. Most Condition records are mapped from diagnostic codes, but recorded signs, symptoms and summary descriptions also contribute to this table."

"Conditions span a time interval from start to end, but are typically recorded as single snapshot records with no end date. The reason is twofold: (i) At the time of the recording the duration is not known and later not recorded, and (ii) the Persons typically cease interacting with the healthcare system when they feel better, which leads to incomplete capture of resolved Conditions."

- Purpose: Stores information about medical conditions diagnosed or observed during visits.
- Key Fields: condition_occurrence_id, person_id, condition_concept_id, condition_start_date

In [9]:
pd.read_csv(data_path + "/condition_occurrence.csv").head(2)

Unnamed: 0,condition_occurrence_id,person_id,condition_concept_id,condition_start_date,condition_start_datetime,condition_end_date,condition_end_datetime,condition_type_concept_id,stop_reason,provider_id,visit_occurrence_id,visit_detail_id,condition_source_value,condition_source_concept_id,condition_status_source_value,condition_status_concept_id
0,7000818053728441484,1741351032930224901,196523,2179-07-24,2179-07-24 18:21:00,2179-07-28,2179-07-28 15:54:00,32821,,,-5779522865065417426,,78791,44824628,,
1,-3514320024333679102,1741351032930224901,436659,2179-07-24,2179-07-24 18:21:00,2179-07-28,2179-07-28 15:54:00,32821,,,-5779522865065417426,,2809,44828816,,


### 5. Drug Exposure

"The purpose of records in this table is to indicate an exposure to a certain drug as best as possible. In this context a drug is defined as an active ingredient. Drug Exposures are defined by Concepts from the Drug domain, which form a complex hierarchy. As a result, one DRUG_SOURCE_CONCEPT_ID may map to multiple standard concept ids if it is a combination product."

There are some convenctions how to estimate the end date.

- Purpose: Tracks medications prescribed and administered to patients.
- Key Fields: drug_exposure_id, person_id, drug_concept_id, drug_exposure_start_date, drug_exposure_end_date


In [12]:
pd.read_csv(data_path + "/drug_exposure.csv").head(2)

Unnamed: 0,drug_exposure_id,person_id,drug_concept_id,drug_exposure_start_date,drug_exposure_start_datetime,drug_exposure_end_date,drug_exposure_end_datetime,verbatim_end_date,drug_type_concept_id,stop_reason,...,sig,route_concept_id,lot_number,provider_id,visit_occurrence_id,visit_detail_id,drug_source_value,drug_source_concept_id,route_source_value,dose_unit_source_value
0,294884377115777655,1741351032930224901,40166274,2177-07-16,2177-07-16 22:00:00,2177-07-17,2177-07-17 21:00:00,,32838,,...,,4142048,,,3736965967695233281,,2751001,45144375,SC,VIAL
1,-3609243742606366340,1741351032930224901,40166274,2177-07-17,2177-07-17 19:00:00,2177-07-18,2177-07-18 18:00:00,,32838,,...,,4142048,,,3736965967695233281,,2751001,45144375,SC,VIAL


### 6. Procedure Occurrence

"This table contains records of activities or processes ordered by, or carried out by, a healthcare provider on the patient with a diagnostic or therapeutic purpose."

"Lab tests are not a procedure, if something is observed with an expected resulting amount and unit then it should be a measurement."
- Purpose: Records procedures performed on patients.
- Key Fields: procedure_occurrence_id, person_id, procedure_concept_id, procedure_date

In [13]:
pd.read_csv(data_path + "/procedure_occurrence.csv").head(2)

Unnamed: 0,procedure_occurrence_id,person_id,procedure_concept_id,procedure_date,procedure_datetime,procedure_type_concept_id,modifier_concept_id,quantity,provider_id,visit_occurrence_id,visit_detail_id,procedure_source_value,procedure_source_concept_id,modifier_source_value
0,-6348795981381799385,4783904755296699562,2102720,2113-07-18,2113-07-18 14:55:00,32821,0,1.0,,-433474223361412760,,19301,2102720,
1,7881544392229438243,7918537411740862407,2102732,2129-10-30,2129-10-30 13:20:00,32821,0,1.0,,7730200099818586525,,19303,2102732,


### 7. Device Exposure
"The Device domain captures information about a person’s exposure to a foreign physical object or instrument which is used for diagnostic or therapeutic purposes through a mechanism beyond chemical action. Devices include implantable objects (e.g. pacemakers, stents, artificial joints), medical equipment and supplies (e.g. bandages, crutches, syringes), other instruments used in medical procedures (e.g. sutures, defibrillators) and material used in clinical care (e.g. adhesives, body material, dental material, surgical material)."

- Key Fields: device_exposure_id, person_id, device_concept_id, device_exposure_start_date, device_concept_type_id

In [14]:
pd.read_csv(data_path + "/device_exposure.csv").head(2)

Unnamed: 0,device_exposure_id,person_id,device_concept_id,device_exposure_start_date,device_exposure_start_datetime,device_exposure_end_date,device_exposure_end_datetime,device_type_concept_id,unique_device_id,quantity,provider_id,visit_occurrence_id,visit_detail_id,device_source_value,device_source_concept_id
0,-6080797302205697410,-626229666378242477,45768171,2171-11-14,2171-11-14 13:00:00,2171-11-14,2171-11-14 13:00:00,32817,,,,-1486256937339039377,,224087,2000030021
1,-5973025877998015677,-626229666378242477,45768171,2171-11-14,2171-11-14 20:00:00,2171-11-14,2171-11-14 20:00:00,32817,,,,-1486256937339039377,,224087,2000030021


### 7. Measurement

- Purpose: Captures clinical measurements or laboratory test results.
- Key Fields: measurement_id, person_id, measurement_concept_id, measurement_date, value_as_number

In [15]:
pd.read_csv(data_path + "/measurement.csv").head(2)

  pd.read_csv(data_path + "/measurement.csv").head(2)


Unnamed: 0,measurement_id,person_id,measurement_concept_id,measurement_date,measurement_datetime,measurement_time,measurement_type_concept_id,operator_concept_id,value_as_number,value_as_concept_id,unit_concept_id,range_low,range_high,provider_id,visit_occurrence_id,visit_detail_id,measurement_source_value,measurement_source_concept_id,unit_source_value,value_source_value
0,7620661609057829801,-7437341330444582833,3007913,2113-09-14,2113-09-14 10:41:00,,32856,,586.0,,8876.0,,,,3697313480337443666,,50801,2000001000.0,mm Hg,586
1,-6166868866082303206,-2312013739856114142,3012501,2116-07-05,2116-07-05 05:51:00,,32856,,-4.0,,9557.0,,,,-5005846256467230136,,50802,2000001000.0,mEq/L,-4


### 8. Observation

"Observations differ from Measurements in that they do not require a standardized test or some other activity to generate clinical fact. Typical observations are medical history, family history, the stated need for certain treatment, social circumstances, lifestyle choices, healthcare utilization patterns, etc. If the generation clinical facts requires a standardized testing such as lab testing or imaging and leads to a standardized result, the data item is recorded in the MEASUREMENT table."

- Purpose: Stores observations that do not fit into other tables, such as social history or patient-reported outcomes.
- Key Fields: observation_id, person_id, observation_concept_id, observation_date, value_as_string

In [17]:
pd.read_csv(data_path + "/observation.csv").head(2)

Unnamed: 0,observation_id,person_id,observation_concept_id,observation_date,observation_datetime,observation_type_concept_id,value_as_number,value_as_string,value_as_concept_id,qualifier_concept_id,unit_concept_id,provider_id,visit_occurrence_id,visit_detail_id,observation_source_value,observation_source_concept_id,unit_source_value,qualifier_source_value
0,-7221215846302395378,7131048714591189903,4215685,2189-05-22,2189-05-22 23:18:00,32821,,,,,,,1832418602869683763,,V4972,44827368,,
1,-9113461341871004757,7131048714591189903,440922,2189-05-22,2189-05-22 23:18:00,32821,,,,,,,1832418602869683763,,V5867,44820462,,


### 9. Death
- Purpose: Captures information related to patient death.
- Key Fields: person_id, death_date, death_type_concept_id, cause_concept_id

In [18]:
pd.read_csv(data_path + "/death.csv").head(2)

Unnamed: 0,person_id,death_date,death_datetime,death_type_concept_id,cause_concept_id,cause_source_value,cause_source_concept_id
0,-2312013739856114142,2116-07-05,2116-07-05 08:05:00,32817,0,,0
1,-7671795861352464589,2115-10-12,2115-10-12 00:00:00,32817,0,,0


### 10. Note
- Purpose: Contains unstructured clinical notes.
- Key Fields: note_id, person_id, note_date, note_text

In [19]:
pd.read_csv(data_path + "/note.csv").head(2)

Unnamed: 0,note_id,person_id,note_date,note_datetime,note_type_concept_id,note_class_concept_id,note_title,note_text,encoding_concept_id,language_concept_id,provider_id,visit_occurrence_id,visit_detail_id,note_source_value


### 13. Note_NLP
- Purpose: Encodes all output of NLP on clinical notes. Each row represents a single extracted term from a note.
- Key Fields: note_nlp_id, note_id, lexical_variant, note_nlp_concept_id

In [20]:
pd.read_csv(data_path + "/note_nlp.csv").head(2)

Unnamed: 0,note_nlp_id,note_id,section_concept_id,snippet,offset,lexical_variant,note_nlp_concept_id,note_nlp_source_concept_id,nlp_system,nlp_date,nlp_datetime,term_exists,term_temporal,term_modifiers


### 14. Specimen
The specimen domain contains the records identifying biological samples from a person.

- Purpose:
- Key Fields: specimen_id, person_id, specimen_concept_id, specimen_date

In [21]:
pd.read_csv(data_path + "/specimen.csv").head(2)

Unnamed: 0,specimen_id,person_id,specimen_concept_id,specimen_type_concept_id,specimen_date,specimen_datetime,quantity,unit_concept_id,anatomic_site_concept_id,disease_status_concept_id,specimen_source_id,specimen_source_value,unit_source_value,anatomic_site_source_value,disease_status_source_value
0,-5102033398575528989,4668337230155062633,4001183,32856,2117-07-16,2117-07-16 10:00:00,,,0,0,"{""subject_id"":10021487,""hadm_id"":20429160,""mic...",70003,,,
1,5035924384215166531,2288881942133868955,4001183,32856,2157-11-20,2157-11-20 12:20:00,,,0,0,"{""subject_id"":10001217,""hadm_id"":24597018,""mic...",70003,,,


TODO continue