### Defining a cohort

A cohort is a table whose rows correspond to unique combinations of `person_id` and `index_date` where each combination is mapped to a unique `row_id`.
For downstream feature extraction and modeling, the cohort table should also contain additional columns for labeling for outcomes and group categories.

Here, we will call a set of pre-defined set of transformations to define a cohort of hospital admissions and extract relevant labels. For details of how this cohort is defined, refer to source code. In practice, a cohort can be defined arbitrarily, as long as meets the specification described above and is stored in a table in the database.

In [2]:
import os
from prediction_utils.cohorts.admissions.cohort import (
    BQAdmissionRollupCohort, BQAdmissionOutcomeCohort, BQFilterInpatientCohort
)
from prediction_utils.util import patient_split

In [3]:
# Configuration for the extraction
config_dict = {
    'gcloud_project': 'som-nero-phi-nigam-starr',
    'dataset_project': 'som-rit-phi-starr-prod',
    'rs_dataset_project': 'som-nero-phi-nigam-starr',
    'dataset': 'starr_omop_cdm5_deid_1pcent_lite_latest',
    'rs_dataset': 'temp_dataset',
    'cohort_name': 'vignette_cohort',
    'cohort_name_labeled': 'vignette_cohort_labeled',
    'cohort_name_filtered': 'vignette_cohort_filtered',
    'has_birth_datetime': True
}

In [4]:
cohort = BQAdmissionRollupCohort(**config_dict)



In [5]:
# Create the cohort table
cohort.create_cohort_table()

In [6]:
# Let's inspect the cohort
cohort_df = cohort.db.read_sql_query(
    query="SELECT * FROM {rs_dataset_project}.{rs_dataset}.{cohort_name}".format(**config_dict)
)

Downloading: 100%|██████████| 4093/4093 [00:01<00:00, 2054.80rows/s]


In [7]:
cohort_df.head()

Unnamed: 0,person_id,admit_date,discharge_date
0,29923656,2019-07-22,2019-07-23
1,29927078,2016-08-15,2016-08-19
2,29927087,2019-08-08,2019-08-11
3,29927561,2015-07-24,2015-07-26
4,29927632,2011-04-17,2011-04-19


In [8]:
# Now let's add some labels
cohort_labeled = BQAdmissionOutcomeCohort(**config_dict)
cohort_labeled.create_cohort_table()



In [9]:
cohort_df_labeled = cohort_labeled.db.read_sql_query(
    query="SELECT * FROM {rs_dataset_project}.{rs_dataset}.{cohort_name_labeled}".format(**config_dict)
)

Downloading: 100%|██████████| 4093/4093 [00:01<00:00, 2301.22rows/s]


In [10]:
cohort_df_labeled.head()

Unnamed: 0,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name
0,30718792,2018-05-07,2018-05-28,0,0,21,1,0,0,<18,Hispanic or Latino,FEMALE
1,30867687,2017-07-11,2017-08-11,0,0,31,1,1,0,<18,Hispanic or Latino,FEMALE
2,31829531,2015-07-16,2015-08-07,0,0,22,1,0,0,<18,Hispanic or Latino,MALE
3,31942455,2018-11-02,2018-11-26,0,0,24,1,0,0,<18,White,FEMALE
4,31432321,2015-04-21,2015-05-31,0,0,40,1,1,0,<18,White,MALE


In [11]:
# Now let's filter down to one prediction per patient and add a row_id column called `prediction_id`
cohort_filtered = BQFilterInpatientCohort(**config_dict)
cohort_filtered.create_cohort_table()



In [12]:
# Get the filtered cohort
cohort_df_filtered = cohort_filtered.db.read_sql_query(
    query = """
    SELECT *
    FROM {rs_dataset_project}.{rs_dataset}.{cohort_name_filtered}
    """.format(**config_dict)
).set_index('prediction_id').reset_index()

Downloading: 100%|██████████| 2059/2059 [00:01<00:00, 1106.98rows/s]


In [13]:
cohort_df_filtered.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name
0,-2634508823241925258,29927078,2016-08-15,2016-08-19,0,0,4,0,0,38,[30-45),Asian,FEMALE
1,1780316215979524235,29927087,2019-08-08,2019-08-11,0,0,3,0,0,31,[30-45),Asian,FEMALE
2,4456189537238262902,29927561,2015-07-24,2015-07-26,0,0,2,0,0,45,[45-55),Asian,FEMALE
3,922196558384555124,29927632,2011-04-17,2011-04-19,0,0,2,0,0,49,[45-55),Asian,FEMALE
4,6633284178958689091,29928326,2014-09-02,2014-09-04,0,0,2,0,0,23,[18-30),Hispanic or Latino,FEMALE


In [14]:
# Partition the dataset into folds for later
cohort_df_final = patient_split(cohort_df_filtered)

In [15]:
cohort_df_final.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name,fold_id
0,-2634508823241925258,29927078,2016-08-15,2016-08-19,0,0,4,0,0,38,[30-45),Asian,FEMALE,test
1,1780316215979524235,29927087,2019-08-08,2019-08-11,0,0,3,0,0,31,[30-45),Asian,FEMALE,7
2,4456189537238262902,29927561,2015-07-24,2015-07-26,0,0,2,0,0,45,[45-55),Asian,FEMALE,7
3,922196558384555124,29927632,2011-04-17,2011-04-19,0,0,2,0,0,49,[45-55),Asian,FEMALE,10
4,6633284178958689091,29928326,2014-09-02,2014-09-04,0,0,2,0,0,23,[18-30),Hispanic or Latino,FEMALE,10


In [16]:
# Write the result to disk
cohort_path = '/share/pi/nigam/projects/prediction_utils/scratch/cohort'
os.makedirs(cohort_path, exist_ok=True)
cohort_df_final.to_parquet(
    os.path.join(cohort_path, "cohort.parquet"), engine="pyarrow", index=False
)