### Defining a cohort

A cohort is a table whose rows correspond to unique combinations of `person_id` and `index_date` where each combination is mapped to a unique `row_id`.
For downstream feature extraction and modeling, the cohort table should also contain additional columns for labeling for outcomes and group categories.

Here, we will call a set of pre-defined set of transformations to define a cohort of hospital admissions and extract relevant labels. For details of how this cohort is defined, refer to source code. In practice, a cohort can be defined arbitrarily, as long as meets the specification described above and is stored in a table in the database.

In [1]:
import os
from prediction_utils.cohorts.admissions.cohort import (
    BQAdmissionRollupCohort, BQAdmissionOutcomeCohort, BQFilterInpatientCohort
)
from prediction_utils.util import patient_split

In [2]:
# Configuration for the extraction
config_dict = {
    'gcloud_project': 'som-nero-phi-nigam-starr',
    'dataset_project': 'som-rit-phi-starr-prod',
    'rs_dataset_project': 'som-nero-phi-nigam-starr',
    'dataset': 'starr_omop_cdm5_deid_1pcent_lite_latest',
    'rs_dataset': 'temp_dataset',
    'cohort_name': 'vignette_cohort',
    'cohort_name_labeled': 'vignette_cohort_labeled',
    'cohort_name_filtered': 'vignette_cohort_filtered',
    'has_birth_datetime': True
}

In [3]:
cohort = BQAdmissionRollupCohort(**config_dict)



In [4]:
# Create the cohort table
cohort.create_cohort_table()

In [5]:
# Let's inspect the cohort
cohort_df = cohort.db.read_sql_query(
    query="SELECT * FROM {rs_dataset_project}.{rs_dataset}.{cohort_name}".format(**config_dict)
)

Downloading: 100%|██████████| 6361/6361 [00:01<00:00, 3629.01rows/s]


In [6]:
cohort_df.head()

Unnamed: 0,person_id,admit_date,discharge_date
0,29923244,2019-01-31,2019-02-01
1,29926951,2015-11-17,2015-11-19
2,29927222,2012-07-31,2012-08-01
3,29927700,2011-06-20,2011-06-21
4,29927700,2014-11-04,2014-11-09


In [7]:
# Now let's add some labels
cohort_labeled = BQAdmissionOutcomeCohort(**config_dict)
cohort_labeled.create_cohort_table()



In [8]:
cohort_df_labeled = cohort_labeled.db.read_sql_query(
    query="SELECT * FROM {rs_dataset_project}.{rs_dataset}.{cohort_name_labeled}".format(**config_dict)
)

Downloading: 100%|██████████| 6361/6361 [00:01<00:00, 4972.26rows/s]


In [9]:
cohort_df_labeled.head()

Unnamed: 0,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name
0,31943657,2018-12-03,2018-12-22,0,0,19,1,0,0,<18,Hispanic or Latino,FEMALE
1,32502316,2019-07-08,2020-05-26,0,0,323,1,1,0,<18,White,MALE
2,29980256,2016-08-01,2016-08-21,0,0,20,1,1,0,<18,White,FEMALE
3,32496707,2019-05-25,2019-08-16,0,0,83,1,0,0,<18,Hispanic or Latino,MALE
4,31276528,2017-10-03,2017-10-27,0,0,24,1,0,0,<18,White,FEMALE


In [10]:
# Now let's filter down to one prediction per patient and add a row_id column called `prediction_id`
cohort_filtered = BQFilterInpatientCohort(**config_dict)
cohort_filtered.create_cohort_table()



In [11]:
# Get the filtered cohort
cohort_df_filtered = cohort_filtered.db.read_sql_query(
    query = """
    SELECT *
    FROM {rs_dataset_project}.{rs_dataset}.{cohort_name_filtered}
    """.format(**config_dict)
).set_index('prediction_id').reset_index()

Downloading: 100%|██████████| 2754/2754 [00:01<00:00, 1948.58rows/s]


In [12]:
cohort_df_filtered.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name
0,-1398580073358422910,29927700,2011-06-20,2011-06-21,0,0,1,0,0,37,[30-45),Asian,FEMALE
1,1589314921634653616,29927707,2018-06-11,2018-06-13,0,0,2,0,0,33,[30-45),Asian,FEMALE
2,6821490609013120526,29928422,2015-08-07,2015-08-09,0,0,2,0,0,34,[30-45),Other,FEMALE
3,-4393416961359616376,29929013,2010-08-09,2010-08-10,0,0,1,0,0,59,[55-65),White,FEMALE
4,-2677891202090663005,29929519,2018-06-07,2018-06-09,0,0,2,0,0,37,[30-45),Asian,FEMALE


In [13]:
# Partition the dataset into folds for later
cohort_df_final = patient_split(cohort_df_filtered)

In [14]:
cohort_df_final.head()

Unnamed: 0,prediction_id,person_id,admit_date,discharge_date,hospital_mortality,month_mortality,LOS_days,LOS_7,readmission_30,age_in_years,age_group,race_eth,gender_concept_name,fold_id
0,-1398580073358422910,29927700,2011-06-20,2011-06-21,0,0,1,0,0,37,[30-45),Asian,FEMALE,2
1,1589314921634653616,29927707,2018-06-11,2018-06-13,0,0,2,0,0,33,[30-45),Asian,FEMALE,8
2,6821490609013120526,29928422,2015-08-07,2015-08-09,0,0,2,0,0,34,[30-45),Other,FEMALE,1
3,-4393416961359616376,29929013,2010-08-09,2010-08-10,0,0,1,0,0,59,[55-65),White,FEMALE,2
4,-2677891202090663005,29929519,2018-06-07,2018-06-09,0,0,2,0,0,37,[30-45),Asian,FEMALE,7


In [15]:
# Write the result to disk
cohort_path = '/share/pi/nigam/projects/prediction_utils/scratch/cohort'
os.makedirs(cohort_path, exist_ok=True)
cohort_df_final.to_parquet(
    os.path.join(cohort_path, "cohort.parquet"), engine="pyarrow", index=False
)