-
Notifications
You must be signed in to change notification settings - Fork 3
Cohort Extraction
The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.
from medicaid_utils.filters.patients.cohort_extraction import extract_cohort
extract_cohort(
state="AL",
lst_year=[2016, 2017, 2018],
dct_diag_proc_codes=dct_codes,
dct_filters=dct_filters,
lst_types_to_export=["ip", "ot", "ps"],
dct_data_paths=dct_paths,
cms_format="TAF",
)Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic. Codes are matched using prefix matching — "250" matches "2500", "25000", "25002", etc.
dct_codes = {
"diag_codes": {
"diabetes_t2": {
"incl": {
9: ["250"], # ICD-9 prefix
10: ["E11"], # ICD-10 prefix
},
"excl": {
9: ["25001", "25003", "25011", "25013"], # Odd 5th digits = Type 1
10: ["E10"], # Exclude Type 1
},
},
},
"proc_codes": {},
}Procedure codes are keyed by procedure coding system:
dct_codes = {
"diag_codes": {},
"proc_codes": {
"methadone": {
7: [ # ICD-10-PCS (system code 7)
"HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ",
],
},
},
}Common procedure system codes:
-
1— CPT/HCPCS -
6— ICD-9-CM procedure -
7— ICD-10-PCS
Filters control which claims and patients are included:
dct_filters = {
"cohort": {
"ip": {
"missing_dob": 0, # Exclude missing DOB
"range_numeric_age_prncpl_proc": (18, 64), # Age 18-64
},
"ot": {
"missing_dob": 0,
"range_numeric_age_srvc_bgn": (18, 64),
},
},
"export": {},
}| Type | Example | Description |
|---|---|---|
| Column value | "missing_dob": 0 |
Keep rows where column equals value |
| Numeric range | "range_numeric_age_srvc_bgn": (18, 64) |
Keep rows where column is within range (inclusive) |
| Date range | "range_date_srvc_bgn_date": ("20160101", "20181231") |
Keep rows where date is within range |
| Exclusion | "excl_female": 1 |
Exclude patients with positive exclusion flag |
After extraction, the export folder contains:
-
cohort_{STATE}.csv— patient-level file with condition flags, inclusion indicator, and date of birth -
cohort_{STATE}_{YEAR}.csv— year-specific patient file -
cohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet— filter statistics - Exported claim files in the requested format (CSV or Parquet)
The cohort_{STATE}_{YEAR}.csv file contains these columns (indexed by BENE_MSIS):
| Column Pattern | Description |
|---|---|
include |
1 if patient is included in the final cohort, 0 if excluded |
YEAR |
Claim year |
STATE_CD |
State code |
birth_date |
Date of birth (merged from PS file) |
{type}_diag_{condition} |
1 if condition found in that claim type (e.g., ip_diag_diabetes_t2) |
{type}_diag_{condition}_date |
Date of first occurrence of the condition |
{type}_proc_{procedure} |
1 if procedure found in that claim type (e.g., ip_proc_methadone) |
{type}_proc_{procedure}_date |
Date of first occurrence of the procedure |
{type}_diag_condn |
1 if ANY diagnosis condition matched in that claim type |
{type}_diag_condn_date |
Date of first ANY diagnosis condition |
{type}_proc_condn |
1 if ANY procedure condition matched in that claim type |
{type}_proc_condn_date |
Date of first ANY procedure condition |
Where {type} is the claim type (ip, ot, ot_line), {condition} comes from your dct_diag_codes keys, and {procedure} comes from your dct_proc_codes keys.
During preprocessing, claims are flagged with exclusion columns (prefixed excl_). Filter keys in dct_filters omit the excl_ prefix:
IP claims: excl_missing_dob, excl_missing_admsn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_delivery, excl_female, excl_duplicated
OT claims: excl_missing_dob, excl_missing_srvc_bgn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_female, excl_duplicated
PS claims: excl_duplicated_bene_id
for state in ["AL", "IL", "CA", "NY", "TX"]:
extract_cohort(
state=state,
lst_year=[2016, 2017, 2018],
dct_diag_proc_codes=dct_codes,
dct_filters=dct_filters,
lst_types_to_export=["ip", "ot", "ps"],
dct_data_paths={
"source_root": "/data/cms/",
"export_folder": f"/output/cohort/{state}/",
},
cms_format="TAF",
)- Common Recipes — More code examples
- Risk Adjustment Algorithms — Apply after cohort extraction
medicaid-utils | Documentation | PyPI | GitHub | MIT License | Research Computing Group, Biostatistics Laboratory, The University of Chicago
Getting Started
User Guide
Recipes & How-Tos
Reference
Links