Skip to content

Cohort Extraction

Manu Murugesan edited this page Mar 13, 2026 · 2 revisions

Cohort Extraction

The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.

Basic Usage

from medicaid_utils.filters.patients.cohort_extraction import extract_cohort

extract_cohort(
    state="AL",
    lst_year=[2016, 2017, 2018],
    dct_diag_proc_codes=dct_codes,
    dct_filters=dct_filters,
    lst_types_to_export=["ip", "ot", "ps"],
    dct_data_paths=dct_paths,
    cms_format="TAF",
)

Defining Diagnosis Codes

Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic. Codes are matched using prefix matching"250" matches "2500", "25000", "25002", etc.

dct_codes = {
    "diag_codes": {
        "diabetes_t2": {
            "incl": {
                9: ["250"],       # ICD-9 prefix
                10: ["E11"],      # ICD-10 prefix
            },
            "excl": {
                9: ["25001", "25003", "25011", "25013"],  # Odd 5th digits = Type 1
                10: ["E10"],      # Exclude Type 1
            },
        },
    },
    "proc_codes": {},
}

Defining Procedure Codes

Procedure codes are keyed by procedure coding system:

dct_codes = {
    "diag_codes": {},
    "proc_codes": {
        "methadone": {
            7: [  # ICD-10-PCS (system code 7)
                "HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ",
            ],
        },
    },
}

Common procedure system codes:

  • 1 — CPT/HCPCS
  • 6 — ICD-9-CM procedure
  • 7 — ICD-10-PCS

Defining Filters

Filters control which claims and patients are included:

dct_filters = {
    "cohort": {
        "ip": {
            "missing_dob": 0,                          # Exclude missing DOB
            "range_numeric_age_prncpl_proc": (18, 64), # Age 18-64
        },
        "ot": {
            "missing_dob": 0,
            "range_numeric_age_srvc_bgn": (18, 64),
        },
    },
    "export": {},
}

Filter Types

Type Example Description
Column value "missing_dob": 0 Keep rows where column equals value
Numeric range "range_numeric_age_srvc_bgn": (18, 64) Keep rows where column is within range (inclusive)
Date range "range_date_srvc_bgn_date": ("20160101", "20181231") Keep rows where date is within range
Exclusion "excl_female": 1 Exclude patients with positive exclusion flag

Output Files

After extraction, the export folder contains:

  • cohort_{STATE}.csv — patient-level file with condition flags, inclusion indicator, and date of birth
  • cohort_{STATE}_{YEAR}.csv — year-specific patient file
  • cohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet — filter statistics
  • Exported claim files in the requested format (CSV or Parquet)

Multiple States

for state in ["AL", "IL", "CA", "NY", "TX"]:
    extract_cohort(
        state=state,
        lst_year=[2016, 2017, 2018],
        dct_diag_proc_codes=dct_codes,
        dct_filters=dct_filters,
        lst_types_to_export=["ip", "ot", "ps"],
        dct_data_paths={
            "source_root": "/data/cms/",
            "export_folder": f"/output/cohort/{state}/",
        },
        cms_format="TAF",
    )

See Also

Clone this wiki locally