Skip to content

Cohort Extraction

Manu Murugesan edited this page Mar 14, 2026 · 2 revisions

Cohort Extraction

The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.

Basic Usage

from medicaid_utils.filters.patients.cohort_extraction import extract_cohort

extract_cohort(
    state="AL",
    lst_year=[2016, 2017, 2018],
    dct_diag_proc_codes=dct_codes,
    dct_filters=dct_filters,
    lst_types_to_export=["ip", "ot", "ps"],
    dct_data_paths=dct_paths,
    cms_format="TAF",
)

Defining Diagnosis Codes

Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic. Codes are matched using prefix matching"250" matches "2500", "25000", "25002", etc.

dct_codes = {
    "diag_codes": {
        "diabetes_t2": {
            "incl": {
                9: ["250"],       # ICD-9 prefix
                10: ["E11"],      # ICD-10 prefix
            },
            "excl": {
                9: ["25001", "25003", "25011", "25013"],  # Odd 5th digits = Type 1
                10: ["E10"],      # Exclude Type 1
            },
        },
    },
    "proc_codes": {},
}

Defining Procedure Codes

Procedure codes are keyed by procedure coding system:

dct_codes = {
    "diag_codes": {},
    "proc_codes": {
        "methadone": {
            7: [  # ICD-10-PCS (system code 7)
                "HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ",
            ],
        },
    },
}

Common procedure system codes:

  • 1 — CPT/HCPCS
  • 6 — ICD-9-CM procedure
  • 7 — ICD-10-PCS

Defining Filters

Filters control which claims and patients are included:

dct_filters = {
    "cohort": {
        "ip": {
            "missing_dob": 0,                          # Exclude missing DOB
            "range_numeric_age_prncpl_proc": (18, 64), # Age 18-64
        },
        "ot": {
            "missing_dob": 0,
            "range_numeric_age_srvc_bgn": (18, 64),
        },
    },
    "export": {},
}

Filter Types

Type Example Description
Column value "missing_dob": 0 Keep rows where column equals value
Numeric range "range_numeric_age_srvc_bgn": (18, 64) Keep rows where column is within range (inclusive)
Date range "range_date_srvc_bgn_date": ("20160101", "20181231") Keep rows where date is within range
Exclusion "excl_female": 1 Exclude patients with positive exclusion flag

Output Files

After extraction, the export folder contains:

  • cohort_{STATE}.csv — patient-level file with condition flags, inclusion indicator, and date of birth
  • cohort_{STATE}_{YEAR}.csv — year-specific patient file
  • cohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet — filter statistics
  • Exported claim files in the requested format (CSV or Parquet)

Cohort File Columns

The cohort_{STATE}_{YEAR}.csv file contains these columns (indexed by BENE_MSIS):

Column Pattern Description
include 1 if patient is included in the final cohort, 0 if excluded
YEAR Claim year
STATE_CD State code
birth_date Date of birth (merged from PS file)
{type}_diag_{condition} 1 if condition found in that claim type (e.g., ip_diag_diabetes_t2)
{type}_diag_{condition}_date Date of first occurrence of the condition
{type}_proc_{procedure} 1 if procedure found in that claim type (e.g., ip_proc_methadone)
{type}_proc_{procedure}_date Date of first occurrence of the procedure
{type}_diag_condn 1 if ANY diagnosis condition matched in that claim type
{type}_diag_condn_date Date of first ANY diagnosis condition
{type}_proc_condn 1 if ANY procedure condition matched in that claim type
{type}_proc_condn_date Date of first ANY procedure condition

Where {type} is the claim type (ip, ot, ot_line), {condition} comes from your dct_diag_codes keys, and {procedure} comes from your dct_proc_codes keys.

Exclusion Columns on Claim Files

During preprocessing, claims are flagged with exclusion columns (prefixed excl_). Filter keys in dct_filters omit the excl_ prefix:

IP claims: excl_missing_dob, excl_missing_admsn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_delivery, excl_female, excl_duplicated

OT claims: excl_missing_dob, excl_missing_srvc_bgn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_female, excl_duplicated

PS claims: excl_duplicated_bene_id

Multiple States

for state in ["AL", "IL", "CA", "NY", "TX"]:
    extract_cohort(
        state=state,
        lst_year=[2016, 2017, 2018],
        dct_diag_proc_codes=dct_codes,
        dct_filters=dct_filters,
        lst_types_to_export=["ip", "ot", "ps"],
        dct_data_paths={
            "source_root": "/data/cms/",
            "export_folder": f"/output/cohort/{state}/",
        },
        cms_format="TAF",
    )

See Also

Clone this wiki locally