Cohort Extraction

The cohort extraction module is the primary tool for building patient-level analytic files. It identifies patients matching diagnosis/procedure criteria, applies inclusion/exclusion filters, and exports the resulting claim files.

Basic Usage

from medicaid_utils.filters.patients.cohort_extraction import extract_cohort

extract_cohort(
    state="AL",
    lst_year=[2016, 2017, 2018],
    dct_diag_proc_codes=dct_codes,
    dct_filters=dct_filters,
    lst_types_to_export=["ip", "ot", "ps"],
    dct_data_paths=dct_paths,
    cms_format="TAF",
)

Defining Diagnosis Codes

Use ICD-9 and/or ICD-10 codes with inclusion and exclusion logic. Codes are matched using prefix matching — "250" matches "2500", "25000", "25002", etc.

dct_codes = {
    "diag_codes": {
        "diabetes_t2": {
            "incl": {
                9: ["250"],       # ICD-9 prefix
                10: ["E11"],      # ICD-10 prefix
            },
            "excl": {
                9: ["25001", "25003", "25011", "25013"],  # Odd 5th digits = Type 1
                10: ["E10"],      # Exclude Type 1
            },
        },
    },
    "proc_codes": {},
}

Defining Procedure Codes

Procedure codes are keyed by procedure coding system:

dct_codes = {
    "diag_codes": {},
    "proc_codes": {
        "methadone": {
            7: [  # ICD-10-PCS (system code 7)
                "HZ81ZZZ", "HZ84ZZZ", "HZ85ZZZ", "HZ86ZZZ",
            ],
        },
    },
}

Common procedure system codes:

1 — CPT/HCPCS
6 — ICD-9-CM procedure
7 — ICD-10-PCS

Defining Filters

Filters control which claims and patients are included:

dct_filters = {
    "cohort": {
        "ip": {
            "missing_dob": 0,                          # Exclude missing DOB
            "range_numeric_age_prncpl_proc": (18, 64), # Age 18-64
        },
        "ot": {
            "missing_dob": 0,
            "range_numeric_age_srvc_bgn": (18, 64),
        },
    },
    "export": {},
}

Filter Types

Type	Example	Description
Column value	`"missing_dob": 0`	Keep rows where column equals value
Numeric range	`"range_numeric_age_srvc_bgn": (18, 64)`	Keep rows where column is within range (inclusive)
Date range	`"range_date_srvc_bgn_date": ("20160101", "20181231")`	Keep rows where date is within range
Exclusion	`"excl_female": 1`	Exclude patients with positive exclusion flag

Output Files

After extraction, the export folder contains:

cohort_{STATE}.csv — patient-level file with condition flags, inclusion indicator, and date of birth
cohort_{STATE}_{YEAR}.csv — year-specific patient file
cohort_exclusions_{TYPE}_{STATE}_{YEAR}.parquet — filter statistics
Exported claim files in the requested format (CSV or Parquet)

Cohort File Columns

The cohort_{STATE}_{YEAR}.csv file contains these columns (indexed by BENE_MSIS):

Column Pattern	Description
`include`	1 if patient is included in the final cohort, 0 if excluded
`YEAR`	Claim year
`STATE_CD`	State code
`birth_date`	Date of birth (merged from PS file)
`{type}_diag_{condition}`	1 if condition found in that claim type (e.g., `ip_diag_diabetes_t2`)
`{type}_diag_{condition}_date`	Date of first occurrence of the condition
`{type}_proc_{procedure}`	1 if procedure found in that claim type (e.g., `ip_proc_methadone`)
`{type}_proc_{procedure}_date`	Date of first occurrence of the procedure
`{type}_diag_condn`	1 if ANY diagnosis condition matched in that claim type
`{type}_diag_condn_date`	Date of first ANY diagnosis condition
`{type}_proc_condn`	1 if ANY procedure condition matched in that claim type
`{type}_proc_condn_date`	Date of first ANY procedure condition

Where {type} is the claim type (ip, ot, ot_line), {condition} comes from your dct_diag_codes keys, and {procedure} comes from your dct_proc_codes keys.

Exclusion Columns on Claim Files

During preprocessing, claims are flagged with exclusion columns (prefixed excl_). Filter keys in dct_filters omit the excl_ prefix:

IP claims: excl_missing_dob, excl_missing_admsn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_delivery, excl_female, excl_duplicated

OT claims: excl_missing_dob, excl_missing_srvc_bgn_date, excl_encounter_claim, excl_capitation_claim, excl_ffs_claim, excl_female, excl_duplicated

PS claims: excl_duplicated_bene_id

Multiple States

for state in ["AL", "IL", "CA", "NY", "TX"]:
    extract_cohort(
        state=state,
        lst_year=[2016, 2017, 2018],
        dct_diag_proc_codes=dct_codes,
        dct_filters=dct_filters,
        lst_types_to_export=["ip", "ot", "ps"],
        dct_data_paths={
            "source_root": "/data/cms/",
            "export_folder": f"/output/cohort/{state}/",
        },
        cms_format="TAF",
    )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cohort Extraction

Cohort Extraction

Basic Usage

Defining Diagnosis Codes

Defining Procedure Codes

Defining Filters

Filter Types

Output Files

Cohort File Columns

Exclusion Columns on Claim Files

Multiple States

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally