Skip to content

MAX vs TAF

Manu Murugesan edited this page Mar 14, 2026 · 6 revisions

MAX vs TAF: Understanding the Two CMS File Formats

medicaid-utils supports both Medicaid file formats published by CMS. Understanding their differences is essential for working with Medicaid claims data.

Overview

Feature MAX (Medicaid Analytic eXtract) TAF (T-MSIS Analytic Files)
Years available 1999–2015 2014–present
Diagnosis coding Primarily ICD-9-CM Primarily ICD-10-CM
File structure Single flat file per claim type Multiple sub-files per claim type
Beneficiary ID BENE_MSIS, BENE_ID, or MSIS_ID BENE_MSIS, BENE_ID, or MSIS_ID
Raw CMS claim types IP, OT, RX, PS, CC IP, OT, LT, RX, DE (person summary)
Supported in medicaid-utils IP, OT, PS, CC IP, OT, LT, RX, PS
Diagnosis columns DIAG_CD_1DIAG_CD_9 DGNS_CD_1DGNS_CD_12
Procedure columns PRCDR_CD_1PRCDR_CD_6 PRCDR_CD_1PRCDR_CD_6, LINE_PRCDR_CD
Date columns SRVC_BGN_DT, ADMSN_DT SRVC_BGN_DT, ADMSN_DT

Key Differences in Code

Accessing DataFrames

MAX — Single DataFrame accessible via .df:

from medicaid_utils.preprocessing import max_ip

ip = max_ip.MAXIP(year=2012, state="WY", data_root="/data/cms")
df = ip.df  # Single Dask DataFrame

TAF — Multiple sub-file DataFrames in .dct_files:

from medicaid_utils.preprocessing import taf_ip

ip = taf_ip.TAFIP(year=2019, state="AL", data_root="/data/cms")
df_base = ip.dct_files["base"]             # Header/base records
df_line = ip.dct_files["line"]             # Line-level detail
df_dx   = ip.dct_files["base_diag_codes"]  # Diagnosis codes
df_ndc  = ip.dct_files["line_ndc_codes"]   # NDC codes

Specifying Format

Most functions accept a cms_format parameter:

# MAX (after constructing LST_DIAG_CD from DIAG_CD_* columns)
score(ip.df, lst_diag_col_name="LST_DIAG_CD", cms_format="MAX")

# TAF (after calling ip.gather_bene_level_diag_ndc_codes())
score(ip.dct_files["base_diag_codes"], lst_diag_col_name="LST_DIAG_CD", cms_format="TAF")

Cohort Extraction

The extract_cohort function handles format differences internally:

# Just change cms_format — the rest of the API is the same
extract_cohort(state="WY", lst_year=[2012], cms_format="MAX", ...)
extract_cohort(state="AL", lst_year=[2019], cms_format="TAF", ...)

TAF Sub-File Types

Each TAF claim type is split into sub-files:

Suffix Description Dict Key
h (e.g., iph) Header/base "base"
l (e.g., ipl) Line-level detail "line"
occr (e.g., ipoccr) Occurrence codes "occurrence_code"
dx (e.g., ipdx) Diagnosis codes "base_diag_codes"
ndc (e.g., ipndc) NDC codes "line_ndc_codes"

Which Format Should I Use?

  • ICD-9 studies (pre-October 2015): Use MAX data
  • ICD-10 studies (post-October 2015): Use TAF data
  • Cross-era studies: Use both, with ICD-9 and ICD-10 code mappings in your dct_diag_proc_codes
  • Pharmacy studies: TAF preferred (medicaid-utils implements TAF RX preprocessing via TAFRX; MAX RX data exists in CMS but is not yet supported in the package)

Beneficiary ID (BENE_MSIS)

BENE_MSIS is a composite identifier constructed by medicaid-utils (not a raw CMS column). It applies to both MAX and TAF:

BENE_MSIS = STATE_CD + "-" + HAS_BENE + "-" + (BENE_ID or MSIS_ID)
  • BENE_ID — CMS-assigned, intended to be unique across states and years
  • MSIS_ID — State-assigned, unique only within a state and year
  • HAS_BENE — 1 if BENE_ID exists, 0 otherwise (falls back to MSIS_ID)

Example: "AL-1-123456789" (Alabama, has BENE_ID, ID is 123456789)

The index_col parameter on all claim classes accepts any of the three IDs: "BENE_MSIS", "BENE_ID", or "MSIS_ID". The default is "BENE_MSIS".

Column Name Reference

See Glossary for the complete column name mapping between MAX and TAF.

Clone this wiki locally