# 00 - Process Single Year CVD Mortality Data

## Motivation

In this project, we are predicting county-level cardiovascular disease (CVD) mortality rates in the United States.

We use **age-standardized mortality rates** to ensure fair comparisons across counties with different age distributions.

This notebook extracts the CVD mortality data for each individual year (2012-2019), which will later be combined into a single dataset.

In [None]:
import pandas as pd
import os

## Processing Function

The function below extracts CVD mortality data for a single year:
- Filters for `race_name == 'Total'` (aggregate across all races)
- Filters for `age_name == 'Age-standardized'` (age-standardized mortality rate)
- Keeps only county-level data (`fips > 60`)
- Removes unnecessary columns and renames the target variable

In [None]:
def process_cvd_mortality_data(year):
    """
    Processes CVD mortality data for a specific year.
    
    Parameters:
    year (int): The year for which the data needs to be processed (e.g., 2012).
    
    Returns:
    None: Saves the processed CSV file in the specified directory.
    """
    # Construct the input file path dynamically
    input_file_path = f"../data_cvd/raw/IHME_USA_COD_COUNTY_RACE_ETHN_2000_2019_MX_CVD_BOTH/IHME_USA_COD_COUNTY_RACE_ETHN_2000_2019_MX_{year}_CVD_BOTH_Y2023M06D12.CSV"
    
    # Read the raw data
    df = pd.read_csv(input_file_path)
    print(f"Loaded {len(df)} rows for year {year}")
    
    # Filter for Total race and Age-standardized mortality rate
    df_filtered = df[(df['race_name'] == 'Total') & (df['age_name'] == 'Age-standardized')]
    print(f"After filtering for Total race and Age-standardized: {len(df_filtered)} rows")
    
    # Remove empty cells
    df_clean = df_filtered.dropna()
    print(f"After dropping NaN: {len(df_clean)} rows")
    
    # Keep only county-level data (fips > 60)
    df_counties = df_clean[df_clean['fips'] > 60]
    print(f"After filtering for county-level (fips > 60): {len(df_counties)} rows")
    
    # Drop unnecessary columns
    columns_to_drop = ['measure_id', 'location_id', 'fips', 'measure_name', 'race_id',
                       'race_name', 'sex_id', 'sex_name', 'age_group_id', 'age_name',
                       'cause_id', 'cause_name', 'metric_id', 'metric_name', 'upper', 'lower']
    df_final = df_counties.drop(columns=columns_to_drop)
    
    # Rename the 'val' column to 'cvd_mortality_rate'
    df_final = df_final.rename(columns={'val': 'cvd_mortality_rate'})
    
    # Output file path for the processed CSV
    output_file_path = f"../data_cvd/processed/cvd_single_year/cvd_mortality_{year}.csv"
    
    # Ensure output directory exists
    os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
    
    # Save the processed data to a CSV file
    df_final.to_csv(output_file_path, index=False)
    
    print(f"Processed CVD mortality data for {year} saved to {output_file_path}")
    print(f"Final dataset shape: {df_final.shape}")
    print("---")
    
    return df_final

## Test with a Single Year

In [None]:
# Test with 2012 first
df_2012 = process_cvd_mortality_data(2012)
df_2012.head()

In [None]:
# Check statistics of the CVD mortality rate
print("CVD Mortality Rate Statistics (2012):")
print(df_2012['cvd_mortality_rate'].describe())

## Process All Years (2012-2019)

We process years 2012-2019 to match the time period of our ACS and atmospheric data.

In [None]:
# Process all years from 2012 to 2019
for year in range(2012, 2020):
    process_cvd_mortality_data(year)

## Verify Output Files

In [None]:
# List all generated files
import os
output_dir = "../data_cvd/processed/cvd_single_year/"
files = sorted(os.listdir(output_dir))
print("Generated files:")
for f in files:
    if f.endswith('.csv'):
        filepath = os.path.join(output_dir, f)
        df = pd.read_csv(filepath)
        print(f"  {f}: {len(df)} rows")