# 03 - Combine Feature Files by Year for CVD Mortality

## Purpose
This notebook combines individual feature CSV files for each year (2012–2019) into consolidated datasets. It performs two main operations:

1. **Horizontal Merge**: Combines all 19 feature files for each year into a single dataset per year
2. **Vertical Stack**: Combines all yearly datasets into one master dataset containing all years

## Input
- 152 CSV files from notebook 02: 19 variables × 8 years (2012-2019)
- Each file contains: County, State, State_FIPS, County_FIPS, Fips, cvd_mortality_rate, year, and one ACS variable

## Output
- **8 yearly files**: `combined_features_2012.csv` through `combined_features_2019.csv` 
  - Location: `data_cvd/processed/combined_by_year/`
- **1 master file**: `combined_all_years.csv` containing all years stacked vertically
  - Location: `data_cvd/processed/combined_by_year/`

## 1. Import Libraries

In [1]:
import pandas as pd
from pathlib import Path
import os

## 2. Setup Directories

In [2]:
# Directory containing individual feature CSV files from notebook 02
directory = Path("../data_cvd/processed/acs_individual_variables")

# Output directory for combined datasets
output_directory = Path("../data_cvd/processed/combined_by_year") 

# Create output directory if it doesn't exist
output_directory.mkdir(parents=True, exist_ok=True)
print(f"Output directory ready: {output_directory}")

Output directory ready: ../data_cvd/processed/combined_by_year


## 3. Function 1: Combine Features for a Single Year

This function performs horizontal merging of all feature files for a specific year.

In [3]:
def combine_features_for_year(year):
    """
    Combines all feature files for a single year by merging them horizontally.
    
    Parameters:
    -----------
    year : int
        The year to process (e.g., 2012, 2013, ...)
    
    Returns:
    --------
    pd.DataFrame
        Combined DataFrame with all features for the given year
    """
    # Get all CSV files for the given year
    files = list(directory.glob(f"*_{year}.csv"))
    
    if len(files) == 0:
        raise ValueError(f"No files found for year {year} in {directory}")
    
    print(f"\nProcessing {year}: Found {len(files)} feature files")
    
    # Read all files for the year
    dataframes = [pd.read_csv(file) for file in files]
    
    # Start with the first DataFrame
    combined_year_df = dataframes[0]
    print(f"  Starting with {combined_year_df.shape[0]} counties")
    
    # Merge all subsequent DataFrames horizontally
    # Key improvement: Merge on FIPS codes only (NOT including target variable)
    # Using 'inner' join to keep only counties with complete data
    for i, df in enumerate(dataframes[1:], start=2):
        combined_year_df = pd.merge(
            combined_year_df, 
            df,
            on=["County", "State", "State_FIPS", "County_FIPS", "Fips"],
            how="inner",  # Only keep counties with data in ALL features
            suffixes=('', '_drop')  # Handle duplicate columns if any
        )
        
        # Drop duplicate target variable columns if they exist (keeping the first)
        cols_to_drop = [col for col in combined_year_df.columns if col.endswith('_drop')]
        if cols_to_drop:
            combined_year_df = combined_year_df.drop(columns=cols_to_drop)
    
    print(f"  After merging: {combined_year_df.shape[0]} counties with {combined_year_df.shape[1]} columns")
    
    # Add Year column for temporal identification (capitalize for consistency)
    combined_year_df["Year"] = year
    
    # Save the combined file for the year
    combined_year_filepath = output_directory / f"combined_features_{year}.csv"
    combined_year_df.to_csv(combined_year_filepath, index=False)
    print(f"  Saved: {combined_year_filepath}")
    
    return combined_year_df

## 4. Function 2: Combine All Years

This function performs vertical stacking of all yearly combined datasets.

In [4]:
def combine_all_years(start_year, end_year):
    """
    Combines all years by stacking each year's horizontally combined features vertically.
    
    Parameters:
    -----------
    start_year : int
        First year to include (e.g., 2012)
    end_year : int
        Last year to include (e.g., 2019)
    
    Returns:
    --------
    pd.DataFrame
        Combined DataFrame with all years stacked vertically
    """
    all_years_dataframes = []
    
    print("=" * 70)
    print(f"COMBINING FEATURES FOR YEARS {start_year} TO {end_year}")
    print("=" * 70)
    
    # Process each year
    for year in range(start_year, end_year + 1):
        yearly_df = combine_features_for_year(year)
        all_years_dataframes.append(yearly_df)
    
    print("\n" + "=" * 70)
    print("STACKING ALL YEARS VERTICALLY")
    print("=" * 70)
    
    # Combine all years vertically (stack rows)
    combined_all_years_df = pd.concat(all_years_dataframes, axis=0, ignore_index=True)
    
    print(f"\nFinal dataset shape: {combined_all_years_df.shape}")
    print(f"  - Total rows (county-year observations): {combined_all_years_df.shape[0]}")
    print(f"  - Total columns (features + identifiers): {combined_all_years_df.shape[1]}")
    
    # Save the final combined file
    combined_all_years_filepath = output_directory / "combined_all_years.csv"
    combined_all_years_df.to_csv(combined_all_years_filepath, index=False)
    print(f"\nSaved master file: {combined_all_years_filepath}")
    
    return combined_all_years_df

## 5. Execute: Combine All Features for 2012-2019

In [5]:
# Combine all files from 2012 to 2019
final_combined_df = combine_all_years(2012, 2019)

# Display summary information
print("\n" + "=" * 70)
print("DATASET SUMMARY")
print("=" * 70)
print(f"\nColumns in final dataset:")
print(final_combined_df.columns.tolist())
print(f"\nData types:")
print(final_combined_df.dtypes)
print(f"\nMissing values per column:")
print(final_combined_df.isnull().sum())

COMBINING FEATURES FOR YEARS 2012 TO 2019

Processing 2012: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 26 columns
  Saved: ../data_cvd/processed/combined_by_year/combined_features_2012.csv

Processing 2013: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 26 columns
  Saved: ../data_cvd/processed/combined_by_year/combined_features_2013.csv

Processing 2014: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 26 columns
  Saved: ../data_cvd/processed/combined_by_year/combined_features_2014.csv

Processing 2015: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 26 columns
  Saved: ../data_cvd/processed/combined_by_year/combined_features_2015.csv

Processing 2016: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 26 columns
  Saved: ../data_cvd/processed/combined_by_year/combined_features_20

## 6. Verification

In [6]:
# Verify the distribution of observations per year
print("Observations per year:")
print(final_combined_df['Year'].value_counts().sort_index())

# Check for any missing values in key columns
print("\nMissing values in key columns:")
key_cols = ['County', 'State', 'State_FIPS', 'County_FIPS', 'Fips', 'cvd_mortality_rate', 'Year']
available_key_cols = [col for col in key_cols if col in final_combined_df.columns]
print(final_combined_df[available_key_cols].isnull().sum())

# Summary statistics for CVD mortality rate
print("\nCVD Mortality Rate Statistics:")
print(final_combined_df['cvd_mortality_rate'].describe())

Observations per year:
Year
2012    3111
2013    3111
2014    3111
2015    3111
2016    3111
2017    3111
2018    3111
2019    3111
Name: count, dtype: int64

Missing values in key columns:
County                0
State                 0
State_FIPS            0
County_FIPS           0
Fips                  0
cvd_mortality_rate    0
Year                  0
dtype: int64

CVD Mortality Rate Statistics:
count    24888.000000
mean         0.002703
std          0.000551
min          0.000731
25%          0.002303
50%          0.002647
75%          0.003055
max          0.006389
Name: cvd_mortality_rate, dtype: float64


In [7]:
# Show first few rows
print("\nFirst few rows:")
final_combined_df.head()


First few rows:


Unnamed: 0,County,State,cvd_mortality_rate,year,State_FIPS,County_FIPS,Fips,Poverty Rate,High School Degree or Higher (%),Black Population,...,Total Population,Rent Burden Count (+50%),Hispanic Population,No Vehicle (Owner),Median Household Income,Total Families (Single Mother),Total Occupied Households,Unemployment Rate,No Vehicle (Renter),Year
0,Autauga County,Alabama,0.003293,2012,1,1,1001,11.6,85.0,9880.0,...,54590.0,906.0,1310.0,444.0,53773.0,1562.0,19934.0,8.6,580.0,2012
1,Baldwin County,Alabama,0.002951,2012,1,3,1003,13.3,87.0,17016.0,...,183226.0,3468.0,7915.0,1021.0,50706.0,4626.0,72751.0,8.5,1221.0,2012
2,Barbour County,Alabama,0.003027,2012,1,5,1005,26.1,70.2,12645.0,...,27469.0,786.0,1365.0,317.0,31889.0,1045.0,9423.0,13.5,623.0,2012
3,Bibb County,Alabama,0.003566,2012,1,7,1007,16.5,71.5,4953.0,...,22769.0,226.0,419.0,251.0,36824.0,536.0,7386.0,10.5,127.0,2012
4,Blount County,Alabama,0.003056,2012,1,9,1009,14.7,73.9,754.0,...,57466.0,651.0,4646.0,335.0,45192.0,1069.0,21031.0,10.0,471.0,2012


## 7. Summary

**Outputs created:**
1. Individual yearly datasets: `combined_features_2012.csv` through `combined_features_2019.csv`
2. Master dataset: `combined_all_years.csv` (all years stacked)

**Next steps:**
- Proceed to notebook 04 for data cleaning and handling missing values
- Then notebook 05 for combining with weather and livestock data
- Finally notebooks 06+ for feature analysis and machine learning