# Combine Feature Files by Year

## Purpose
This notebook combines individual feature CSV files for each year (2012–2019) into consolidated datasets. It performs two main operations:

1. **Horizontal Merge**: Combines all 19 feature files for each year into a single dataset per year
2. **Vertical Stack**: Combines all yearly datasets into one master dataset containing all years

## Input
- 152 CSV files from notebook 02: 19 variables × 8 years (2012-2019)
- Each file contains: County, State, State_FIPS, County_FIPS, mean_life_expectancy, and one ACS variable
- Variables: 14 B-tables + 5 S-tables

## Output
- **8 yearly files**: `combined_features_2012.csv` through `combined_features_2019.csv` 
  - Location: `data_cleaned/processed/combined_by_year/`
- **1 master file**: `combined_all_years.csv` containing all years stacked vertically
  - Location: `data_cleaned/processed/combined_by_year/`

## Key Improvements
- Uses **inner join** to keep only counties with complete data across all features
- Merges on **FIPS codes only** (not including target variable in merge keys)
- Preserves life expectancy as the target variable
- Adds Year column for temporal identification

## 1. Import Libraries

In [1]:
import pandas as pd
from pathlib import Path
import os

## 2. Setup Directories

Configure input and output paths for data processing.

In [2]:
# Directory containing individual feature CSV files from notebook 02
directory = Path("../data_cleaned/processed/acs_individual_variables")

# Output directory for combined datasets
output_directory = Path("../data_cleaned/processed/combined_by_year") 

# Create output directory if it doesn't exist
output_directory.mkdir(parents=True, exist_ok=True)
print(f" Output directory ready: {output_directory}")

 Output directory ready: ../data_cleaned/processed/combined_by_year


## 3. Function 1: Combine Features for a Single Year

This function performs horizontal merging of all feature files for a specific year.

**Logic:**
1. Find all CSV files for the given year (e.g., all files ending with `_2012.csv`)
2. Load each file into a pandas DataFrame
3. Merge all DataFrames horizontally using FIPS codes as keys
4. Use **inner join** to keep only counties with data for ALL features
5. Add a Year column for tracking
6. Save the combined dataset

In [3]:
def combine_features_for_year(year):
    """
    Combines all feature files for a single year by merging them horizontally.
    
    Parameters:
    -----------
    year : int
        The year to process (e.g., 2012, 2013, ...)
    
    Returns:
    --------
    pd.DataFrame
        Combined DataFrame with all features for the given year
    """
    # Get all CSV files for the given year
    files = list(directory.glob(f"*_{year}.csv"))
    
    if len(files) == 0:
        raise ValueError(f"No files found for year {year} in {directory}")
    
    print(f"\nProcessing {year}: Found {len(files)} feature files")
    
    # Read all files for the year
    dataframes = [pd.read_csv(file) for file in files]
    
    # Start with the first DataFrame
    combined_year_df = dataframes[0]
    print(f"  Starting with {combined_year_df.shape[0]} counties")
    
    # Merge all subsequent DataFrames horizontally
    # Key improvement: Merge on FIPS codes only (NOT including target variable)
    # Using 'inner' join to keep only counties with complete data
    for i, df in enumerate(dataframes[1:], start=2):
        # Identify columns that will be added (excluding merge keys)
        new_cols = [col for col in df.columns if col not in ['County', 'State', 'State_FIPS', 'County_FIPS', 'mean_life_expectancy']]
        
        combined_year_df = pd.merge(
            combined_year_df, 
            df,
            on=["County", "State", "State_FIPS", "County_FIPS"],
            how="inner",  # Only keep counties with data in ALL features
            suffixes=('', '_drop')  # Handle duplicate columns if any
        )
        
        # Drop duplicate target variable columns if they exist (keeping the first)
        cols_to_drop = [col for col in combined_year_df.columns if col.endswith('_drop')]
        if cols_to_drop:
            combined_year_df = combined_year_df.drop(columns=cols_to_drop)
    
    print(f"  After merging: {combined_year_df.shape[0]} counties with {combined_year_df.shape[1]} columns")
    
    # Add Year column for temporal identification
    combined_year_df["Year"] = year
    
    # Save the combined file for the year
    combined_year_filepath = output_directory / f"combined_features_{year}.csv"
    combined_year_df.to_csv(combined_year_filepath, index=False)
    print(f"   Saved: {combined_year_filepath}")
    
    return combined_year_df

## 4. Function 2: Combine All Years

This function performs vertical stacking of all yearly combined datasets.

**Logic:**
1. Loop through each year in the range
2. Call `combine_features_for_year()` for each year
3. Stack all yearly DataFrames vertically (concatenate along rows)
4. Save the master dataset containing all years

In [4]:
def combine_all_years(start_year, end_year):
    """
    Combines all years by stacking each year's horizontally combined features vertically.
    
    Parameters:
    -----------
    start_year : int
        First year to include (e.g., 2012)
    end_year : int
        Last year to include (e.g., 2019)
    
    Returns:
    --------
    pd.DataFrame
        Combined DataFrame with all years stacked vertically
    """
    all_years_dataframes = []
    
    print("=" * 70)
    print(f"COMBINING FEATURES FOR YEARS {start_year} TO {end_year}")
    print("=" * 70)
    
    # Process each year
    for year in range(start_year, end_year + 1):
        yearly_df = combine_features_for_year(year)
        all_years_dataframes.append(yearly_df)
    
    print("\n" + "=" * 70)
    print("STACKING ALL YEARS VERTICALLY")
    print("=" * 70)
    
    # Combine all years vertically (stack rows)
    combined_all_years_df = pd.concat(all_years_dataframes, axis=0, ignore_index=True)
    
    print(f"\nFinal dataset shape: {combined_all_years_df.shape}")
    print(f"  - Total rows (county-year observations): {combined_all_years_df.shape[0]}")
    print(f"  - Total columns (features + identifiers): {combined_all_years_df.shape[1]}")
    
    # Save the final combined file
    combined_all_years_filepath = output_directory / "combined_all_years.csv"
    combined_all_years_df.to_csv(combined_all_years_filepath, index=False)
    print(f"\n Saved master file: {combined_all_years_filepath}")
    
    return combined_all_years_df

## 5. Execute: Combine All Features for 2012-2019

Run the combining process for all years and create the master dataset.

In [5]:
# Combine all files from 2012 to 2019
final_combined_df = combine_all_years(2012, 2019)

# Display summary information
print("\n" + "=" * 70)
print("DATASET SUMMARY")
print("=" * 70)
print(f"\nColumns in final dataset:")
print(final_combined_df.columns.tolist())
print(f"\nData types:")
print(final_combined_df.dtypes)
print(f"\nMissing values per column:")
print(final_combined_df.isnull().sum())
print(f"\nFirst few rows:")
print(final_combined_df.head())

COMBINING FEATURES FOR YEARS 2012 TO 2019

Processing 2012: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 24 columns
   Saved: ../data_cleaned/processed/combined_by_year/combined_features_2012.csv

Processing 2013: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 24 columns
   Saved: ../data_cleaned/processed/combined_by_year/combined_features_2013.csv

Processing 2014: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 24 columns
   Saved: ../data_cleaned/processed/combined_by_year/combined_features_2014.csv

Processing 2015: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 24 columns
   Saved: ../data_cleaned/processed/combined_by_year/combined_features_2015.csv

Processing 2016: Found 19 feature files
  Starting with 3111 counties
  After merging: 3111 counties with 24 columns
   Saved: ../data_cleaned/processed/combined_by_

## 6. Verification

Check data quality and consistency.

In [None]:
# Verify the distribution of observations per year
print("Observations per year:")
print(final_combined_df['Year'].value_counts().sort_index())

# Check for any missing values in key columns
print("\nMissing values in key columns:")
key_cols = ['County', 'State', 'State_FIPS', 'County_FIPS', 'mean_life_expectancy', 'Year']
print(final_combined_df[key_cols].isnull().sum())

# Summary statistics for life expectancy
print("\nLife Expectancy Statistics:")
print(final_combined_df['mean_life_expectancy'].describe())

## 7. Summary

**Outputs created:**
1. Individual yearly datasets: `combined_features_2012.csv` through `combined_features_2019.csv`
2. Master dataset: `combined_all_years.csv` (all years stacked)

**Next steps:**
- Proceed to notebook 04 for data cleaning and handling missing values
- Then notebook 05 for exploratory data analysis (EDA) and feature engineering
- Finally notebooks 06+ for machine learning model development