# Create Final Reduced Dataset

## Purpose
This notebook creates the final reduced dataset by:
1. **Dropping redundant demographics variables** (6 variables identified in notebook 06)
2. **Dropping irrelevant weather variables** (67 variables from notebook 07)
3. **Dropping redundant weather variables** (15 variables from correlation analysis in notebook 07)
4. **Formatting weather variable names** to Title Case for publication
5. **Verifying data quality** and final feature set
6. **Saving final reduced dataset** ready for modeling

## Input
- `data_cleaned/combined_final/final_combined_all_variables.csv` (from notebook 05)

## Output
- `data_cleaned/combined_final/final_combined_all_variables_reduced.csv`
- Publication-ready dataset with selected features only

## Expected Changes
- **Demographics**: 15 → 9 variables (drop 6)
- **Weather**: ~103 → ~21 variables (drop 82)
- **Livestock**: 8 variables (keep all)
- **Total**: ~138 → ~42 variables (drop 88)

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully")
print(f"pandas version: {pd.__version__}")

Libraries imported successfully
pandas version: 2.3.3


## 2. Load Combined Dataset

Load the full combined dataset from notebook 05 (before any feature reduction).

In [2]:
# Load combined dataset
df = pd.read_csv('../data_cleaned/combined_final/final_combined_all_variables.csv')

print("=" * 70)
print("ORIGINAL DATASET LOADED")
print("=" * 70)
print(f"Shape: {df.shape}")
print(f"  - Rows: {df.shape[0]:,}")
print(f"  - Columns: {df.shape[1]}")
print(f"  - Years: {sorted(df['Year'].unique())}")

# Store original shape for comparison
original_rows = df.shape[0]
original_cols = df.shape[1]

ORIGINAL DATASET LOADED
Shape: (24487, 131)
  - Rows: 24,487
  - Columns: 131
  - Years: [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]


## 3. Define Variables to Drop

### 3.1 Redundant Demographics Variables

Based on correlation analysis in notebook 06, drop 6 demographics variables.

In [3]:
# Demographics variables to drop (from notebook 06 analysis)
demographics_to_drop = [
    'Rent Burden (+50% of HI)',
    'Median Household Income',
    'Gini Index',
    'White Population (%)',
    'High School Degree or Higher (%)',
    'Median Age'
]

print("Demographics variables to drop:")
print(f"  Total: {len(demographics_to_drop)}")
for i, var in enumerate(demographics_to_drop, 1):
    print(f"    {i}. {var}")

Demographics variables to drop:
  Total: 6
    1. Rent Burden (+50% of HI)
    2. Median Household Income
    3. Gini Index
    4. White Population (%)
    5. High School Degree or Higher (%)
    6. Median Age


### 3.2 Irrelevant Weather Variables

These 67 weather variables are known to be redundant or irrelevant (from notebook 07 step 4.3).

In [4]:
# Irrelevant weather variables (from notebook 07 step 4.3)
irrelevant_weather = [
    'PM$_{10}$',
    'FoT PM$_{10}$ above EPA threshold', 
    'FoT PM$_{10}$ above75ᵗʰ percentile',
    'FoT PM$_{1}$ above75ᵗʰ percentile', 
    'PM$_1$', 
    'Specific humidity.1', 
    'Total column water',
    '2m dew point temperature', 
    'Wet bulb temperature', 
    '2m temperature',
    'PM$_{2.5}$', 
    'FoT Temperature above 90 $\\mathrm{^o F}$', 
    'FoT Temperature below 0 $\\mathrm{^o C}$',
    'FoT PM$_{2.5}$ above EPA threshold', 
    'Temperature', 
    'Surface geopotential',
    'Dust aerosol (0.03-0.55 µm) mixing ratio', 
    'Nitrogen dioxide', 
    'Peroxyacetyl nitrate', 
    'Formaldehyde',
    'Sulphur dioxide', 
    'Nitric acid', 
    'Propane',
    'Nitrogen monoxide',
    'Ozone', 
    'Hydrogen peroxide', 
    'Specific humidity',
    
    # Total AOD measurements (redundant aerosol measurements)
    'Total AOD at 469 nm', 
    'Total AOD at 550 nm', 
    'Total AOD at 670 nm', 
    'Total AOD at 865 nm',
    'Total AOD at 1240 nm',
    
    # Total column gas measurements (redundant with other air quality measures)
    'Total column carbon monoxide', 
    'Total column ethane', 
    'Total column formaldehyde',
    'Total column hydrogen peroxide', 
    'Total column hydroxyl radical', 
    'Total column isoprene', 
    'Total column methane', 
    'Total column nitric acid', 
    'Total column nitrogen dioxide', 
    'Total column ozone', 
    'Total column peroxyacetyl nitrate',
    'Total column propane', 
    'Total column sulphur dioxide', 
    'Total column water vapour',
    'Total column nitrogen monoxide',
    
    # Vertically integrated aerosol masses (redundant)
    'Vertically integrated mass of dust aerosol (0.03-0.55 µm)',
    'Vertically integrated mass of dust aerosol (0.55-9 µm)', 
    'Vertically integrated mass of dust aerosol (9-20 µm)',
    'Vertically integrated mass of hydrophilic black carbon aerosol', 
    'Vertically integrated mass of hydrophilic organic matter aerosol',
    'Vertically integrated mass of hydrophobic organic matter aerosol',
    'Vertically integrated mass of hydrophobic black carbon aerosol', 
    'Vertically integrated mass of sea salt aerosol (0.03-0.5 µm)',
    'Vertically integrated mass of sea salt aerosol (0.5-5 µm)', 
    'Vertically integrated mass of sea salt aerosol (5-20 µm)',
    'Vertically integrated mass of sulphate aerosol', 
    'Vertically integrated mass of sulphur dioxide'
]

print(f"Irrelevant weather variables to drop: {len(irrelevant_weather)}")

Irrelevant weather variables to drop: 59


### 3.3 Redundant Weather Variables from Analysis

These 15 weather variables were identified as redundant in notebook 07 correlation analysis.

In [5]:
# Redundant weather variables (from notebook 07 correlation analysis)
redundant_weather = [
    'Dust aerosol (0.9-20 µm) mixing ratio',
    'Hydrophobic black carbon aerosol mixing ratio',
    'Sea salt aerosol (5-20 µm) mixing ratio',
    'Dust aerosol (0.55-0.9 µm) mixing ratio',
    'Sea salt aerosol (0.5-5 µm) mixing ratio',
    'Hydrophilic black carbon aerosol mixing ratio',
    'Sulphate aerosol mixing ratio',
    'Hydroxyl radical',
    'FoT Temperature below25ᵗʰ percentile',
    'FoT Temperature above75ᵗʰ percentile',
    'Sea salt aerosol (0.03-0.5 µm) mixing ratio',
    'Hydrophobic organic matter aerosol mixing ratio',
    'Carbon monoxide',
    'Ethane',
    'Hydrophilic organic matter aerosol mixing ratio'
]

print("Redundant weather variables to drop (from analysis):")
print(f"  Total: {len(redundant_weather)}")
for i, var in enumerate(redundant_weather, 1):
    print(f"    {i:2}. {var}")

Redundant weather variables to drop (from analysis):
  Total: 15
     1. Dust aerosol (0.9-20 µm) mixing ratio
     2. Hydrophobic black carbon aerosol mixing ratio
     3. Sea salt aerosol (5-20 µm) mixing ratio
     4. Dust aerosol (0.55-0.9 µm) mixing ratio
     5. Sea salt aerosol (0.5-5 µm) mixing ratio
     6. Hydrophilic black carbon aerosol mixing ratio
     7. Sulphate aerosol mixing ratio
     8. Hydroxyl radical
     9. FoT Temperature below25ᵗʰ percentile
    10. FoT Temperature above75ᵗʰ percentile
    11. Sea salt aerosol (0.03-0.5 µm) mixing ratio
    12. Hydrophobic organic matter aerosol mixing ratio
    13. Carbon monoxide
    14. Ethane
    15. Hydrophilic organic matter aerosol mixing ratio


### 3.4 Combine All Variables to Drop

In [6]:
# Combine all variables to drop
all_vars_to_drop = demographics_to_drop + irrelevant_weather + redundant_weather

print("=" * 70)
print("SUMMARY: VARIABLES TO DROP")
print("=" * 70)
print(f"  Demographics: {len(demographics_to_drop)}")
print(f"  Irrelevant weather: {len(irrelevant_weather)}")
print(f"  Redundant weather: {len(redundant_weather)}")
print(f"  TOTAL: {len(all_vars_to_drop)}")

SUMMARY: VARIABLES TO DROP
  Demographics: 6
  Irrelevant weather: 59
  Redundant weather: 15
  TOTAL: 80


## 4. Check Which Variables Exist

Filter to only drop columns that actually exist in the dataset.

In [7]:
# Filter to only columns that exist in the dataset
cols_to_drop = [col for col in all_vars_to_drop if col in df.columns]

print("=" * 70)
print("COLUMNS TO DROP (THAT EXIST IN DATASET)")
print("=" * 70)
print(f"Specified: {len(all_vars_to_drop)}")
print(f"Found in dataset: {len(cols_to_drop)}")
print(f"Not found: {len(all_vars_to_drop) - len(cols_to_drop)}")

# Check which columns don't exist (for debugging)
cols_not_found = [col for col in all_vars_to_drop if col not in df.columns]
if cols_not_found:
    print(f"\n Columns NOT found in dataset ({len(cols_not_found)}):")
    for i, col in enumerate(cols_not_found[:10], 1):
        print(f"    {i}. {col}")
    if len(cols_not_found) > 10:
        print(f"    ... and {len(cols_not_found) - 10} more")
else:
    print("\n All specified columns exist in the dataset")

COLUMNS TO DROP (THAT EXIST IN DATASET)
Specified: 80
Found in dataset: 80
Not found: 0

 All specified columns exist in the dataset


## 5. Drop Redundant Columns

Remove all identified redundant variables from the dataset.

In [8]:
print("=" * 70)
print("DROPPING REDUNDANT COLUMNS")
print("=" * 70)

print(f"\nOriginal shape: {df.shape}")
print(f"  - Rows: {original_rows:,}")
print(f"  - Columns: {original_cols}")

# Drop columns
df_reduced = df.drop(columns=cols_to_drop)

# New shape
reduced_rows = df_reduced.shape[0]
reduced_cols = df_reduced.shape[1]
cols_dropped = original_cols - reduced_cols

print(f"\nReduced shape: {df_reduced.shape}")
print(f"  - Rows: {reduced_rows:,} (no change ✓)")
print(f"  - Columns: {reduced_cols}")
print(f"  - Columns dropped: {cols_dropped}")
print(f"  - Retention: {(reduced_cols / original_cols) * 100:.1f}%")

DROPPING REDUNDANT COLUMNS

Original shape: (24487, 131)
  - Rows: 24,487
  - Columns: 131

Reduced shape: (24487, 51)
  - Rows: 24,487 (no change ✓)
  - Columns: 51
  - Columns dropped: 80
  - Retention: 38.9%


## 6. Identify Remaining Variables

Categorize remaining columns by type for verification.

In [10]:
print("=" * 70)
print(f"REMAINING COLUMNS ({len(df_reduced.columns)})")
print("=" * 70)

# Define identifiers and target
identifiers = ['County', 'State', 'Year']
target = ['Mean Life Expectancy']

# Define demographics variables (original list minus dropped ones)
all_demographics = [
    'Median Household Income',  # dropped
    'Total Population',
    'Gini Index',  # dropped
    'Median Age',  # dropped
    'Poverty Rate',
    'Unemployment Rate',
    'Disability Rate',
    "Bachelor's Degree or Higher (%)",
    "High School Degree or Higher (%)",  # dropped
    'White Population (%)',  # dropped
    'Hispanic Population (%)',
    'Black Population (%)',
    'Households with No Vehicle (%)',
    'Rent Burden (+50% of HI)',  # dropped
    'Single Mother Families (%)'
]
demographics_remaining = [col for col in all_demographics if col in df_reduced.columns]

# Define livestock variables
livestock = ['Buffalo', 'Cattle', 'Chicken', 'Duck', 'Goat', 'Horse', 'Pig', 'Sheep']
livestock_remaining = [col for col in livestock if col in df_reduced.columns]

# Weather variables are everything else
known_cols = identifiers + target + demographics_remaining + livestock_remaining
weather_remaining = [col for col in df_reduced.columns if col not in known_cols]

# Display categorization
print("\n1. Identifiers (3):")
for col in identifiers:
    if col in df_reduced.columns:
        print(f"      {col}")

print("\n2. Target Variable (1):")
for col in target:
    if col in df_reduced.columns:
        print(f"      {col}")

print(f"\n3. Demographics Variables ({len(demographics_remaining)}):")
for i, col in enumerate(demographics_remaining, 1):
    print(f"     {i}. {col}")

print(f"\n4. Livestock Variables ({len(livestock_remaining)}):")
for i, col in enumerate(livestock_remaining, 1):
    print(f"     {i}. {col}")

print(f"\n5. Weather Variables ({len(weather_remaining)}):")
for i, col in enumerate(sorted(weather_remaining), 1):
    print(f"     {i:2}. {col}")

print(f"\n{'=' * 70}")
print("TOTAL BREAKDOWN:")
print(f"  Identifiers: {len([c for c in identifiers if c in df_reduced.columns])}")
print(f"  Target: {len([c for c in target if c in df_reduced.columns])}")
print(f"  Demographics: {len(demographics_remaining)}")
print(f"  Livestock: {len(livestock_remaining)}")
print(f"  Weather: {len(weather_remaining)}")
print(f"  TOTAL: {df_reduced.shape[1]}")

REMAINING COLUMNS (51)

1. Identifiers (3):
      County
      State
      Year

2. Target Variable (1):
      Mean Life Expectancy

3. Demographics Variables (9):
     1. Total Population
     2. Poverty Rate
     3. Unemployment Rate
     4. Disability Rate
     5. Bachelor's Degree or Higher (%)
     6. Hispanic Population (%)
     7. Black Population (%)
     8. Households with No Vehicle (%)
     9. Single Mother Families (%)

4. Livestock Variables (8):
     1. Buffalo
     2. Cattle
     3. Chicken
     4. Duck
     5. Goat
     6. Horse
     7. Pig
     8. Sheep

5. Weather Variables (30):
      1. 10m wind speed
      2. Black carbon AOD at 550 nm
      3. Dust AOD at 550 nm
      4. FoT Carbonmonoxide above75ᵗʰ percentile
      5. FoT Ethane above75ᵗʰ percentile
      6. FoT Formaldehyde above75ᵗʰ percentile
      7. FoT Hydrogen peroxide above75ᵗʰ percentile
      8. FoT Hydroxyl radical above75ᵗʰ percentile
      9. FoT Isoprene above75ᵗʰ percentile
     10. FoT Nitric acid

## 7. Format Weather Variable Names to Title Case

### 7.1 Create Title Case Mapping

Convert weather variable names to Title Case (capitalize first letters of words).
Keep LaTeX notation, special characters, and units as-is.

In [11]:
print("=" * 70)
print("CREATING TITLE CASE MAPPING FOR WEATHER VARIABLES")
print("=" * 70)

# Function to convert to Title Case while preserving special characters
def to_title_case_preserve_special(name):
    """
    Convert variable name to Title Case while preserving:
    - LaTeX notation: $_{2.5}$, $\\mathrm{^o F}$
    - Special characters: µm, ᵗʰ
    - Units and parentheses
    """
    # Split by spaces and capitalize each word
    words = name.split(' ')
    titled_words = []
    
    for word in words:
        # Don't capitalize words that are all lowercase units or special notation
        # Capitalize everything else
        if word:  # Skip empty strings
            # Capitalize first letter if it's a letter
            if word[0].isalpha():
                titled_words.append(word[0].upper() + word[1:])
            else:
                titled_words.append(word)
    
    return ' '.join(titled_words)

# Create mapping for weather variables only
weather_name_mapping = {}

for col in weather_remaining:
    # Convert to Title Case
    new_name = to_title_case_preserve_special(col)
    weather_name_mapping[col] = new_name

print(f"\nCreated Title Case mapping for {len(weather_name_mapping)} weather variables")
print("\nSample conversions:")
for i, (old, new) in enumerate(list(weather_name_mapping.items())[:10], 1):
    print(f"  {i:2}. {old}")
    print(f"       {new}")

CREATING TITLE CASE MAPPING FOR WEATHER VARIABLES

Created Title Case mapping for 30 weather variables

Sample conversions:
   1. fips
       Fips
   2. Black carbon AOD at 550 nm
       Black Carbon AOD At 550 Nm
   3. Dust AOD at 550 nm
       Dust AOD At 550 Nm
   4. Land-sea mask
       Land-sea Mask
   5. Mean sea level pressure
       Mean Sea Level Pressure
   6. Organic matter AOD at 550 nm
       Organic Matter AOD At 550 Nm
   7. Sea salt AOD at 550 nm
       Sea Salt AOD At 550 Nm
   8. Sulphate AOD at 550 nm
       Sulphate AOD At 550 Nm
   9. Surface pressure
       Surface Pressure
  10. Isoprene
       Isoprene


### 7.2 Apply Title Case Formatting

Rename weather columns to Title Case format.

In [12]:
print("=" * 70)
print("APPLYING TITLE CASE FORMATTING")
print("=" * 70)

# Apply renaming (only weather variables)
df_reduced = df_reduced.rename(columns=weather_name_mapping)

print(f"\n Renamed {len(weather_name_mapping)} weather variables to Title Case")
print("\nNote: Demographics variables already in correct format (no changes)")
print("      Livestock variables already in correct format (no changes)")

APPLYING TITLE CASE FORMATTING

 Renamed 30 weather variables to Title Case

Note: Demographics variables already in correct format (no changes)
      Livestock variables already in correct format (no changes)


## 8. Verification

### 8.1 Check for Missing Values

In [13]:
print("=" * 70)
print("DATA QUALITY CHECK")
print("=" * 70)

# Check for missing values
missing_values = df_reduced.isnull().sum()
total_missing = missing_values.sum()

print(f"\nMissing values: {total_missing}")
if total_missing > 0:
    print("\n Columns with missing values:")
    for col in df_reduced.columns:
        if missing_values[col] > 0:
            pct = (missing_values[col] / len(df_reduced)) * 100
            print(f"  {col}: {missing_values[col]:,} ({pct:.2f}%)")
else:
    print("   No missing values")

DATA QUALITY CHECK

Missing values: 0
   No missing values


### 8.2 Verify Year Distribution

In [14]:
# Check year distribution
print("\nObservations per year:")
year_counts = df_reduced['Year'].value_counts().sort_index()
for year, count in year_counts.items():
    print(f"  {year}: {count:,} counties")

print(f"\nTotal counties per year: {year_counts.values[0]:,}")


Observations per year:
  2012: 3,061 counties
  2013: 3,061 counties
  2014: 3,061 counties
  2015: 3,061 counties
  2016: 3,061 counties
  2017: 3,061 counties
  2018: 3,060 counties
  2019: 3,061 counties

Total counties per year: 3,061


### 8.3 Verify Target Variable

In [15]:
# Target variable statistics
print("\nTarget variable statistics (Mean Life Expectancy):")
print(f"  Count: {df_reduced['Mean Life Expectancy'].count():,}")
print(f"  Min: {df_reduced['Mean Life Expectancy'].min():.2f} years")
print(f"  Max: {df_reduced['Mean Life Expectancy'].max():.2f} years")
print(f"  Mean: {df_reduced['Mean Life Expectancy'].mean():.2f} years")
print(f"  Median: {df_reduced['Mean Life Expectancy'].median():.2f} years")
print(f"  Std Dev: {df_reduced['Mean Life Expectancy'].std():.2f} years")


Target variable statistics (Mean Life Expectancy):
  Count: 24,487
  Min: 67.09 years
  Max: 92.25 years
  Mean: 77.35 years
  Median: 77.47 years
  Std Dev: 2.54 years


### 8.4 Display Final Column List

In [16]:
print("=" * 70)
print(f"FINAL COLUMN LIST ({len(df_reduced.columns)} columns)")
print("=" * 70)
print("\nAll columns in final dataset:")
for i, col in enumerate(df_reduced.columns, 1):
    print(f"  {i:2}. {col}")

FINAL COLUMN LIST (51 columns)

All columns in final dataset:
   1. County
   2. State
   3. Mean Life Expectancy
   4. Poverty Rate
   5. Bachelor's Degree or Higher (%)
   6. Disability Rate
   7. Total Population
   8. Unemployment Rate
   9. Year
  10. Hispanic Population (%)
  11. Black Population (%)
  12. Households with No Vehicle (%)
  13. Single Mother Families (%)
  14. Fips
  15. Black Carbon AOD At 550 Nm
  16. Dust AOD At 550 Nm
  17. Land-sea Mask
  18. Mean Sea Level Pressure
  19. Organic Matter AOD At 550 Nm
  20. Sea Salt AOD At 550 Nm
  21. Sulphate AOD At 550 Nm
  22. Surface Pressure
  23. Isoprene
  24. Leaf Area Index, High Vegetation
  25. Leaf Area Index, Low Vegetation
  26. Snow Albedo
  27. Snow Depth
  28. Relative Humidity
  29. 10m Wind Speed
  30. FoT Carbonmonoxide Above75ᵗʰ Percentile
  31. FoT Ethane Above75ᵗʰ Percentile
  32. FoT Formaldehyde Above75ᵗʰ Percentile
  33. FoT Hydroxyl Radical Above75ᵗʰ Percentile
  34. FoT Isoprene Above75ᵗʰ Percentile

### 8.5 Display Sample Data

In [17]:
# Display first few rows
print("\nSample of final reduced dataset:")
df_reduced.head()


Sample of final reduced dataset:


Unnamed: 0,County,State,Mean Life Expectancy,Poverty Rate,Bachelor's Degree or Higher (%),Disability Rate,Total Population,Unemployment Rate,Year,Hispanic Population (%),...,FoT Propane Above75ᵗʰ Percentile,FoT Sulphur Dioxide Above75ᵗʰ Percentile,Buffalo,Cattle,Chicken,Duck,Goat,Horse,Pig,Sheep
0,Autauga County,Alabama,75.729244,11.6,24.4,15.0,54590.0,8.6,2012,2.399707,...,24.795082,56.523224,104.304193,765.477347,1460.733157,6.200551,47.55684,54.780249,3.199114,13.548855
1,Baldwin County,Alabama,77.927178,13.3,29.3,14.0,183226.0,8.5,2012,4.319802,...,14.310109,30.327869,0.0,356.486268,20.764893,2.339871,20.358615,74.033112,3.03827,4.505194
2,Barbour County,Alabama,75.726346,26.1,13.0,20.5,27469.0,13.5,2012,4.969238,...,20.252732,11.47541,74.777062,595.600612,150608.306932,6.572719,11.807808,24.807758,8.24866,6.213882
3,Bibb County,Alabama,73.854904,16.5,8.2,16.2,22769.0,10.5,2012,1.840221,...,27.117486,59.187158,0.0,288.78229,516.739571,3.657299,15.416166,42.510005,20.036976,2.450206
4,Blount County,Alabama,75.703162,14.7,12.0,17.3,57466.0,10.0,2012,8.084781,...,32.411202,62.773224,0.0,1692.635794,549287.154544,16.943384,42.096943,228.072444,9.920152,22.714804


In [18]:
# Display data types
print("\nData types:")
print(df_reduced.dtypes.value_counts())


Data types:
float64    47
object      2
int64       2
Name: count, dtype: int64


## 9. Save Final Reduced Dataset

Export the final reduced dataset with selected features only.

In [19]:
# Create output directory if needed
output_dir = Path('../data_cleaned/combined_final')
output_dir.mkdir(parents=True, exist_ok=True)

# Save reduced dataset
output_path = output_dir / 'final_combined_all_variables_reduced.csv'
df_reduced.to_csv(output_path, index=False)

print("=" * 70)
print("FINAL REDUCED DATASET SAVED")
print("=" * 70)
print(f" File: {output_path}")
print(f"  - Shape: {df_reduced.shape}")
print(f"  - Rows: {df_reduced.shape[0]:,}")
print(f"  - Columns: {df_reduced.shape[1]}")
print(f"  - File size: {output_path.stat().st_size / (1024*1024):.2f} MB")

FINAL REDUCED DATASET SAVED
 File: ../data_cleaned/combined_final/final_combined_all_variables_reduced.csv
  - Shape: (24487, 51)
  - Rows: 24,487
  - Columns: 51
  - File size: 18.66 MB


## 10. Summary Report

Complete summary of feature reduction process.

In [20]:
print("=" * 70)
print("FEATURE REDUCTION SUMMARY")
print("=" * 70)

print("\n1. CHANGES MADE:")
print(f"    Dropped {cols_dropped} redundant variables")
print(f"     - Demographics: 6 variables")
print(f"     - Irrelevant weather: {len([c for c in irrelevant_weather if c in df.columns])} variables")
print(f"     - Redundant weather: {len([c for c in redundant_weather if c in df.columns])} variables")
print(f"    Formatted {len(weather_name_mapping)} weather variables to Title Case")
print(f"    Kept all {len(livestock_remaining)} livestock variables")

print("\n2. DATASET COMPARISON:")
print(f"   Original: {original_cols} columns, {original_rows:,} rows")
print(f"   Reduced:  {reduced_cols} columns, {reduced_rows:,} rows")
print(f"   Reduction: {cols_dropped} columns ({(cols_dropped/original_cols)*100:.1f}%)")

print("\n3. FINAL FEATURE BREAKDOWN:")
print(f"   - Identifiers: {len([c for c in identifiers if c in df_reduced.columns])} (County, State, Year)")
print(f"   - Target: {len([c for c in target if c in df_reduced.columns])} (Mean Life Expectancy)")
print(f"   - Demographics: {len(demographics_remaining)} variables")
print(f"   - Weather: {len(weather_remaining)} variables (Title Case formatted)")
print(f"   - Livestock: {len(livestock_remaining)} variables")
print(f"   - TOTAL: {df_reduced.shape[1]} columns")

print("\n4. DATA QUALITY:")
print(f"    No missing values: {total_missing == 0}")
print(f"    Years covered: {df_reduced['Year'].min()} - {df_reduced['Year'].max()}")
print(f"    Counties per year: ~{df_reduced.groupby('Year').size().mean():.0f}")
print(f"    Target variable range: {df_reduced['Mean Life Expectancy'].min():.2f} - {df_reduced['Mean Life Expectancy'].max():.2f} years")

print("\n5. OUTPUT:")
print(f"    Saved to: {output_path}")
print(f"    File size: {output_path.stat().st_size / (1024*1024):.2f} MB")

print("\n" + "=" * 70)
print(" FEATURE REDUCTION COMPLETE")
print(" Dataset ready for machine learning modeling (notebooks 09+)")
print("=" * 70)

FEATURE REDUCTION SUMMARY

1. CHANGES MADE:
    Dropped 80 redundant variables
     - Demographics: 6 variables
     - Irrelevant weather: 59 variables
     - Redundant weather: 15 variables
    Formatted 30 weather variables to Title Case
    Kept all 8 livestock variables

2. DATASET COMPARISON:
   Original: 131 columns, 24,487 rows
   Reduced:  51 columns, 24,487 rows
   Reduction: 80 columns (61.1%)

3. FINAL FEATURE BREAKDOWN:
   - Identifiers: 3 (County, State, Year)
   - Target: 1 (Mean Life Expectancy)
   - Demographics: 9 variables
   - Weather: 30 variables (Title Case formatted)
   - Livestock: 8 variables
   - TOTAL: 51 columns

4. DATA QUALITY:
    No missing values: True
    Years covered: 2012 - 2019
    Counties per year: ~3061
    Target variable range: 67.09 - 92.25 years

5. OUTPUT:
    Saved to: ../data_cleaned/combined_final/final_combined_all_variables_reduced.csv
    File size: 18.66 MB

 FEATURE REDUCTION COMPLETE
 Dataset ready for machine learning modeling (no