# Data Cleaning

## Purpose
This notebook performs data quality checks and cleaning on the combined CVD mortality dataset. Since we used **inner join** in notebook 03, most data quality issues should already be filtered out. This notebook focuses on:

1. Identifying and handling missing values (NaN/empty fields)
2. Detecting and removing Census API error codes (`-666666666`)
3. Verifying data integrity and consistency
4. Saving the cleaned dataset for feature engineering

## Input
- `data_cvd/processed/combined_by_year/combined_all_years.csv`
- Expected: County-year observations with 19 ACS variables + CVD mortality rate
- Years: 2012-2019

## Output
- `data_cvd/demographics_final/combined_all_years_cleaned_final.csv`
- Clean dataset ready for feature engineering

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

## 2. Load Data

Read the combined dataset from notebook 03.

In [2]:
# Load the combined dataset
df = pd.read_csv('../data_cvd/processed/combined_by_year/combined_all_years.csv')

print(f"Dataset shape: {df.shape}")
print(f"  - Rows (county-year observations): {df.shape[0]:,}")
print(f"  - Columns (features + identifiers): {df.shape[1]}")

Dataset shape: (24888, 27)
  - Rows (county-year observations): 24,888
  - Columns (features + identifiers): 27


## 3. Initial Data Inspection

Review the structure, data types, and first few rows of the dataset.

In [3]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,County,State,cvd_mortality_rate,year,State_FIPS,County_FIPS,Fips,Poverty Rate,High School Degree or Higher (%),Black Population,...,Total Population,Rent Burden Count (+50%),Hispanic Population,No Vehicle (Owner),Median Household Income,Total Families (Single Mother),Total Occupied Households,Unemployment Rate,No Vehicle (Renter),Year
0,Autauga County,Alabama,0.003293,2012,1,1,1001,11.6,85.0,9880.0,...,54590.0,906.0,1310.0,444.0,53773.0,1562.0,19934.0,8.6,580.0,2012
1,Baldwin County,Alabama,0.002951,2012,1,3,1003,13.3,87.0,17016.0,...,183226.0,3468.0,7915.0,1021.0,50706.0,4626.0,72751.0,8.5,1221.0,2012
2,Barbour County,Alabama,0.003027,2012,1,5,1005,26.1,70.2,12645.0,...,27469.0,786.0,1365.0,317.0,31889.0,1045.0,9423.0,13.5,623.0,2012
3,Bibb County,Alabama,0.003566,2012,1,7,1007,16.5,71.5,4953.0,...,22769.0,226.0,419.0,251.0,36824.0,536.0,7386.0,10.5,127.0,2012
4,Blount County,Alabama,0.003056,2012,1,9,1009,14.7,73.9,754.0,...,57466.0,651.0,4646.0,335.0,45192.0,1069.0,21031.0,10.0,471.0,2012


In [4]:
# Display dataset info
print("Dataset information:")
df.info()

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24888 entries, 0 to 24887
Data columns (total 27 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   County                            24888 non-null  object 
 1   State                             24888 non-null  object 
 2   cvd_mortality_rate                24888 non-null  float64
 3   year                              24888 non-null  int64  
 4   State_FIPS                        24888 non-null  int64  
 5   County_FIPS                       24888 non-null  int64  
 6   Fips                              24888 non-null  int64  
 7   Poverty Rate                      24871 non-null  float64
 8   High School Degree or Higher (%)  24872 non-null  float64
 9   Black Population                  24872 non-null  float64
 10  Total Families                    24872 non-null  float64
 11  Bachelor's Degree or Higher (%)   24872 non-nu

In [5]:
# Display all column names
print(f"Total columns: {len(df.columns)}\n")
print("Column names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2}. {col}")

Total columns: 27

Column names:
 1. County
 2. State
 3. cvd_mortality_rate
 4. year
 5. State_FIPS
 6. County_FIPS
 7. Fips
 8. Poverty Rate
 9. High School Degree or Higher (%)
10. Black Population
11. Total Families
12. Bachelor's Degree or Higher (%)
13. Median Age
14. Gini Index
15. Rent Denominator
16. Disability Rate
17. White Population
18. Total Population
19. Rent Burden Count (+50%)
20. Hispanic Population
21. No Vehicle (Owner)
22. Median Household Income
23. Total Families (Single Mother)
24. Total Occupied Households
25. Unemployment Rate
26. No Vehicle (Renter)
27. Year


## 4. Check for Missing Values

Identify any missing values (NaN or empty fields) in the dataset.

In [6]:
# Count missing values per column
missing_values = df.isnull().sum()

print("Missing values per column:")
print("=" * 60)
for col in df.columns:
    if missing_values[col] > 0:
        pct = (missing_values[col] / len(df)) * 100
        print(f"{col:40s}: {missing_values[col]:5} ({pct:.2f}%)")

total_missing = missing_values.sum()
print("=" * 60)
print(f"Total missing values: {total_missing}")

if total_missing == 0:
    print("\n No missing values detected!")

Missing values per column:
Poverty Rate                            :    17 (0.07%)
High School Degree or Higher (%)        :    16 (0.06%)
Black Population                        :    16 (0.06%)
Total Families                          :    16 (0.06%)
Bachelor's Degree or Higher (%)         :    16 (0.06%)
Median Age                              :    16 (0.06%)
Gini Index                              :    17 (0.07%)
Rent Denominator                        :    17 (0.07%)
Disability Rate                         :    16 (0.06%)
White Population                        :    16 (0.06%)
Total Population                        :    16 (0.06%)
Rent Burden Count (+50%)                :    17 (0.07%)
Hispanic Population                     :    16 (0.06%)
No Vehicle (Owner)                      :    16 (0.06%)
Median Household Income                 :    17 (0.07%)
Total Families (Single Mother)          :    16 (0.06%)
Total Occupied Households               :    16 (0.06%)
Unemployment Rate    

In [7]:
# If there are missing values, examine the affected rows
if df.isnull().any().any():
    print("\nRows with missing values:")
    print("=" * 60)
    rows_with_missing = df[df.isnull().any(axis=1)]
    print(f"Total rows affected: {len(rows_with_missing)}")
    print("\nSample of affected rows:")
    print(rows_with_missing.head(10))
else:
    print("\n No rows with missing values!")


Rows with missing values:
Total rows affected: 17

Sample of affected rows:
                         County         State  cvd_mortality_rate  year  \
9137               Bedford City      Virginia            0.002471  2014   
9414   Wade Hampton Census Area        Alaska            0.003672  2015   
11725            Shannon County  South Dakota            0.003367  2015   
12248              Bedford City      Virginia            0.002511  2015   
12525  Wade Hampton Census Area        Alaska            0.003549  2016   
14836            Shannon County  South Dakota            0.003396  2016   
15359              Bedford City      Virginia            0.002472  2016   
15636  Wade Hampton Census Area        Alaska            0.003504  2017   
17947            Shannon County  South Dakota            0.003373  2017   
18470              Bedford City      Virginia            0.002441  2017   

       State_FIPS  County_FIPS   Fips  Poverty Rate  \
9137           51          515  51515     

## 5. Check for Census Error Codes

The Census API uses `-666666666` to indicate "data not available" or other data quality issues. Identify any occurrences of this error code.

In [8]:
# Check for -666666666 error codes
error_code_mask = (df == -666666666).any(axis=1)
rows_with_errors = df[error_code_mask]

print("Census API Error Code Check:")
print("=" * 60)
print(f"Rows with -666666666 error codes: {len(rows_with_errors)}")

if len(rows_with_errors) > 0:
    print(f"Percentage of dataset: {(len(rows_with_errors) / len(df)) * 100:.2f}%")
    print("\nColumns affected:")
    for col in df.columns:
        count = (df[col] == -666666666).sum()
        if count > 0:
            print(f"  {col}: {count} occurrences")
    
    print("\nSample of affected rows:")
    print(rows_with_errors.head(10))
else:
    print(" No Census error codes found!")

Census API Error Code Check:
Rows with -666666666 error codes: 0
 No Census error codes found!


## 6. Remove Problematic Rows

Remove all rows that contain either missing values or Census error codes. Since we used inner join, these should be minimal.

In [9]:
# Create cleaned dataset by removing problematic rows
print("Cleaning dataset...")
print("=" * 60)

# Starting size
print(f"Original dataset: {len(df):,} rows")

# Step 1: Remove rows with missing values
df_cleaned = df.dropna()
print(f"After removing missing values: {len(df_cleaned):,} rows ({len(df) - len(df_cleaned)} removed)")

# Step 2: Remove rows with Census error codes
df_cleaned = df_cleaned[(df_cleaned != -666666666).all(axis=1)]
print(f"After removing error codes: {len(df_cleaned):,} rows ({len(df) - len(df_cleaned)} total removed)")

print("=" * 60)
print(f"Final cleaned dataset: {len(df_cleaned):,} rows")
print(f"Rows removed: {len(df) - len(df_cleaned)} ({((len(df) - len(df_cleaned)) / len(df)) * 100:.2f}%)")

Cleaning dataset...
Original dataset: 24,888 rows
After removing missing values: 24,871 rows (17 removed)
After removing error codes: 24,871 rows (17 total removed)
Final cleaned dataset: 24,871 rows
Rows removed: 17 (0.07%)


## 7. Verify Cleaned Dataset

Perform final checks to ensure the cleaned dataset has no remaining issues.

In [10]:
# Verification checks
print("Cleaned Dataset Verification:")
print("=" * 60)

# Check for missing values
missing_after = df_cleaned.isnull().sum().sum()
print(f"Missing values: {missing_after}")

# Check for error codes
error_codes_after = (df_cleaned == -666666666).any().any()
print(f"Error codes remaining: {error_codes_after}")

# Check data integrity
print(f"\nShape: {df_cleaned.shape}")
print(f"Years present: {sorted(df_cleaned['Year'].unique())}")
print(f"Counties per year:")
print(df_cleaned.groupby('Year').size())

print("\n" + "=" * 60)
if missing_after == 0 and not error_codes_after:
    print(" Dataset is clean and ready for feature engineering!")
else:
    print(" Warning: Issues remain in the dataset")

Cleaned Dataset Verification:
Missing values: 0
Error codes remaining: False

Shape: (24871, 27)
Years present: [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
Counties per year:
Year
2012    3111
2013    3111
2014    3110
2015    3108
2016    3108
2017    3108
2018    3107
2019    3108
dtype: int64

 Dataset is clean and ready for feature engineering!


In [11]:
# Summary statistics for the target variable
print("\nSummary Statistics for CVD Mortality Rate:")
print("=" * 60)
print(df_cleaned['cvd_mortality_rate'].describe())


Summary Statistics for CVD Mortality Rate:
count    24871.000000
mean         0.002703
std          0.000551
min          0.000731
25%          0.002303
50%          0.002647
75%          0.003055
max          0.006389
Name: cvd_mortality_rate, dtype: float64


In [12]:
# Display first few rows of cleaned dataset
print("Sample of cleaned dataset:")
df_cleaned.head(10)

Sample of cleaned dataset:


Unnamed: 0,County,State,cvd_mortality_rate,year,State_FIPS,County_FIPS,Fips,Poverty Rate,High School Degree or Higher (%),Black Population,...,Total Population,Rent Burden Count (+50%),Hispanic Population,No Vehicle (Owner),Median Household Income,Total Families (Single Mother),Total Occupied Households,Unemployment Rate,No Vehicle (Renter),Year
0,Autauga County,Alabama,0.003293,2012,1,1,1001,11.6,85.0,9880.0,...,54590.0,906.0,1310.0,444.0,53773.0,1562.0,19934.0,8.6,580.0,2012
1,Baldwin County,Alabama,0.002951,2012,1,3,1003,13.3,87.0,17016.0,...,183226.0,3468.0,7915.0,1021.0,50706.0,4626.0,72751.0,8.5,1221.0,2012
2,Barbour County,Alabama,0.003027,2012,1,5,1005,26.1,70.2,12645.0,...,27469.0,786.0,1365.0,317.0,31889.0,1045.0,9423.0,13.5,623.0,2012
3,Bibb County,Alabama,0.003566,2012,1,7,1007,16.5,71.5,4953.0,...,22769.0,226.0,419.0,251.0,36824.0,536.0,7386.0,10.5,127.0,2012
4,Blount County,Alabama,0.003056,2012,1,9,1009,14.7,73.9,754.0,...,57466.0,651.0,4646.0,335.0,45192.0,1069.0,21031.0,10.0,471.0,2012
5,Bullock County,Alabama,0.003317,2012,1,11,1011,22.2,70.0,7609.0,...,10779.0,213.0,623.0,300.0,34500.0,420.0,3740.0,14.3,299.0,2012
6,Butler County,Alabama,0.003426,2012,1,13,1013,25.7,73.8,8961.0,...,20730.0,547.0,203.0,281.0,30752.0,925.0,8149.0,13.8,545.0,2012
7,Calhoun County,Alabama,0.003727,2012,1,15,1015,21.4,78.5,24421.0,...,117834.0,3232.0,3838.0,1009.0,40093.0,3345.0,45764.0,13.3,1794.0,2012
8,Chambers County,Alabama,0.00323,2012,1,17,1017,23.3,70.0,13409.0,...,34228.0,1051.0,555.0,390.0,32181.0,1328.0,13634.0,16.0,829.0,2012
9,Cherokee County,Alabama,0.003113,2012,1,19,1019,19.5,77.5,1253.0,...,25917.0,353.0,333.0,169.0,36241.0,687.0,11758.0,10.5,299.0,2012


## 8. Feature Engineering: Race Percentages

Calculate race percentages from population counts.

In [13]:
print("Calculating race percentages...")
print("=" * 60)

# White Population (%)
df_cleaned['White Population (%)'] = (df_cleaned['White Population'] / df_cleaned['Total Population']) * 100

# Hispanic Population (%)
df_cleaned['Hispanic Population (%)'] = (df_cleaned['Hispanic Population'] / df_cleaned['Total Population']) * 100

# Black Population (%)
df_cleaned['Black Population (%)'] = (df_cleaned['Black Population'] / df_cleaned['Total Population']) * 100

print(" Race percentages calculated")
print("\nNew columns created:")
print("  - White Population (%)")
print("  - Hispanic Population (%)")
print("  - Black Population (%)")

# Display summary statistics
print("\nSummary statistics:")
print("\nWhite Population (%):")
print(df_cleaned['White Population (%)'].describe())
print("\nHispanic Population (%):")
print(df_cleaned['Hispanic Population (%)'].describe())
print("\nBlack Population (%):")
print(df_cleaned['Black Population (%)'].describe())

Calculating race percentages...
 Race percentages calculated

New columns created:
  - White Population (%)
  - Hispanic Population (%)
  - Black Population (%)

Summary statistics:

White Population (%):
count    24871.000000
mean        83.367542
std         16.688267
min          3.110111
25%         76.865451
50%         89.872513
75%         95.369517
max        100.000000
Name: White Population (%), dtype: float64

Hispanic Population (%):
count    24871.000000
mean         8.822994
std         13.466728
min          0.000000
25%          1.910721
50%          3.744418
75%          8.975933
max         99.184800
Name: Hispanic Population (%), dtype: float64

Black Population (%):
count    24871.000000
mean         9.092531
std         14.543398
min          0.000000
25%          0.626508
50%          2.263721
75%         10.400362
max         87.412281
Name: Black Population (%), dtype: float64


## 9. Feature Engineering: Housing and Family Percentages

Calculate percentages for housing and family characteristics.

In [14]:
print("Calculating housing and family percentages...")
print("=" * 60)

# Households with No Vehicle (%)
df_cleaned['Households with No Vehicle (%)'] = (
    (df_cleaned['No Vehicle (Owner)'] + df_cleaned['No Vehicle (Renter)']) / 
    df_cleaned['Total Occupied Households']
) * 100

# Rent Burden (+50% of HI)
df_cleaned['Rent Burden (+50% of HI)'] = (
    df_cleaned['Rent Burden Count (+50%)'] / 
    df_cleaned['Rent Denominator']
) * 100

# Single Mother Families (%)
df_cleaned['Single Mother Families (%)'] = (
    df_cleaned['Total Families (Single Mother)'] / 
    df_cleaned['Total Families']
) * 100

print(" Housing and family percentages calculated")
print("\nNew columns created:")
print("  - Households with No Vehicle (%)")
print("  - Rent Burden (+50% of HI)")
print("  - Single Mother Families (%)")

# Display summary statistics
print("\nSummary statistics:")
print("\nHouseholds with No Vehicle (%):")
print(df_cleaned['Households with No Vehicle (%)'].describe())
print("\nRent Burden (+50% of HI):")
print(df_cleaned['Rent Burden (+50% of HI)'].describe())
print("\nSingle Mother Families (%):")
print(df_cleaned['Single Mother Families (%)'].describe())

Calculating housing and family percentages...
 Housing and family percentages calculated

New columns created:
  - Households with No Vehicle (%)
  - Rent Burden (+50% of HI)
  - Single Mother Families (%)

Summary statistics:

Households with No Vehicle (%):
count    24871.000000
mean         6.488079
std          4.317458
min          0.000000
25%          4.371020
50%          5.804968
75%          7.646126
max         83.978873
Name: Households with No Vehicle (%), dtype: float64

Rent Burden (+50% of HI):
count    24871.000000
mean        18.721220
std          6.335326
min          0.000000
25%         14.627110
50%         18.854013
75%         22.875408
max         50.156703
Name: Rent Burden (+50% of HI), dtype: float64

Single Mother Families (%):
count    24871.000000
mean         9.357959
std          3.594417
min          0.000000
25%          6.997596
50%          8.892219
75%         11.153632
max         30.714648
Name: Single Mother Families (%), dtype: float64


## 10. Drop Raw Count Columns

Remove the raw count columns that were used to create percentages, keeping only Total Population.

In [15]:
print("Dropping raw count columns...")
print("=" * 60)

# Columns to drop (all used for percentage calculations except Total Population)
columns_to_drop = [
    'White Population',
    'Hispanic Population',
    'Black Population',
    'No Vehicle (Owner)',
    'No Vehicle (Renter)',
    'Total Occupied Households',
    'Rent Burden Count (+50%)',
    'Rent Denominator',
    'Total Families (Single Mother)',
    'Total Families'
]

df_cleaned = df_cleaned.drop(columns=columns_to_drop)

print(f" Dropped {len(columns_to_drop)} columns")
print("\nColumns dropped:")
for col in columns_to_drop:
    print(f"  - {col}")

print(f"\nDataset shape after dropping: {df_cleaned.shape}")
print(f"  - Rows: {df_cleaned.shape[0]:,}")
print(f"  - Columns: {df_cleaned.shape[1]}")

Dropping raw count columns...
 Dropped 10 columns

Columns dropped:
  - White Population
  - Hispanic Population
  - Black Population
  - No Vehicle (Owner)
  - No Vehicle (Renter)
  - Total Occupied Households
  - Rent Burden Count (+50%)
  - Rent Denominator
  - Total Families (Single Mother)
  - Total Families

Dataset shape after dropping: (24871, 23)
  - Rows: 24,871
  - Columns: 23


## 11. Rename Target Variable

Rename the target variable for consistency.

In [16]:
print("Renaming target variable...")
print("=" * 60)

# Rename cvd_mortality_rate to CVD Mortality Rate
df_cleaned = df_cleaned.rename(columns={'cvd_mortality_rate': 'CVD Mortality Rate'})

print(" Renamed 'cvd_mortality_rate' to 'CVD Mortality Rate'")
print(f"\nTarget variable summary:")
print(df_cleaned['CVD Mortality Rate'].describe())

Renaming target variable...
 Renamed 'cvd_mortality_rate' to 'CVD Mortality Rate'

Target variable summary:
count    24871.000000
mean         0.002703
std          0.000551
min          0.000731
25%          0.002303
50%          0.002647
75%          0.003055
max          0.006389
Name: CVD Mortality Rate, dtype: float64


## 12. Final Dataset Verification

Review the final dataset structure with all engineered features.

In [17]:
print("Final Dataset Structure:")
print("=" * 60)
print(f"Shape: {df_cleaned.shape}")
print(f"  - Rows: {df_cleaned.shape[0]:,}")
print(f"  - Columns: {df_cleaned.shape[1]}")

print("\nAll column names:")
for i, col in enumerate(df_cleaned.columns, 1):
    print(f"{i:2}. {col}")

print("\nSample of final dataset:")
df_cleaned.head(10)

Final Dataset Structure:
Shape: (24871, 23)
  - Rows: 24,871
  - Columns: 23

All column names:
 1. County
 2. State
 3. CVD Mortality Rate
 4. year
 5. State_FIPS
 6. County_FIPS
 7. Fips
 8. Poverty Rate
 9. High School Degree or Higher (%)
10. Bachelor's Degree or Higher (%)
11. Median Age
12. Gini Index
13. Disability Rate
14. Total Population
15. Median Household Income
16. Unemployment Rate
17. Year
18. White Population (%)
19. Hispanic Population (%)
20. Black Population (%)
21. Households with No Vehicle (%)
22. Rent Burden (+50% of HI)
23. Single Mother Families (%)

Sample of final dataset:


Unnamed: 0,County,State,CVD Mortality Rate,year,State_FIPS,County_FIPS,Fips,Poverty Rate,High School Degree or Higher (%),Bachelor's Degree or Higher (%),...,Total Population,Median Household Income,Unemployment Rate,Year,White Population (%),Hispanic Population (%),Black Population (%),Households with No Vehicle (%),Rent Burden (+50% of HI),Single Mother Families (%)
0,Autauga County,Alabama,0.003293,2012,1,1,1001,11.6,85.0,24.4,...,54590.0,53773.0,8.6,2012,78.92288,2.399707,18.098553,5.136952,20.377868,10.955253
1,Baldwin County,Alabama,0.002951,2012,1,3,1003,13.3,87.0,29.3,...,183226.0,50706.0,8.5,2012,86.373659,4.319802,9.286892,3.081745,18.386173,8.876182
2,Barbour County,Alabama,0.003027,2012,1,5,1005,26.1,70.2,13.0,...,27469.0,31889.0,13.5,2012,48.957006,4.969238,46.033711,9.975592,25.047801,16.343447
3,Bibb County,Alabama,0.003566,2012,1,7,1007,16.5,71.5,8.2,...,22769.0,36824.0,10.5,2012,76.678818,1.840221,21.753261,5.11779,16.062544,9.766764
4,Blount County,Alabama,0.003056,2012,1,9,1009,14.7,73.9,12.0,...,57466.0,45192.0,10.0,2012,94.850868,8.084781,1.31208,3.832438,16.38148,6.85696
5,Bullock County,Alabama,0.003317,2012,1,11,1011,22.2,70.0,9.3,...,10779.0,34500.0,14.3,2012,28.09166,5.779757,70.590964,16.016043,25.088339,17.398509
6,Butler County,Alabama,0.003426,2012,1,13,1013,25.7,73.8,12.2,...,20730.0,30752.0,13.8,2012,54.896286,0.979257,43.227207,10.136213,22.668877,17.193309
7,Calhoun County,Alabama,0.003727,2012,1,15,1015,21.4,78.5,16.3,...,117834.0,40093.0,13.3,2012,75.025035,3.257124,20.724918,6.124902,23.394861,10.984139
8,Chambers County,Alabama,0.00323,2012,1,17,1017,23.3,70.0,11.7,...,34228.0,32181.0,16.0,2012,58.931284,1.621479,39.175529,8.940883,25.053635,14.752277
9,Cherokee County,Alabama,0.003113,2012,1,19,1019,19.5,77.5,11.9,...,25917.0,36241.0,10.5,2012,92.888837,1.284871,4.834665,3.980269,11.72368,8.03133


## 13. Save Final Dataset

Export the cleaned and engineered dataset for machine learning models.

In [18]:
# Create output directory
output_dir = Path('../data_cvd/demographics_final')
output_dir.mkdir(parents=True, exist_ok=True)

# Save final cleaned and engineered dataset
output_path = output_dir / 'combined_all_years_cleaned_final.csv'
df_cleaned.to_csv(output_path, index=False)

print("=" * 60)
print("DATASET SAVED")
print("=" * 60)
print(f"✓ Final dataset saved to: {output_path}")
print(f"\n  - Shape: {df_cleaned.shape}")
print(f"  - Rows: {df_cleaned.shape[0]:,}")
print(f"  - Columns: {df_cleaned.shape[1]}")
print(f"  - File size: {output_path.stat().st_size / (1024*1024):.2f} MB")
print(f"\n  - Engineered features: 6")
print(f"  - Ready for machine learning!")

DATASET SAVED
✓ Final dataset saved to: ../data_cvd/demographics_final/combined_all_years_cleaned_final.csv

  - Shape: (24871, 23)
  - Rows: 24,871
  - Columns: 23
  - File size: 5.37 MB

  - Engineered features: 6
  - Ready for machine learning!


## 14. Summary

**Data Cleaning Results:**
- Original observations: See output in Section 6
- Rows removed: See output in Section 6
- Final clean observations: See output in Section 6

**Quality Checks Passed:**
-  No missing values (NaN)
-  No Census error codes (-666666666)
-  All years present (2012-2019)
-  Consistent county counts across years

**Feature Engineering Completed:**
- **Race Percentages (3 features):**
  - White Population (%) = (White Population / Total Population) × 100
  - Hispanic Population (%) = (Hispanic Population / Total Population) × 100
  - Black Population (%) = (Black Population / Total Population) × 100

- **Housing & Family Percentages (3 features):**
  - Households with No Vehicle (%) = [(No Vehicle Owner + Renter) / Total Households] × 100
  - Rent Burden (+50% of HI) = (Rent Burden Count / Rent Denominator) × 100
  - Single Mother Families (%) = (Single Mother Families / Total Families) × 100

**Columns Dropped:**
- Raw count columns used for percentages (10 columns dropped)
- Kept: Total Population (useful as a feature)

**Target Variable:**
- Renamed `cvd_mortality_rate` → `CVD Mortality Rate`

**Final Dataset:**
- **Location:** `data_cvd/demographics_final/combined_all_years_cleaned_final.csv`
- **Features:** 19 variables (6 engineered + 13 original) + identifiers + target
- **Ready for:** Exploratory Data Analysis (EDA) and Machine Learning models

**Next Steps:**
- Proceed to notebook 05 to merge with weather and livestock data
- Then notebooks 06+ for feature analysis and machine learning model development