# ðŸš— EPA Fuel Economy - Regression EDA

**Author:** Reza Mirzaeifard
**Date:** December 2025
**Dataset:** EPA Fuel Economy (Real-world vehicle data)

---

## 1. Problem Statement

**Goal**: Predict **combined fuel economy (MPG)** based on vehicle characteristics.

### Why EPA Fuel Economy for Regression?

| Criterion | UAH Scores | EPA Dataset |
|-----------|------------|-------------|
| **Target type** | Aggregate scores (0-100) | True continuous (MPG) |
| **Sample size** | ~40 trips | 40,000+ vehicles |
| **Outliers** | Few | ~10% (perfect for robust!) |

### Challenges to Demonstrate
1. **Outliers** (~10%): Use Huber Regressor (robust)
2. **High-cardinality categoricals**: Use target encoding
3. **Multicollinearity**: Use Ridge/Lasso regularization

---


In [None]:
# Clear stale imports
import sys
for mod in list(sys.modules.keys()):
    if mod.startswith('src'):
        del sys.modules[mod]


In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.data import load_epa_fuel_economy
from src.features import (
    analyze_outliers_dataframe,
    print_outlier_summary,
    find_high_correlation_pairs,
    get_feature_columns,
    get_correlations_with_target,
)
from src.visualization import (
    setup_style,
    plot_target_distribution_regression,
    plot_target_vs_numerical_features,
    plot_target_vs_categorical_features,
    plot_correlation_matrix,
    plot_categorical_distributions,
)
from src.utils import (
    print_dataset_info,
    print_target_statistics,
    print_skewness_check,
    print_feature_types,
    print_categorical_cardinality,
    print_high_correlation_pairs,
    print_save_confirmation,
    print_success,
)

setup_style()
print_success('Setup complete')


## 2. Load Data


In [None]:
dataset = load_epa_fuel_economy(year_min=2015, year_max=2024, sample_size=5000, random_state=42)
print_dataset_info(dataset.info)

import pandas as pd
df = pd.DataFrame(dataset.X, columns=dataset.feature_names)
df['comb08'] = dataset.y.values
df.head(10)


## 3. Target Distribution


In [None]:
print_target_statistics(df['comb08'].values, "Combined MPG")
print_skewness_check(df['comb08'].skew())

fig = plot_target_distribution_regression(df['comb08'].values, target_name='Combined MPG')
fig.savefig(project_root / 'results' / 'figures' / 'target_distribution_regression.png', dpi=300, bbox_inches='tight')


## 4. Outlier Detection (MAD method - robust)


In [None]:
outlier_results = analyze_outliers_dataframe(df, columns=['comb08'], method='mad', mad_threshold=3.5)
print_outlier_summary(outlier_results, method="MAD")


## 5. Feature Types


In [None]:
numerical_cols, categorical_cols = get_feature_columns(df, target_col='comb08')
print_feature_types(numerical_cols, categorical_cols)


## 6. Categorical Features


In [None]:
key_cats = print_categorical_cardinality(df, categorical_cols, high_cardinality_threshold=50)

fig = plot_categorical_distributions(df, columns=key_cats, top_n=10, n_cols=3,
    save_path=str(project_root / 'results' / 'figures' / 'categorical_distributions_regression.png'))


## 7. Target vs Features


In [None]:
fig = plot_target_vs_numerical_features(df, target_col='comb08', feature_cols=numerical_cols[:6], n_cols=3,
    save_path=str(project_root / 'results' / 'figures' / 'target_vs_features_regression.png'))

fig = plot_target_vs_categorical_features(df, target_col='comb08', feature_cols=key_cats[:3],
    save_path=str(project_root / 'results' / 'figures' / 'target_by_categories_regression.png'))


## 8. Correlation Analysis


In [None]:
correlations = get_correlations_with_target(df, target_col='comb08', feature_cols=numerical_cols)
print("\nðŸ“Š Correlations with Target:")
print(correlations)

fig = plot_correlation_matrix(df, columns=numerical_cols + ['comb08'])
fig.savefig(project_root / 'results' / 'figures' / 'correlation_matrix_regression.png', dpi=300, bbox_inches='tight')

corr_matrix = df[numerical_cols].corr()
high_corr_pairs = find_high_correlation_pairs(corr_matrix, numerical_cols, threshold=0.8)
print_high_correlation_pairs(high_corr_pairs)


## 9. Save Processed Data


In [None]:
output_path = project_root / 'data' / 'processed' / 'epa_fuel_economy.csv'
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False)

print_save_confirmation(str(output_path), df.shape, "comb08 (MPG)", len(numerical_cols), len(categorical_cols))


## 10. Key Takeaways

### ðŸŽ¯ Challenges Identified
1. **Outliers** (~10%): Use Huber Regressor
2. **High-cardinality categoricals**: Use target encoding
3. **Multicollinearity**: Use Ridge/Lasso

### ðŸ“Š Modeling Strategy
1. **Baseline**: Linear Regression (OLS)
2. **Robust**: Huber Regressor (handles outliers)
3. **Regularized**: Ridge (L2) for multicollinearity
4. **Ensemble**: Random Forest (best performer)

---

**âœ… EDA complete â†’ Ready for `04_regression.ipynb`**
