# Housing Price Prediction

**Goal:** Predict house prices (`price`) from property features using a leakage-safe preprocessing + modeling pipeline, **cross-validated** model selection, and interpretability + error analysis.

## Research question

This project investigates **which property features most strongly drive house prices** and how accurately we can predict the target (`SalePrice`) using supervised learning.

Key questions:
- Which features contribute the most to predictive performance?
- How do linear vs. non-linear models compare under cross-validation?
- What are the main error modes on a held-out test set?

ðŸŽ¯ This project uses the full ML_Houses_dataset.csv dataset consisting of 84 features.
The dataset can be accessed [here](https://d32aokrjazspmn.cloudfront.net/materials/ML_Houses_dataset.csv).

Within the scope of this project, the following steps will be performed:

 - Exploring and understanding all features,
 - Applying appropriate preprocessing and encoding techniques,
 - Performing feature engineering to create new meaningful variables,
 - Incorporating engineered features into the model,
 - Applying feature selection methods to evaluate and improve model performance.

## 0. Setup

- Reproducibility with a fixed random seed
- Core libraries: pandas, numpy, scikit-learn, matplotlib

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns 

from sklearn.model_selection import train_test_split, KFold, cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt


RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)