# 🏠 End-to-End Machine Learning Project
**Source: Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (Chapter 2)**

This notebook demonstrates a complete ML workflow to predict housing prices in California using the `housing.csv` dataset. It includes:
- Data download and loading
- Exploratory data analysis
- Data preprocessing (handling missing values, feature engineering, encoding)
- Train/test split with stratification
- Pipeline building using Scikit-Learn

Let's begin!

In [None]:
import os
import tarfile
import urllib.request
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
%matplotlib inline

**Theory**: In Machine Learning projects, reproducibility is important. Therefore, automating the data retrieval and loading process allows for consistent and repeatable experiments. Here, we define functions to download and load the California housing dataset.

## 📥 Download and Load Data

In [None]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    with tarfile.open(tgz_path) as housing_tgz:
        housing_tgz.extractall(path=housing_path)

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

fetch_housing_data()
housing = load_housing_data()
housing.head()

**Theory**: Initial exploration helps us understand the data’s shape, type, and quality. We use `info()`, `describe()`, and histograms to detect missing values, data types, value ranges, and possible outliers.

## 🔎 Explore the Data

In [None]:
housing.info()

In [None]:
housing.describe()

In [None]:
housing["ocean_proximity"].value_counts()

In [None]:
housing.hist(bins=50, figsize=(20, 15))
plt.show()

**Theory**: Creating a stratified split ensures that the train and test sets are representative of the overall dataset, especially for important features like `median_income`. This prevents sampling bias and helps the model generalize better.

## ✂️ Create Stratified Train/Test Split

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

**Theory**: Visualizing data geospatially (e.g., plotting longitude vs. latitude) allows us to spot regional patterns that may not be obvious in tabular form. Coloring by target variable and scaling by population size makes spatial trends more visible.

## 🧭 Geographical Visualization

In [None]:
housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population", figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.show()

**Theory**: Pearson correlation coefficient measures linear relationships between numerical features. Feature engineering involves creating new features (e.g., `rooms_per_household`) to better capture the underlying structure of the data.

## 📊 Correlation and Feature Engineering

In [None]:
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"] = housing["population"] / housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

**Theory**: A data preprocessing pipeline helps automate and structure the transformation of raw data. Scikit-Learn pipelines allow chaining of preprocessing steps like imputation, feature scaling, and encoding. This ensures consistency between training and prediction phases and supports hyperparameter tuning more easily.

## 🧹 Data Preprocessing Pipeline

In [None]:
housing_labels = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis=1)

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        rooms_per_household = X[:, 3] / X[:, 6]
        population_per_household = X[:, 5] / X[:, 6]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, 4] / X[:, 3]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

In [None]:
housing_num = housing.drop("ocean_proximity", axis=1)

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared.shape