# 2. Data Preprocessing (California Housing)

This notebook follows the class notes on data preprocessing and applies them to the **California Housing** dataset. We will:
- Inspect data format
- Identify and correct erroneous values
- Detect and treat outliers
- Normalize numerical attributes
- Disaggregate categorical variables (after creating a categorical feature)
- Categorize numerical variables

All preprocessing steps are explicit, justified, and documented.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

In [None]:
# Load California housing dataset
cal = fetch_california_housing(as_frame=True)
cal_data = cal.frame.copy()
cal_data.head()

## Create a Dirty Version of the Dataset

To practice preprocessing, we will intentionally introduce **erroneous values** (e.g., "NA") and **outliers** into a copy of the dataset. We keep the original data intact.

In [None]:
# Make a dirty copy (deterministic for reproducibility)
cal_dirty = cal_data.copy()

rng = np.random.default_rng(42)

# Identify candidate columns
num_cols = cal_dirty.select_dtypes(include='number').columns.tolist()

# Erroneous value tokens (common in real datasets)
err_tokens = ["NA", "N/A", "?", "", " ", "nan", "NaN", "None", "null", "NULL", "-"]

# 1) Introduce erroneous values across multiple numeric columns
if num_cols:
    n_err = max(1, int(0.02 * len(cal_dirty)))  # 2% of rows
    for col in num_cols[:3]:  # affect first 3 numeric columns
        rows = rng.choice(cal_dirty.index, size=n_err, replace=False)
        cal_dirty.loc[rows, col] = rng.choice(err_tokens, size=n_err)

# 2) Introduce missing values (np.nan) in more numeric columns
if num_cols:
    n_miss = max(1, int(0.015 * len(cal_dirty)))  # 1.5% of rows
    for col in num_cols[3:6]:  # next 3 numeric columns (if any)
        rows = rng.choice(cal_dirty.index, size=n_miss, replace=False)
        cal_dirty.loc[rows, col] = np.nan

# 3) Introduce outliers in multiple numeric columns
if num_cols:
    n_out = max(1, int(0.005 * len(cal_dirty)))  # 0.5% of rows
    for col in num_cols[:4]:  # outliers in first 4 numeric columns
        rows = rng.choice(cal_dirty.index, size=n_out, replace=False)
        numeric_series = pd.to_numeric(cal_dirty[col], errors='coerce')
        max_val = numeric_series.max()
        cal_dirty.loc[rows, col] = max_val * rng.integers(20, 60, size=n_out)

# Quick check
print("Original missing values:")
print(cal_data.isna().sum().head())
print("Dirty missing values (after injection):")
print(cal_dirty.isna().sum().head())

cal_dirty.head()

## 1. Data Preprocessing

Data preprocessing is the set of transformations required to make data suitable for machine learning. It is necessary but must be done carefully to avoid introducing bias or errors.

Key principle: **all transformations must be explicit, justified, and documented**.

## 2. Format

A dataset is a collection of data points (rows) described by attributes (columns). Attributes may be:
- Categorical (finite set of values)
- Numerical (continuous values)
- Text (strings)
- Other (images, sounds, dates, etc.)

We first inspect the dataset format and types.

In [None]:
# Basic inspection
print(cal_data.shape)
display(cal_data.head())

# Data types and missing values overview
cal_data.info()

In [None]:
# Separate columns by type
num_cols = cal_data.select_dtypes(include='number').columns.tolist()
cat_cols = cal_data.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()

print("Numerical columns:", num_cols)
print("Categorical/Text columns:", cat_cols)

### 2.2 Data Format considerations

We need to consider:
- Character encoding (e.g., special characters like Ã±)
- Numbers (decimal/thousand separators, units)
- Categorical values (consistent spelling/casing)
- Erroneous values (tokens that represent missing values)

## 3. Erroneous Values

Erroneous values are invalid for an attribute (not outliers). Typical causes:
- Missing values
- Incorrect format (e.g., "30,0" instead of "30.0")
- Measurement errors
- Encoding errors
- Nonexistent categories

We first detect common missing-value tokens and format issues.

In [None]:
# Common missing-value tokens observed in real datasets
missing_tokens = ["NA", "N/A", "?", "", " ", "nan", "NaN", "None", "null", "NULL", "-"]

# Replace tokens with NaN for consistent handling
clean = cal_dirty.replace(missing_tokens, np.nan)

# Coerce numeric columns to numeric (invalid strings become NaN)
for col in num_cols:
    clean[col] = pd.to_numeric(clean[col], errors='coerce')

missing_counts = clean.isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0].head(15)

### 3.5 Correction strategies

Typical strategies:
- Remove rows or columns with erroneous values
- Correct values when the fix is unambiguous
- Impute missing values (mean, median, mode, fixed value, or a model)

Below we show a simple imputation strategy.

In [None]:
# Simple imputation example
imputed = clean.copy()

# Numerical: median (robust to outliers)
for col in num_cols:
    if imputed[col].isna().any():
        # Your code here (use the median() and fillna() methods)
        imputed[col] = ...

imputed.isna().sum().sort_values(ascending=False).head(10)

## 4. Outliers

Outliers are values that deviate significantly from the distribution. They can strongly affect models.

Detection is subjective and context-dependent. Common rules:
- More than N standard deviations from the mean (e.g., N=3)
- Above or below percentile P (e.g., P=95)
- Low probability under the distribution (e.g., p < 0.01)

In [None]:
# Choose a numerical column to illustrate outlier detection
num_col = num_cols[0]

# Z-score and IQR detection for the chosen column
x = imputed[num_col]

# Z-score method (standardize and identify outliers as samples with |z| > 3)
outliers_z = ...

# IQR method
q1, q3 = x.quantile(0.25), x.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
outliers_iqr = imputed[(x < lower) | (x > upper)]

print(f"Column: {num_col}")
print("Outliers (Z-score > 3):", len(outliers_z))
print("Outliers (IQR rule):", len(outliers_iqr))

outliers_iqr[[num_col]].head()

In [None]:
# Optional visualization with two numerical features
plt.figure(figsize=(6,4))
plt.scatter(imputed[num_cols[0]], imputed[num_cols[1]], alpha=0.5)
plt.xlabel(num_cols[0])
plt.ylabel(num_cols[1])
plt.title("Scatter plot (possible outliers)")
plt.show()

### 4.3 Treatment

Outlier treatment is often similar to erroneous values:
- Remove
- Correct (if clearly wrong)
- Impute or cap (winsorization)

## 5. Normalization

Normalization transforms numerical data to satisfy specific properties:
- Range (e.g., [0, 1])
- Unit (consistent measurement units)
- Scale (attributes comparable for distance-based models)

We use two common techniques: **standardization** and **scaling**.

In [None]:
# Standardization (mean 0, std 1) and Min-Max scaling [0,1]
# HINT: Use the formula (x - mean) / std for standardization and (x - min) / (max - min) for scaling
normalized = imputed.copy()

# Standardization
normalized[num_cols] = ...

# Scaling
scaled = imputed.copy()
scaled[num_cols] = ...

normalized[num_cols].head()

## 6. Disaggregation

Disaggregation converts categorical attributes into numerical features so they can be used by ML algorithms.

The California Housing dataset is fully numeric, so we will create a categorical feature for demonstration (e.g., binning `MedInc`).

In [None]:
# Create a categorical feature from a numeric one
cal_cat = imputed.copy()
cal_cat["MedInc_bin"] = pd.qcut(cal_cat["MedInc"], q=4, labels=["low", "mid", "high", "very_high"])

# Label encoding
cal_cat["MedInc_label"] = cal_cat["MedInc_bin"].astype('category').cat.codes
cal_cat[["MedInc_bin", "MedInc_label"]].head()

In [None]:
# One-hot encoding
medinc_onehot = pd.get_dummies(cal_cat["MedInc_bin"], prefix="MedInc")
medinc_onehot.head()

## 7. Categorization

Categorization converts numerical attributes into categorical ones. This is subjective and context-dependent.

Example: assigning house-age into bins using the [pd.cut()](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) function.

In [None]:
# Categorization example
categorized = imputed.copy()

categorized["HouseAge_group"] = pd.cut(...)

categorized[["HouseAge", "HouseAge_group"]].head()

## 8. Conclusions

- The dataset is a fundamental part of the ML process.
- Preprocessing is required and must be explicit, justified, and documented.
- Error correction includes removing, correcting, or imputing values.
- Outlier treatment includes detecting and then removing, correcting, or imputing.
- Normalization uses standardization or scaling.
- Disaggregation uses labeling or one-hot encoding.
- Categorization converts numerical values into categories.