# Data Preprocessing & Feature Engineering

**Objective:** Transform raw housing data into a clean, reproducible format for machine learning. This notebook implements the core preprocessing pipeline including missing value imputation, outlier handling, and categorical encoding.

#### 1. Modular Imports and Data Loading

We start by importing our custom logic from the `src` directory. By keeping our cleaning functions in a separate _Python_ module, we ensure that our data pipeline is reproducible and follows professional modular design.

In [2]:
import sys; sys.path.append("..")
import pandas as pd
from src.data_processing import preprocess_data, handle_outliers

df = pd.read_csv('../data/raw/housing.csv')

#### 2. The Transformation Pipeline

We apply the `preprocess_data` function to standardize the dataset.
* **Target Standardizing:** Renamed median_house_value to Price.
* **Missing Values:** Numerical columns are imputed.
* **Feature Engineering:** We created Rooms_Per_Household and Bedrooms_Per_Room.

**Justification for Decision:**
We used the Median strategy for filling missing values in `total_bedrooms`. In `housing data`, `total bedroom` counts are often skewed by extremely large apartment complexes or blocks. The median is more robust to these outliers than the mean, providing a more accurate "typical" value for the district.

In [3]:
# Transformation Pipeline
df_clean = preprocess_data(df, fill_strategy='median')
# Justification: Median was used for total_bedrooms because the
# distribution is skewed; mean would be affected by extreme values.

#### 3. Outlier Detection and Capping

Using the **Interquartile Range (IQR)** method, we address outliers in key numerical features. We target features that show high variance: `Price`, `median_income`, `total_rooms`, `total_bedrooms`, `population`, and `households`.

Removing values outside

**1.5 × IQR1.5 × IQR**

ensures that our linear models are not skewed by extreme geographical anomalies or data entry errors.

In [4]:
df_no_outliers = handle_outliers(df_clean, ['Price', 'median_income', 'total_rooms', 'total_bedrooms', 'population', 'households'])

#### 4. Categorical Encoding and Data Export

Finally, we convert the categorical `ocean_proximity` feature into numerical format using **One-Hot Encoding.** We use `drop_first=True` to avoid the "dummy variable trap" (multicollinearity), which is essential for stable **Linear Regression performance.**

The processed, immutable data is then saved to the processed data folder.

In [5]:
# Type Conversion: One-Hot Encoding
df_final = pd.get_dummies(df_no_outliers, columns=['ocean_proximity'], drop_first=True)

df_final.to_csv('../data/processed/cleaned_housing.csv', index=False)

#### 5. Final Preprocessing Verification

We check the state of the final dataframe to ensure all transformations were applied correctly and that no null values remain.

In [6]:
# Display first five rows and last 5 rows
print("----------- PROCESSED HEAD -----------\n")
print(df_final.head(),"\n")
print("----------- PROCESSED TAIL -----------\n")
print(df_final.tail(),"\n")

# Check the new columns specifically
print("----------- NEW COLUMNS CREATED -----------\n")
print(df_final[['Price', 'Rooms_Per_Household', 'Bedrooms_Per_Room']].head())

# Check for missing values
print("----------- MISSING VALUES -----------\n")
print(df_final.isnull().sum(), "\n")

----------- PROCESSED HEAD -----------

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   
5    -122.25     37.85                52.0        919.0           213.0   
6    -122.25     37.84                52.0       2535.0           489.0   

   population  households  median_income     Price  Rooms_Per_Household  \
2       496.0       177.0         7.2574  352100.0             8.288136   
3       558.0       219.0         5.6431  341300.0             5.817352   
4       565.0       259.0         3.8462  342200.0             6.281853   
5       413.0       193.0         4.0368  269700.0             4.761658   
6      1094.0       514.0         3.6591  299200.0             4.931907   

   Bedrooms_Per_Room  ocean_proximity_INLAND  ocean_proxim