# Preprocessing and Baseline Modeling

This notebook implements a reproducible preprocessing pipeline and establishes a baseline model
using insights derived from the EDA phase.

## Data Loading

In [1]:
import pandas as pd
from pathlib import Path

DATA_DIR = Path("../data/raw")

train_df = pd.read_csv(DATA_DIR / "train.csv")
test_df = pd.read_csv(DATA_DIR / "test.csv")

train_df.shape, test_df.shape

((1460, 81), (1459, 80))

## Problem Setup

- Task: Supervised regression
- Target variable: SalePrice
- Evaluation metric: RMSE
- Validation strategy: Hold-out split (80/20)

In [2]:
TARGET = "SalePrice"
X = train_df.drop(columns=[TARGET])
y = train_df[TARGET]

X.shape, y.shape

((1460, 80), (1460,))

## Train–Validation Split

We split the training data into train and validation sets to evaluate models before final training.

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.2,random_state=42)

X_train.shape, X_val.shape, y_train.shape, y_val.shape

((1168, 80), (292, 80), (1168,), (292,))

## Feature Groups

Separate numerical and categorical features for preprocessing.

In [4]:
numerical_features = X_train.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X_train.select_dtypes(include=["object"]).columns

len(numerical_features), len(categorical_features)

(37, 43)

## Numerical Feature Preprocessing

- Median imputation for missing values
- Log transformation for skewed features
- Standard scaling

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer, StandardScaler
import numpy as np

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("log", FunctionTransformer(np.log1p, feature_names_out="one-to-one")),
    ("scaler", StandardScaler())
])

## Categorical Feature Preprocessing

- Constant imputation for missing values
- Ordinal encoding for categorical features

In [7]:
from sklearn.preprocessing import OrdinalEncoder

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="None")),
    ("encoder", OrdinalEncoder(
        handle_unknown="use_encoded_value",
        unknown_value=-1
    ))
])

## Full Preprocessing Pipeline

Combine numerical and categorical preprocessing using ColumnTransformer.

In [8]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

## Baseline Model

Train a Ridge Regression model using the full preprocessing pipeline.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [10]:
baseline_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", Ridge(alpha=1.0))
])

In [12]:
baseline_model.fit(X_train, y_train)
y_val_pred = baseline_model.predict(X_val)

rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
rmse

np.float64(37899.36205174425)

## Baseline Model with Log-Transformed Target

Train the same model using a log-transformed target to handle skewness.

In [13]:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)

baseline_model.fit(X_train, y_train_log)

y_val_pred_log = baseline_model.predict(X_val)
y_val_pred = np.expm1(y_val_pred_log)

rmse_log = np.sqrt(mean_squared_error(y_val, y_val_pred))
rmse_log

np.float64(29434.02185491127)

## Baseline Summary

- Ridge regression with raw target resulted in RMSE ≈ 37.9k
- Log-transforming the target reduced RMSE to ≈ 29.4k
- This confirms strong skewness in SalePrice
- All future models will be trained on log1p(SalePrice)