# Preprocessing Pipeline for Housing Price Prediction

This notebook performs complete preprocessing of the training and test datasets, including:

- Loading raw data  
- Separating numerical and categorical features  
- Handling missing values  
- Standardizing numerical variables  
- One-hot encoding categorical variables  
- Saving processed data for modeling  

The final outputs are:
- `X_processed.npz` – preprocessed feature matrix (training)
- `X_test_processed.npz` – preprocessed feature matrix (test)
- `y.csv` – target variable


### Import Dependencies

We first import all required Python libraries. Pandas and NumPy handle data manipulation, scikit-learn provides preprocessing tools, and SciPy lets us save sparse matrices efficiently. Loading these libraries up front ensures the notebook environment is prepared for all preprocessing steps.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from scipy import sparse

### Load Datasets

We load the training and test datasets from the `data/` directory. Inspecting the shape and first few rows helps confirm that the files are correctly formatted and that the expected number of features is present. This step also allows us to visually assess missing values and data types before preprocessing.

In [2]:
# Load training and test datasets

train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")

print("Training data shape:", train_df.shape)
train_df.head()

Training data shape: (1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Identify Numeric & Categorical Features

Machine learning models treat numeric and categorical variables differently. Numeric features often require scaling to normalize ranges, while categorical features require encoding into numerical format. Automatically detecting feature types ensures our preprocessing pipeline adapts correctly to the dataset structure.

In [3]:
# Identify numeric and categorical features

numeric_features = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

# Remove target variable from numeric list

numeric_features.remove("SalePrice")  #target column

print("Numeric features:", len(numeric_features))
print("Categorical features:", len(categorical_features))

Numeric features: 37
Categorical features: 43


### Handle Missing Values

Missing values can disrupt model training or introduce bias.  
- Numeric features are imputed using the median because it is robust to outliers.  
- Categorical features are imputed using the mode to preserve the most common category.

Consistent imputation across training and test sets ensures the model receives uniform feature distributions during inference.

In [4]:
# Fill missing values in numeric features using median

train_df[numeric_features] = train_df[numeric_features].fillna(train_df[numeric_features].median())
test_df[numeric_features] = test_df[numeric_features].fillna(train_df[numeric_features].median())

# Fill missing values in categorical features using mode

train_df[categorical_features] = train_df[categorical_features].fillna(train_df[categorical_features].mode().iloc[0])
test_df[categorical_features] = test_df[categorical_features].fillna(train_df[categorical_features].mode().iloc[0])

print("Missing value handling completed.")

Missing value handling completed.


### Split Features and Target

Machine learning models learn to predict a specific target variable—in this case, `SalePrice`. We separate the dataset into:
- `X`: input features  
- `y`: target output  

This separation allows preprocessing to be applied consistently to inputs while preserving the target for supervised learning.

In [5]:
# Separate features and target variable

X = train_df.drop("SalePrice", axis=1)
y = train_df["SalePrice"]

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

Feature matrix shape: (1460, 80)
Target vector shape: (1460,)


### Build Preprocessing Pipeline

A `ColumnTransformer` is used to apply different transformations to numeric and categorical columns:
- **StandardScaler** normalizes numeric values, improving gradient-based model performance.
- **OneHotEncoder** converts categorical variables into binary indicator columns, enabling models to interpret non-numeric information.

Using a pipeline ensures that all preprocessing steps are applied consistently and can later be reused during model inference.

In [6]:
# Create a preprocessing pipeline with scaling and encoding

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

# Fit-transform training data and transform test data

X_processed = preprocessor.fit_transform(X)
X_test_processed = preprocessor.transform(test_df)

print("Feature preprocessing completed.")
print("Processed training feature matrix shape:", X_processed.shape)
print("Processed test feature matrix shape:", X_test_processed.shape)

Feature preprocessing completed.
Processed training feature matrix shape: (1460, 289)
Processed test feature matrix shape: (1459, 289)


### Save Processed Outputs

The processed feature matrices are saved in sparse `.npz` format to reduce storage space, especially since one-hot encoding can dramatically increase dimensionality. The target vector `y` is saved as a CSV file. These processed files will be directly loaded by the model training notebooks, ensuring complete separation between preprocessing and modeling.

In [7]:
# Save transformed matrices as compressed sparse files

sparse.save_npz("../data/X_processed.npz", X_processed)
sparse.save_npz("../data/X_test_processed.npz", X_test_processed)

# Save target variable

y.to_csv("../data/y.csv", index=False)

print("Preprocessing completed & files saved successfully!")

Preprocessing completed & files saved successfully!
