# Phase 4 — Preprocessing Pipelines

In this phase, preprocessing pipelines are introduced to handle numerical and categorical features in a structured and leakage-safe manner.

The objectives of this phase are:
- incorporate additional relevant features
- apply appropriate preprocessing (imputation, encoding, scaling)
- ensure transformations are learned only from training data
- prepare data for more expressive machine learning models


In [24]:
#Define feature groups 
TARGET = "Production"

NUMERIC_FEATURES = [
    "Start_Hour",
    "Day_of_Year"
]

CATEGORICAL_FEATURES = [
    "Season",
    "Source",
    "Day_Name"
]

print("Numeric features:", NUMERIC_FEATURES)
print("Categorical features:", CATEGORICAL_FEATURES)


Numeric features: ['Start_Hour', 'Day_of_Year']
Categorical features: ['Season', 'Source', 'Day_Name']


#### **Build preprocessing pipelines**

#### numerical pipeline

In [25]:
numeric_pipline = Pipeline(steps=
                          [("imputer",SimpleImputer(strategy="median")),
                           ("scaler",StandardScaler())
                          ])

#### categorical pipeline

In [26]:
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

#### **Combine pipelines**

In [27]:
preprocessor = ColumnTransformer(transformers=[
                                ("num",numeric_pipline,NUMERIC_FEATURES),
                                 ("cat", categorical_pipeline, CATEGORICAL_FEATURES)
                                 ]
                                )


#### **Model + Pipeline (Ridge)**

In [28]:
ridge_pipeline = Pipeline(steps=[
    ("preprocessor",preprocessor),
    ("ridge",Ridge(alpha=0.6))
])

In [29]:
ridge_pipeline.fit(train_df[NUMERIC_FEATURES + CATEGORICAL_FEATURES], y_train)

ridge_y_val_pred = ridge_pipeline.predict(
    valid_df[NUMERIC_FEATURES + CATEGORICAL_FEATURES]
)

rmse_ridge_pipeline = root_mean_squared_error(
    y_valid,
    ridge_y_val_pred
)
print("Pipeline Ridge RMSE:", rmse_ridge_pipeline)

Pipeline Ridge RMSE: 4351.64040680396


**Observation:**  
The preprocessing pipeline with additional categorical features slightly improves validation RMSE compared to the Phase 3 baseline. This indicates that contextual features contribute meaningful predictive signal, while larger performance gains are expected from time-dependent feature engineering in subsequent phases.


## Phase 4 Summary — Preprocessing Pipelines

In this phase, preprocessing pipelines were introduced to systematically handle numerical and categorical features.

Key outcomes:
- Numerical and categorical features were processed using dedicated pipelines.
- Missing values were handled via imputation strategies appropriate to feature type.
- Categorical variables were encoded using one-hot encoding with safe handling of unseen categories.
- A Ridge Regression model was trained using the full preprocessing pipeline.
- The pipeline-based model provides a more realistic and extensible baseline for subsequent feature engineering and advanced models.

This phase establishes a robust foundation for introducing lag features, rolling statistics, and non-linear models in later phases.


---