# Modeling and Evaluation

This notebook trains and evaluates regression models to predict SalesAmount using the processed, feature-engineered dataset prepared in the previous step.

The goal is to:
- Establish a baseline performance
- Compare multiple regression models
- Evaluate generalization performance
- Identify key drivers of sales


In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge

# Define processed data directory
processed_dir = os.path.join("..", "data", "processed")

# Load preprocessed datasets
X_train = pd.read_csv(os.path.join(processed_dir, "X_train.csv"))
X_test = pd.read_csv(os.path.join(processed_dir, "X_test.csv"))
y_train = pd.read_csv(os.path.join(processed_dir, "y_train.csv")).squeeze()
y_test = pd.read_csv(os.path.join(processed_dir, "y_test.csv")).squeeze()


## Baseline Model

Before training more complex models, we establish a baseline performance using a simple regression strategy.

The baseline model predicts the mean of the target variable regardless of input features. While simplistic, it provides a minimum performance threshold that all subsequent models should outperform.

This ensures that any added model complexity is justified by measurable performance gains.


In [2]:
# Initialize baseline regressor (predicts mean of y_train)
baseline_model = DummyRegressor(strategy="mean")

# Fit baseline model
baseline_model.fit(X_train, y_train)

# Generate predictions
y_pred_baseline = baseline_model.predict(X_test)


In [3]:
# Evaluate baseline performance
baseline_mae = mean_absolute_error(y_test, y_pred_baseline)
baseline_rmse = np.sqrt(mean_squared_error(y_test, y_pred_baseline))
baseline_r2 = r2_score(y_test, y_pred_baseline)

baseline_mae, baseline_rmse, baseline_r2


(694.7562612966226, np.float64(931.0226674689202), -2.8972993177500683e-05)

## Baseline Performance Summary

The baseline model provides a reference point for model evaluation.

Key takeaways:

- MAE and RMSE reflect error magnitude when predicting the average sale value
- R² is expected to be close to 0, indicating no explanatory power
- Any predictive model must significantly outperform this baseline to be considered useful

The baseline results will be used as a benchmark when evaluating trained regression models.


## Linear Regression

We now train a Linear Regression model to establish a first meaningful predictive benchmark beyond the baseline model.

Linear Regression learns a coefficient (weight) for each feature and combines them linearly to predict the target variable (`SalesAmount`).

This step allows us to:
- Verify that the engineered features contain predictive signal
- Quantify improvement over the baseline
- Establish a reference for more advanced models

If Linear Regression fails to outperform the baseline, it indicates issues in feature engineering, preprocessing, or data leakage handling.


In [4]:
# Initialize Linear Regression model
lin_reg_model = LinearRegression()

# Train model on training data
lin_reg_model.fit(X_train, y_train)

# Predict on test data
y_pred_lr = lin_reg_model.predict(X_test)


In [5]:
# Evaluate Linear Regression performance
lr_mae = mean_absolute_error(y_test, y_pred_lr)
lr_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lr))
lr_r2 = r2_score(y_test, y_pred_lr)

lr_mae, lr_rmse, lr_r2

(0.016980434102387105, np.float64(0.06982823170490929), 0.9999999943745902)

## Linear Regression Performance Summary

The Linear Regression model provides the first true test of whether the engineered feature set can explain variation in `SalesAmount`.

Evaluation metrics are interpreted as follows:

- Mean Absolute Error (MAE) indicates average prediction error in dollars
- Root Mean Squared Error (RMSE) penalizes larger prediction errors
- R² measures the proportion of variance explained by the model

These results are compared directly against the baseline model.
A successful model must demonstrate:
- Lower MAE than baseline
- Lower RMSE than baseline
- Positive R² value

The Linear Regression results will guide whether regularization or more complex models are required.


## Leakage Diagnosis

The near-perfect Linear Regression performance indicates the presence of target leakage.

Some input features directly determine SalesAmount, which allows the model to effectively reproduce the target rather than learn general predictive patterns.

To ensure a fair and realistic evaluation, features that are deterministically linked to the target or only known post-transaction must be removed before modeling.


In [6]:
# Explicitly remove target leakage features
leakage_features = [
    "UnitPrice",
    "UnitPriceDiscountPct",
    "DiscountAmount",
    "OrderQuantity"
]

X_train_clean = X_train.drop(columns=leakage_features)
X_test_clean = X_test.drop(columns=leakage_features)

X_train_clean.shape, X_test_clean.shape


((48318, 460), (12080, 460))

## Linear Regression (Leakage-Free Features)

After removing leakage-prone features, the Linear Regression model is retrained using only information that would be available at prediction time.

This provides a realistic estimate of model performance and ensures comparability with future models.


In [7]:
# Re-train Linear Regression without leakage
lin_reg_clean = LinearRegression()
lin_reg_clean.fit(X_train_clean, y_train)

y_pred_lr_clean = lin_reg_clean.predict(X_test_clean)

In [8]:
lr_clean_mae = mean_absolute_error(y_test, y_pred_lr_clean)
lr_clean_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lr_clean))
lr_clean_r2 = r2_score(y_test, y_pred_lr_clean)

lr_clean_mae, lr_clean_rmse, lr_clean_r2

(0.016972559631497124, np.float64(0.06979550105867736), 0.9999999943798626)

## Extended Leakage Mitigation

Further inspection revealed additional features that indirectly encode pricing or transaction-level information closely tied to `SalesAmount`.

To ensure a realistic prediction setting, all price-related and post-transaction attributes are removed. The remaining feature set reflects information that would be available prior to or at order time.


In [9]:
# Remove additional leakage-prone features
additional_leakage_features = [
    "DealerPrice",
    "ListPrice",
    "StandardCost",
    "ProductStandardCost",
    "Freight",
    "RevisionNumber",
    "ProductSubcategoryKey"
]

X_train_final = X_train_clean.drop(columns=additional_leakage_features)
X_test_final = X_test_clean.drop(columns=additional_leakage_features)

X_train_final.shape, X_test_final.shape


((48318, 453), (12080, 453))

## Linear Regression (Fully Leakage-Free)

After removing all identified leakage sources, Linear Regression is trained using only non-transactional, non-price-related features.

This model provides a realistic baseline for predictive performance and serves as a reference for regularized and nonlinear models.


In [10]:
lin_reg_final = LinearRegression()
lin_reg_final.fit(X_train_final, y_train)

y_pred_lr_final = lin_reg_final.predict(X_test_final)

lr_final_mae = mean_absolute_error(y_test, y_pred_lr_final)
lr_final_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lr_final))
lr_final_r2 = r2_score(y_test, y_pred_lr_final)

lr_final_mae, lr_final_rmse, lr_final_r2

(2.8318416250645018, np.float64(9.234553568611803), 0.9999016161342982)

## Interpretation of Final Model Performance

Despite extensive leakage mitigation, the Linear Regression model achieves a very high R² score.

This performance is explained by the structure of the dataset rather than residual target leakage:

- The prediction task operates at the transaction / line-item level
- High-cardinality categorical features (e.g. product identifiers, customer attributes) allow the model to learn stable pricing patterns
- Sales amounts are largely deterministic within this dataset once product and customer context is known

As a result, the model effectively learns strong associations rather than forecasting uncertain outcomes.

This behavior is acceptable for explaining relationships within the data, but it should be interpreted with caution if used for future predictions.


## Ridge Regression — Conceptual Overview

While Linear Regression fits coefficients by minimizing prediction error, it can become unstable when the number of features is large or when features are highly correlated.

Ridge Regression extends Linear Regression by adding **L2 regularization**, which penalizes large coefficient values during training.

Conceptually, Ridge Regression:

- Preserves the linear relationship between features and target
- Shrinks coefficients toward zero (but does not eliminate them)
- Reduces model sensitivity to correlated and high-dimensional features
- Improves numerical stability and generalization

This makes Ridge Regression well-suited for datasets created through extensive one-hot encoding, as is the case in this project.


## Ridge Regression — Model Training

We now train a Ridge Regression model using the fully leakage-free feature set.

A fixed regularization strength (`alpha`) is used initially to assess whether coefficient shrinkage improves robustness relative to standard Linear Regression.

The `alpha` parameter controls regularization strength:

- Small alpha → behaves similarly to Linear Regression
- Large alpha → stronger penalty, more coefficient shrinkage


In [11]:
# Initialize Ridge Regression model
ridge_model = Ridge(alpha=1.0, random_state=42)

# Train Ridge Regression model
ridge_model.fit(X_train_final, y_train)

# Generate predictions
y_pred_ridge = ridge_model.predict(X_test_final)

## Ridge Regression — Evaluation

Ridge Regression is evaluated using the same metrics as previous models to ensure direct comparability.

This evaluation helps determine whether regularization:

- Maintains predictive accuracy
- Improves robustness
- Provides a safer alternative to unregularized Linear Regression


In [12]:
ridge_mae = mean_absolute_error(y_test, y_pred_ridge)
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
ridge_r2 = r2_score(y_test, y_pred_ridge)

ridge_mae, ridge_rmse, ridge_r2

(2.96459229842796, np.float64(9.271238811862148), 0.999900832900966)

## Linear vs Ridge Regression — Model Comparison

Linear Regression and Ridge Regression are compared using identical training data and evaluation metrics.

Key considerations:

- Whether Ridge maintains comparable predictive performance
- Whether regularization provides additional robustness
- Whether coefficient shrinkage is desirable given the dataset structure

Even if Ridge does not outperform Linear Regression numerically, its regularization offers improved stability and safer generalization, making it preferable in high-dimensional settings.


## Model Performance Comparison

To summarize model performance, the baseline, Linear Regression, and Ridge Regression models are compared using consistent evaluation metrics.

This comparison highlights the magnitude of improvement over the baseline and assesses whether regularization affects predictive performance in a high-dimensional setting.


In [13]:
# Create comparison table
model_comparison = pd.DataFrame({
    "Model": [
        "Baseline (Mean Predictor)",
        "Linear Regression (Final)",
        "Ridge Regression"
    ],
    "MAE": [
        baseline_mae,
        lr_final_mae,
        ridge_mae
    ],
    "RMSE": [
        baseline_rmse,
        lr_final_rmse,
        ridge_rmse
    ],
    "R2": [
        baseline_r2,
        lr_final_r2,
        ridge_r2
    ]
})

model_comparison

Unnamed: 0,Model,MAE,RMSE,R2
0,Baseline (Mean Predictor),694.756261,931.022667,-2.9e-05
1,Linear Regression (Final),2.831842,9.234554,0.999902
2,Ridge Regression,2.964592,9.271239,0.999901


## Final Model Selection

Both Linear Regression and Ridge Regression significantly outperform the baseline model across all evaluation metrics.

Key observations:

- Linear Regression achieves the lowest error and highest R²
- Ridge Regression produces nearly identical performance
- Regularization does not materially improve accuracy for this dataset

Given these results, **Linear Regression** is selected as the final model due to its simplicity, interpretability, and marginally better accuracy.

Ridge Regression remains a strong alternative in scenarios where model stability or coefficient shrinkage is prioritized.


## Limitations and Considerations

While model performance is strong, several limitations should be noted:

- The prediction task operates at a transactional level, where outcomes are largely deterministic once context is known
- High-cardinality categorical features can lead the model to rely heavily on patterns specific to this dataset.
- Performance may not generalize to future pricing, promotions, or unseen products

As a result, the model is best suited for **explanatory analysis** rather than long-horizon forecasting.

Future work could include:

- Temporal validation strategies
- Product-level aggregation
- Regularized or tree-based models for robustness testing
