In this step, we apply RobustScaler to both transformed and non-transformed versions of our dataset.
RobustScaler is particularly suitable when our features contain outliers, as it scales data using the interquartile range (IQR) instead of the mean and standard deviation.

We fit only on the training set to avoid data leakage.

The same scaler is used to transform validation and test sets accordingly.

This setup allows us to later compare the effect of feature transformation (log1p) versus simple outlier-resistant scaling on model performance.

In [1]:
from sklearn.preprocessing import RobustScaler
import pandas as pd

In [2]:
# Load data
X_train_transform = pd.read_csv("../../data/processed/transformed/X_train_transform.csv")
X_val_transform = pd.read_csv("../../data/processed/transformed/X_val_transform.csv")
X_test_transform = pd.read_csv("../../data/processed/transformed/X_test_transform.csv")

X_train_no_transform = pd.read_csv("../../data/processed/no_transformed/X_train_no_transform.csv")
X_val_no_transform = pd.read_csv("../../data/processed/no_transformed/X_val_no_transform.csv")
X_test_no_transform = pd.read_csv("../../data/processed/no_transformed/X_test_no_transform.csv")

In [3]:
# --- Setup ---
scaler_transform = RobustScaler()
scaler_no_transform = RobustScaler()

# --- 1. Fit only on training sets ---
X_train_transform_scaled = scaler_transform.fit_transform(X_train_transform)
X_val_transform_scaled = scaler_transform.transform(X_val_transform)
X_test_transform_scaled = scaler_transform.transform(X_test_transform)

X_train_no_transform_scaled = scaler_no_transform.fit_transform(X_train_no_transform)
X_val_no_transform_scaled = scaler_no_transform.transform(X_val_no_transform)
X_test_no_transform_scaled = scaler_no_transform.transform(X_test_no_transform)

# --- 2. Wrap back as DataFrames (optional, for better column tracking) ---
X_train_transform_scaled = pd.DataFrame(X_train_transform_scaled, columns=X_train_transform.columns, index=X_train_transform.index)
X_val_transform_scaled = pd.DataFrame(X_val_transform_scaled, columns=X_val_transform.columns, index=X_val_transform.index)
X_test_transform_scaled = pd.DataFrame(X_test_transform_scaled, columns=X_test_transform.columns, index=X_test_transform.index)

X_train_no_transform_scaled = pd.DataFrame(X_train_no_transform_scaled, columns=X_train_no_transform.columns, index=X_train_no_transform.index)
X_val_no_transform_scaled = pd.DataFrame(X_val_no_transform_scaled, columns=X_val_no_transform.columns, index=X_val_no_transform.index)
X_test_no_transform_scaled = pd.DataFrame(X_test_no_transform_scaled, columns=X_test_no_transform.columns, index=X_test_no_transform.index)


In [4]:
X_train_transform_scaled.to_csv("../../data/processed/transformed/X_train_transform_scaled.csv", index=False)
X_val_transform_scaled.to_csv("../../data/processed/transformed/X_val_transform_scaled.csv", index=False)
X_test_transform_scaled.to_csv("../../data/processed/transformed/X_test_transform_scaled.csv", index=False)

X_train_no_transform_scaled.to_csv("../../data/processed/no_transformed/X_train_no_transform_scaled.csv", index=False)
X_val_no_transform_scaled.to_csv("../../data/processed/no_transformed/X_val_no_transform_scaled.csv", index=False)
X_test_no_transform_scaled.to_csv("../../data/processed/no_transformed/X_test_no_transform_scaled.csv", index=False)

print("Train/Validation/Test datasets with scaling saved.")

Train/Validation/Test datasets with scaling saved.


After fitting the RobustScaler on training data, we save the scaler object using joblib.
This ensures consistency during testing and future predictions, preventing data leakage and scaling mismatches.
Separate scalers are stored for the log-transformed and non-transformed datasets

In [5]:
import joblib
import os

SCALER_DIR = "../../models/"
os.makedirs(SCALER_DIR, exist_ok=True)

# store scaler with log version
joblib.dump(scaler_transform, os.path.join(SCALER_DIR, "robust_scaler_transform.pkl"))

# storescaler_no_transform（without log version）
joblib.dump(scaler_no_transform, os.path.join(SCALER_DIR, "robust_scaler_no_transform.pkl"))

print("Scalers saved.")


Scalers saved.
