## 1️⃣ Load LOG-TRANSFORMED Dataset (Locked Pipeline)

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 200)

DATA_PATH = "ames_housing_log_v1.csv"
df = pd.read_csv(DATA_PATH)

df.shape

(2930, 82)

## 2️⃣ Separate Features and Target

In [2]:
TARGET_COL = "SalePrice"

X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

X.shape, y.shape

((2930, 81), (2930,))

## 3️⃣ Train–Test Split (NO SCALING YET)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

X_train.shape, X_test.shape


((2344, 81), (586, 81))

Interpretation

- The model must not see test data

- Scaling learns statistics → must be learned from training data only

- This is where .fit() matters

## 4️⃣ Identify Numeric Features for Scaling

In [4]:
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
len(numeric_features)


38

## 5️⃣ Standard Scaling — Correct Way

In [5]:
scaler = StandardScaler()

# Fit ONLY on training data
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[numeric_features] = scaler.fit_transform(
    X_train[numeric_features]
)

# Apply learned parameters to test data
X_test_scaled[numeric_features] = scaler.transform(
    X_test[numeric_features]
)


## 6️⃣ What Did `.fit()` Learn?

In [6]:
scaler.mean_[:5], scaler.scale_[:5]


(array([1.46429863e+03, 7.14363168e+08, 3.82109570e+00, 4.20715046e+00,
        9.09177289e+00]),
 array([8.42802427e+02, 1.88650423e+08, 6.95662442e-01, 3.24837590e-01,
        5.06102309e-01]))

Interpretation

- .mean_ → feature-wise means (training set)

- .scale_ → feature-wise standard deviation

- These values must NEVER be learned from test data

## 7️⃣ Sanity Check — Scaled Training Data

In [7]:
X_train_scaled[numeric_features].mean().round(3).head()


Order          -0.0
PID            -0.0
MS SubClass    -0.0
Lot Frontage   -0.0
Lot Area       -0.0
dtype: float64

In [8]:
X_train_scaled[numeric_features].std().round(3).head()

Order           1.0
PID             1.0
MS SubClass     1.0
Lot Frontage    1.0
Lot Area        1.0
dtype: float64

Expected:

- Mean ≈ 0

- Std ≈ 1

## 8️⃣ Save Scaled Datasets (Pipeline Discipline)

In [10]:
X_train_scaled.to_csv("X_train_scaled_v1.csv", index=False)
X_test_scaled.to_csv("X_test_scaled_v1.csv", index=False)
y_train.to_csv("y_train_v1.csv", index=False)
y_test.to_csv("y_test_v1.csv", index=False)

print("Saved:")
print("→ X_train_scaled_v1.csv")
print("→ X_test_scaled_v1.csv")
print("→ y_train_v1.csv")
print("→ y_test_v1.csv")

Saved:
→ X_train_scaled_v1.csv
→ X_test_scaled_v1.csv
→ y_train_v1.csv
→ y_test_v1.csv


## 9️⃣ What Happens If We Do It WRONG?

In [11]:
# # WRONG: fitting on full data (leakage)
# wrong_scaler = StandardScaler()
# wrong_scaler.fit(X[numeric_features])

# wrong_scaler.mean_[:5]


Important crucial Points:

- Test statistics leak into training

- Performance becomes over-optimistic

- This mistake alone ruins real-world models

10️⃣ Notebook A5 Summary

We learned:

- Why split happens before scaling

- What `.fit()` really means

- Why `.transform()` must follow

- How to persist clean pipeline stages