# Housing Price Modeling (Session 1)

This notebook intentionally follows a simple, linear workflow to prepare for refactoring into a production script later. It trains a regression model on `data/housing.csv`, evaluates it, saves the artifact, and shows prediction usage.

- Dataset: `data/housing.csv`
- Target: `Price`
- Model: StandardScaler + SGDRegressor
- Metrics: RMSE, R²
- Artifacts: `scripts/session_1/housing_linear.joblib`

> Next steps (outside this notebook): move logic into `scripts/session_1/train.py` with CLI args and proper logging.


In [12]:
import os
import logging
from pathlib import Path

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s - %(message)s",
)
logger = logging.getLogger("housing")

# Paths
PROJECT_ROOT = Path("..")
DATA_PATH = PROJECT_ROOT / "data" / "housing.csv"
ARTIFACT_DIR = PROJECT_ROOT / "scripts" / "session_1"
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)
MODEL_PATH = ARTIFACT_DIR / "housing_linear.joblib"

logger.info(f"Data path: {DATA_PATH}")
logger.info(f"Artifact dir: {ARTIFACT_DIR}")


2025-10-15 22:48:41,764 INFO housing - Data path: ../data/housing.csv
2025-10-15 22:48:41,764 INFO housing - Artifact dir: ../scripts/session_1


In [13]:
import pandas as pd

logger.info("Loading dataset...")
df = pd.read_csv(DATA_PATH)
logger.info(f"Loaded {len(df)} rows and {len(df.columns)} columns")

df.head()


2025-10-15 22:48:42,430 INFO housing - Loading dataset...
2025-10-15 22:48:42,437 INFO housing - Loaded 5000 rows and 7 columns


Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.45857,5.682861,7.009188,4.09,23086.8005,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.64245,6.0029,6.730821,3.09,40173.07217,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.06718,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.24005,7.188236,5.586729,3.26,34310.24283,1260617.0,USS Barnett\nFPO AP 44820
4,59982.19723,5.040555,7.839388,4.23,26354.10947,630943.5,USNS Raymond\nFPO AE 09386


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

logger.info("Preparing features and target...")
# Identify target and basic features from the CSV header
TARGET = "Price"
ALL_COLUMNS = df.columns.tolist()
NUM_FEATURES = [
    "Avg. Area Income",
    "Avg. Area House Age",
    "Avg. Area Number of Rooms",
    "Avg. Area Number of Bedrooms",
    "Area Population",
]
CAT_FEATURES = [
    # 'Address' exists but is high-cardinality; we'll drop it for a simple baseline
]

X = df[NUM_FEATURES]
y = df[TARGET]

logger.info("Splitting train/test...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

logger.info("Building pipeline...")
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), NUM_FEATURES),
        # ("cat", OneHotEncoder(handle_unknown="ignore"), CAT_FEATURES),
    ],
    remainder="drop",
)

model = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("regressor", SGDRegressor(
            max_iter=5000,
            tol=1e-4,
            learning_rate="optimal",
            random_state=42,
            verbose=1
        )),
    ]
)

logger.info("Training model...")
model.fit(X_train, y_train)

logger.info("Evaluating model...")
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
logger.info(f"RMSE: {rmse:.2f} | MAE: {mae:.2f} | R2: {r2:.4f}")

rmse, mae, r2


2025-10-15 22:48:43,063 INFO housing - Preparing features and target...
2025-10-15 22:48:43,064 INFO housing - Splitting train/test...
2025-10-15 22:48:43,066 INFO housing - Building pipeline...
2025-10-15 22:48:43,066 INFO housing - Training model...
2025-10-15 22:48:43,077 INFO housing - Evaluating model...
2025-10-15 22:48:43,080 INFO housing - RMSE: 102872.04 | MAE: 82975.05 | R2: 0.9140


-- Epoch 1
Norm: 9157128538600.54, NNZs: 5, Bias: -5594879402809.260742, T: 4000, Avg. loss: 139357686162922607499804672.000000
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 3069854213904.52, NNZs: 5, Bias: 496285158656.350342, T: 8000, Avg. loss: 14060167604749197148946432.000000
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 2260655455408.22, NNZs: 5, Bias: -85984075351.807861, T: 12000, Avg. loss: 4611039422602054098485248.000000
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 791937154521.12, NNZs: 5, Bias: -177484299868.038452, T: 16000, Avg. loss: 2112325263972229971968000.000000
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 865589764406.31, NNZs: 5, Bias: -92012516843.681519, T: 20000, Avg. loss: 1092838592790266485145600.000000
Total training time: 0.00 seconds.
-- Epoch 6
Norm: 776061050879.56, NNZs: 5, Bias: -247924243400.909973, T: 24000, Avg. loss: 617870853586077834280960.000000
Total training time: 0.00 seconds.
-- Epoch 7
Norm: 624539093499.02, NNZs:

(np.float64(102872.04304935805), 82975.04669811542, 0.913984831459622)

In [15]:
import joblib

logger.info(f"Saving model to {MODEL_PATH} ...")
joblib.dump(model, MODEL_PATH)
logger.info("Model saved.")

MODEL_PATH


2025-10-15 22:48:44,140 INFO housing - Saving model to ../scripts/session_1/housing_linear.joblib ...
2025-10-15 22:48:44,143 INFO housing - Model saved.


PosixPath('../scripts/session_1/housing_linear.joblib')

In [16]:
# Demonstrate predictions using the trained pipeline and after reload
import numpy as np

# Create a small batch from X_test
sample = X_test.iloc[:5]
logger.info("Predicting with in-memory model...")
preds_in_memory = model.predict(sample)

logger.info("Reloading model from disk and predicting...")
loaded = joblib.load(MODEL_PATH)
preds_loaded = loaded.predict(sample)

logger.info("Comparing predictions (should match closely):")
comparison = pd.DataFrame({
    "pred_in_memory": preds_in_memory,
    "pred_loaded": preds_loaded,
})
comparison


2025-10-15 22:48:44,834 INFO housing - Predicting with in-memory model...
2025-10-15 22:48:44,836 INFO housing - Reloading model from disk and predicting...
2025-10-15 22:48:44,838 INFO housing - Comparing predictions (should match closely):


Unnamed: 0,pred_in_memory,pred_loaded
0,1323582.0,1323582.0
1,1256339.0,1256339.0
2,1301581.0,1301581.0
3,1251606.0,1251606.0
4,1069425.0,1069425.0
