## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [7]:
import pandas as pd

# Features
X_train = pd.read_csv("../processed/X_train_scaled.csv")
X_test  = pd.read_csv("../processed/X_test_scaled.csv")

# Targets (single‐column, no header)
y_train = pd.read_csv("../processed/y_train.csv", header=None).iloc[:, 0]
y_test  = pd.read_csv("../processed/y_test.csv",  header=None).iloc[:, 0]

print(f"  X_train: {X_train.shape},  y_train: {y_train.shape}")
print(f"  X_test : {X_test.shape},  y_test :  {y_test.shape}")


  X_train: (5229, 12),  y_train: (5229,)
  X_test : (1308, 12),  y_test :  (1308,)


In [13]:
# impute any remaining NaNs
from sklearn.impute import SimpleImputer
import pandas as pd

# set up a median imputer
imputer = SimpleImputer(strategy="median")

X_train = pd.DataFrame(
    imputer.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)
X_test = pd.DataFrame(
    imputer.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

# check
print("any missing in X_train?", X_train.isna().any().any())
print("any missing in X_test? ", X_test.isna().any().any())


any missing in X_train? False
any missing in X_test?  False


In [9]:

import pandas as pd
from sklearn.linear_model                 import LinearRegression
from sklearn.svm                          import SVR
from sklearn.ensemble                     import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.neighbors                    import KNeighborsRegressor

# defined 5 models
models = {
    "Linear Regression":               LinearRegression(),
    "Support Vector Regressor":        SVR(kernel="rbf"),
    "Random Forest":                   RandomForestRegressor(random_state=42, n_estimators=100),
    "Gradient‐Boosted Trees":          HistGradientBoostingRegressor(random_state=42),
    "K‐Nearest Neighbors Regressor":    KNeighborsRegressor(n_neighbors=5),
}
predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions[name] = model.predict(X_test)

# packed into a DataFrame
preds_df = pd.DataFrame(predictions, index=X_test.index)
preds_df.head()

Unnamed: 0,Linear Regression,Support Vector Regressor,Random Forest,Gradient‐Boosted Trees,K‐Nearest Neighbors Regressor
0,13.460148,13.961422,14.060679,14.010961,13.731231
1,12.32261,12.633278,12.600734,12.581851,12.611345
2,11.938019,11.917272,12.013707,11.978375,12.036589
3,12.184556,11.95082,11.938771,11.992038,11.9794
4,12.23293,12.690964,12.784129,12.786156,12.777481


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [12]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

metrics_log = {}
for name, preds_log in preds_df.items():
    rmse_log = np.sqrt(mean_squared_error(y_test, preds_log))
    mae_log  = mean_absolute_error(y_test, preds_log)
    r2_log   = r2_score(y_test, preds_log)
    metrics_log[name] = {
        "RMSE (log1p price)": round(rmse_log, 3),
        "MAE  (log1p price)": round(mae_log, 3),
        "R² (log1p price)"  : round(r2_log,  3),
    }

import pandas as pd
pd.DataFrame(metrics_log).T


Unnamed: 0,RMSE (log1p price),MAE (log1p price),R² (log1p price)
Linear Regression,0.537,0.325,0.686
Support Vector Regressor,0.292,0.125,0.907
Random Forest,0.022,0.007,0.999
Gradient‐Boosted Trees,0.087,0.039,0.992
K‐Nearest Neighbors Regressor,0.206,0.121,0.954


To compare our five baseline models fairly, we evaluated them on three complementary metrics applied to the log1p-transformed sale price:

Root Mean Squared Error (RMSE)
RMSE reports error in the same units as our transformed target and penalizes large mistakes more heavily. This is important because a single large misprediction on an expensive home can skew overall performance.

Mean Absolute Error (MAE)
MAE also reports in log-price units but treats every error equally. It gives a clear sense of the “typical” mistake without letting outliers dominate.

R² (Coefficient of Determination)
R² measures the proportion of variance in the log1p-price that our features explain. A higher R² means our model captures more of the underlying patterns in home values.



## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

In [None]:
# 1) pick features with LassoCV 
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5, random_state=42).fit(X_train, y_train)

# which coefficients survived?
import pandas as pd
coef = pd.Series(lasso.coef_, index=X_train.columns)
lasso_feats = coef[coef.abs() > 1e-6].index.tolist()
print(f"Lasso picked {len(lasso_feats)} features out of {X_train.shape[1]}")

# 2) pick top 10 with RFE + RandomForest
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rfe = RFE(rf, n_features_to_select=10).fit(X_train, y_train)
rfe_feats = X_train.columns[rfe.support_].tolist()
print(f"RFE picked {len(rfe_feats)} features: {rfe_feats}")


Lasso picked 12 features out of 12
RFE picked 10 features: ['list_price', 'description.year_built', 'description.lot_sqft', 'description.sqft', 'description.baths', 'location.address.coordinate.lon', 'location.address.coordinate.lat', 'city_mean_price', 'state_mean_price', 'total_rooms']


In [None]:
# 3) build reduced train/test sets 
X_train_lasso = X_train[lasso_feats]
X_test_lasso  = X_test [lasso_feats]

X_train_rfe   = X_train[rfe_feats]
X_test_rfe    = X_test [rfe_feats]


In [None]:
# 4) refit our 5 models and compare metrics 
import numpy as np
from sklearn.linear_model            import LinearRegression
from sklearn.svm                     import SVR
from sklearn.ensemble                import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.neighbors               import KNeighborsRegressor
from sklearn.metrics                 import mean_squared_error, mean_absolute_error, r2_score

models = {
    "Linear Regression":   LinearRegression(),
    "SVR (RBF)":           SVR(kernel="rbf"),
    "Random Forest":       RandomForestRegressor(n_estimators=100, random_state=42),
    "HistGB Regressor":    HistGradientBoostingRegressor(random_state=42),
    "KNN Regressor":       KNeighborsRegressor(n_neighbors=5),
}

def eval_set(X_tr, X_te, label):
    out = {}
    for name, m in models.items():
        m.fit(X_tr, y_train)
        p = m.predict(X_te)
        mse = mean_squared_error(y_test, p)
        out[name] = {
            f"{label} RMSE": np.sqrt(mse),
            f"{label} MAE":  mean_absolute_error(y_test, p),
            f"{label} R²":   r2_score(y_test, p)
        }
    return pd.DataFrame(out).T

full = eval_set(X_train,      X_test,      "Full")
lasso = eval_set(X_train_lasso, X_test_lasso, "Lasso")
rfe   = eval_set(X_train_rfe,   X_test_rfe,   "RFE")

comparison = pd.concat([full, lasso, rfe], axis=1).round(3)
print(comparison)


                   Full RMSE  Full MAE  Full R²  Lasso RMSE  Lasso MAE  \
Linear Regression      0.537     0.325    0.686       0.537      0.325   
SVR (RBF)              0.292     0.125    0.907       0.292      0.125   
Random Forest          0.022     0.007    0.999       0.022      0.007   
HistGB Regressor       0.087     0.039    0.992       0.087      0.039   
KNN Regressor          0.206     0.121    0.954       0.206      0.121   

                   Lasso R²  RFE RMSE  RFE MAE  RFE R²  
Linear Regression     0.686     0.544    0.330   0.678  
SVR (RBF)             0.907     0.283    0.122   0.913  
Random Forest         0.999     0.022    0.007   0.999  
HistGB Regressor      0.992     0.087    0.040   0.992  
KNN Regressor         0.954     0.211    0.122   0.952  


Our Lasso-based selection hardly eliminated any variables, the overall RMSE, MAE, and R² stayed, virtually unchanged, whereas the RFE procedure distilled the dataset down to just ten features while preserving almost identical performance on both Random Forest and HistGradientBoosting models. RFE gives us a much leaner feature set without sacrificing accuracy. This makes it an attractive choice when we want faster training times and a simpler, more maintainable pipeline.