## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [2]:
# impute any remaining NaNs
from sklearn.impute import SimpleImputer
import pandas as pd

# set up a median imputer
imputer = SimpleImputer(strategy="median")

X_train = pd.DataFrame(
    imputer.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)
X_test = pd.DataFrame(
    imputer.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

# check
print("Any missing in X_train?", X_train.isna().any().any())
print("Any missing in X_test? ", X_test.isna().any().any())


Any missing in X_train? False
Any missing in X_test?  False


In [None]:

import pandas as pd


X_train = pd.read_csv("../processed/X_train_scaled.csv")
X_test  = pd.read_csv("../processed/X_test_scaled.csv")


y_train = pd.read_csv("../processed/y_train.csv", header=None).iloc[:, 0]
y_test  = pd.read_csv("../processed/y_test.csv",  header=None).iloc[:, 0]

print(f"  X_train: {X_train.shape},  y_train: {y_train.shape}")
print(f"  X_test : {X_test.shape},  y_test :  {y_test.shape}")



✔️  Loaded shapes:
  X_train: (6327, 13),  y_train: (6327,)
  X_test : (1582, 13),  y_test :  (1582,)


In [21]:
# load pre‐processed train/test splits
import pandas as pd


X_train = pd.read_csv("../processed/X_train_scaled.csv")
X_test  = pd.read_csv("../processed/X_test_scaled.csv")


y_train_df = pd.read_csv("../processed/y_train.csv")
y_test_df  = pd.read_csv("../processed/y_test.csv")

y_train = y_train_df["description.sold_price"]
y_test  = y_test_df["description.sold_price"]

print(f"  X_train: {X_train.shape},  y_train: {y_train.shape}")
print(f"  X_test : {X_test.shape},  y_test : {y_test.shape}")



KeyError: 'description.sold_price'

In [None]:

import pandas as pd
from sklearn.linear_model                 import LinearRegression
from sklearn.svm                          import SVR
from sklearn.ensemble                     import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.neighbors                    import KNeighborsRegressor

# defined 5 models
models = {
    "Linear Regression":               LinearRegression(),
    "Support Vector Regressor":        SVR(kernel="rbf"),
    "Random Forest":                   RandomForestRegressor(random_state=42, n_estimators=100),
    "Gradient‐Boosted Trees":          HistGradientBoostingRegressor(random_state=42),
    "K‐Nearest Neighbors Regressor":    KNeighborsRegressor(n_neighbors=5),
}
predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions[name] = model.predict(X_test)

# packed into a DataFrame
preds_df = pd.DataFrame(predictions, index=X_test.index)
preds_df.head()


Unnamed: 0,Linear Regression,Support Vector Regressor,Random Forest,Gradient‐Boosted Trees,K‐Nearest Neighbors Regressor
0,1320801.0,312247.477953,1278030.0,1326687.0,1207400.0
1,302883.7,311853.00324,296980.0,287932.2,300000.0
2,153330.0,311700.932187,165000.0,156006.0,169000.0
3,137711.0,311779.172336,153090.0,160788.3,160000.0
4,253668.7,311908.00485,356885.0,350339.8,354200.0


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gathered evaluation metrics and compared results
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

metrics = {}
for name, preds in preds_df.items():
    mse  = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)                   # take the square‐root!
    mae  = mean_absolute_error(y_test, preds)
    r2   = r2_score(y_test, preds)
    metrics[name] = {
        "RMSE ($)": int(rmse),
        "MAE ($)": int(mae),
        "R²":       round(r2, 3)
    }

results = pd.DataFrame(metrics).T
print(results)




                               RMSE ($)   MAE ($)     R²
Linear Regression               79489.0   41457.0  0.985
Support Vector Regressor       666130.0  237451.0 -0.028
Random Forest                   29599.0    4388.0  0.998
Gradient‐Boosted Trees         186738.0   27068.0  0.919
K‐Nearest Neighbors Regressor  129996.0   46709.0  0.961


We chose RMSE, MAE, and R² because together they offer a comprehensive, real-world evaluation of our models’ performance:

Root Mean Squared Error (RMSE) expresses error in dollars and penalizes larger deviations more heavily—crucial when a single misestimate on a high-value property can amount to tens of thousands of dollars.

Mean Absolute Error (MAE) also reports in dollars but treats all errors equally, providing insight into what a “typical” prediction error looks like without letting a few extreme outliers dominate.

R² (Coefficient of Determination) captures the proportion of variance in sale prices explained by our features, so we can confidently state, for example, “this model accounts for 99% of the variability in home values.”



## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)