## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

# 1. Load processed data from Part 1

X_train = pd.read_csv("../data/processed/X_train.csv")
X_test  = pd.read_csv("../data/processed/X_test.csv")
y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()
y_test  = pd.read_csv("../data/processed/y_test.csv").squeeze()


In [8]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.linear_model    import LinearRegression, Ridge
from sklearn.tree            import DecisionTreeRegressor
from sklearn.ensemble        import RandomForestRegressor

# 1) Select only numeric features
num_cols     = X_train.select_dtypes(include=[np.number]).columns
X_train_num  = X_train[num_cols].copy()
X_test_num   = X_test [num_cols].copy()

# 2) Impute any missing values (median strategy)
imp = SimpleImputer(strategy="median")
X_train_imp = imp.fit_transform(X_train_num)
X_test_imp  = imp.transform(X_test_num)

# 3) Define candidate models
models = {
    "LinearRegression": LinearRegression(),
    "Ridge"           : Ridge(alpha=1.0),
    "DecisionTree"    : DecisionTreeRegressor(random_state=42),
    "RandomForest"    : RandomForestRegressor(n_estimators=100, random_state=42)
}

# 4) Train them
for name, model in models.items():
    model.fit(X_train_imp, y_train)
    print(f"✓ {name} trained")


✓ LinearRegression trained
✓ Ridge trained
✓ DecisionTree trained
✓ RandomForest trained


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [9]:
# gather evaluation metrics and compare results

# 1) Import metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
import pandas as pd

# 2) Define a small helper to compute all metrics
def evaluate(name, y_true, y_pred):
    mse  = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae  = mean_absolute_error(y_true, y_pred)
    r2   = r2_score(y_true, y_pred)
    return {
        "model": name,
        "RMSE": rmse,
        "MAE":  mae,
        "R2":   r2
    }

# 3) Loop through the trained models and collect results
results = []
for name, model in models.items():
    preds = model.predict(X_test_imp)     # or X_test_ohe if you OHE’d
    results.append(evaluate(name, y_test, preds))

# 4) Build a DataFrame and sort it
results_df = pd.DataFrame(results).sort_values("RMSE").reset_index(drop=True)
results_df

Unnamed: 0,model,RMSE,MAE,R2
0,DecisionTree,24677.880629,2343.112825,0.995102
1,RandomForest,42299.420281,12209.684555,0.98561
2,Ridge,294775.896769,181173.49216,0.301157
3,LinearRegression,294787.04493,181217.430887,0.301104


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)