## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [20]:
# import models and fit - start with linear regression
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# import the csv from 1 - EDA.ipynb

data = pd.read_csv("/Users/thomasdoherty/Desktop/tdsf-midterm/df_linear_model.csv")



In [21]:
data.head(5)

Unnamed: 0,photos,price_reduced_amount,description.sold_price,description.sqft,description.baths,description.garage,description.stories,description.beds,products.brand_name,description.type_townhomes,days_listed,n_amenities,n_high_amenities,age_yrs,n_rooms
0,0,0.0,129900.0,1478.0,2.0,2.0,1.0,3.0,1,0,28,10.0,4.0,26,5.0
1,1,3000.0,88500.0,1389.0,2.0,1.0,2.0,4.0,1,0,67,1.0,1.0,79,6.0
2,0,0.0,145000.0,2058.0,2.0,0.0,1.0,3.0,1,0,28,4.0,1.0,55,5.0
3,1,9000.0,65000.0,1432.0,2.0,0.0,1.0,3.0,1,0,195,3.0,1.0,69,5.0
4,1,5000.0,169000.0,1804.0,2.0,0.0,1.0,3.0,2,0,75,3.0,1.0,40,5.0


In [22]:
# import statsmodels
import statsmodels.api as sm

# X is initialized as all columns except description.sold_price
X = data.drop('description.sold_price', axis=1)

X = sm.add_constant(X) # adding a constant

In [23]:
# initialize y as description.sold_price
y = data['description.sold_price']

In [26]:
# for making the above split instead
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=False, train_size=0.2, random_state=42)

In [27]:
# Train our model
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

In [29]:
# Check performance on train and test set
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

mae_train = mean_absolute_error(y_train, y_train_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))


print(f'Train R^2: {r2_train}\nTest R^2:  {r2_test}\n')
print(f'Train MAE: {mae_train}\nTest MAE:  {mae_test}\n')
print(f'Train MSE: {mse_train}\nTest MSE:  {mse_test}\n')
print(f'Train RMSE: {rmse_train}\nTest RMSE:  {rmse_test}')

Train R^2: 0.49750643766193303
Test R^2:  0.3039116111857322

Train MAE: 148452.04313448473
Test MAE:  193380.93010931715

Train MSE: 46819317199.72055
Test MSE:  213541658648.7287

Train RMSE: 216377.71881531738
Test RMSE:  462105.67909162154


**Separate statistical model results (below) from ML (above)**

In [24]:
# create a Python Linear Regression object
lin_reg = sm.OLS(y,X)

In [25]:
# fit the model
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                              OLS Regression Results                              
Dep. Variable:     description.sold_price   R-squared:                       0.379
Model:                                OLS   Adj. R-squared:                  0.373
Method:                     Least Squares   F-statistic:                     66.58
Date:                    Mon, 16 Sep 2024   Prob (F-statistic):          1.44e-136
Time:                            21:57:40   Log-Likelihood:                -20551.
No. Observations:                    1434   AIC:                         4.113e+04
Df Residuals:                        1420   BIC:                         4.120e+04
Df Model:                              13                                         
Covariance Type:                nonrobust                                         
                                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)