This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [80]:
import pandas as pd

# Load CSV files

x_train_df = pd.read_csv('../data/processed/x_train.csv')
x_test_df = pd.read_csv('../data/processed/x_test.csv')
y_train_df = pd.read_csv('../data/processed/y_train.csv')
y_test_df = pd.read_csv('../data/processed/y_test.csv')

# Load CSV files into DataFrames

x_train_df = pd.read_csv('../data/processed/x_train.csv')
x_test_df = pd.read_csv('../data/processed/x_test.csv')
y_train_df = pd.read_csv('../data/processed/y_train.csv')
y_test_df = pd.read_csv('../data/processed/y_test.csv')


In [85]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

y_train = y_train.squeeze()
y_test = y_test.squeeze()

# Display all columns and nested data

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.expand_frame_repr', False)

# Display the first few rows of each dataset

print("x_train:")
print(x_train.head())
print("x_test:")
print(x_test.head())
print("y_train:")
print(y_train.head())
print("y_test:")
print(y_test.head())

def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Model: {model.__class__.__name__}")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    print("-" * 30)
    return y_pred, mse, r2

# Initialize models

models = [
    LinearRegression(),
    SVR(),
    RandomForestRegressor(random_state=42),
    xgb.XGBRegressor(random_state=42)
]

# Evaluate each model

for model in models:
    evaluate_model(model, x_train, y_train, x_test, y_test)

# Store predictions for each model

predictions = {}
xgboost_model = None
for model in models:
    if isinstance(model, xgb.XGBRegressor):
        xgboost_model = model
    y_pred, mse, r2 = evaluate_model(model, x_train, y_train, x_test, y_test)
    predictions[model.__class__.__name__] = y_pred

    print(f"Model: {model.__class__.__name__}")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    print("-" * 30)

import json

model.save_model('../models/xgboost_model.model')

%store xgboost_model


x_train:
   last_update_date  list_date  open_houses  property_id  community  listing_id  price_reduced_amount  matterport  primary_photo.href  source.plan_id  source.agents  source.spec_id  source.type  description.year_built  description.baths_3qtr  description.sold_date  description.sold_price  description.baths_full  description.name  description.baths_half  description.lot_sqft  description.sqft  description.baths  description.sub_type  description.baths_1qtr  description.garage  description.stories  description.beds  description.type  lead_attributes.show_contact_an_agent  flags.is_new_construction  flags.is_for_rent  flags.is_subdivision  flags.is_contingent  flags.is_price_reduced  flags.is_pending  flags.is_foreclosure  flags.is_plan  flags.is_coming_soon  flags.is_new_listing  products.brand_name  other_listings.rdc  location.address.postal_code  location.address.coordinate.lon  location.address.coordinate.lat  location.address.state_code  location.address.line  location.stre



Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [82]:
# gather evaluation metrics and compare results

# While the MSE itself might not directly convey the amount of error, the RMSE derived from it provides a more meaningful interpretation in terms of the magnitude of prediction errors.

# Root mean squared error:

from sklearn.metrics import mean_squared_error
import numpy as np

rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")


# RMSE can be more sensitive to outliers than MAE because it squares the errors, leading to potentially larger deviations from the true value.

# MAE can be a good metric for regression problems, especially when you want a robust measure of error that is easy to interpret and less sensitive to outliers. It should be fine with this problem.

# R-squared provides a measure of overall fit, while adjusted R-squared helps to prevent overfitting by considering the complexity of the model. Context of problem is important!

from sklearn.metrics import r2_score

# Calculate R-squared

r_squared = r2_score(y_test, y_pred)

# Calculate adjusted R-squared

n = len(x_test)  # Number of samples in the test set
p = x_test.shape[1]  # Number of features
adjusted_r_squared = 1 - (1 - r_squared) * ((n - 1) / (n - p - 1))

print("R-squared:", r_squared)
print("Adjusted R-squared:", adjusted_r_squared)

# I would go with adjusted r-squared as it penalizes model complexity by accounting for predictors used.

Root Mean Squared Error: 0.36573971454558235
R-squared: 0.9999727886305444
Adjusted R-squared: 1.014231546225281


**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [54]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)