This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [41]:
# import models and fit

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

data_dir = '../data/processed/'

# Load the datasets
x_train = pd.read_csv(f'{data_dir}x_train.csv')
x_test = pd.read_csv(f'{data_dir}x_test.csv')
y_train = pd.read_csv(f'{data_dir}y_train.csv')
y_test = pd.read_csv(f'{data_dir}y_test.csv')

y_train = y_train.squeeze()
y_test = y_test.squeeze()

# Display all columns and nested data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.expand_frame_repr', False)

# Display the first few rows of each dataset
print("x_train:")
print(x_train.head())
print("x_test:")
print(x_test.head())
print("y_train:")
print(y_train.head())
print("y_test:")
print(y_test.head())

# Function to evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Model: {model.__class__.__name__}")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}")
    print("-" * 30)

# Initialize models
models = [
    LinearRegression(),
    SVR(),
    RandomForestRegressor(random_state=42),
    xgb.XGBRegressor(random_state=42)
]

# Evaluate each model
for model in models:
    evaluate_model(model, x_train, y_train, x_test, y_test)


  x_train = pd.read_csv(f'{data_dir}x_train.csv')
  x_test = pd.read_csv(f'{data_dir}x_test.csv')


x_train:
   last_update_date   list_date  open_houses  list_price  property_id  community virtual_tours  listing_id  price_reduced_amount  matterport                                                                  primary_photo.href  source.plan_id                                                           source.agents  source.spec_id source.type  description.year_built  description.baths_3qtr  description.sold_date  description.baths_full  description.name  description.baths_half  description.lot_sqft  description.sqft  description.baths description.sub_type  description.baths_1qtr  description.garage  description.stories  description.beds description.type  lead_attributes.show_contact_an_agent  flags.is_new_construction  flags.is_for_rent  flags.is_subdivision  flags.is_contingent flags.is_price_reduced  flags.is_pending flags.is_foreclosure  flags.is_plan  flags.is_coming_soon  flags.is_new_listing products.brand_name                                                                 

ValueError: could not convert string to float: "[{'type': None, 'href': 'https://listings.nextdoorphotos.com/4724allpointsviewway/?mls'}]"

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)