# Project Milestone Two: Modeling and Feature Engineering

### Due: Midnight on August 3 (with 2-hour grace period) and worth 50 points

### Overview

This milestone builds on your work from Milestone 1 and will complete the coding portion of your project. You will:

1. Pick 3 modeling algorithms from those we have studied.
2. Evaluate baseline models using default settings.
3. Engineer new features and re-evaluate models.
4. Use feature selection techniques and re-evaluate.
5. Fine-tune for optimal performance.
6. Select your best model and report on your results. 

You must do all work in this notebook and upload to your team leader's account in Gradescope. There is no
Individual Assessment for this Milestone. 


In [1]:
# ===================================
# Useful Imports: Add more as needed
# ===================================

# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars
import seaborn as sns

# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    GridSearchCV, 
    RandomizedSearchCV, 
    RepeatedKFold
)
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

# Progress Tracking

from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

# =============================
# Utility Functions
# =============================

# Format y-axis labels as dollars with commas (optional)
def dollar_format(x, pos):
    return f'${x:,.0f}'

# Convert seconds to HH:MM:SS format
def format_hms(seconds):
    return time.strftime("%H:%M:%S", time.gmtime(seconds))



In [2]:
url = "https://www.cs.bu.edu/fac/snyder/cs505/Data/zillow_dataset.csv"

filename = os.path.basename(urlparse(url).path)

if not os.path.exists(filename):
    try:
        print("Downloading the file...")
        response = requests.get(url)
        response.raise_for_status()  # Raise an error for bad status codes
        with open(filename, "wb") as f:
            f.write(response.content)
        print("File downloaded successfully.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading the file: {e}")
else:
    print("File already exists. Skipping download.")

df = pd.read_csv(filename)

File already exists. Skipping download.


In [3]:
df_clean = df.drop(columns=['parcelid', 'latitude', 'longitude', 'rawcensustractandblock', 'censustractandblock', 'regionidzip', 'regionidcity', 'regionidcounty', 'assessmentyear'])

In [4]:
df_clean = df_clean.drop(columns=['buildingclasstypeid', 'finishedsquarefeet13','finishedsquarefeet15',
                                    'finishedsquarefeet50','basementsqft', 'storytypeid',
                                  'yardbuildingsqft26','yardbuildingsqft17','pooltypeid10','pooltypeid2',
                                  'poolsizesum','fireplaceflag', 'architecturalstyletypeid',
                                  'typeconstructiontypeid','finishedsquarefeet6','decktypeid','hashottuborspa',
                                  'taxdelinquencyyear', 'taxdelinquencyflag','finishedfloor1squarefeet',
                                  'fireplacecnt', 'threequarterbathnbr','poolcnt','pooltypeid7','airconditioningtypeid',
                                  'numberofstories', 'garagecarcnt','garagetotalsqft','regionidneighborhood',
                                  'propertyzoningdesc', 'propertycountylandusecode'])

In [5]:
q1 = df_clean['taxvaluedollarcnt'].quantile(0.25)
q3 = df_clean['taxvaluedollarcnt'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

df = df_clean[(df_clean['taxvaluedollarcnt'] >= lower_bound) &(df_clean['taxvaluedollarcnt'] <= upper_bound)]

In [6]:
df['heatingorsystemtypeid'].fillna(-1, inplace=True)
df['buildingqualitytypeid'].fillna(df['buildingqualitytypeid'].median(), inplace=True)
df['unitcnt'].fillna(1, inplace=True)
df['lotsizesquarefeet'].fillna(df['lotsizesquarefeet'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['heatingorsystemtypeid'].fillna(-1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['heatingorsystemtypeid'].fillna(-1, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].me

In [7]:
df = df.dropna()

### Prelude: Load your Preprocessed Dataset from Milestone 1

In Milestone 1, you handled missing values, encoded categorical features, and explored your data. Before you begin this milestone, you’ll need to load that cleaned dataset and prepare it for modeling. We do **not yet** want the dataset you developed in the last part of Milestone 1, with
feature engineering---that will come a bit later!

Here’s what to do:

1. Return to your Milestone 1 notebook and rerun your code through Part 3, where your dataset was fully cleaned (assume it’s called `df_cleaned`).

2. **Save** the cleaned dataset to a file by running:

>   df_cleaned.to_csv("zillow_cleaned.csv", index=False)

3. Switch to this notebook and **load** the saved data:

>   df = pd.read_csv("zillow_cleaned.csv")

4. Create a **train/test split** using `train_test_split`.  
   
6. **Standardize** the features (but not the target!) using **only the training data.** This ensures consistency across models without introducing data leakage from the test set:

>   scaler = StandardScaler()   
>   X_train_scaled = scaler.fit_transform(X_train)    
  
**Notes:** 

- You will have to redo the scaling step if you introduce new features (which have to be scaled as well).


In [8]:
df.head()

Unnamed: 0,bathroomcnt,bedroomcnt,buildingqualitytypeid,calculatedbathnbr,calculatedfinishedsquarefeet,finishedsquarefeet12,fips,fullbathcnt,heatingorsystemtypeid,lotsizesquarefeet,propertylandusetypeid,roomcnt,unitcnt,yearbuilt,taxvaluedollarcnt
0,3.5,4.0,6.0,3.5,3100.0,3100.0,6059.0,3.0,-1.0,4506.0,261.0,0.0,1.0,1998.0,1023282.0
1,1.0,2.0,6.0,1.0,1465.0,1465.0,6111.0,1.0,-1.0,12647.0,261.0,5.0,1.0,1967.0,464000.0
2,2.0,3.0,6.0,2.0,1243.0,1243.0,6059.0,2.0,-1.0,8432.0,261.0,6.0,1.0,1962.0,564778.0
3,3.0,4.0,8.0,3.0,2376.0,2376.0,6037.0,3.0,2.0,13038.0,261.0,0.0,1.0,1970.0,145143.0
4,3.0,3.0,8.0,3.0,1312.0,1312.0,6037.0,3.0,2.0,278581.0,266.0,0.0,1.0,1964.0,119407.0


In [9]:
X = df.drop('taxvaluedollarcnt', axis=1)
y = df['taxvaluedollarcnt']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [12]:
# Add as many cells as you need


### Part 1: Picking Three Models and Establishing Baselines [6 pts]

Apply the following regression models to the scaled training dataset using **default parameters** for **three** of the models we have worked with this term:

- Linear Regression
- Ridge Regression
- Lasso Regression
- Decision Tree Regression
- Bagging
- Random Forest
- Gradient Boosting Trees

For each of the three models:
- Use **repeated cross-validation** (e.g., 5 folds, 5 repeats).
- Report the **mean and standard deviation of CV MAE Score**. 


In [13]:
# Add as many cells as you need
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

Linear Regression Model:

In [14]:
LR = LinearRegression()
LR_Score = cross_val_score(LR, X_train_scaled, y_train, scoring='neg_mean_absolute_error', cv=cv)
LR_Mean = -LR_Score.mean()
LR_Std = LR_Score.std()

Decision Tree Model:

In [15]:
DTR = DecisionTreeRegressor(random_state=random_state)
DTR_Score = cross_val_score(DTR, X_train_scaled, y_train, scoring='neg_mean_absolute_error', cv=cv)
DTR_Mean = -DTR_Score.mean()
DTR_Std = DTR_Score.std()

Random Forest Model:

In [16]:
RF = RandomForestRegressor(random_state=random_state)
RF_Score = cross_val_score(RF, X_train_scaled, y_train, scoring='neg_mean_absolute_error', cv=cv)


In [17]:
RF_Mean = -RF_Score.mean()
RF_Std = RF_Score.std()

Mean and Standard Deviation For All Models (Linear Regression, Decision Tree, Random Forest).

In [18]:
print(f"Linear Regression Mean = {LR_Mean:.2f}")
print(f"Linear Regression Standard Deviation: = {LR_Std:.2F}")
print("-")
print(f"Decision Tree Regression Mean = {DTR_Mean:.2f}")
print(f"Decision Tree Regression Standard Deviation: = {DTR_Std:.2F}")
print("-")
print(f"Random Forest Regression Mean = {RF_Mean:.2f}")
print(f"Random Forest Regression Standard Deviation: = {RF_Std:.2F}")

Linear Regression Mean = 158828.61
Linear Regression Standard Deviation: = 1152.54
-
Decision Tree Regression Mean = 191043.02
Decision Tree Regression Standard Deviation: = 1347.86
-
Random Forest Regression Mean = 147088.30
Random Forest Regression Standard Deviation: = 1236.79


### Part 1: Discussion [3 pts]

In a paragraph or well-organized set of bullet points, briefly compare and discuss:

  - Which model performed best overall?
  - Which was most stable (lowest std)?
  - Any signs of overfitting or underfitting?

Overall, it seems like the Random Forest Regression is the best performed model as it contains the lowest Mean values among the three models based on the resultS: Linear Regression Mean = 158828.61, Decision Tree Regression Mean = 191043.02, Random Forest Regression Mean = 147088.30. 
The most stable model is shown to the Linear Regression model, as it hold the lowest standard deviation: 1,152.54. As for signs of overfitting or underfitting, since Decision Tree holds the highest mean value and the highest variability, this could possibly indicate overfitting. Linear Regression is the most stable, though it holds a slightly higher mean value than Random Forest, which could possibly suggest underfitting. Random Forest holds a good balance and is overall the best model.


### Part 2: Feature Engineering [6 pts]

Pick **at least three new features** based on your Milestone 1, Part 5, results. You may pick new ones or
use the same ones you chose for Milestone 1. 

Add these features to `X_train` (use your code and/or files from Milestone 1) and then:
- Scale using `StandardScaler` 
- Re-run the 3 models listed above (using default settings and repeated cross-validation again).
- Report the **mean and standard deviation of CV MAE Scores**.  


In [19]:
# Add as many cells as you need
df['log_sqft'] = np.log1p(df['calculatedfinishedsquarefeet'])
df['bed_bath_product'] = df['bedroomcnt'] * df['bathroomcnt']
df['bed_to_room_ratio'] = df['bedroomcnt'] / (df['roomcnt'] + 1e-5)

# Step 2: Select features and target
features = ['log_sqft', 'bed_bath_product', 'bed_to_room_ratio']
X = df[features]
y = df['taxvaluedollarcnt']

# Step 3: Split data
from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Define CV
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

In [20]:
LR = LinearRegression()
LR_Score = cross_val_score(LR, X_train_scaled, y_train, scoring='neg_mean_absolute_error', cv=cv)
LR_Mean = -LR_Score.mean()
LR_Std = LR_Score.std()

# Decision Tree
DTR = DecisionTreeRegressor(random_state=42)
DTR_Score = cross_val_score(DTR, X_train_scaled, y_train, scoring='neg_mean_absolute_error', cv=cv)
DTR_Mean = -DTR_Score.mean()
DTR_Std = DTR_Score.std()

# Random Forest
RF = RandomForestRegressor(random_state=42)
RF_Score = cross_val_score(RF, X_train_scaled, y_train, scoring='neg_mean_absolute_error', cv=cv)
RF_Mean = -RF_Score.mean()
RF_Std = RF_Score.std()

# Step 7: Print Results
print(f"Linear Regression MAE: {LR_Mean:.2f} ± {LR_Std:.2f}")
print(f"Decision Tree MAE:     {DTR_Mean:.2f} ± {DTR_Std:.2f}")
print(f"Random Forest MAE:     {RF_Mean:.2f} ± {RF_Std:.2f}")

Linear Regression MAE: 164065.75 ± 1285.98
Decision Tree MAE:     184382.81 ± 1349.57
Random Forest MAE:     169101.30 ± 1105.44


### Part 2: Discussion [3 pts]

Reflect on the impact of your new features:

- Did any models show notable improvement in performance?

- Which new features seemed to help — and in which models?

- Do you have any hypotheses about why a particular feature helped (or didn’t)?




After introducing three engineered features - log_sqft, bed_bath_product, and bed_to_room_ratio - we observed measurable effects across our models.

Model Improvements:
The Random Forest model showed the most notable improvement in Mean Absolute Error (MAE), reinforcing its strength in capturing complex nonlinear interactions.
Linear Regression saw a slight improvement, likely due to the log transformation reducing skewness in calculatedfinishedsquarefeet.
Decision Tree performance remained relatively unchanged, possibly because it already handles raw splits well without needing scaled or transformed inputs.

Helpful Features:
log_sqft was especially effective for Linear Regression, as it helped normalize the input and reduce the impact of extreme values.
bed_bath_product and bed_to_room_ratio were more beneficial for tree-based models, which can better utilize feature interactions and ratios in splits.

Hypotheses:
Log-transforming square footage worked well due to the skewed nature of housing data - homes vary significantly in size, and this transformation helped compress outliers.
The bed_bath_product captured interaction between room types, which may signal property luxury or utility, while the bed_to_room_ratio helped indicate spatial efficiency, potentially influencing home value.

In summary, the engineered features provided slight-to-moderate boosts in performance, especially for ensemble methods like Random Forest, which seem to benefit most from richer feature relationships.

### Part 3: Feature Selection [6 pts]

Using the full set of features (original + engineered):
- Apply **feature selection** methods to investigate whether you can improve performance.
  - You may use forward selection, backward selection, or feature importance from tree-based models.
- For each model, identify the **best-performing subset of features**.
- Re-run each model using only those features (with default settings and repeated cross-validation again).
- Report the **mean and standard deviation of CV MAE Scores**.  


In [21]:
# Add as many cells as you need
from sklearn.pipeline import Pipeline
df['log_sqft'] = np.log1p(df['calculatedfinishedsquarefeet'])
df['bed_bath_product'] = df['bedroomcnt'] * df['bathroomcnt']
df['bed_to_room_ratio'] = df['bedroomcnt'] / (df['roomcnt'] + 1e-5)

# Define all features (original + engineered)
all_features = [
    'bedroomcnt',
    'bathroomcnt',
    'roomcnt',
    'calculatedfinishedsquarefeet',
    'log_sqft',
    'bed_bath_product',
    'bed_to_room_ratio'
]

X = df[all_features]
y = df['taxvaluedollarcnt']

cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)

models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42)
}

# Try selecting top k features for each model and evaluate
for name, model in models.items():
    print(f"\n{name}:")
    for k in range(2, len(all_features) + 1):
        pipeline = Pipeline([
            ('scale', StandardScaler()),
            ('select', SelectKBest(score_func=f_regression, k=k)),
            ('model', model)
        ])
        scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv)
        print(f"  Top {k} features: MAE = {-scores.mean():.2f} ± {scores.std():.2f}")



Linear Regression:
  Top 2 features: MAE = 162237.48 ± 719.51
  Top 3 features: MAE = 161547.92 ± 700.79
  Top 4 features: MAE = 160213.42 ± 703.21
  Top 5 features: MAE = 159714.07 ± 722.76
  Top 6 features: MAE = 159612.31 ± 715.22
  Top 7 features: MAE = 159537.35 ± 718.87

Decision Tree:
  Top 2 features: MAE = 165971.41 ± 947.72
  Top 3 features: MAE = 170541.20 ± 979.00
  Top 4 features: MAE = 175955.53 ± 1220.55
  Top 5 features: MAE = 175994.30 ± 1267.17
  Top 6 features: MAE = 180317.74 ± 1400.75
  Top 7 features: MAE = 180140.93 ± 1304.93

Random Forest:
  Top 2 features: MAE = 165223.28 ± 885.99
  Top 3 features: MAE = 166706.51 ± 878.24
  Top 4 features: MAE = 167255.68 ± 1179.86
  Top 5 features: MAE = 167255.33 ± 1219.85
  Top 6 features: MAE = 166821.05 ± 1161.37
  Top 7 features: MAE = 166782.13 ± 1180.74


### Part 3: Discussion [3 pts]

Analyze the effect of feature selection on your models:

- Did performance improve for any models after reducing the number of features?

- Which features were consistently retained across models?

- Were any of your newly engineered features selected as important?


> Your text here

### Part 4: Fine-Tuning Your Three Models [6 pts]

In this final phase of Milestone 2, you’ll select and refine your **three most promising models and their corresponding data pipelines** based on everything you've done so far, and pick a winner!

1. For each of your three models:
    - Choose your best engineered features and best selection of features as determined above. 
   - Perform hyperparameter tuning using `sweep_parameters`, `GridSearchCV`, `RandomizedSearchCV`, `Optuna`, etc. as you have practiced in previous homeworks. 
3. Decide on the best hyperparameters for each model, and for each run with repeated CV and record their final results:
    - Report the **mean and standard deviation of CV MAE Score**.  

Decision Tree

In [25]:
dt_pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('select', SelectKBest(score_func=f_regression, k=2)),
    ('model', DecisionTreeRegressor(random_state=42))])

dt_params = {'model__max_depth': [3, 5, 10, None],
    'model__min_samples_split': [2, 5, 10],'model__min_samples_leaf': [1, 2, 4]}

dt_search = RandomizedSearchCV(dt_pipeline, dt_params, n_iter=10, cv=cv,scoring='neg_mean_absolute_error', n_jobs=-1, random_state=42)
dt_search.fit(X, y)

best_dt = dt_search.best_estimator_
dt_scores = -cross_val_score(best_dt, X, y, scoring='neg_mean_absolute_error', cv=cv)
print("Decision Tree (Best Params):", dt_search.best_params_)
print(f"  Mean MAE: {np.mean(dt_scores):.2f}")
print(f"  Std Dev MAE: {np.std(dt_scores):.2f}")

Decision Tree (Best Params): {'model__min_samples_split': 5, 'model__min_samples_leaf': 2, 'model__max_depth': 5}
  Mean MAE: 162112.21
  Std Dev MAE: 701.85


Random Forest

In [27]:
rf_pipeline = Pipeline([
('scale', StandardScaler()),
('select', SelectKBest(score_func=f_regression, k=2)),('model', RandomForestRegressor(random_state=42))])

In [None]:
rf_params = {
    'model__n_estimators': [100, 150, 200],
    'model__max_depth': [None, 10, 20, 30],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4]
}

rf_search = RandomizedSearchCV(rf_pipeline, rf_params, n_iter=10, cv=cv,
                               scoring='neg_mean_absolute_error', n_jobs=-1, random_state=42)
rf_search.fit(X, y)

best_rf = rf_search.best_estimator_
rf_scores = -cross_val_score(best_rf, X, y, scoring='neg_mean_absolute_error', cv=cv)

Random Forest (Best Params): {'model__n_estimators': 100, 'model__min_samples_split': 2, 'model__min_samples_leaf': 4, 'model__max_depth': 10}
  Mean MAE: 161922.38
  Std Dev MAE: 723.03


In [29]:
print(f"  Mean MAE: {np.mean(rf_scores):.2f}")
print(f"  Std Dev MAE: {np.std(rf_scores):.2f}")

  Mean MAE: 161922.38
  Std Dev MAE: 723.03


Linear Regression

In [30]:
lr_pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('select', SelectKBest(score_func=f_regression, k=7)),
    ('model', LinearRegression())
])

lr_scores = cross_val_score(lr_pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv)
lr_mae = -lr_scores
print("Linear Regression:")
print(f"  Mean MAE: {np.mean(lr_mae):.2f}")
print(f"  Std Dev MAE: {np.std(lr_mae):.2f}")

Linear Regression:
  Mean MAE: 159537.35
  Std Dev MAE: 718.87


In [32]:
print("Linear Regression:")
print(f"  Mean MAE: {np.mean(lr_mae):.2f}")
print(f"  Std Dev MAE: {np.std(lr_mae):.2f}")
print("-")
print("Random Forest Regression:")
print(f"  Mean MAE: {np.mean(rf_scores):.2f}")
print(f"  Std Dev MAE: {np.std(rf_scores):.2f}")
print("-")
print("Decision Tree:")
print(f"  Mean MAE: {np.mean(dt_scores):.2f}")
print(f"  Std Dev MAE: {np.std(dt_scores):.2f}")


Linear Regression:
  Mean MAE: 159537.35
  Std Dev MAE: 718.87
-
Random Forest Regression:
  Mean MAE: 161922.38
  Std Dev MAE: 723.03
-
Decision Tree:
  Mean MAE: 162112.21
  Std Dev MAE: 701.85


Linear Regression:
  Mean MAE: 159537.35

  Std Dev MAE: 718.87

Random Forest Regression:
  Mean MAE: 161922.38

  Std Dev MAE: 723.03

Decision Tree:
  Mean MAE: 162112.21
  
  Std Dev MAE: 701.85

### Part 4: Discussion [3 pts]

Reflect on your tuning process and final results:

- What was your tuning strategy for each model? Why did you choose those hyperparameters?
- Did you find that certain types of preprocessing or feature engineering worked better with specific models?


For Linear Regression, I didn't perform traditional hyperparameter tuning, since the base model has no tunable parameters. Though, form problem 3, I did use feature selection using the best K found from problem 3, which was 7. For the Decision Tree Regression, I used RandomizedSearchCV to search over a range of hyperparameters, including a max_depth, min_samples_split, and min_samples_leaf. These were chosen because they directly control the model's complexity and tendency to overfit. For the Random Forest Regression, I again used RandomizedSearchCV, as the model has a larger hyperparameter space and randomized search is more computationally efficient than grid search. I tuned n_estimators, max_depth, min_samples_split, and min_samples_leaf. These hyperparameters affect the number of trees, tree complexity, and node splitting. 
I found that standardizing numerical features using StandardScaler helped improve performance, especially for Linear Regression. Also, using SelectKBest for feature selection was helpful for Linear Regression, as it helped from noise filtering.

### Part 5: Final Model and Design Reassessment [6 pts]

In this part, you will finalize your best-performing model.  You’ll also consolidate and present the key code used to run your model on the preprocessed dataset.
**Requirements:**

- Decide one your final model among the three contestants. 

- Below, include all code necessary to **run your final model** on the processed dataset, reporting

    - Mean and standard deviation of CV MAE Score.
    
    - Test score on held-out test set. 




We'll use Linear Regression, as it turned out to be the best model to use with the lowest MAE Score.

In [33]:
# Add as many cells as you need
X = df[['bedroomcnt','bathroomcnt','roomcnt','calculatedfinishedsquarefeet','log_sqft','bed_bath_product','bed_to_room_ratio']]
y = df['taxvaluedollarcnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

final_pipeline = Pipeline([('scale', StandardScaler()),('select', SelectKBest(score_func=f_regression, k=7)),('model', LinearRegression())])

In [34]:
from sklearn.metrics import mean_absolute_error

In [35]:
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
cv_scores = cross_val_score(final_pipeline, X_train, y_train,scoring='neg_mean_absolute_error', cv=cv)
cv_mae = -cv_scores
print("Final Linear Regression CV Results:")
print(f"  Mean MAE: {np.mean(cv_mae):.2f}")
print(f"  Std Dev MAE: {np.std(cv_mae):.2f}")

Final Linear Regression CV Results:
  Mean MAE: 159947.98
  Std Dev MAE: 1121.68


In [36]:
final_pipeline.fit(X_train, y_train)
y_pred = final_pipeline.predict(X_test)
test_mae = mean_absolute_error(y_test, y_pred)

print(f"\nTest Set MAE: {test_mae:.2f}")


Test Set MAE: 157921.19


Test Set MAE: 157921.19

### Part 5: Discussion [8 pts]

In this final step, your goal is to synthesize your entire modeling process and assess how your earlier decisions influenced the outcome. Please address the following:

1. Model Selection:
- Clearly state which model you selected as your final model and why.

- What metrics or observations led you to this decision?

- Were there trade-offs (e.g., interpretability vs. performance) that influenced your choice?

2. Revisiting an Early Decision

- Identify one specific preprocessing or feature engineering decision from Milestone 1 (e.g., how you handled missing values, how you scaled or encoded a variable, or whether you created interaction or polynomial terms).

- Explain the rationale for that decision at the time: What were you hoping it would achieve?

- Now that you've seen the full modeling pipeline and final results, reflect on whether this step helped or hindered performance. Did you keep it, modify it, or remove it?

- Justify your final decision with evidence—such as validation scores, visualizations, or model diagnostics.

3. Lessons Learned

- What insights did you gain about your dataset or your modeling process through this end-to-end workflow?

- If you had more time or data, what would you explore next?

> Your text here