# Modeling
---

To begin, I intend to create a baseline model from the mean `tax_value`. Then, I will be creating three models to compare against the baseline:

- Linear Regression (Ordinary Least Squares)
- LassoLars
- Polynomial

For my evaluation metric, I will be using RMSE. I chose RMSE because it gives a clear idea of how much error we are seeing in each model. Rather than getting a somewhat abstract idea of our model's error, we can see it in terms of the target variable's units (dollars, in this case).

In [4]:
# import modules
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import itertools
import wrangle
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars
from sklearn.preprocessing import PolynomialFeatures
# turn off pink warnings
import warnings
warnings.filterwarnings('ignore')

# list columns for outlier removal
out_cols = ['beds', 'baths', 'sq_ft', 'tax_value']
# list columns for scaling
scaled_cols = ['beds', 'baths', 'sq_ft']
# wrangle data
train, validate, test = wrangle.wrangle_zillow(out_cols, 1.5, scaled_cols)
# preview train data
train.head()

Unnamed: 0,beds,baths,sq_ft,tax_value,fips,beds_scaled,baths_scaled,sq_ft_scaled
17227,2.0,1.0,877.0,148732.0,6037.0,-1.536937,-1.517808,-1.348418
36170,3.0,2.0,1386.0,319465.0,6059.0,-0.268813,-0.13987,-0.518473
16538,4.0,2.5,2064.0,810703.0,6059.0,0.999311,0.549099,0.587033
29765,3.0,2.0,1323.0,393000.0,6037.0,-0.268813,-0.13987,-0.621197
22836,4.0,3.0,2605.0,202872.0,6037.0,0.999311,1.238067,1.469155


In [5]:
# establish baseline
baseline = train.tax_value.mean()
baseline

374001.18450097844

In [6]:
# separate samples into x and y
scaled_cols = ['beds_scaled', 'baths_scaled', 'sq_ft_scaled']

x_train = train[scaled_cols]
y_train = train.tax_value

x_validate = validate[scaled_cols]
y_validate = validate.tax_value

x_test = test[scaled_cols]
y_test = test.tax_value

In [8]:
# evaluate baseline
y_train = pd.DataFrame(y_train)
y_validate = pd.DataFrame(y_validate)

y_train['baseline'] = baseline
y_validate['baseline'] = baseline

rmse_train = mean_squared_error(y_train.tax_value, y_train.baseline)**0.5
rmse_validate = mean_squared_error(y_validate.tax_value, y_validate.baseline)**0.5

print('Baseline(mean `tax_value`) RMSE')
print(f'Train: {rmse_train}')
print(f'Validate: {rmse_validate}')

Baseline(mean `tax_value`) RMSE
Train: 244969.44100949066
Validate: 244786.9506494297


## Model 1: OLS

In [10]:
# create object
lm = LinearRegression(normalize=True)
# fit model to train
lm.fit(x_train, y_train.tax_value)
# train predictions
y_train['ols_pred'] = lm.predict(x_train)
# evaluate model on train
rmse_train_ols = mean_squared_error(y_train.tax_value, y_train.ols_pred)**0.5
# validate predictions
y_validate['ols_pred'] = lm.predict(x_validate)
# evaluate model on validate
rmse_validate_ols = mean_squared_error(y_validate.tax_value, y_validate.ols_pred)**0.5

# print results
print(f'OLS RMSE Train: {rmse_train_ols}')
print(f'OLS RMSE Validate: {rmse_validate_ols}')

OLS RMSE Train: 217613.08303855872
OLS RMSE Validate: 216699.00485652036


## Model 2: LassoLars

In [11]:
# create object
lars = LassoLars()
# fit model to train
lars.fit(x_train, y_train.tax_value)
# make predictions on train
y_train['ll_pred'] = lars.predict(x_train)
# evaluate model on train
rmse_train_ll = mean_squared_error(y_train.tax_value, y_train.ll_pred)**0.5
# validate predictions
y_validate['ll_pred'] = lars.predict(x_validate)
# evaluate model on validate
rmse_validate_ll = mean_squared_error(y_validate.tax_value, y_validate.ll_pred)**0.5

# print results
print(f'LassoLars RMSE Train: {rmse_train_ll}')
print(f'LassoLars RMSE Validate: {rmse_validate_ll}')

LassoLars RMSE Train: 217613.41467660828
LassoLars RMSE Validate: 216689.23021125476


## Model 3: Polynomial Regression

In [13]:
# create polynomial features object
poly_feat = PolynomialFeatures()
# fit/transform object on train
x_train_poly = poly_feat.fit_transform(x_train)
# transform on validate and test
x_validate_poly = poly_feat.transform(x_validate)
x_test_poly = poly_feat.transform(x_test)

In [14]:
# create model object
plm = LinearRegression(normalize=True)
# fit model to train
plm.fit(x_train_poly, y_train.tax_value)
# make predictions on train
y_train['poly_pred'] = plm.predict(x_train_poly)
# evaluate model on train
rmse_train_poly = mean_squared_error(y_train.tax_value, y_train.poly_pred)**0.5
# validate predictions
y_validate['poly_pred'] = plm.predict(x_validate_poly)
# evaluate model on validate
rmse_validate_poly = mean_squared_error(y_validate.tax_value, y_validate.poly_pred)**0.5

# print results
print(f'Polynomial Regression RMSE Train: {rmse_train_poly}')
print(f'Polynomial Regression RMSE Validate: {rmse_validate_poly}')

Polynomial Regression RMSE Train: 217533.63421490567
Polynomial Regression RMSE Validate: 216584.59703183424


In [18]:
# view results as dataframe
rmse = pd.DataFrame({'Linear Regression':[rmse_train_ols, rmse_validate_ols, (rmse_train_ols-rmse_validate_ols)],
                    'LassoLars':[rmse_train_ll, rmse_validate_ll, (rmse_train_ll-rmse_validate_ll)],
                    'Polynomial':[rmse_train_poly, rmse_validate_poly, (rmse_train_poly-rmse_validate_poly)],
                    'Baseline':[rmse_train, rmse_validate, (rmse_train-rmse_validate)]},
                    index=['train', 'validate', 'difference'])
rmse

Unnamed: 0,Linear Regression,LassoLars,Polynomial,Baseline
train,217613.083039,217613.414677,217533.634215,244969.441009
validate,216699.004857,216689.230211,216584.597032,244786.950649
difference,914.078182,924.184465,949.037183,182.49036


Our top model is the Polynomial Regression model!

Knowing this, I can now evaluate this model on the test dataset.

In [20]:
# make y_test into dataframe
y_test = pd.DataFrame(y_test)
# evaluate baseline on test
y_test['baseline'] = baseline
rmse_test = mean_squared_error(y_test.tax_value, y_test.baseline)**0.5


# test predictions
y_test['poly_pred'] = plm.predict(x_test_poly)
# evaluate model on validate
rmse_test_poly = mean_squared_error(y_test.tax_value, y_test.poly_pred)**0.5
# print results
print(f'Baseline RMSE Test: {rmse_test}')
print(f'Polynomial Regression RMSE Test: {rmse_test_poly}')

Baseline RMSE Test: 243476.17293529224
Polynomial Regression RMSE Test: 217301.69629665604


My top model was the Polynomial Regression Model.

This model performed with an RMSE of 217301.7