<a href="https://colab.research.google.com/github/smccracken13/Zestimate-Project/blob/main/Zestimate_Modeling_Random_Forest_Tuning_(McCracken).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goals of this notebook are to:

1. Create dummies for categorical data
2. Split the data into train and test sets
3. Random Forest Tuning with RandomSearchCV

In [1]:
from google.colab import files
# load zillow_clean.csv
files.upload()

Saving zillow_clean.csv to zillow_clean.csv


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
df = pd.read_csv('zillow_clean.csv', low_memory=False, index_col = 'Unnamed: 0')

In [4]:
# Remove absolute log error column
df.drop(columns=['fips', 'abs_log_error'], inplace = True)

In [5]:
# set index to parcelid
df.set_index('parcelid')

Unnamed: 0_level_0,logerror,transaction_month,transaction_day,transaction_quarter,aircon,architecture,basementsqft,bathroomcnt,bedroomcnt,framing,...,numberofstories,fireplaceflag,tav_built,tax_assessed_value,assessmentyear,tav_land,property_tax,taxdelinquencyflag,taxdelinquencyyear,age
parcelid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11016594,0.0276,Jan,1,1st,Central,not given,not given,2.0,3.0,not given,...,1.0,not given,122754.0,360170.0,2015.0,237416.0,6735.88,0.0,not applicable,57.0
14366692,-0.1684,Jan,1,1st,not given,not given,not given,3.5,4.0,not given,...,1.0,not given,346458.0,585529.0,2015.0,239071.0,10153.02,0.0,not applicable,2.0
12098116,-0.0040,Jan,1,1st,Central,not given,not given,3.0,2.0,not given,...,1.0,not given,61994.0,119906.0,2015.0,57912.0,11484.48,0.0,not applicable,76.0
12643413,0.0218,Jan,2,1st,Central,not given,not given,2.0,2.0,not given,...,1.0,not given,171518.0,244880.0,2015.0,73362.0,3048.74,0.0,not applicable,29.0
14432541,-0.0050,Jan,2,1st,not given,not given,not given,2.5,4.0,not given,...,2.0,not given,169574.0,434551.0,2015.0,264977.0,5488.96,0.0,not applicable,35.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10774160,-0.0356,Dec,30,4th,Central,not given,not given,1.0,1.0,not given,...,1.0,not given,43800.0,191000.0,2015.0,147200.0,2495.24,0.0,not applicable,37.0
12046695,0.0070,Dec,30,4th,not given,not given,not given,3.0,3.0,not given,...,1.0,not given,117893.0,161111.0,2015.0,43218.0,1886.54,0.0,not applicable,51.0
12995401,-0.2679,Dec,30,4th,not given,not given,not given,2.0,4.0,not given,...,1.0,not given,22008.0,38096.0,2015.0,16088.0,1925.70,1.0,2014.0,92.0
11402105,0.0602,Dec,30,4th,not given,not given,not given,2.0,2.0,not given,...,1.0,not given,132991.0,165869.0,2015.0,32878.0,2285.57,0.0,not applicable,35.0


# One-hot encoding

In [6]:
# get list of categorical columns
cat_cols = ['transaction_month', 'transaction_day','transaction_quarter','aircon',
            'architecture', 'basementsqft', 'framing', 'deck', 'heating',
            'poolsizesum', 'county_land_use_code', 'land_use_code','zoning_code',
            'city', 'county', 'neighborhood','zipcode', 'storytypeid', 'material',
            'patio_sqft', 'shed_sqft','assessmentyear', 'taxdelinquencyyear','has_spa',
            'pool_with_spa', 'pool_without_spa', 'fireplaceflag']

prefix_list = ['tm', 'td', 'tq', 'air', 'arch', 'bsqft', 'fram', 'deck', 'heat',
               'poolsize', 'county_lu_code', 'lu_code', 'zoning', 'city',
               'county', 'neigh', 'zip', 'storyid', 'material', 'patiosqft', 'shedsqft',
               'assessyear', 'taxdelyear', 'has_spa', 'pool_with_spa', 'pool_without_spa', 'fireplaceflag']

prefix_dict = dict(zip(cat_cols, prefix_list))

In [7]:
# Identify columns that have 'not given' to make sure the get one-hot encoded
not_given_cols = df.columns[df.isin(['not given']).any()]
print(not_given_cols)

Index(['aircon', 'architecture', 'basementsqft', 'framing', 'deck', 'has_spa',
       'heating', 'poolsizesum', 'pool_with_spa', 'pool_without_spa',
       'zoning_code', 'city', 'neighborhood', 'storytypeid', 'material',
       'patio_sqft', 'shed_sqft', 'fireplaceflag'],
      dtype='object')


In [8]:
# one-hot encode cat cols
df = pd.get_dummies(df, columns = cat_cols, prefix= prefix_dict, drop_first=True)
print(len(df.columns))

984


# Train and Test Split

In [9]:
# Create train_test_split
X = df.loc[:, df.columns != 'logerror']
y = df['logerror']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Modeling

1. Random Forest Hyperparameter Tuning

In [10]:
# load modeling packages
from sklearn.ensemble import RandomForestRegressor

# load metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, mean_absolute_error

Hyperparameter tuning for Random Forest Regressor

In [11]:
from sklearn.model_selection import RandomizedSearchCV

# Maximum number of levels in tree
max_depth = [3, 5, 7, 9]
max_depth.append(None)

# Create the random grid
random_grid = {'n_estimators': [50, 100, 200, 500],
               'max_features': ['auto', 'sqrt'],
               'max_depth': max_depth,}

# Run the RandomizedSearchCV
rfc_tuned = RandomizedSearchCV(estimator = RandomForestRegressor(),
                               param_distributions = random_grid,
                               n_iter = 15, #start with 15, could increase later
                               cv = 3,
                               verbose=2,
                               random_state=1,
                               n_jobs = -1)

rfc_tuned.fit(X_train, y_train)

# print best results from randomized search
print(rfc_tuned.best_params_)

Fitting 3 folds for each of 15 candidates, totalling 45 fits
{'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 9}


In [17]:
# Use the best parameters from the RandomSearchCV for the Random Forest Regressor
# {'n_estimators': 200, 'max_features': 'sqrt', 'max_depth': 9}

# fit and predict
#tuned_rfc_reg = rfc_tuned.best_params_
tuned_rfc_reg = RandomForestRegressor(n_estimators= 200, max_features= 'sqrt', max_depth= 9)
tuned_rfc_reg.fit(X_train, y_train)
y_pred = tuned_rfc_reg.predict(X_test)

# model evaluation
print('Tuned Random Forest Regression Model')
print('MSE :', mean_squared_error(y_test, y_pred))
print('RMSE :', np.sqrt(mean_squared_error(y_test, y_pred)))
print('MAE :', mean_absolute_error(y_test, y_pred))

Tuned Random Forest Regression Model
MSE : 0.023855882680175067
RMSE : 0.15445349682080708
MAE : 0.06706073937055983


This tuned Random Forest Regressor had the lowest MAE and lowest RMSE of all of our models. It performed similarly to the Linear Regression model that used only numeric data (did not include any of the categorical data that had been one-hot encoded). The Linear Regression model takes  less computing power that the Random Forest Regressor, especially considering the tuning.