# Evaluation of Machine Learning Models for Zillow/AirBNB Datasets

There are XX parts to this document:
1. Comparing models
2. Using selected model to make predictions

# Section 1 - Comparing Models

### Regression vs Classification Models

If conducting supervised learning on these datasets, we will need to use a regression model. The data we are looking at is continuous data, which requires regression mdoeling. Logistic modeling is discreet and would not be applicable for this dataset.

### Comparing Regression Models
We will test:
* Random Forest Regression
* SVM Regression
* Multivariate Linear Regression

Since our dataset contains some variables that are continuous and some that are categorical, a random forest model may outperform a linear regression model. 

We will test both a random forest model and linear regression.

## Prepare the data for the models

In [70]:
# import dependencies
import pandas as pd
import numpy as np
from datetime import datetime
import seaborn as sb

import sqlite3
# Python SQL toolkit and Object Relational Mapper
import sqlalchemy
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine, inspect, func

#modeling dependencies
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [2]:
# Load the data
con_zlw = sqlite3.connect("../finalprojectdatabase.db")
zlw = pd.read_sql_query("SELECT * from zillow_google", con_zlw)

con_airbnb = sqlite3.connect("../finalprojectdatabase.db")
airbnb = pd.read_sql_query("SELECT * from arbnb_google", con_airbnb)

In [3]:
zlw.head()

Unnamed: 0,Address,City,Zipcode,bathrooms,bedrooms,daysOnZillow,homeType,latitude,livingArea,longitude,...,taxAssessedYear,url,yearBuilt,house_number,street_name,google_zip,google_neighborhood,lat_lng,zipcode_length,google_address
0,1121 SW 10th Dr,Gresham,97080,3.0,4.0,11.0,SINGLE_FAMILY,45.488228,2724.0,-122.44442,...,2020-01-01,https://www.zillow.com/homedetails/1121-SW-10t...,1982-01-01,1131,Southwest Florence Drive,97080,Gresham Butte,"45.48822784, -122.4444199",5,1131 Southwest Florence Drive
1,19309 NE Glisan St,Portland,97230,1.0,3.0,11.0,SINGLE_FAMILY,45.526634,1217.0,-122.464088,...,2017-01-01,https://www.zillow.com/homedetails/19309-NE-Gl...,1953-01-01,19309,Northeast Glisan Street,97230,North Gresham,"45.52663422, -122.4640884",5,19309 Northeast Glisan Street
2,1518 SE 12th St,Gresham,97080,2.0,3.0,14.0,SINGLE_FAMILY,45.487991,1150.0,-122.416184,...,2020-01-01,https://www.zillow.com/homedetails/1518-SE-12t...,1967-01-01,1518,Southeast 12th Street,97080,Asert,"45.48799133, -122.4161835",5,1518 Southeast 12th Street
3,110 NW Willowbrook Ct,Gresham,97030,2.0,3.0,25.0,SINGLE_FAMILY,45.498184,2036.0,-122.451332,...,2020-01-01,https://www.zillow.com/homedetails/110-NW-Will...,1978-01-01,110,Northwest Willowbrook Court,97030,Northwest,"45.4981842, -122.4513321",5,110 Northwest Willowbrook Court
4,3569 SW Mckinley St,Gresham,97080,3.0,3.0,41.0,SINGLE_FAMILY,45.475353,2209.0,-122.468307,...,2020-01-01,https://www.zillow.com/homedetails/3569-SW-Mck...,2017-01-01,3569,Southwest McKinley Street,97080,Pleasant Valley,"45.47535324, -122.4683075",5,3569 Southwest McKinley Street


In [4]:
# set address as index
zlw = zlw.set_index('google_address', drop=True)
print(zlw.columns.to_list())
zlw.head()

['Address', 'City', 'Zipcode', 'bathrooms', 'bedrooms', 'daysOnZillow', 'homeType', 'latitude', 'livingArea', 'longitude', 'lotSize', 'date_sold', 'price', 'pricePerSquareFoot', 'rentZestimate', 'taxAssessedValue', 'taxAssessedYear', 'url', 'yearBuilt', 'house_number', 'street_name', 'google_zip', 'google_neighborhood', 'lat_lng', 'zipcode_length']


Unnamed: 0_level_0,Address,City,Zipcode,bathrooms,bedrooms,daysOnZillow,homeType,latitude,livingArea,longitude,...,taxAssessedValue,taxAssessedYear,url,yearBuilt,house_number,street_name,google_zip,google_neighborhood,lat_lng,zipcode_length
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1131 Southwest Florence Drive,1121 SW 10th Dr,Gresham,97080,3.0,4.0,11.0,SINGLE_FAMILY,45.488228,2724.0,-122.44442,...,397560.0,2020-01-01,https://www.zillow.com/homedetails/1121-SW-10t...,1982-01-01,1131,Southwest Florence Drive,97080,Gresham Butte,"45.48822784, -122.4444199",5
19309 Northeast Glisan Street,19309 NE Glisan St,Portland,97230,1.0,3.0,11.0,SINGLE_FAMILY,45.526634,1217.0,-122.464088,...,269520.0,2017-01-01,https://www.zillow.com/homedetails/19309-NE-Gl...,1953-01-01,19309,Northeast Glisan Street,97230,North Gresham,"45.52663422, -122.4640884",5
1518 Southeast 12th Street,1518 SE 12th St,Gresham,97080,2.0,3.0,14.0,SINGLE_FAMILY,45.487991,1150.0,-122.416184,...,309260.0,2020-01-01,https://www.zillow.com/homedetails/1518-SE-12t...,1967-01-01,1518,Southeast 12th Street,97080,Asert,"45.48799133, -122.4161835",5
110 Northwest Willowbrook Court,110 NW Willowbrook Ct,Gresham,97030,2.0,3.0,25.0,SINGLE_FAMILY,45.498184,2036.0,-122.451332,...,373030.0,2020-01-01,https://www.zillow.com/homedetails/110-NW-Will...,1978-01-01,110,Northwest Willowbrook Court,97030,Northwest,"45.4981842, -122.4513321",5
3569 Southwest McKinley Street,3569 SW Mckinley St,Gresham,97080,3.0,3.0,41.0,SINGLE_FAMILY,45.475353,2209.0,-122.468307,...,453610.0,2020-01-01,https://www.zillow.com/homedetails/3569-SW-Mck...,2017-01-01,3569,Southwest McKinley Street,97080,Pleasant Valley,"45.47535324, -122.4683075",5


In [5]:
# drop columns not needed for machine learning
zlw = zlw.drop(['Address',
                'City',
                'Zipcode',
                #'bathrooms',
                #'bedrooms',
                #'daysOnZillow',
                #'homeType',
                'latitude',
                #'livingArea',
                'longitude',
                #'lotSize',
                'date_sold',
                #'price',
                #'pricePerSquareFoot',
                #'rentZestimate',
                #'taxAssessedValue',
                #'taxAssessedYear',
                'url',
                #'yearBuilt',
                'house_number',
                'street_name',
                #'google_zip',
                #'google_neighborhood',
                'lat_lng',
                'zipcode_length'], axis=1)
zlw.dtypes

bathrooms              float64
bedrooms               float64
daysOnZillow           float64
homeType                object
livingArea             float64
lotSize                float64
price                    int64
pricePerSquareFoot     float64
rentZestimate          float64
taxAssessedValue       float64
taxAssessedYear         object
yearBuilt               object
google_zip               int64
google_neighborhood     object
dtype: object

In [6]:
airbnb.head()

Unnamed: 0,listing_url,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,...,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,house_number,street_name,google_zip,google_neighborhood,lat_lng,google_address
0,https://www.airbnb.com/rooms/12899,within an hour,100.0,100.0,1,0,Concordia,45.56488,-122.63418,Entire home/apt,...,4.99,4.93,4.94,4.08,5827,Northeast 31st Avenue,97211,Concordia,"45.56488, -122.63418",5827 Northeast 31st Avenue
1,https://www.airbnb.com/rooms/37676,within a day,100.0,55.0,1,1,Pearl,45.52564,-122.68273,Entire home/apt,...,4.77,4.94,4.66,0.91,1110,Northwest Flanders Street,97209,Northwest Portland,"45.52564, -122.68273",1110 Northwest Flanders Street
2,https://www.airbnb.com/rooms/41601,within an hour,100.0,100.0,1,1,Roseway,45.54804,-122.58541,Entire home/apt,...,4.92,4.67,4.83,1.76,7510,Northeast Fremont Street,97213,Roseway,"45.54804, -122.58541",7510 Northeast Fremont Street
3,https://www.airbnb.com/rooms/61893,within an hour,100.0,73.0,1,1,Goose Hollow,45.52258,-122.69955,Entire home/apt,...,5.0,5.0,4.93,0.24,2334,Southwest Cactus Drive,97205,Goose Hollow,"45.52258, -122.69955",2334 Southwest Cactus Drive
4,https://www.airbnb.com/rooms/80357,within an hour,100.0,52.0,1,1,Sullivan's Gulch,45.53364,-122.63895,Entire home/apt,...,5.0,5.0,5.0,0.02,2608,Northeast Halsey Street,97232,Sullivan's Gulch,"45.53364, -122.63895",2608 Northeast Halsey Street


In [7]:
# set address as index
airbnb = airbnb.set_index('google_address', drop=True)
print(airbnb.columns.to_list())
airbnb.head()

['listing_url', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_identity_verified', 'neighbourhood_cleansed', 'latitude', 'longitude', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'number_of_reviews_l30d', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month', 'house_number', 'street_name', 'google_zip', 'google_neighborhood', 'lat_lng']


Unnamed: 0_level_0,listing_url,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,house_number,street_name,google_zip,google_neighborhood,lat_lng
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5827 Northeast 31st Avenue,https://www.airbnb.com/rooms/12899,within an hour,100.0,100.0,1,0,Concordia,45.56488,-122.63418,Entire home/apt,...,4.99,4.99,4.93,4.94,4.08,5827,Northeast 31st Avenue,97211,Concordia,"45.56488, -122.63418"
1110 Northwest Flanders Street,https://www.airbnb.com/rooms/37676,within a day,100.0,55.0,1,1,Pearl,45.52564,-122.68273,Entire home/apt,...,4.83,4.77,4.94,4.66,0.91,1110,Northwest Flanders Street,97209,Northwest Portland,"45.52564, -122.68273"
7510 Northeast Fremont Street,https://www.airbnb.com/rooms/41601,within an hour,100.0,100.0,1,1,Roseway,45.54804,-122.58541,Entire home/apt,...,4.95,4.92,4.67,4.83,1.76,7510,Northeast Fremont Street,97213,Roseway,"45.54804, -122.58541"
2334 Southwest Cactus Drive,https://www.airbnb.com/rooms/61893,within an hour,100.0,73.0,1,1,Goose Hollow,45.52258,-122.69955,Entire home/apt,...,4.93,5.0,5.0,4.93,0.24,2334,Southwest Cactus Drive,97205,Goose Hollow,"45.52258, -122.69955"
2608 Northeast Halsey Street,https://www.airbnb.com/rooms/80357,within an hour,100.0,52.0,1,1,Sullivan's Gulch,45.53364,-122.63895,Entire home/apt,...,5.0,5.0,5.0,5.0,0.02,2608,Northeast Halsey Street,97232,Sullivan's Gulch,"45.53364, -122.63895"


In [8]:
# drop columns not needed for machine learning
airbnb = airbnb.drop(['listing_url',
                       #'host_response_time',
                       #'host_response_rate',
                       #'host_acceptance_rate',
                       #'host_is_superhost',
                       #'host_identity_verified',
                       'neighbourhood_cleansed',
                       'latitude',
                       'longitude',
                       #'room_type',
                       #'accommodates',
                       #'bathrooms',
                       #'bedrooms',
                       #'beds',
                       #'price',
                       #'minimum_nights',
                       #'maximum_nights',
                       #'number_of_reviews_l30d',
                       'last_review',
                       #'review_scores_rating',
                       #'review_scores_accuracy',
                       #'review_scores_cleanliness',
                       #'review_scores_checkin',
                       #'review_scores_communication',
                       #'review_scores_location',
                       #'review_scores_value',
                       #'reviews_per_month',
                       'house_number',
                       'street_name',
                       #'google_zip',
                       #'google_neighborhood',
                       'lat_lng'], axis=1)
airbnb.dtypes

host_response_time              object
host_response_rate             float64
host_acceptance_rate           float64
host_is_superhost                int64
host_identity_verified           int64
room_type                       object
accommodates                     int64
bathrooms                      float64
bedrooms                       float64
beds                           float64
price                          float64
minimum_nights                   int64
maximum_nights                   int64
number_of_reviews_l30d           int64
review_scores_rating           float64
review_scores_accuracy         float64
review_scores_cleanliness      float64
review_scores_checkin          float64
review_scores_communication    float64
review_scores_location         float64
review_scores_value            float64
reviews_per_month              float64
google_zip                       int64
google_neighborhood             object
dtype: object

### Test OneHotEncoder vs Label Encoding

In [9]:
le = LabelEncoder()
ohe = OneHotEncoder(sparse=False)

In [10]:
# set columns to check to list

In [11]:
zlw_le = zlw.copy()
zlw_ohe = zlw.copy()
airbnb_le = airbnb.copy()
airbnb_ohe = airbnb.copy()

In [12]:
zlw_cat_col = zlw.select_dtypes(include='object').columns
airbnb_cat_col = airbnb.select_dtypes(include=['object']).columns

In [13]:
#Label Encoder
for row in zlw_cat_col:
    zlw_le[row] = le.fit_transform(zlw_le[row].astype(str))

for row in airbnb_cat_col:
    airbnb_le[row] = le.fit_transform(airbnb_le[row].astype(str))

In [14]:
zlw_le.head()

Unnamed: 0_level_0,bathrooms,bedrooms,daysOnZillow,homeType,livingArea,lotSize,price,pricePerSquareFoot,rentZestimate,taxAssessedValue,taxAssessedYear,yearBuilt,google_zip,google_neighborhood
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1131 Southwest Florence Drive,3.0,4.0,11.0,3,2724.0,9583.0,512000,188.0,1995.0,397560.0,4,102,97080,51
19309 Northeast Glisan Street,1.0,3.0,11.0,3,1217.0,13939.0,348000,286.0,1695.0,269520.0,1,73,97230,99
1518 Southeast 12th Street,2.0,3.0,14.0,3,1150.0,7000.0,404200,351.0,1800.0,309260.0,4,87,97080,6
110 Northwest Willowbrook Court,2.0,3.0,25.0,3,2036.0,6969.0,478200,235.0,2250.0,373030.0,4,98,97030,107
3569 Southwest McKinley Street,3.0,3.0,41.0,3,2209.0,5227.0,550000,249.0,2300.0,453610.0,4,137,97080,120


In [15]:
airbnb_le.head()

Unnamed: 0_level_0,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_identity_verified,room_type,accommodates,bathrooms,bedrooms,beds,...,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,google_zip,google_neighborhood
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5827 Northeast 31st Avenue,3,100.0,100.0,1,0,0,3,1.0,2.0,2.0,...,4.93,4.94,4.98,4.99,4.99,4.93,4.94,4.08,97211,8
1110 Northwest Flanders Street,1,100.0,55.0,1,1,0,3,1.0,1.0,1.0,...,4.88,4.86,4.86,4.83,4.77,4.94,4.66,0.91,97209,37
7510 Northeast Fremont Street,3,100.0,100.0,1,1,0,2,1.0,1.0,1.0,...,4.84,4.9,4.9,4.95,4.92,4.67,4.83,1.76,97213,45
2334 Southwest Cactus Drive,3,100.0,73.0,1,1,0,2,1.0,1.0,1.0,...,5.0,5.0,5.0,4.93,5.0,5.0,4.93,0.24,97205,16
2608 Northeast Halsey Street,3,100.0,52.0,1,1,0,2,1.0,1.0,1.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.02,97232,51


In [16]:
#OneHotEncoder - get dummies
zlw_dum = pd.get_dummies(zlw_ohe, columns = zlw_cat_col)
airbnb_dum = pd.get_dummies(airbnb_ohe, columns = airbnb_cat_col)

In [17]:
zlw_dum.head()

Unnamed: 0_level_0,bathrooms,bedrooms,daysOnZillow,livingArea,lotSize,price,pricePerSquareFoot,rentZestimate,taxAssessedValue,google_zip,...,google_neighborhood_West Mount Scott,google_neighborhood_West Portland Park,google_neighborhood_Westlake,google_neighborhood_Westridge,google_neighborhood_Wilkes,google_neighborhood_Wilkes East,google_neighborhood_Willamette,google_neighborhood_Witch Hazel,google_neighborhood_Woodland Park,google_neighborhood_Woodlawn
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1131 Southwest Florence Drive,3.0,4.0,11.0,2724.0,9583.0,512000,188.0,1995.0,397560.0,97080,...,0,0,0,0,0,0,0,0,0,0
19309 Northeast Glisan Street,1.0,3.0,11.0,1217.0,13939.0,348000,286.0,1695.0,269520.0,97230,...,0,0,0,0,0,0,0,0,0,0
1518 Southeast 12th Street,2.0,3.0,14.0,1150.0,7000.0,404200,351.0,1800.0,309260.0,97080,...,0,0,0,0,0,0,0,0,0,0
110 Northwest Willowbrook Court,2.0,3.0,25.0,2036.0,6969.0,478200,235.0,2250.0,373030.0,97030,...,0,0,0,0,0,0,0,0,0,0
3569 Southwest McKinley Street,3.0,3.0,41.0,2209.0,5227.0,550000,249.0,2300.0,453610.0,97080,...,0,0,0,0,0,0,0,0,0,0


In [18]:
airbnb_dum.head()

Unnamed: 0_level_0,host_response_rate,host_acceptance_rate,host_is_superhost,host_identity_verified,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,google_neighborhood_Southwest Hills,google_neighborhood_Southwest Portland,google_neighborhood_Sullivan's Gulch,google_neighborhood_Sumner,google_neighborhood_Sunderland,google_neighborhood_Sunnyside,google_neighborhood_Vernon,google_neighborhood_West Portland Park,google_neighborhood_Wilkes,google_neighborhood_Woodlawn
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5827 Northeast 31st Avenue,100.0,100.0,1,0,3,1.0,2.0,2.0,89.0,3,...,0,0,0,0,0,0,0,0,0,0
1110 Northwest Flanders Street,100.0,55.0,1,1,3,1.0,1.0,1.0,14.0,30,...,0,0,0,0,0,0,0,0,0,0
7510 Northeast Fremont Street,100.0,100.0,1,1,2,1.0,1.0,1.0,129.0,3,...,0,0,0,0,0,0,0,0,0,0
2334 Southwest Cactus Drive,100.0,73.0,1,1,2,1.0,1.0,1.0,104.0,30,...,0,0,0,0,0,0,0,0,0,0
2608 Northeast Halsey Street,100.0,52.0,1,1,2,1.0,1.0,1.0,9.0,90,...,0,0,1,0,0,0,0,0,0,0


## Random Forest Model

### Zillow Random Forest

In [74]:
zlw_rf = zlw_le

In [75]:
# Set the targets
y_rf = zlw_rf['price']
X_rf = zlw_rf.drop('price', axis=1)

In [76]:
# Split data
X_zlw_rf_train, X_zlw_rf_test, y_zlw_rf_train, y_zlw_rf_test = train_test_split(X_rf, y_rf, test_size=0.2, random_state=573)

In [77]:
# train the model
model_zlw_rf = RandomForestRegressor(n_estimators = 128, random_state = 573)
model_zlw_rf.fit(X_zlw_rf_train, y_zlw_rf_train)

RandomForestRegressor(n_estimators=128, random_state=573)

In [86]:
# predict results
y_pred_zlw_rf = model_zlw_rf.predict(X_zlw_rf_test)
y_pred_zlw_rf

array([778833.59375  , 565717.1875   , 449790.0390625, ...,
       665672.625    , 714108.0234375, 838323.8828125])

In [87]:
print(y_pred_zlw_rf.shape)
print(y_zlw_rf_test.shape)

(3071,)
(3071,)


In [88]:
R2_rf = metrics.r2_score(y_zlw_rf_test, y_pred_zlw_rf)
R2_rf

0.945129342070966

## Zillow 

In [89]:
whos

Variable                                 Type                     Data/Info
---------------------------------------------------------------------------
LabelEncoder                             type                     <class 'sklearn.preproces<...>ing._label.LabelEncoder'>
LinearRegression                         ABCMeta                  <class 'sklearn.linear_mo<...>._base.LinearRegression'>
MAE_airbnb_le_minmax_lr                  float64                  56.518324253110094
MAE_airbnb_le_std_lr                     float64                  56.51832425311002
MAE_zlw_le_minmax_lr                     float64                  38882.020304289945
MAE_zlw_le_std_lr                        float64                  38882.020304289945
MAE_zlw_ohe_minmax_lr                    float64                  219490020174431.7
MAE_zlw_ohe_std_lr                       float64                  1.338747060486432e+16
MSE_airbnb_le_minmax_lr                  float64                  9101.08174429059
MSE_airbnb

## Linear Regression Model
See the [Linear Regression Model](./ML_Model_Final.ipynb) for the linear regression.

#### Set variables & perform scaling

In [19]:
std_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

In [20]:
#Standard Scaler Variables
# LE
y_zlw_le_std = zlw_le.price
X_zlw_le_std = zlw_le.drop('price', axis=1)

y_airbnb_le_std = airbnb_le.price
X_airbnb_le_std = airbnb_le.drop('price', axis=1)

# OHE
y_zlw_ohe_std = zlw_dum.price
X_zlw_ohe_std = zlw_dum.drop('price', axis=1)

y_airbnb_ohe_std = airbnb_dum.price
X_airbnb_ohe_std = airbnb_dum.drop('price', axis=1)

In [21]:
# MinMax Scaler Variables
# LE
y_zlw_le_minmax = zlw_le.price
X_zlw_le_minmax = zlw_le.drop('price', axis=1)

y_airbnb_le_minmax = airbnb_le.price
X_airbnb_le_minmax = airbnb_le.drop('price', axis=1)

# OHE
y_zlw_ohe_minmax = zlw_dum.price
X_zlw_ohe_minmax = zlw_dum.drop('price', axis=1)

y_airbnb_ohe_minmax = airbnb_dum.price
X_airbnb_ohe_minmax = airbnb_dum.drop('price', axis=1)

In [22]:
# fit and transform the X data - StandardScaler

# LE
X_zlw_le_std[X_zlw_le_std.columns] = std_scaler.fit_transform(X_zlw_le_std[X_zlw_le_std.columns].values)
X_airbnb_le_std[X_airbnb_le_std.columns] = std_scaler.fit_transform(X_airbnb_le_std[X_airbnb_le_std.columns].values)

# OHE
X_zlw_ohe_std[X_zlw_ohe_std.columns] = std_scaler.fit_transform(X_zlw_ohe_std[X_zlw_ohe_std.columns].values)
X_airbnb_ohe_std[X_airbnb_ohe_std.columns] = std_scaler.fit_transform(X_airbnb_ohe_std[X_airbnb_ohe_std.columns].values)

In [23]:
# fit and transform the X data - MinMaxScaler
# LE
X_zlw_le_minmax[X_zlw_le_minmax.columns] = minmax_scaler.fit_transform(X_zlw_le_minmax[X_zlw_le_minmax.columns].values)
X_airbnb_le_minmax[X_airbnb_le_minmax.columns] = minmax_scaler.fit_transform(X_airbnb_le_minmax[X_airbnb_le_minmax.columns].values)

# OHE
X_zlw_ohe_minmax[X_zlw_ohe_minmax.columns] = minmax_scaler.fit_transform(X_zlw_ohe_minmax[X_zlw_ohe_minmax.columns].values)
X_airbnb_ohe_minmax[X_airbnb_ohe_minmax.columns] = minmax_scaler.fit_transform(X_airbnb_ohe_minmax[X_airbnb_ohe_minmax.columns].values)

In [24]:
# check the mean (~0) and STD (~1) of standard scaler
X_zlw_ohe_std.describe()

Unnamed: 0,bathrooms,bedrooms,daysOnZillow,livingArea,lotSize,pricePerSquareFoot,rentZestimate,taxAssessedValue,google_zip,homeType_APARTMENT,...,google_neighborhood_West Mount Scott,google_neighborhood_West Portland Park,google_neighborhood_Westlake,google_neighborhood_Westridge,google_neighborhood_Wilkes,google_neighborhood_Wilkes East,google_neighborhood_Willamette,google_neighborhood_Witch Hazel,google_neighborhood_Woodland Park,google_neighborhood_Woodlawn
count,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,...,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0,15352.0
mean,1.629176e-16,7.405344000000001e-17,-2.9621380000000005e-17,1.703229e-16,1.258908e-16,1.888363e-16,5.1837410000000006e-17,4.4432060000000007e-17,4.543919e-14,1.1108020000000001e-17,...,0.0,2.9621380000000005e-17,-1.8513360000000003e-17,-1.2959350000000002e-17,2.2216030000000003e-17,3.702672e-18,-7.405344e-18,1.4810690000000003e-17,2.777004e-18,9.25668e-18
std,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,...,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033,1.000033
min,-1.751025,-1.749102,-1.649546,-2.027389,-2.007945,-3.176845,-1.814716,-2.442766,-1.777561,-0.08611147,...,-0.064192,-0.05600388,-0.03873531,-0.02677747,-0.07099941,-0.05716246,-0.08687587,-0.06959573,-0.00807108,-0.07237641
25%,-0.5426399,-0.4675569,-0.941142,-0.7637637,-0.6204474,-0.5806214,-0.6239734,-0.642151,-1.021562,-0.08611147,...,-0.064192,-0.05600388,-0.03873531,-0.02677747,-0.07099941,-0.05716246,-0.08687587,-0.06959573,-0.00807108,-0.07237641
50%,0.6657453,-0.4675569,0.03291342,-0.1121618,-0.2054682,-0.1479175,-0.2060686,-0.1923489,0.6416375,-0.08611147,...,-0.064192,-0.05600388,-0.03873531,-0.02677747,-0.07099941,-0.05716246,-0.08687587,-0.06959573,-0.00807108,-0.07237641
75%,0.6657453,0.813988,0.9361285,0.6359738,0.5343187,0.4214297,0.2908373,0.4111476,0.7695759,-0.08611147,...,-0.064192,-0.05600388,-0.03873531,-0.02677747,-0.07099941,-0.05716246,-0.08687587,-0.06959573,-0.00807108,-0.07237641
max,2.478323,2.095533,1.564837,3.136516,3.422028,19.8134,13.1188,12.06552,1.29296,11.61286,...,15.578271,17.8559,25.81624,37.34483,14.08462,17.494,11.51068,14.3687,123.8992,13.81666


In [25]:
print(X_zlw_le_std.shape)
print(X_airbnb_le_std.shape)
print(X_zlw_ohe_std.shape)
print(X_airbnb_ohe_std.shape)

(15352, 13)
(2037, 23)
(15352, 336)
(2037, 84)


In [26]:
print(X_zlw_le_minmax.shape)
print(X_airbnb_le_minmax.shape)
print(X_zlw_ohe_minmax.shape)
print(X_airbnb_ohe_minmax.shape)

(15352, 13)
(2037, 23)
(15352, 336)
(2037, 84)


#### Split into Testing & Training Data

In [27]:
# StandardScaler Split
X_zlw_le_std_train, X_zlw_le_std_test, y_zlw_le_std_train, y_zlw_le_std_test = train_test_split(X_zlw_le_std, y_zlw_le_std, test_size=0.2, random_state=573)
X_airbnb_le_std_train, X_airbnb_le_std_test, y_airbnb_le_std_train, y_airbnb_le_std_test = train_test_split(X_airbnb_le_std, y_airbnb_le_std, test_size=0.2, random_state=573)

X_zlw_ohe_std_train, X_zlw_ohe_std_test, y_zlw_ohe_std_train, y_zlw_ohe_std_test = train_test_split(X_zlw_ohe_std, y_zlw_ohe_std, test_size=0.2, random_state=573)
X_airbnb_ohe_std_train, X_airbnb_ohe_std_test, y_airbnb_ohe_std_train, y_airbnb_ohe_std_test = train_test_split(X_airbnb_ohe_std, y_airbnb_ohe_std, test_size=0.2, random_state=573)

print(X_zlw_le_std_train.shape)
print(X_airbnb_le_std_train.shape)
print(X_zlw_ohe_std_train.shape)
print(X_airbnb_ohe_std_train.shape)

(12281, 13)
(1629, 23)
(12281, 336)
(1629, 84)


In [28]:
# MinMaxScaler Split
X_zlw_le_minmax_train, X_zlw_le_minmax_test, y_zlw_le_minmax_train, y_zlw_le_minmax_test = train_test_split(X_zlw_le_minmax, y_zlw_le_minmax, test_size=0.2, random_state=573)
X_airbnb_le_minmax_train, X_airbnb_le_minmax_test, y_airbnb_le_minmax_train, y_airbnb_le_minmax_test = train_test_split(X_airbnb_le_minmax, y_airbnb_le_minmax, test_size=0.2, random_state=573)

X_zlw_ohe_minmax_train, X_zlw_ohe_minmax_test, y_zlw_ohe_minmax_train, y_zlw_ohe_minmax_test = train_test_split(X_zlw_ohe_minmax, y_zlw_ohe_minmax, test_size=0.2, random_state=573)
X_airbnb_ohe_minmax_train, X_airbnb_ohe_minmax_test, y_airbnb_ohe_minmax_train, y_airbnb_ohe_minmax_test = train_test_split(X_airbnb_ohe_minmax, y_airbnb_ohe_minmax, test_size=0.2, random_state=573)

print(X_zlw_le_minmax_train.shape)
print(X_airbnb_le_minmax_train.shape)
print(X_zlw_ohe_minmax_train.shape)
print(X_airbnb_ohe_minmax_train.shape)

(12281, 13)
(1629, 23)
(12281, 336)
(1629, 84)


##### Zillow Label Encoded with StandardScaler

In [29]:
# set the model type
# zlw label encoded with standard scaler
model_zlw_le_std_lr = LinearRegression()

In [30]:
# fit the model to the training data and calculate scores for the training and testing data
model_zlw_le_std_lr.fit(X_zlw_le_std_train, y_zlw_le_std_train)
training_score_zlw_le_std_lr = model_zlw_le_std_lr.score(X_zlw_le_std_train, y_zlw_le_std_train)
testing_score_std_zlw_le_std_lr = model_zlw_le_std_lr.score(X_zlw_le_std_test, y_zlw_le_std_test)

print(f"Training Score: {training_score_zlw_le_std_lr}")
print(f"Testing Score: {testing_score_std_zlw_le_std_lr}")

Training Score: 0.9364205636135193
Testing Score: 0.9228911807991071


In [31]:
# set the predictions
y_pred_zlw_le_std_lr = model_zlw_le_std_lr.predict(X_zlw_le_std_test)

# compare predicted vs actual
results_zlw_le_std_lr = pd.DataFrame({"Actual": y_zlw_le_std_test, "Predicted": y_pred_zlw_le_std_lr, "Absolute Difference": abs(y_pred_zlw_le_std_lr-y_zlw_le_std_test)})

# calculate statistical metrics
MAE_zlw_le_std_lr = metrics.mean_absolute_error(y_zlw_le_std_test, y_pred_zlw_le_std_lr)
MSE_zlw_le_std_lr = metrics.mean_squared_error(y_zlw_le_std_test, y_pred_zlw_le_std_lr)
RMSE_zlw_le_std_lr = np.sqrt(MSE_zlw_le_std_lr)
R2_zlw_le_std_lr = metrics.r2_score(y_zlw_le_std_test,y_pred_zlw_le_std_lr)
# aR2: https://www.statology.org/adjusted-r-squared-in-python/
aR2_zlw_le_std_lr = 1-(((1-R2_zlw_le_std_lr)*(len(y_zlw_le_std_test)-1))/(len(y_zlw_le_std_test)-X_zlw_le_std_test.shape[1]-1))

print(f'The maximum difference between predicted and actual price is ${(results_zlw_le_std_lr["Absolute Difference"].max()):,.2f}')
print(f'The minimum difference between predicted and actual price is ${(results_zlw_le_std_lr["Absolute Difference"].min()):,.2f}')
print(f'The median difference between predicted and actual price is ${(results_zlw_le_std_lr["Absolute Difference"].median()):,.2f}')
print(f'The average difference between predicted and actual price is ${(results_zlw_le_std_lr["Absolute Difference"].mean()):,.2f}\n')

print(f'The MAE is: ${(MAE_zlw_le_std_lr):,.2f}\nThe MSE is {(MSE_zlw_le_std_lr):.2f}\nThe RMSE is ${(RMSE_zlw_le_std_lr):,.2f}')
print(f'The R-squared value is {(R2_zlw_le_std_lr):.6f}\nThe adjusted R-squared value is {(aR2_zlw_le_std_lr):.6f}')
results_zlw_le_std_lr.head()

The maximum difference between predicted and actual price is $1,161,876.80
The minimum difference between predicted and actual price is $34.49
The median difference between predicted and actual price is $25,357.57
The average difference between predicted and actual price is $38,882.02

The MAE is: $38,882.02
The MSE is 4821157681.17
The RMSE is $69,434.56
The R-squared value is 0.922891
The adjusted R-squared value is 0.922563


Unnamed: 0_level_0,Actual,Predicted,Absolute Difference
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
350 6th Street,769000,766032.945759,2967.054241
914 Northeast Portland Boulevard Court,570000,592515.29748,22515.29748
20855 Southwest 90th Avenue,450000,465556.432584,15556.432584
8827 Southeast Knapp Street,375000,328118.029069,46881.970931
1117-1119 Northeast 60th Avenue,265000,224193.753975,40806.246025


##### Zillow Label Encoded with MinMaxScaler

In [32]:
# set the model type
# zlw label encoded with minmax scaler
model_zlw_le_minmax_lr = LinearRegression()

In [33]:
# fit the model to the training data and calculate scores for the training and testing data
model_zlw_le_minmax_lr.fit(X_zlw_le_minmax_train, y_zlw_le_minmax_train)
training_score_zlw_le_minmax_lr = model_zlw_le_minmax_lr.score(X_zlw_le_minmax_train, y_zlw_le_minmax_train)
testing_score_std_zlw_le_minmax_lr = model_zlw_le_minmax_lr.score(X_zlw_le_minmax_test, y_zlw_le_minmax_test)

print(f"Training Score: {training_score_zlw_le_minmax_lr}")
print(f"Testing Score: {testing_score_std_zlw_le_minmax_lr}")

Training Score: 0.9364205636135193
Testing Score: 0.922891180799107


In [34]:
# set the predictions
y_pred_zlw_le_minmax_lr = model_zlw_le_minmax_lr.predict(X_zlw_le_minmax_test)

# compare predicted vs actual
results_zlw_le_minmax_lr = pd.DataFrame({"Actual": y_zlw_le_minmax_test, "Predicted": y_pred_zlw_le_minmax_lr, "Absolute Difference": abs(y_pred_zlw_le_minmax_lr-y_zlw_le_minmax_test)})

# calculate statistical metrics
MAE_zlw_le_minmax_lr = metrics.mean_absolute_error(y_zlw_le_minmax_test, y_pred_zlw_le_minmax_lr)
MSE_zlw_le_minmax_lr = metrics.mean_squared_error(y_zlw_le_minmax_test, y_pred_zlw_le_minmax_lr)
RMSE_zlw_le_minmax_lr = np.sqrt(MSE_zlw_le_minmax_lr)
R2_zlw_le_minmax_lr = metrics.r2_score(y_zlw_le_minmax_test,y_pred_zlw_le_minmax_lr)
# aR2: https://www.statology.org/adjusted-r-squared-in-python/
aR2_zlw_le_minmax_lr = 1-(((1-R2_zlw_le_minmax_lr)*(len(y_zlw_le_minmax_test)-1))/(len(y_zlw_le_minmax_test)-X_zlw_le_minmax_test.shape[1]-1))

print(f'The maximum difference between predicted and actual price is ${(results_zlw_le_minmax_lr["Absolute Difference"].max()):,.2f}')
print(f'The minimum difference between predicted and actual price is ${(results_zlw_le_minmax_lr["Absolute Difference"].min()):,.2f}')
print(f'The median difference between predicted and actual price is ${(results_zlw_le_minmax_lr["Absolute Difference"].median()):,.2f}')
print(f'The average difference between predicted and actual price is ${(results_zlw_le_minmax_lr["Absolute Difference"].mean()):,.2f}\n')

print(f'The MAE is: ${(MAE_zlw_le_minmax_lr):,.2f}\nThe MSE is {(MSE_zlw_le_minmax_lr):.2f}\nThe RMSE is ${(RMSE_zlw_le_minmax_lr):,.2f}')
print(f'The R-squared value is {(R2_zlw_le_minmax_lr):.6f}\nThe adjusted R-squared value is {(aR2_zlw_le_minmax_lr):.6f}')
results_zlw_le_minmax_lr.head()

The maximum difference between predicted and actual price is $1,161,876.80
The minimum difference between predicted and actual price is $34.49
The median difference between predicted and actual price is $25,357.57
The average difference between predicted and actual price is $38,882.02

The MAE is: $38,882.02
The MSE is 4821157681.17
The RMSE is $69,434.56
The R-squared value is 0.922891
The adjusted R-squared value is 0.922563


Unnamed: 0_level_0,Actual,Predicted,Absolute Difference
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
350 6th Street,769000,766032.945759,2967.054241
914 Northeast Portland Boulevard Court,570000,592515.29748,22515.29748
20855 Southwest 90th Avenue,450000,465556.432584,15556.432584
8827 Southeast Knapp Street,375000,328118.029069,46881.970931
1117-1119 Northeast 60th Avenue,265000,224193.753975,40806.246025


There is no difference between the Standard and MinMax scalers on the Zillow dataset when performing Linear Regression. Either can be used. 

##### Zillow OneHot Encoded with StandardScaler

In [43]:
# # set the model type
# # zlw label encoded with standard scaler
# model_zlw_ohe_std_lr = LinearRegression()

In [44]:
# # fit the model to the training data and calculate scores for the training and testing data
# model_zlw_ohe_std_lr.fit(X_zlw_ohe_std_train, y_zlw_ohe_std_train)
# training_score_zlw_ohe_std_lr = model_zlw_ohe_std_lr.score(X_zlw_ohe_std_train, y_zlw_ohe_std_train)
# testing_score_zlw_ohe_std_lr = model_zlw_ohe_std_lr.score(X_zlw_ohe_std_test, y_zlw_ohe_std_test)

# print(f"Training Score: {training_score_zlw_ohe_std_lr}")
# print(f"Testing Score: {testing_score_zlw_ohe_std_lr}")

In [45]:
# # set the predictions
# y_pred_zlw_ohe_std_lr = model_zlw_ohe_std_lr.predict(X_zlw_ohe_std_test)

# # compare predicted vs actual
# results_zlw_ohe_std_lr = pd.DataFrame({"Actual": y_zlw_ohe_std_test, "Predicted": y_pred_zlw_ohe_std_lr, "Absolute Difference": abs(y_pred_zlw_ohe_std_lr-y_zlw_ohe_std_test)})

# # calculate statistical metrics
# MAE_zlw_ohe_std_lr = metrics.mean_absolute_error(y_zlw_ohe_std_test, y_pred_zlw_ohe_std_lr)
# MSE_zlw_ohe_std_lr = metrics.mean_squared_error(y_zlw_ohe_std_test, y_pred_zlw_ohe_std_lr)
# RMSE_zlw_ohe_std_lr = np.sqrt(MSE_zlw_ohe_std_lr)
# R2_zlw_ohe_std_lr = metrics.r2_score(y_zlw_ohe_std_test,y_pred_zlw_ohe_std_lr)
# # aR2: https://www.statology.org/adjusted-r-squared-in-python/
# aR2_zlw_ohe_std_lr = 1-(((1-R2_zlw_ohe_std_lr)*(len(y_zlw_ohe_std_test)-1))/(len(y_zlw_ohe_std_test)-X_zlw_ohe_std_test.shape[1]-1))

# print(f'The maximum difference between predicted and actual price is ${(results_zlw_ohe_std_lr["Absolute Difference"].max()):,.2f}')
# print(f'The minimum difference between predicted and actual price is ${(results_zlw_ohe_std_lr["Absolute Difference"].min()):,.2f}')
# print(f'The median difference between predicted and actual price is ${(results_zlw_ohe_std_lr["Absolute Difference"].median()):,.2f}')
# print(f'The average difference between predicted and actual price is ${(results_zlw_ohe_std_lr["Absolute Difference"].mean()):,.2f}\n')

# print(f'The MAE is: ${(MAE_zlw_ohe_std_lr):,.2f}\nThe MSE is {(MSE_zlw_ohe_std_lr):.2f}\nThe RMSE is ${(RMSE_zlw_ohe_std_lr):,.2f}')
# print(f'The R-squared value is {(R2_zlw_ohe_std_lr):.6f}\nThe adjusted R-squared value is {(aR2_zlw_ohe_std_lr):.6f}')
# results_zlw_ohe_std_lr.head()

##### Zillow OneHot Encoded with MinMaxScaler

In [46]:
# # set the model type
# # zlw label encoded with standard scaler
# model_zlw_ohe_minmax_lr = LinearRegression()

In [47]:
# # fit the model to the training data and calculate scores for the training and testing data
# model_zlw_ohe_minmax_lr.fit(X_zlw_ohe_minmax_train, y_zlw_ohe_minmax_train)
# training_score_zlw_ohe_minmax_lr = model_zlw_ohe_std_lr.score(X_zlw_ohe_minmax_train, y_zlw_ohe_minmax_train)
# testing_score_zlw_ohe_minmax_lr = model_zlw_ohe_std_lr.score(X_zlw_ohe_minmax_test, y_zlw_ohe_minmax_test)

# print(f"Training Score: {training_score_zlw_ohe_minmax_lr}")
# print(f"Testing Score: {testing_score_zlw_ohe_minmax_lr}")

In [48]:
# # set the predictions
# y_pred_zlw_ohe_minmax_lr = model_zlw_ohe_minmax_lr.predict(X_zlw_ohe_minmax_test)

# # compare predicted vs actual
# results_zlw_ohe_minmax_lr = pd.DataFrame({"Actual": y_zlw_ohe_minmax_test, "Predicted": y_pred_zlw_ohe_minmax_lr, "Absolute Difference": abs(y_pred_zlw_ohe_minmax_lr-y_zlw_ohe_minmax_test)})

# # calculate statistical metrics
# MAE_zlw_ohe_minmax_lr = metrics.mean_absolute_error(y_zlw_ohe_minmax_test, y_pred_zlw_ohe_minmax_lr)
# MSE_zlw_ohe_minmax_lr = metrics.mean_squared_error(y_zlw_ohe_minmax_test, y_pred_zlw_ohe_minmax_lr)
# RMSE_zlw_ohe_minmax_lr = np.sqrt(MSE_zlw_ohe_minmax_lr)
# R2_zlw_ohe_minmax_lr = metrics.r2_score(y_zlw_ohe_minmax_test,y_pred_zlw_ohe_minmax_lr)
# # aR2: https://www.statology.org/adjusted-r-squared-in-python/
# aR2_zlw_ohe_minmax_lr = 1-(((1-R2_zlw_ohe_minmax_lr)*(len(y_zlw_ohe_minmax_test)-1))/(len(y_zlw_ohe_minmax_test)-X_zlw_ohe_minmax_test.shape[1]-1))

# print(f'The maximum difference between predicted and actual price is ${(results_zlw_ohe_minmax_lr["Absolute Difference"].max()):,.2f}')
# print(f'The minimum difference between predicted and actual price is ${(results_zlw_ohe_minmax_lr["Absolute Difference"].min()):,.2f}')
# print(f'The median difference between predicted and actual price is ${(results_zlw_ohe_minmax_lr["Absolute Difference"].median()):,.2f}')
# print(f'The average difference between predicted and actual price is ${(results_zlw_ohe_minmax_lr["Absolute Difference"].mean()):,.2f}\n')

# print(f'The MAE is: ${(MAE_zlw_ohe_minmax_lr):,.2f}\nThe MSE is {(MSE_zlw_ohe_minmax_lr):.2f}\nThe RMSE is ${(RMSE_zlw_ohe_minmax_lr):,.2f}')
# print(f'The R-squared value is {(R2_zlw_ohe_minmax_lr):.6f}\nThe adjusted R-squared value is {(aR2_zlw_ohe_minmax_lr):.6f}')
# results_zlw_ohe_minmax_lr.head()

Based on the results from both Standard and MinMax scalers with OneHotEncoding, it is not recommended to use OHE.

##### AirBNB Label Encoded with StandardScaler

In [53]:
# set the model type
# airbnb label encoded with standard scaler
model_airbnb_le_std_lr = LinearRegression()

In [54]:
# fit the model to the training data and calculate scores for the training and testing data
model_airbnb_le_std_lr.fit(X_airbnb_le_std_train, y_airbnb_le_std_train)
training_score_airbnb_le_std_lr = model_airbnb_le_std_lr.score(X_airbnb_le_std_train, y_airbnb_le_std_train)
testing_score_std_airbnb_le_std_lr = model_airbnb_le_std_lr.score(X_airbnb_le_std_test, y_airbnb_le_std_test)

print(f"Training Score: {training_score_airbnb_le_std_lr}")
print(f"Testing Score: {testing_score_std_airbnb_le_std_lr}")

Training Score: 0.37783794964074047
Testing Score: 0.3455694717384111


In [55]:
# set the predictions
y_pred_airbnb_le_std_lr = model_airbnb_le_std_lr.predict(X_airbnb_le_std_test)

# compare predicted vs actual
results_airbnb_le_std_lr = pd.DataFrame({"Actual": y_airbnb_le_std_test, "Predicted": y_pred_airbnb_le_std_lr, "Absolute Difference": abs(y_pred_airbnb_le_std_lr-y_airbnb_le_std_test)})

# calculate statistical metrics
MAE_airbnb_le_std_lr = metrics.mean_absolute_error(y_airbnb_le_std_test, y_pred_airbnb_le_std_lr)
MSE_airbnb_le_std_lr = metrics.mean_squared_error(y_airbnb_le_std_test, y_pred_airbnb_le_std_lr)
RMSE_airbnb_le_std_lr = np.sqrt(MSE_airbnb_le_std_lr)
R2_airbnb_le_std_lr = metrics.r2_score(y_airbnb_le_std_test,y_pred_airbnb_le_std_lr)
# aR2: https://www.statology.org/adjusted-r-squared-in-python/
aR2_airbnb_le_std_lr = 1-(((1-R2_airbnb_le_std_lr)*(len(y_airbnb_le_std_test)-1))/(len(y_airbnb_le_std_test)-X_airbnb_le_std_test.shape[1]-1))

print(f'The maximum difference between predicted and actual price is ${(results_airbnb_le_std_lr["Absolute Difference"].max()):,.2f}')
print(f'The minimum difference between predicted and actual price is ${(results_airbnb_le_std_lr["Absolute Difference"].min()):,.2f}')
print(f'The median difference between predicted and actual price is ${(results_airbnb_le_std_lr["Absolute Difference"].median()):,.2f}')
print(f'The average difference between predicted and actual price is ${(results_airbnb_le_std_lr["Absolute Difference"].mean()):,.2f}\n')

print(f'The MAE is: ${(MAE_airbnb_le_std_lr):,.2f}\nThe MSE is {(MSE_airbnb_le_std_lr):.2f}\nThe RMSE is ${(RMSE_airbnb_le_std_lr):,.2f}')
print(f'The R-squared value is {(R2_airbnb_le_std_lr):.6f}\nThe adjusted R-squared value is {(aR2_airbnb_le_std_lr):.6f}')
results_airbnb_le_std_lr.head()

The maximum difference between predicted and actual price is $891.37
The minimum difference between predicted and actual price is $0.15
The median difference between predicted and actual price is $32.29
The average difference between predicted and actual price is $56.52

The MAE is: $56.52
The MSE is 9101.08
The RMSE is $95.40
The R-squared value is 0.345569
The adjusted R-squared value is 0.306372


Unnamed: 0_level_0,Actual,Predicted,Absolute Difference
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1313 Southeast 26th Avenue,68.0,56.217265,11.782735
5757 Northeast Sumner Street,6.0,92.909678,86.909678
4057 Northeast 10th Avenue,93.0,104.550238,11.550238
4541 Northeast 35th Avenue,85.0,125.87628,40.87628
117 South Whitaker Street,18.0,112.576847,94.576847


##### AirBNB Label Encoded with MinMaxScaler

In [49]:
# set the model type
# airbnb label encoded with standard scaler
model_airbnb_le_minmax_lr = LinearRegression()

In [51]:
# fit the model to the training data and calculate scores for the training and testing data
model_airbnb_le_minmax_lr.fit(X_airbnb_le_minmax_train, y_airbnb_le_minmax_train)
training_score_airbnb_le_minmax_lr = model_airbnb_le_minmax_lr.score(X_airbnb_le_minmax_train, y_airbnb_le_minmax_train)
testing_score_std_airbnb_le_minmax_lr = model_airbnb_le_minmax_lr.score(X_airbnb_le_minmax_test, y_airbnb_le_minmax_test)

print(f"Training Score: {training_score_airbnb_le_minmax_lr}")
print(f"Testing Score: {testing_score_std_airbnb_le_minmax_lr}")

Training Score: 0.37783794964074047
Testing Score: 0.34556947173841035


In [52]:
# set the predictions
y_pred_airbnb_le_minmax_lr = model_airbnb_le_minmax_lr.predict(X_airbnb_le_minmax_test)

# compare predicted vs actual
results_airbnb_le_minmax_lr = pd.DataFrame({"Actual": y_airbnb_le_minmax_test, "Predicted": y_pred_airbnb_le_minmax_lr, "Absolute Difference": abs(y_pred_airbnb_le_minmax_lr-y_airbnb_le_minmax_test)})

# calculate statistical metrics
MAE_airbnb_le_minmax_lr = metrics.mean_absolute_error(y_airbnb_le_minmax_test, y_pred_airbnb_le_minmax_lr)
MSE_airbnb_le_minmax_lr = metrics.mean_squared_error(y_airbnb_le_minmax_test, y_pred_airbnb_le_minmax_lr)
RMSE_airbnb_le_minmax_lr = np.sqrt(MSE_airbnb_le_minmax_lr)
R2_airbnb_le_minmax_lr = metrics.r2_score(y_airbnb_le_minmax_test,y_pred_airbnb_le_minmax_lr)
# aR2: https://www.statology.org/adjusted-r-squared-in-python/
aR2_airbnb_le_minmax_lr = 1-(((1-R2_airbnb_le_minmax_lr)*(len(y_airbnb_le_minmax_test)-1))/(len(y_airbnb_le_minmax_test)-X_airbnb_le_minmax_test.shape[1]-1))

print(f'The maximum difference between predicted and actual price is ${(results_airbnb_le_minmax_lr["Absolute Difference"].max()):,.2f}')
print(f'The minimum difference between predicted and actual price is ${(results_airbnb_le_minmax_lr["Absolute Difference"].min()):,.2f}')
print(f'The median difference between predicted and actual price is ${(results_airbnb_le_minmax_lr["Absolute Difference"].median()):,.2f}')
print(f'The average difference between predicted and actual price is ${(results_airbnb_le_minmax_lr["Absolute Difference"].mean()):,.2f}\n')

print(f'The MAE is: ${(MAE_airbnb_le_minmax_lr):,.2f}\nThe MSE is {(MSE_airbnb_le_minmax_lr):.2f}\nThe RMSE is ${(RMSE_airbnb_le_minmax_lr):,.2f}')
print(f'The R-squared value is {(R2_airbnb_le_minmax_lr):.6f}\nThe adjusted R-squared value is {(aR2_airbnb_le_minmax_lr):.6f}')
results_airbnb_le_minmax_lr.head()

The maximum difference between predicted and actual price is $891.37
The minimum difference between predicted and actual price is $0.15
The median difference between predicted and actual price is $32.29
The average difference between predicted and actual price is $56.52

The MAE is: $56.52
The MSE is 9101.08
The RMSE is $95.40
The R-squared value is 0.345569
The adjusted R-squared value is 0.306372


Unnamed: 0_level_0,Actual,Predicted,Absolute Difference
google_address,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1313 Southeast 26th Avenue,68.0,56.217265,11.782735
5757 Northeast Sumner Street,6.0,92.909678,86.909678
4057 Northeast 10th Avenue,93.0,104.550238,11.550238
4541 Northeast 35th Avenue,85.0,125.87628,40.87628
117 South Whitaker Street,18.0,112.576847,94.576847


There is no difference between the Standard and MinMax scalers on the Zillow dataset when performing Linear Regression. Either can be used. 