# Evaluation of Machine Learning Models for Zillow/AirBNB Datasets

There are XX parts to this document:
1. Comparing models
2. Using selected model to make predictions

# Section 1 - Comparing Models

### Regression vs Classification Models

If conducting supervised learning on these datasets, we will need to use a regression model. The data we are looking at is continuous data, which requires regression mdoeling. Logistic modeling is discreet and would not be applicable for this dataset.

### Linear Regression vs Random Forest

Since our dataset contains some variables that are continuous and some that are categorical, a random forest model may outperform a linear regression model. 

We will test both a random forest model and linear regression.

## Prepare the data for the models

In [None]:
# import dependencies
import pandas as pd
import numpy as np
from datetime import datetime
import seaborn as sb

import sqlite3
# Python SQL toolkit and Object Relational Mapper
import sqlalchemy
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine, inspect, func

#modeling dependencies
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [None]:
# Load the data
con_zlw = sqlite3.connect("../finalprojectdatabase.db")
zlw = pd.read_sql_query("SELECT * from zillow_google", con_zlw)

con_airbnb = sqlite3.connect("../finalprojectdatabase.db")
airbnb = pd.read_sql_query("SELECT * from arbnb_google", con_airbnb)

In [None]:
zlw.head()

In [None]:
# set address as index
zlw = zlw.set_index('google_address', drop=True)
print(zlw.columns.to_list())
zlw.head()

In [None]:
# drop columns not needed for machine learning
zlw = zlw.drop(['Address',
                'City',
                'Zipcode',
                #'bathrooms',
                #'bedrooms',
                #'daysOnZillow',
                #'homeType',
                'latitude',
                #'livingArea',
                'longitude',
                #'lotSize',
                'date_sold',
                #'price',
                #'pricePerSquareFoot',
                #'rentZestimate',
                #'taxAssessedValue',
                #'taxAssessedYear',
                'url',
                #'yearBuilt',
                'house_number',
                'street_name',
                #'google_zip',
                #'google_neighborhood',
                'lat_lng',
                'zipcode_length'], axis=1)
zlw.dtypes

In [None]:
airbnb.head()

In [None]:
# set address as index
airbnb = airbnb.set_index('google_address', drop=True)
print(airbnb.columns.to_list())
airbnb.head()

In [None]:
# drop columns not needed for machine learning
airbnb = airbnb.drop(['listing_url',
                       #'host_response_time',
                       #'host_response_rate',
                       #'host_acceptance_rate',
                       #'host_is_superhost',
                       #'host_identity_verified',
                       'neighbourhood_cleansed',
                       'latitude',
                       'longitude',
                       #'room_type',
                       #'accommodates',
                       #'bathrooms',
                       #'bedrooms',
                       #'beds',
                       #'price',
                       #'minimum_nights',
                       #'maximum_nights',
                       #'number_of_reviews_l30d',
                       'last_review',
                       #'review_scores_rating',
                       #'review_scores_accuracy',
                       #'review_scores_cleanliness',
                       #'review_scores_checkin',
                       #'review_scores_communication',
                       #'review_scores_location',
                       #'review_scores_value',
                       #'reviews_per_month',
                       'house_number',
                       'street_name',
                       #'google_zip',
                       #'google_neighborhood',
                       'lat_lng'], axis=1)
airbnb.dtypes

### Test OneHotEncoder vs Label Encoding

In [None]:
le = LabelEncoder()
ohe = OneHotEncoder(sparse=False)

In [None]:
# set columns to check to list

In [None]:
zlw_le = zlw.copy()
zlw_ohe = zlw.copy()
airbnb_le = airbnb.copy()
airbnb_ohe = airbnb.copy()

In [None]:
zlw_cat_col = zlw.select_dtypes(include='object').columns
airbnb_cat_col = airbnb.select_dtypes(include=['object']).columns

In [None]:
#Label Encoder
for row in zlw_cat_col:
    zlw_le[row] = le.fit_transform(zlw_le[row].astype(str))

for row in airbnb_cat_col:
    airbnb_le[row] = le.fit_transform(airbnb_le[row].astype(str))

In [None]:
zlw_le.head()

In [None]:
airbnb_le.head()

In [None]:
#OneHotEncoder - get dummies
zlw_dum = pd.get_dummies(zlw_ohe, columns = zlw_cat_col)
airbnb_dum = pd.get_dummies(airbnb_ohe, columns = airbnb_cat_col)

In [None]:
zlw_dum.head()

In [None]:
airbnb_dum.head()

#### Set variables & perform scaling

In [None]:
std_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

In [None]:

# LE
y_zlw_le = zlw_le.price
X_zlw_le = zlw_le.drop('price', axis=1)

y_airbnb_le = airbnb_le.price
X_airbnb_le = airbnb_le.drop('price', axis=1)

# OHE
y_zlw_ohe = zlw_dum.price
X_zlw_ohe = zlw_dum.drop('price', axis=1)

y_airbnb_ohe = airbnb_dum.price
X_airbnb_ohe = airbnb_dum.drop('price', axis=1)

In [None]:
# fit and transform the X data - StandardScaler

# LE
X_zlw_le = std_scaler.fit_transform(X_zlw_le[X_zlw_le.columns].values)
X_airbnb_le = std_scaler.fit_transform(X_airbnb_le[X_airbnb_le.columns].values)

# OHE
X_zlw_ohe[X_zlw_ohe.columns] = std_scaler.fit_transform(X_zlw_ohe[X_zlw_ohe.columns].values)
X_airbnb_ohe[X_airbnb_ohe.columns] = std_scaler.fit_transform(X_airbnb_ohe[X_airbnb_ohe.columns].values)

In [None]:
# # fit and transform the X data - MinMaxScaler
# # LE
# X_zlw_le[X_zlw_le.columns] = minmax_scaler.fit_transform(X_zlw_le[X_zlw_le.columns].values)
# X_airbnb_le[X_airbnb_le.columns] = minmax_scaler.fit_transform(X_airbnb_le[X_airbnb_le.columns].values)

# # OHE
# X_zlw_ohe[X_zlw_ohe.columns] = minmax_scaler.fit_transform(X_zlw_ohe[X_zlw_ohe.columns].values)
# X_airbnb_ohe[X_airbnb_ohe.columns] = minmax_scaler.fit_transform(X_airbnb_ohe[X_airbnb_ohe.columns].values)

In [None]:
# check the mean (~0) and STD (~1) of standard scaler
X_airbnb_ohe.describe()

In [None]:
X_airbnb_ohe.shape

#### Split into Testing & Training Data

## Linear Regression Model
See the [Linear Regression Model](./ML_Model_Final.ipynb) for the linear regression.

## Random Forest Model