# House Sales in King County, USA

### Description
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

It's a great dataset for evaluating simple regression models.

# Business Understanding
##### Problem: 
Our client want to have a machine that can correctly predict the house price.

##### Clear Questions: 
- How to predict the house price accurately?

##### Analytic Approach: 
Regression ( Linear and Polynomial )

##### Data Requirements / Features: 
- Rely on feature selection algorithm followed by EDA

<img src="./feature-info.jpg">

In [None]:
!pip install autoviz geopy

# Data Understanding

In [None]:
# import libraries
import numpy as np
import pandas as pd 
pd.set_option('display.max_columns', 500)
import joblib # export model
from datetime import datetime # cek waktu proses

import category_encoders as ce # binary encoding

# machine learning
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# find location
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="agent")

#automate EDA
from autoviz.AutoViz_Class import AutoViz_Class

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# laod the data
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')
df.head()

In [None]:
# check tail
df.tail()

In [None]:
# check view
df['view'].unique() 

In [None]:
# check available columns
df.columns

In [None]:
# check shape
df.shape

In [None]:
# describe
df.describe()

In [None]:
# check column and its dtype
df.info()

In [None]:
# check missing value
df.isnull().sum()

In [None]:
# check unique values
df.nunique()

<!-- -->

# Data Preparation

In [None]:
# slice date to year, month, date
df["date_year"] = df["date"].str.slice(0,4).astype(int)
df["date_month"] = df["date"].str.slice(4,6).astype(int)
df["date_day"] = df["date"].str.slice(6,8).astype(int)

df.head()

In [None]:
# check column uniqueness
print('bedrooms:', df["bedrooms"].unique())
print('bathrooms:', df["bathrooms"].unique())
print('floors:', df["floors"].unique())
print('waterfront:', df["waterfront"].unique())
print('view:', df["view"].unique())
print('condition:', df["condition"].unique())
print('grade:', df["grade"].unique())
print('year_built:', df["yr_built"].unique())
print('year_renovated:', df["yr_renovated"].unique())
print('zipcode:', df["zipcode"].unique())
print('date_year:', df["date_year"].unique())
print('date_month:', df["date_month"].unique())
print('date_day:', df["date_day"].unique())

In [None]:
# generate new column house_age
df["house_age"] = df["date_year"] - df["yr_built"] 
df.head()

In [None]:
# its still possible, the house might be built on the next year when it's sold.
df[df["house_age"]==-1]

In [None]:
# convert lat long to city using geopy
# based on research on https://www.kingcounty.gov/depts/health/codes/cities.aspx there are roughly 39 counties in king county.
citytown = []

start = datetime.now()

for i in range(len(df)):
    try:
        try:
            c = geolocator.reverse(df['lat'][i].astype(str)+', '+df['long'][i].astype(str)).raw['address']['city']
        except Exception:
            c = geolocator.reverse(df['lat'][i].astype(str)+', '+df['long'][i].astype(str)).raw['address']['town']
    except Exception:
        c = 'none'
        
    # print(c)
    citytown.append(c)    
    if(i%100 == 0):
        print(i)

end = datetime.now()

print('process time: ', end - start)

In [None]:
#generate city_town column
df['city_town'] = pd.DataFrame(citytown)

In [None]:
# generate is_renovated
df["is_renovated"] = np.where(df["yr_renovated"] > 0, 1, df["yr_renovated"])
df.head()

In [None]:
# generate rnv_age
# assumed renovated house = full renovated = like new house.
df["rnv_age"] = np.where(df['yr_renovated'] > 0, df['date_year'] - df['yr_renovated'], df['yr_renovated'])
df["rnv_age"] = np.where(df['yr_renovated'] == 0, df['house_age'], df['rnv_age'])
df.head()

In [None]:
# check point 1 <- still not well done yet (medium rare)
df.to_csv('kc_data_house-mid.csv', encoding='utf-8', index=False)

In [None]:
# load check point 1
df_mid = pd.read_csv('kc_data_house-mid.csv')
df_mid.head()

In [None]:
# check house_age statistic
df_mid["house_age"].describe()

In [None]:
# check rnv_age statistic
df_mid["rnv_age"].describe()

In [None]:
# check yr_renovated unique values
df_mid["yr_renovated"].unique()

# Binning
Group into 5 different categories based on house_age and rnv_age

In [None]:
# house age
df_mid["house_age"] = np.where((df_mid["house_age"] >= -1) & (df_mid["house_age"] <= 27), 1, df_mid["house_age"]) #29 steps
df_mid["house_age"] = np.where((df_mid["house_age"] >= 28) & (df_mid["house_age"] <= 51), 2, df_mid["house_age"])
df_mid["house_age"] = np.where((df_mid["house_age"] >= 52) & (df_mid["house_age"] <= 75), 3, df_mid["house_age"]) 
df_mid["house_age"] = np.where((df_mid["house_age"] >= 76) & (df_mid["house_age"] <= 98), 4, df_mid["house_age"])
df_mid["house_age"] = np.where((df_mid["house_age"] >= 99) & (df_mid["house_age"] <= 122), 5, df_mid["house_age"])

df_mid["house_age"].head(20)

In [None]:
# rnv age
df_mid["rnv_age"] = np.where((df_mid["rnv_age"] >= -1) & (df_mid["rnv_age"] <= 27), 1, df_mid["rnv_age"]) #29 steps
df_mid["rnv_age"] = np.where((df_mid["rnv_age"] >= 28) & (df_mid["rnv_age"] <= 51), 2, df_mid["rnv_age"])
df_mid["rnv_age"] = np.where((df_mid["rnv_age"] >= 52) & (df_mid["rnv_age"] <= 75), 3, df_mid["rnv_age"]) 
df_mid["rnv_age"] = np.where((df_mid["rnv_age"] >= 76) & (df_mid["rnv_age"] <= 98), 4, df_mid["rnv_age"])
df_mid["rnv_age"] = np.where((df_mid["rnv_age"] >= 99) & (df_mid["rnv_age"] <= 122), 5, df_mid["rnv_age"])

df_mid["rnv_age"].head(20)

In [None]:
# check binned values
df_mid.head()

In [None]:
# binary Encoding -> zipcode, city_town
encoder = ce.BinaryEncoder(cols=['zipcode', 'city_town'], return_df=True)
df_mid = encoder.fit_transform(df_mid)
df_mid.head()

In [None]:
# one hot encoding -> is_renovated
is_renovated_df = pd.get_dummies(df_mid.is_renovated, prefix='is_renovated')
df_mid = pd.concat([df_mid, is_renovated_df], axis=1)
df_mid.head()

In [None]:
# Selection
# check columns
df_mid.columns

In [None]:
# drop unintended column
y = df_mid["price"]
X = df_mid.drop(['id','date','lat','long','yr_built','yr_renovated','is_renovated','price'], axis=1)

In [None]:
X_columns = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'zipcode_0', 'zipcode_1', 'zipcode_2', 'zipcode_3',
       'zipcode_4', 'zipcode_5', 'zipcode_6', 'zipcode_7', 'sqft_living15',
       'sqft_lot15', 'date_year', 'date_month', 'date_day', 'house_age',
       'city_town_0', 'city_town_1', 'city_town_2', 'city_town_3',
       'city_town_4', 'city_town_5', 'city_town_6', 'rnv_age',
       'is_renovated_0', 'is_renovated_1']

In [None]:
# Min Max scaler -> X
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X = pd.DataFrame(X)

X.columns = X_columns
X.head()

In [None]:
# concat X and y
df_pos = pd.concat([X, y], axis=1)
df_pos.head()

In [None]:
# check point 2 <- well done
df_pos.to_csv('kc_data_house-pos.csv', encoding='utf-8', index=False)

# Exploratory Data Analysis

In [None]:
# load truely meaningful data
df_pos = pd.read_csv('kc_data_house-pos.csv')
df_pos.head()

In [None]:
# check shape
df_pos.shape

In [None]:
# check correlation
df_pos.corr()

In [None]:
# check why zipcode_0 and city_town_0 = NaN
print('zipcode_0: ', df_pos['zipcode_0'].unique())
print('city_town_0: ', df_pos['city_town_0'].unique())

In [None]:
# manual EDA -> sqft_lot (scatterplot)
plt.scatter(df_pos['sqft_lot'], df_pos['price'])
plt.xlabel("sqft_lot")
plt.ylabel("Price")
plt.show()

In [None]:
# manual EDA -> bathrooms
plt.scatter(df_pos['bathrooms'], df_pos['price'])
plt.xlabel("bathrooms")
plt.ylabel("Price")
plt.show()

In [None]:
# manual EDA -> sqft_living
plt.scatter(df_pos['sqft_living'], df_pos['price'])
plt.xlabel("Sqft Living")
plt.ylabel("Price")
plt.show()

In [None]:
# manual EDA -> grade
plt.scatter(df_pos['grade'], df_pos['price'])
plt.xlabel("Grade")
plt.ylabel("Price")
plt.show()

In [None]:
# auto EDA with autoviz library
AV = AutoViz_Class()
AV.AutoViz('kc_data_house-pos.csv', depVar='price')

# Feature Selection Algorithm

In [None]:
# initialize num_feats
num_feats = 15

In [None]:
# split dependent and independent variables
y = df_pos['price']
X = df_pos.drop('price', axis=1)

In [None]:
# 1. PEARSON CORRELATION (filter methods)
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

print("pearson correlation")
print(cor_feature)

In [None]:
# 2. CHI SQUARE FEATURES (filter methods)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)

chi_selector = SelectKBest(chi2, k=num_feats)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

print("chi feature")
print(chi_feature)

In [None]:
# 3.RECURSIVE FEATURE ELIMINATION (wrapper methods)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe_selector = RFE(estimator=LogisticRegression(solver='lbfgs'), n_features_to_select=num_feats, step=10, verbose=5)
rfe_selector.fit(X_norm, y)

rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

In [None]:
# 4. LASSO: SELECT FROM MODEL (embedded methods)
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty='l1', solver='liblinear'), max_features=num_feats)
embeded_lr_selector.fit(X_norm, y)

embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

print("lasso model")
print(embeded_lr_feature)

In [None]:
# 5. TREE BASED SELECT FROM MODEL
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=10, max_depth=6), max_features=num_feats)
embeded_rf_selector.fit(X, y)

embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')


print("random forest")
print(embeded_rf_feature)

In [None]:
# OVERALL
pd.set_option('display.max_rows', None)
feature_name = X.columns.tolist()
 #put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'LASSO':embeded_lr_support,
                                    'Random Forest':embeded_rf_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
print(feature_selection_df.head(num_feats))

# Train Test Split

In [None]:
# take only the selected value
X_filtered = X[['zipcode_2', 'view', 'sqft_living15','grade','zipcode_7','zipcode_6','zipcode_5','zipcode_3','city_town_6','city_town_5', 'city_town_4', 'city_town_3', 'city_town_2','zipcode_4','waterfront']]

In [None]:
# split into train and test dataset
# train and test ratio => 80:20
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=0.2, random_state=0)

In [None]:
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)

# Modeling

In [None]:
#1. Multiple Linear Regression
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

model_mlp = lr.fit(X_train, y_train)
y_hat_lr = lr.predict(X_test)
y_hat_lr

In [None]:
#2. Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
model_poly = lr.fit(X_poly, y_train)

X_poly_test = poly.fit_transform(X_test)
y_hat_poly = lr.predict(X_poly_test)
y_hat_poly

# Evaluation

In [None]:
# evaluate model with MSE
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
# function for plotting
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)

    plt.show()
    plt.close()

In [None]:
Title = 'Distribution Plot of Multiple Linear Regression'
DistributionPlot(y_test, y_hat_lr, "Actual Values (Test)", "Predicted Values (Test)", Title)

print("MSE for multiple linear regression: ", mean_squared_error(y_test, y_hat_lr))
print("r2 score for multiple linear regression: ", r2_score(y_test, y_hat_lr))

In [None]:
Title = 'Distribution Plot of Polynomial Regression with degree 2'
DistributionPlot(y_test, y_hat_poly, "Actual Values (Test)", "Predicted Values (Test)", Title)

print("MSE for polynomial regression (deg=2): ", mean_squared_error(y_test, y_hat_poly))
print("r2 score for polynomial regression (deg=2): ", r2_score(y_test, y_hat_poly))

# Export Model

In [None]:
# export model to pkl format
joblib.dump(model_poly, 'house_price_model.pkl')

In [None]:
# prediction demo with new data
model_clone = joblib.load('house_price_model.pkl')

new_data = pd.DataFrame([[0.0,0.0,0.161934,0.5,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0]])

poly = PolynomialFeatures(degree=2)
X_new = poly.fit_transform(new_data)

model_clone.predict(X_new)

# Conclusion

As we can see, <b>2nd-degree polynomial regression</b> has better performance than <b>multiple linear regression</b>. So we prefer that model to be exported. In this notebook, we also show a mini demonstration about how the saved model can work on new data input and produced the predicted price as the final output.

# References

> House Sales Data
- https://www.kaggle.com/harlfoxem/housesalesprediction

> Burhan's Notebook
- https://www.kaggle.com/burhanykiyakoglu/predicting-house-prices

> Feature Information
- https://www.slideshare.net/PawanShivhare1/predicting-king-county-house-prices