# King County Housing
#### House Price Estimate

**Authors:** Hatice Kastan, Czarina Luna, Ross McKim, Weston Shuken

##### January 2022

***

![image](Images/daria-nepriakhina-LZkbXfzJK4M-unsplash.jpg)

## Overview

    Overview of our project.

## Business Problem

    Stakeholder is a real estate company.
    Business Problem is predicting price and building a house price calculator.

## Data Understanding
    Describe the data being used for this project.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
raw_data = pd.read_csv('Data/kc_house_data.csv')

In [5]:
raw_data.head(2)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639


## Data Cleaning
    Describe and justify the process for preparing the data for analysis.

In [6]:
# Data prep and cleaning

# Change to datetime and add month column
raw_data['date'] = pd.to_datetime(raw_data['date'])
raw_data['month'] = pd.DatetimeIndex(raw_data['date']).month

# Change waterfront missing value to No
raw_data.loc[raw_data.waterfront.isnull(), 'waterfront'] = "NO"

# Change view missing value to None
raw_data.loc[raw_data.view.isnull(), 'view'] = "NONE"

# Change condition to numerical value
cond_dict = {'Poor':0, 'Fair':1, 'Average':2, 'Good':3, 'Very Good':4}
raw_data['condition'].replace(cond_dict, inplace=True)

# Change grade to numerical value
raw_data['grade'] = raw_data['grade'].map(lambda x: int(x.split(' ')[0]))

# Add has_basement column
raw_data['basement'] = raw_data['sqft_basement'].apply(lambda x: 0 if x == 0 else 1)

# Change some yr_renovated missing value to 0 and add renovated column
raw_data.loc[raw_data.yr_renovated.isnull(), 'yr_renovated'] = 0
raw_data['renovated'] = raw_data['yr_renovated'].apply(lambda x: 0 if x == 0 else 1)

# Add house_age column
raw_data['age'] = raw_data['date'].dt.year - raw_data['yr_built']

In [7]:
def corr_check(df, threshold):
    '''
    Enter dataframe and threshold for correlation
    Returns table of the highly correlated pairs
    '''
    corr_df = df.corr().abs().stack().reset_index().sort_values(0, ascending=False)
    corr_df['pairs'] = list(zip(corr_df.level_0, corr_df.level_1))
    corr_df.set_index(['pairs'], inplace = True)
    corr_df.drop(columns=['level_1', 'level_0'], inplace = True)
    corr_df.columns = ['cc']
    corr_df = corr_df.drop_duplicates()
    corr_df = corr_df[(corr_df['cc'] > threshold) & (corr_df['cc'] < 1)]
    return corr_df

corr_check(raw_data, .7)

Unnamed: 0_level_0,cc
pairs,Unnamed: 1_level_1
"(renovated, yr_renovated)",0.999968
"(age, yr_built)",0.999873
"(sqft_living, sqft_above)",0.876448
"(sqft_living, grade)",0.762779
"(sqft_living, sqft_living15)",0.756402
"(sqft_above, grade)",0.756073
"(sqft_living, bathrooms)",0.755758
"(sqft_living15, sqft_above)",0.731767
"(sqft_lot, sqft_lot15)",0.718204
"(grade, sqft_living15)",0.713867


In [8]:
# Drop columns
raw_data.drop(columns=['id', 'yr_renovated', 'sqft_above', 'sqft_basement',
                      'yr_built', 'yr_renovated'], inplace=True)

In [9]:
raw_data.head(2)

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,zipcode,lat,long,sqft_living15,sqft_lot15,month,basement,renovated,age
0,2014-10-13,221900.0,3,1.0,1180,5650,1.0,NO,NONE,2,7,98178,47.5112,-122.257,1340,5650,10,1,0,59
1,2014-12-09,538000.0,3,2.25,2570,7242,2.0,NO,NONE,2,7,98125,47.721,-122.319,1690,7639,12,1,1,63


## Data Exploration
    Generate insights and visualizations about price and its relationships with variables.

In [10]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           21597 non-null  datetime64[ns]
 1   price          21597 non-null  float64       
 2   bedrooms       21597 non-null  int64         
 3   bathrooms      21597 non-null  float64       
 4   sqft_living    21597 non-null  int64         
 5   sqft_lot       21597 non-null  int64         
 6   floors         21597 non-null  float64       
 7   waterfront     21597 non-null  object        
 8   view           21597 non-null  object        
 9   condition      21597 non-null  int64         
 10  grade          21597 non-null  int64         
 11  zipcode        21597 non-null  int64         
 12  lat            21597 non-null  float64       
 13  long           21597 non-null  float64       
 14  sqft_living15  21597 non-null  int64         
 15  sqft_lot15     2159

In [11]:
raw_data.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,condition,grade,zipcode,lat,long,sqft_living15,sqft_lot15,month,basement,renovated,age
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,540296.6,3.3732,2.115826,2080.32185,15099.41,1.494096,2.409825,7.657915,98077.951845,47.560093,-122.213982,1986.620318,12758.283512,6.573969,1.0,0.034449,43.323286
std,367368.1,0.926299,0.768984,918.106125,41412.64,0.539683,0.650546,1.1732,53.513072,0.138552,0.140724,685.230472,27274.44195,3.115061,0.0,0.182384,29.377285
min,78000.0,1.0,0.5,370.0,520.0,1.0,0.0,3.0,98001.0,47.1559,-122.519,399.0,651.0,1.0,1.0,0.0,-1.0
25%,322000.0,3.0,1.75,1430.0,5040.0,1.0,2.0,7.0,98033.0,47.4711,-122.328,1490.0,5100.0,4.0,1.0,0.0,18.0
50%,450000.0,3.0,2.25,1910.0,7618.0,1.5,2.0,7.0,98065.0,47.5718,-122.231,1840.0,7620.0,6.0,1.0,0.0,40.0
75%,645000.0,4.0,2.5,2550.0,10685.0,2.0,3.0,8.0,98118.0,47.678,-122.125,2360.0,10083.0,9.0,1.0,0.0,63.0
max,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,4.0,13.0,98199.0,47.7776,-121.315,6210.0,871200.0,12.0,1.0,1.0,115.0


In [13]:
import statsmodels.api as sm

### Baseline Model
    Run simple linear regression on feature highest correlated with price.

In [14]:
# ols
mod1 = sm.formula.ols(formula= "price ~ sqft_living", data = raw_data)
mod1

<statsmodels.regression.linear_model.OLS at 0x146b3c074f0>

In [15]:
mod1_result = mod1.fit()
mod1_result

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x146b5d2edf0>

In [16]:
mod1_summ = mod1_result.summary()
mod1_summ
#trying linear regression with one variable

0,1,2,3
Dep. Variable:,price,R-squared:,0.493
Model:,OLS,Adj. R-squared:,0.493
Method:,Least Squares,F-statistic:,20970.0
Date:,"Mon, 03 Jan 2022",Prob (F-statistic):,0.0
Time:,16:43:51,Log-Likelihood:,-300060.0
No. Observations:,21597,AIC:,600100.0
Df Residuals:,21595,BIC:,600100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.399e+04,4410.023,-9.975,0.000,-5.26e+04,-3.53e+04
sqft_living,280.8630,1.939,144.819,0.000,277.062,284.664

0,1,2,3
Omnibus:,14801.942,Durbin-Watson:,1.982
Prob(Omnibus):,0.0,Jarque-Bera (JB):,542662.604
Skew:,2.82,Prob(JB):,0.0
Kurtosis:,26.901,Cond. No.,5630.0


In [23]:
all_mod_summ = sm.formula.ols(formula='price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + view + condition+ grade + month + basement + renovated + age', data= raw_data)
all_mod_summ

<statsmodels.regression.linear_model.OLS at 0x146b2779700>

In [24]:
all_mod_result = all_mod_summ.fit()
all_mod_result

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x146b5d8d700>

In [25]:
all_mod_summ = all_mod_result.summary()
all_mod_summ

0,1,2,3
Dep. Variable:,price,R-squared:,0.655
Model:,OLS,Adj. R-squared:,0.655
Method:,Least Squares,F-statistic:,2736.0
Date:,"Mon, 03 Jan 2022",Prob (F-statistic):,0.0
Time:,16:54:24,Log-Likelihood:,-295890.0
No. Observations:,21597,AIC:,591800.0
Df Residuals:,21581,BIC:,591900.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.617e+05,9320.737,-49.534,0.000,-4.8e+05,-4.43e+05
waterfront[T.YES],5.224e+05,2.19e+04,23.906,0.000,4.8e+05,5.65e+05
view[T.EXCELLENT],2.37e+05,1.63e+04,14.519,0.000,2.05e+05,2.69e+05
view[T.FAIR],5.802e+04,1.38e+04,4.209,0.000,3.1e+04,8.5e+04
view[T.GOOD],6.086e+04,1.19e+04,5.123,0.000,3.76e+04,8.41e+04
view[T.NONE],-5.249e+04,7301.375,-7.188,0.000,-6.68e+04,-3.82e+04
bedrooms,-3.886e+04,2030.829,-19.135,0.000,-4.28e+04,-3.49e+04
bathrooms,4.659e+04,3438.300,13.549,0.000,3.98e+04,5.33e+04
sqft_living,169.2426,3.280,51.604,0.000,162.814,175.671

0,1,2,3
Omnibus:,16031.403,Durbin-Watson:,1.975
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1118846.061
Skew:,2.959,Prob(JB):,0.0
Kurtosis:,37.761,Cond. No.,6.16e+21


##### Model Metrics Table
    Create table of metrics we care about, and update with every additional model after.

In [11]:
# metric_df

## Feature Engineering
    Create new variables to predict the price.

In [12]:
# code

### Feature Scaling
    Perform log transformation and standardization.

In [13]:
# StandardScaler and PowerTransformer

### Feature Selection
    Feature ranking with recursive feature elimination.

In [14]:
# RFE

## Data Modeling
    Describe and justify the process for modeling the data.
    Run multiple linear regression on top ranking features.

In [15]:
# OLS

#### Check Assumptions of Linear Regression
    Linearity, independence, normality, homoescadicity

In [16]:
# code