# King County Housing Price Prediction From Linear Regression

## Overview
This notebook is an implementation of the linear regression model created from `housing_price_prediction.ipynb`. This notebook will import the model and a test data set with unknown prices. It will perform all feature engineering performed in `housing_price_prediction.ipynb` and output a list of predicted price. This predicted price will be combined to the test data and exported as seprate file in the results folder. 

### Libraries Import

In [1]:
import os
import pandas as pd
import numpy as np
import pickle 
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

### Test Data Import

In [2]:
kc_import_df = pd.read_csv("data/kc_house_data_test_features.csv", index_col=0)
kc_test_df = kc_import_df #this is done not to adulterate the original file
kc_test_df.head()

Unnamed: 0,id,date,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,1974300020,20140827T000000,4,2.5,2270,11500,1.0,0,0,3,8,1540,730,1967,0,98034,47.7089,-122.241,2020,10918
1,1974300020,20150218T000000,4,2.5,2270,11500,1.0,0,0,3,8,1540,730,1967,0,98034,47.7089,-122.241,2020,10918
2,3630020380,20141107T000000,3,2.5,1470,1779,2.0,0,0,3,8,1160,310,2005,0,98029,47.5472,-121.998,1470,1576
3,1771000290,20141203T000000,3,1.75,1280,16200,1.0,0,0,3,8,1030,250,1976,0,98077,47.7427,-122.071,1160,10565
4,5126310470,20150115T000000,4,2.75,2830,8126,2.0,0,0,3,8,2830,0,2005,0,98059,47.4863,-122.14,2830,7916


### Zipcode Dummy Series Import
The zipcode dummy variables must be imported from the original notebook to match the dimesionality of linear models to the dataset

In [3]:
ziplist = pd.read_csv("data/zipcod_dummy.csv", index_col=0)
ziplist = ziplist.zipcode

## Feature Engineering
This section is equivalent to feature engineering in `housing_price_prediction.ipynb`

In [4]:
#renovation
kc_test_df["renovated"] = kc_test_df.yr_renovated.apply(lambda x: 1 if x > 0 else 0)
kc_test_df["renovation_age"] = kc_test_df.yr_renovated.apply(lambda x: 2020-x if x > 0 else 0)

#basement
kc_test_df["basement"] = kc_test_df.sqft_basement.apply(lambda x: 1 if x != 0 else 0)

#master bathroom
kc_test_df["master_bathroom"] = kc_test_df.bathrooms.apply(lambda x: 1 if x > 2 else 0)

#family house
kc_test_df["family_house"] = kc_test_df.bedrooms.apply(lambda x: 1 if x > 2 else 0)

#sold year and quarter
kc_test_df["sale_year"] = kc_test_df.date.apply(lambda x: int(x[:4]))
kc_test_df["sale_quarter"] = kc_test_df.date.apply(lambda x: int(x[4:6])//3.1 + 1)

#zipcode dummy variables
kc_test_df = kc_test_df.merge(pd.get_dummies(ziplist), left_index=True, right_index=True)

#squared bedrooms and bathrooms
kc_test_df["bedroom_squared"] = kc_test_df["bedrooms"] ** 2
kc_test_df["bathroom_squared"] = kc_test_df["bathrooms"] ** 2

# uncomment to check the data set
# kc_test_df.head()

features = [col for col in kc_test_df.columns if col not in ["id", "date"] ] #remove unused column
kc_test_df_features = kc_test_df[features] #set train/test data using feature above

In [5]:
kc_test_df_features.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,renovated,renovation_age,basement,master_bathroom,family_house,sale_year,sale_quarter,98001,98002,98003,98004,98005,98006,98007,98008,98010,98011,98014,98019,98022,98023,98024,98027,98028,98029,98030,98031,98032,98033,98034,98038,98039,98040,98042,98045,98052,98053,98055,98056,98058,98059,98065,98070,98072,98074,98075,98077,98092,98102,98103,98105,98106,98107,98108,98109,98112,98115,98116,98117,98118,98119,98122,98125,98126,98133,98136,98144,98146,98148,98155,98166,98168,98177,98178,98188,98198,98199,bedroom_squared,bathroom_squared
0,4,2.5,2270,11500,1.0,0,0,3,8,1540,730,1967,0,98034,47.7089,-122.241,2020,10918,0,0,1,1,1,2014,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,6.25
1,4,2.5,2270,11500,1.0,0,0,3,8,1540,730,1967,0,98034,47.7089,-122.241,2020,10918,0,0,1,1,1,2015,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,6.25
2,3,2.5,1470,1779,2.0,0,0,3,8,1160,310,2005,0,98029,47.5472,-121.998,1470,1576,0,0,1,1,1,2014,4.0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,6.25
3,3,1.75,1280,16200,1.0,0,0,3,8,1030,250,1976,0,98077,47.7427,-122.071,1160,10565,0,0,1,0,1,2014,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,3.0625
4,4,2.75,2830,8126,2.0,0,0,3,8,2830,0,2005,0,98059,47.4863,-122.14,2830,7916,0,0,0,1,1,2015,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,7.5625


## Load Model and Predict Price

In [6]:
with open("models/regression_model_rfe.pickle", "rb") as model:
    lr_model_rfe = pickle.load(model)

with open("models/transform_rfe.pickle", "rb") as transform:
    rfe_transform = pickle.load(transform)

In [7]:
# transform features according to RFECV
rfe_features = rfe_transform.transform(kc_test_df_features)

In [8]:
kc_price_predict_rfe = lr_model_rfe.predict(rfe_features)

In [9]:
price_prediction_rfe = pd.DataFrame({"price":kc_price_predict_rfe})

In [10]:
price_prediction_rfe.describe()

Unnamed: 0,price
count,4323.0
mean,617598.3
std,353849.6
min,-410709.1
25%,399417.5
50%,555210.7
75%,737905.8
max,2608024.0


## Prediction Result Merge and Export

In [11]:
kc_import_df = kc_import_df.merge(price_prediction_rfe, left_index=True, right_index=True)

In [12]:
#reset columns for export
kc_import_df = kc_import_df[['id', 'price', 'date', 'bedrooms', 'bathrooms', 'sqft_living',
                        'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
                        'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated','zipcode',
                        'lat', 'long','sqft_living15', 'sqft_lot15']]

In [13]:
kc_import_df.to_csv("results/kc_house_price_prediction.csv")

In [17]:
price_prediction_rfe.to_csv("results/kc_house_price_prediction_no_features.csv", index=0)