# King County House Sales Regression Analysis
## Data Modeling

* Student name: Spencer Hadel
* Student pace: Flex
* Scheduled project review date/time: 6/5/2022, 11:00am EST
* Instructor name: Claude Fried

### Import Modules

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


### Import Prepared Data from kc_preprocessing_exploring.ipynb

We have already preprocessed our data in the kc_kc_preprocessing_exploring notebook:

[Preprocessing Notebook](./kc_preprocessing_exploring.ipynb)

In [2]:
df = pd.read_csv('./data/preprocessed.csv', index_col = 0)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21534 entries, 0 to 21596
Data columns (total 68 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                21534 non-null  float64
 1   sqft_living          21534 non-null  float64
 2   sqft_lot             21534 non-null  float64
 3   yr_built             21534 non-null  float64
 4   bedrooms_10          21534 non-null  int64  
 5   bedrooms_11          21534 non-null  int64  
 6   bedrooms_2           21534 non-null  int64  
 7   bedrooms_3           21534 non-null  int64  
 8   bedrooms_33          21534 non-null  int64  
 9   bedrooms_4           21534 non-null  int64  
 10  bedrooms_5           21534 non-null  int64  
 11  bedrooms_6           21534 non-null  int64  
 12  bedrooms_7           21534 non-null  int64  
 13  bedrooms_8           21534 non-null  int64  
 14  bedrooms_9           21534 non-null  int64  
 15  bathrooms_0.75       21534 non-null 

In [4]:
subs = [(' ', '_'),('.','_'),("'",""),('™', ''), ('®',''),
        ('+','plus'), ('½','half'), ('-','_')
       ]
def col_formatting(col):
    for old, new in subs:
        col = col.replace(old,new)
    return col

df.columns = [col_formatting(col) for col in df.columns]

list(df.columns)

['price',
 'sqft_living',
 'sqft_lot',
 'yr_built',
 'bedrooms_10',
 'bedrooms_11',
 'bedrooms_2',
 'bedrooms_3',
 'bedrooms_33',
 'bedrooms_4',
 'bedrooms_5',
 'bedrooms_6',
 'bedrooms_7',
 'bedrooms_8',
 'bedrooms_9',
 'bathrooms_0_75',
 'bathrooms_1_0',
 'bathrooms_1_25',
 'bathrooms_1_5',
 'bathrooms_1_75',
 'bathrooms_2_0',
 'bathrooms_2_25',
 'bathrooms_2_5',
 'bathrooms_2_75',
 'bathrooms_3_0',
 'bathrooms_3_25',
 'bathrooms_3_5',
 'bathrooms_3_75',
 'bathrooms_4_0',
 'bathrooms_4_25',
 'bathrooms_4_5',
 'bathrooms_4_75',
 'bathrooms_5_0',
 'bathrooms_5_25',
 'bathrooms_5_5',
 'bathrooms_5_75',
 'bathrooms_6_0',
 'bathrooms_6_25',
 'bathrooms_6_5',
 'bathrooms_6_75',
 'bathrooms_7_5',
 'bathrooms_7_75',
 'bathrooms_8_0',
 'floors_1_5',
 'floors_2_0',
 'floors_2_5',
 'floors_3_0',
 'floors_3_5',
 'renovated_2000_1',
 'grade_11_Excellent',
 'grade_12_Luxury',
 'grade_13_Mansion',
 'grade_3_Poor',
 'grade_4_Low',
 'grade_5_Fair',
 'grade_6_Low_Average',
 'grade_7_Average',
 'grade_

## Split Train and Test Data

Now that we have a complete preprocessed dataset, we need to split the data into train and test datasets, as well as identify the feature we are testing for: price.

In [19]:
predictors = df.drop('price', axis=1)
target = df['price']

X_train, X_test, y_train, y_test = train_test_split(predictors, target)

#check size of each
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((16150, 67), (5384, 67), (16150,), (5384,))

In [20]:
X = X_train
y = y_train

model_1 = sm.OLS(y, sm.add_constant(X)).fit()
model_1.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.661
Model:,OLS,Adj. R-squared:,0.66
Method:,Least Squares,F-statistic:,475.9
Date:,"Tue, 03 May 2022",Prob (F-statistic):,0.0
Time:,12:24:35,Log-Likelihood:,-14150.0
No. Observations:,16150,AIC:,28430.0
Df Residuals:,16083,BIC:,28950.0
Df Model:,66,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.5928,0.343,1.731,0.084,-0.079,1.264
sqft_living,0.3784,0.010,36.616,0.000,0.358,0.399
sqft_lot,-0.0667,0.006,-11.978,0.000,-0.078,-0.056
yr_built,-0.2920,0.007,-40.631,0.000,-0.306,-0.278
bedrooms_10,-0.4304,0.586,-0.735,0.463,-1.579,0.718
bedrooms_11,-0.7931,0.586,-1.354,0.176,-1.941,0.355
bedrooms_2,-0.0385,0.053,-0.730,0.466,-0.142,0.065
bedrooms_3,-0.2464,0.053,-4.641,0.000,-0.351,-0.142
bedrooms_33,0.4841,0.585,0.827,0.408,-0.663,1.631

0,1,2,3
Omnibus:,75.554,Durbin-Watson:,1.987
Prob(Omnibus):,0.0,Jarque-Bera (JB):,93.616
Skew:,-0.089,Prob(JB):,4.6900000000000004e-21
Kurtosis:,3.327,Cond. No.,1.26e+16


In order to reduce the number of features, we will use scikit-learn's feature_selection submodule to select only the most important features.

In [42]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate, ShuffleSplit

# Importances are based on coefficient magnitude, so
# we need to scale the data to normalize the coefficients
X_train_for_RFECV = StandardScaler().fit_transform(X)

# Instantiate and fit the selector
selector = RFECV(LinearRegression(), cv=ShuffleSplit(n_splits=3, test_size=0.25, random_state=0))
selector.fit(X_train_for_RFECV, y_train)

# Relevant Features:
for index, col in enumerate(X.columns):
    print(f"{col}: {selector.support_[index]}")

sqft_living: True
sqft_lot: True
yr_built: True
bedrooms_10: False
bedrooms_11: False
bedrooms_2: False
bedrooms_3: True
bedrooms_33: False
bedrooms_4: True
bedrooms_5: True
bedrooms_6: True
bedrooms_7: False
bedrooms_8: False
bedrooms_9: False
bathrooms_0_75: False
bathrooms_1_0: True
bathrooms_1_25: False
bathrooms_1_5: True
bathrooms_1_75: False
bathrooms_2_0: False
bathrooms_2_25: False
bathrooms_2_5: False
bathrooms_2_75: False
bathrooms_3_0: False
bathrooms_3_25: True
bathrooms_3_5: True
bathrooms_3_75: True
bathrooms_4_0: False
bathrooms_4_25: False
bathrooms_4_5: False
bathrooms_4_75: False
bathrooms_5_0: False
bathrooms_5_25: False
bathrooms_5_5: False
bathrooms_5_75: False
bathrooms_6_0: False
bathrooms_6_25: False
bathrooms_6_5: False
bathrooms_6_75: False
bathrooms_7_5: False
bathrooms_7_75: False
bathrooms_8_0: False
floors_1_5: False
floors_2_0: False
floors_2_5: False
floors_3_0: True
floors_3_5: False
renovated_2000_1: False
grade_11_Excellent: True
grade_12_Luxury: Tru