# Analysis of Real Estate Prices and Features in King County, Seattle 

## Overview

## Business Problem

A real estate agency is looking to provide advice to homeowners looking to increase the values of their homes. We are seeking to identify several variables that can **predict** what a home's sale price can be.

**Alternately:** If a customer is looking to purchase land without a house already on the lot, our recommendations will help home builders maximize their profits by helping builders identify which predictor variables to focus on.

## Importing Packages and Libraries

In [2]:
# baseline and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# sci-kit learn
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

# from random import gauss
# from mpl_toolkits.mplot3d import Axes3D

# statsmodels
from statsmodels.formula.api import ols
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

%matplotlib inline

...and some formatting options.

In [3]:
# Shows *all* columns in dataframe, i.e. does not truncate horizontally
pd.set_option('display.max_columns', None)

# Converts from scientific notation to standard form (applied to every df in
# this notebook) and rounds to two decimal places
pd.set_option('display.float_format', lambda x: '%.2f' % x)

Below, we read in the data and check out some of its basic features: column names, null values, data types, etc.

In [1]:
# Reading in .csv file
df = pd.read_csv('../data/kc_house_data.csv')

NameError: name 'pd' is not defined

In [6]:
# Checking out descriptive statistics for numerical columns
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,17755.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474287.77,540296.57,3.37,2.12,2080.32,15099.41,1.49,1788.6,1971.0,83.64,98077.95,47.56,-122.21,1986.62,12758.28
std,2876735715.75,367368.14,0.93,0.77,918.11,41412.64,0.54,827.76,29.38,399.95,53.51,0.14,0.14,685.23,27274.44
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,370.0,1900.0,0.0,98001.0,47.16,-122.52,399.0,651.0
25%,2123049175.0,322000.0,3.0,1.75,1430.0,5040.0,1.0,1190.0,1951.0,0.0,98033.0,47.47,-122.33,1490.0,5100.0
50%,3904930410.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,1560.0,1975.0,0.0,98065.0,47.57,-122.23,1840.0,7620.0
75%,7308900490.0,645000.0,4.0,2.5,2550.0,10685.0,2.0,2210.0,1997.0,0.0,98118.0,47.68,-122.12,2360.0,10083.0
max,9900000190.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,9410.0,2015.0,2015.0,98199.0,47.78,-121.31,6210.0,871200.0


## Splitting data into train and test sets

Our **target** variable, or `X`, is going to be `price`, i.e. the sale price of a given home.

In [7]:
# Creating target variable and predictor dataframe
y = df['price']
X = df.drop(labels = 'price',
            axis = 1)

In [8]:
# Initiating train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [9]:
print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns.")
print(f"y_train is a Series with {y_train.shape[0]} values.")

assert X_train.shape[0] == y_train.shape[0]

X_train is a DataFrame with 14469 rows and 20 columns.
y_train is a Series with 14469 values.


## Initial Cleaning

In [10]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14469 entries, 19709 to 15795
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             14469 non-null  int64  
 1   date           14469 non-null  object 
 2   bedrooms       14469 non-null  int64  
 3   bathrooms      14469 non-null  float64
 4   sqft_living    14469 non-null  int64  
 5   sqft_lot       14469 non-null  int64  
 6   floors         14469 non-null  float64
 7   waterfront     12913 non-null  object 
 8   view           14427 non-null  object 
 9   condition      14469 non-null  object 
 10  grade          14469 non-null  object 
 11  sqft_above     14469 non-null  int64  
 12  sqft_basement  14469 non-null  object 
 13  yr_built       14469 non-null  int64  
 14  yr_renovated   11883 non-null  float64
 15  zipcode        14469 non-null  int64  
 16  lat            14469 non-null  float64
 17  long           14469 non-null  float64
 18  sq