# Inferential Statistics:
    
Are there variables that are particularly significant in terms of explaining the answer to your project question?

Variables such as price of the house, number of bedrooms, bathrooms, square footage, location (zipcode) of a property. All of these factor into the purchase price of a home. By exploring how these variables affect the price of a property, we will be able to better answer the project question to help homebuyers make more confident offers. 

Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?

There appears to be a strong correlation between average price and location of a property. The location of the property is the independent variable and the price of the home is the dependent variable. From the data visualizations that we have done, it appears that the closer the property is to Seattle metropolitan area, the higher the value of the home. This suggests that homebuyers are willing to pay a higher premium for convenience to the larger city for work, tourism, restaurants etc. 



In [1]:
# Looking at filtering of cheap houses vs. expensive houses
# From there, filter down further. Cheap houses? What zipcodes are most popular? 
# Most expensive zipcodes?
# 98039 

In [2]:
# Importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns 
import geopandas as gpd

In [3]:
# Importing the dataframe
df = pd.read_csv('C:/Users/jwhoj/Desktop/Capstone_1/KC_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [4]:
count_row = df.shape[0]  # gives number of row count
count_col = df.shape[1]  # gives number of col count
print(count_row)
print(count_col)

21613
21


In [5]:
# How many houses are in each zipcode?
df['zipcode'].value_counts().head(10)

98103    602
98038    590
98115    583
98052    574
98117    553
98042    548
98034    545
98118    508
98023    499
98006    498
Name: zipcode, dtype: int64

In [6]:
# Calculate mean of prices in King County
np.mean(df['price'])

540182.1587933188

In [7]:
# Create dataframe of more affordable housing < 500k
cheap = df[(df['price'] < 500000) & 
                                           (df['price'] > 0) ]

In [8]:
# Mean of more affordable houses 
np.mean(cheap['price'])

338387.4663926499

In [9]:
cheap['zipcode'].value_counts().head(10)

98042    524
98038    518
98023    477
98133    432
98058    418
98118    393
98034    378
98155    364
98001    347
98092    325
Name: zipcode, dtype: int64

In [10]:
# Create dataframe of more expensive housing > 500k
expensive = df[(df['price'] > 500000)]

In [11]:
np.mean(expensive['price'])

817435.6914834861

In [12]:
expensive.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
5,7237550310,20140512T000000,1230000.0,4,4.5,5420,101930,1.0,0,0,...,11,3890,1530,2001,0,98053,47.6561,-122.005,4760,101930
10,1736800520,20150403T000000,662500.0,3,2.5,3560,9796,1.0,0,0,...,8,1860,1700,1965,0,98007,47.6007,-122.145,2210,8925


In [13]:
price = (df['price'])

# One-Way ANOVA

We will be taking a look at Analysis of Variance Test or ANOVA. We will be conducting the One-Way ANOVA to 
compare if the mean price of multiple zipcodes are equal. Whereas the Two-Way ANOVA compares two or more variables and 
how they are related to the variable in question. 

Null hypothesis: The mean price between different zipcodes are the same. 
    
Alternative hypothesis: The mean price between different zipcodes are different. 

In [14]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF

import numpy as np
import pandas as pd
import scipy

import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [15]:
m = ols('price ~ zipcode',df).fit()
print(m.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     61.26
Date:                Tue, 23 Apr 2019   Prob (F-statistic):           5.22e-15
Time:                        16:19:20   Log-Likelihood:            -3.0759e+05
No. Observations:               21613   AIC:                         6.152e+05
Df Residuals:                   21611   BIC:                         6.152e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3.634e+07   4.57e+06      7.945      0.0

# Interpreting coefficients
There is a lot of information in this output result, but we will focus on the middle section/coefficient table. 
We can interpret the zipcode coefficient (-365.0496) by first seeing how the p-value under P>|t| is so small (0.000) 
This can be interpreted as the zipcode is a statistically significant predictor of price. The regression coefficient 
for zipcode ($-365,049) means that on average each zipcode differs in house price ($-365,049). The confidence interval 
gives a range of ($-456,465, $-273,634) which are quite large sums. 

The computed F-statistic is 61.26 Typically a high F-value means that your data does not support the null hypothesis well.
This means that we reject the null hypothesis and that there is statistical significance between price of a property 
and location. 
