<h2> Modeling - Construct models to predict and forcast</h2>

# Explore for initial models
It stands to reason that size of the home should predict price but what else could improve our prediction and are there any other features that might improve our prediction. Let's explore the following:
* What single feature is the best predictor of home price?
* Are there any groups of features that increase this accuracy of this prediction?
* Are there any other interesting associations?

In [1]:
# Import statements
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as scs
from scipy.stats import norm
import statsmodels.api as sm
import statsmodels.stats.api as sms
from statsmodels.formula.api import ols
plt.style.use('seaborn-colorblind')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

In [2]:
# Read in cleaned Kings County file 
df = pd.read_csv("cleaned_kings.csv")
df.head()

## Searching for the Best Single Predictor
What single feature predicts price the best? 

In [3]:
# Interesting features
features = df[['price','bedrooms','bathrooms','sqft_living','sqft_lot','floors','view','condition','grade',
            'sqft_above','sqft_basement','yr_built','yr_renovated','sqft_living15','sqft_lot15']]
features.head(2)

### Split Features into Categorical and Continuous

In [4]:
# Our initial target will be price
price = df[['price']]
print(price.head())

# Continuous features will remain features
features = df[['bedrooms','bathrooms','sqft_living','sqft_lot', 'sqft_above',
               'sqft_basement','sqft_living15','sqft_lot15']]

# Categorical features
cat_features = df[['floors','view','condition','grade','yr_built','yr_renovated']]

### Price vs Each Feature

In [5]:
# Remove 1st two columns
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])
df.head()

In [6]:
# Create the OLS model by looping through each feature and create a summary table
summaries = []
for feature in features:
    d = {}
    formula_str = 'price ~ ' + str(feature)
    res = ols(formula=formula_str, data=features).fit()
    # Make a summary chart
    d['feature'] = feature
    d['jb'],d['p'], d['sk'], d['kurt'] = sms.jarque_bera(res.resid)
    d['r2'] = res.rsquared
    summaries.append(d)
    res.summary()

### Summary Chart

In [7]:
summaries_df = pd.DataFrame(summaries)
summaries_df.head()

#### Regression Observations
* sqft_living has the best rsquared value when 



In [8]:
# regression for categorical variabes
for feature in cat_features:
    formula_str = 'price ~ ' + str(feature)
    mod = ols(formula=formula_str, data=cat_features)
    res = mod.fit()
     # Make a summary chart
    d['feature'] = feature
    d['jb'],d['p'], d['sk'], d['kurt'] = sms.jarque_bera(res.resid)
    d['r2'] = res.rsquared
    summaries.append(d)
    res.summary()

## Price versus sqft_living  without outliers

In [9]:
summaries = pd.DataFrame(summaries)
summaries

### Sqft Living: Find and remove outliers for better predictive values

In [10]:
# Calculate the sqft_living limit for outliers using 1.5 * IQR + Q3
Q1 = features['sqft_living'].quantile(0.25)
Q3 = features['sqft_living'].quantile(0.75)
IQR = Q3 - Q1
limit = 1.5 * IQR + Q3
limit

In [11]:
# Remove upper outliers from sqft_living and create histogram again
norm_sqft_living = features[features['sqft_living'] < limit]
norm_sqft_living.head(2)

# Histogram (distplot)
sns.distplot(norm_sqft_living['sqft_living'], color='crimson', label=norm_sqft_living, fit=norm, kde=False)
plt.title('Sqft living (without upper outliers)')
plt.show()
print(len(norm_sqft_living), len(features))

In [12]:
# Remove upper outliers from sqft_lot and create histogram again
norm_price_sqft_living = norm_sqft_living[norm_sqft_living['price'] < limit]

# Histogram (distplot)
sns.distplot(norm_price_sqft_living['price'], color='darkorchid', label=norm_sqft_living, fit=norm, kde=False)
plt.title('Price using Sqft living (without price outliers)')
plt.show()
print(len(norm_price_sqft_living), len(features))

### Model Price to Sqft Living

#### Create the prediction regression model

In [13]:
# Regression Model
X = norm_sqft_living["price"] 
y = norm_sqft_living['sqft_living']

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

In [14]:
# plot the observed data and the least squares line
norm_sqft_living.plot(kind='scatter', x='price', y='sqft_living', figsize=(8,5), label='values', color='pink') 
plt.plot(X, predictions, c='red', linewidth=2, label='prediction') 
plt.title('Prediction of Price by Sqft Living')
plt.legend()
plt.show()

#### Visual Relationship of Price to Normed Sqft Living

### Price Outliers Removed

### Price vs sqft_living without outliers observations

Comparison with and without outliers
* p-values still 0
* Skew improved from 2.8 to 2
* Kurtosis improved from 26.97 to 13.8
* JB improved from 1146871.98 to 541541.2
* R2 improved from .492 to .393

Still doesn't seem to be a great predictor


In [15]:
mod = ols(formula='price ~ sqft_living', data=norm_sqft_living)
res = mod.fit()
res.summary()

#### The Price and Sqft_Living data sets are normal

In [16]:
# The shapiro test will confirm the data sets are normal
scs.shapiro(norm_price_sqft_living.price), scs.shapiro(norm_price_sqft_living.sqft_living)

#### Notice: Condition warning 
* The Condition number is large indicating numerical problems.
* I will use this study as a base model and continue to add features or change the model to improve its predictive properties.

#### Predictive regression model  with outliers removed for both price and sqft_living

In [17]:
# Regression Model
X = norm_price_sqft_living["price"] 
y = norm_price_sqft_living['sqft_living']

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

In [18]:
mod = ols(formula='price ~ sqft_living', data=norm_features)
res = mod.fit()
res.summary()

In [19]:
# plot the observed data and the least squares line
norm_sqft_living.plot(kind='scatter', x='price', y='sqft_living', figsize=(8,5), label='values', color='mediumpurple') 
plt.plot(X, predictions, c='red', linewidth=2, label='prediction') 
plt.title('Prediction of Price by Sqft Living')
plt.legend()
plt.show()

### Observations of Price vs Squarefoot Living 
* This model is only a fair predictor as it is cone shaped and loses its predictive value as the price of the homes increases.
* A search should continue to see if a better model can be created.

In [20]:
JB, pval, skew, kurtosis = sms.jarque_bera(res.resid)
rsquared = res.rsquared

summary_stats = "price to sqft_living gives JB: {:.2f}, pval: {:.2f}, skew: {:.2f}, kurtosis: {:.2f}, rsquared: {:.2f}".format(JB,pval,skew,kurtosis,rsquared)
summary_stats        

### price to sqft_living statistics:
JB: 923.08, pval: 0.00, skew: 0.52, kurtosis: 3.10, rsquared: 0.36

### Outliers Removed - norm_features
Since we had a better predictive model with outliers removed for price and square feet, I will now make it the cleaned version for all future models.

In [21]:
# norm_price_sqft_living will become norm_features
norm_features = norm_price_sqft_living

## Price predicted by grade and sqft_living
Square foot of the living space is always an inportant value when determining the price of a home but will other features help to make a better predictive model? I would like to investigate grade since it had a high correlation initially to price.

In [22]:
##

In [23]:
mod = ols(formula='price ~ sqft_living + grade', data=norm_features)
res = mod.fit()
res.summary()

In [24]:
JB, pval, skew, kurtosis = sms.jarque_bera(res.resid)
rsquared = res.rsquared

summary_stats = "price to sqft_living gives JB: {:.2f}, pval: {:.2f}, skew: {:.2f}, kurtosis: {:.2f}, rsquared: {:.2f}".format(JB,pval,skew,kurtosis,rsquared)
summary_stats 

### Indications are there may be some multicollinearity going on even though p-values look good
Will remove values to see if anything improves

In [25]:
VIF = 1 / (1 - 0.507)
VIF

The VIF score indicates the selected features are moderately correlated.
There is slight skew and high kurtosis.
The Jarque-Bera value is high indicating there may be some multicollinearity>