# Zillow Data

### Mission

* Discover the key drivers of property value for single family properties.

* Use features to develop a machine learning model to predict the property value for single family properties.

## Imports

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import os
import wrangle_zillow as wz
import explore_zillow as ez
import model_zillow as mz

import matplotlib.pyplot as plt
import seaborn as sns
from pydataset import data
from math import sqrt
from scipy import stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.feature_selection import f_regression
from sklearn.model_selection import train_test_split

### Acquire Data

* Data acquired from CodeUp database

* Data set contained 52441 rows and 7 columns after cleaning
* Each row represents an individual parcel
* Each column represents a feature of that parcel

### Prepare Data

#### Actions:

* Removed columns that did not contain useful information
* Renamed columns to promote readability
* Removed nulls in the data
* Converted current datatype to appropriate datatype
* Split data into train, validate and test (approx. 60/25/15)

## Data Dictionary


In [2]:
# acquiring, preparing, and splitting the zillow data
train, validate, test = wz.wrangle_zillow()

## A brief look at the data

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21969 entries, 40174 to 40856
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   bed_rooms          21969 non-null  int64  
 1   bath_rooms         21969 non-null  float64
 2   finished_sqft      21969 non-null  float64
 3   taxvaluedollarcnt  21969 non-null  float64
 4   year_built         21969 non-null  int64  
 5   regionidcounty     21969 non-null  int64  
 6   fips               21969 non-null  int64  
dtypes: float64(3), int64(4)
memory usage: 1.3 MB


## Data Summary

In [4]:
train.describe()

Unnamed: 0,bed_rooms,bath_rooms,finished_sqft,taxvaluedollarcnt,year_built,regionidcounty,fips
count,21969.0,21969.0,21969.0,21969.0,21969.0,21969.0,21969.0
mean,3.158405,2.026059,1623.302381,290381.744276,1960.570304,2590.490874,6048.232191
std,0.859079,0.742293,611.688231,163977.895379,21.33703,774.441636,20.907472
min,0.0,0.0,152.0,1000.0,1878.0,1286.0,6037.0
25%,3.0,2.0,1200.0,144941.0,1950.0,2061.0,6037.0
50%,3.0,2.0,1496.0,281384.0,1958.0,3101.0,6037.0
75%,4.0,2.5,1922.0,423359.0,1974.0,3101.0,6059.0
max,9.0,8.0,7970.0,618256.0,2016.0,3101.0,6111.0


### Explore Data

## What is average value of a single family home?

In [5]:
#returns the average/mean of the feature tax val dollar count
# The average gives us great baseline
 
baseline = train.taxvaluedollarcnt.mean()
print(f'The average value is ${baseline:.2f}')

The average value is $290381.74


## Viz

## What features should we investigate first?

In [6]:
x_train, y_train, x_validate, y_validate, x_test, y_test = mz.model_sets(train, validate, test)

In [7]:
ez.select_kbest(x_train, y_train)

Unnamed: 0,p,f
bed_rooms,4.087223e-126,578.123431
bath_rooms,0.0,1705.982503
finished_sqft,0.0,1800.012997
year_built,2.029789e-228,1066.254339
regionidcounty,6.291911e-146,671.810016
fips,5.074378e-122,558.816026


### Kbest selected in order of importance bathrooms, finished square ft, year home was built, and bedrooms as features to investigate. 

## Viz

**I will now use a pearsonsr statistical test to investigate whether finished square feet of the home and tax value are related** 

* I will use a confidence interval of 95% 
* The resulting alpha is .05<br>

${H_0}$: There is no **linear** relationship between finished sqaure feet of the home and tax value.

${H_a}$: There is a **linear** relationship between finished sqaure feet of the home and tax value.

In [8]:
ez.get_stats_sqft(train)

We reject the null hypothesis
pearsonsr test = 0.2752


## Viz

**I will now use a chi squared statistical test to investigate whether bathrooms of the home and tax value are related** 

* I will use a confidence interval of 95% 
* The resulting alpha is .05<br>

${H_0}$: There is no **linear** relationship between bathrooms of the home and tax value.

${H_a}$: There is a **linear** relationship between bathrooms of the home and tax value.

In [9]:
#chi squared
ez.get_chi_bath(train)

We reject the null hypothesis
chi^2 = 282669.3795
p     = 0.0000


## Viz

**I will now use a pearsonsr statistical test to investigate whether the year the home was built and tax value are related** 

* I will use a confidence interval of 95% 
* The resulting alpha is .05<br>

${H_0}$: There is no **linear** relationship between year built of the home and tax value.

${H_a}$: There is a **linear** relationship between year built of the home and tax value.

In [10]:
ez.get_stats_built(train)

We reject the null hypothesis
pearsonsr test = 0.2152


## Viz

**I will now use a chi squared statistical test to investigate whether the bedrooms of the home and tax value are related** 

* I will use a confidence interval of 95% 
* The resulting alpha is .05<br>

${H_0}$: There is no **linear** relationship between bedrooms of the home and tax value.

${H_a}$: There is a **linear** relationship between bedrooms of the home and tax value.

In [11]:
#chi squared
ez.get_chi_bed(train)

We reject the null hypothesis
chi^2 = 174454.4700
p     = 0.0000


## Exploration Summary

* Kbest results selected in order of importance as features to investigate:
    * bathrooms
    * finished square ft
    * year home was built
    * bedrooms
    
* PearsonsR and Chi Squared statistical tests supported significance amongst the relationships between the 4 features and target variable.  

## Creating predictive models

### Features included: 

* "bathrooms", "finished sqft", "year built" - Features that had the most significance in relationship to the target variable are most likely going to model the best predictive power.

### Features not included:

* "bedrooms" - Feature had the weakest relationship to the target variable according to the kbest and chi squared tests. 

## MODEL

In [12]:
# actual matches taxval dollar count
mz.predictions(train)

Unnamed: 0,Actual
40174,169587.0
28703,206900.0
6134,97875.0
17001,379756.0
30568,527700.0


## Simple Linear Regression Model

In [13]:
# here its giing us the formula y = mx + b
mz.simple_lm_model(train,validate)

Home Value = 43.395 * finished sqft year built & bathrooms + -1230105.818


## Viz

In [14]:
validate.head()

Unnamed: 0,bed_rooms,bath_rooms,finished_sqft,taxvaluedollarcnt,year_built,regionidcounty,fips,lm_predictions
225,3,2.0,1395.0,388006.0,1951,3101,6037,273066.633511
7625,3,3.0,2332.0,139666.0,1952,3101,6037,334839.469942
36238,3,3.0,1294.0,576204.0,1939,3101,6037,280454.819812
33265,3,1.0,825.0,443825.0,1954,3101,6037,230093.794376
9581,5,5.0,4227.0,522148.0,2007,3101,6037,497378.466059


## Lasso Lars Regression Model

In [15]:
mz.lasso_model(train,validate)

finished_sqft       43.268612
year_built         714.146945
bath_rooms       20342.313084
dtype: float64


## Viz

In [16]:
validate.head()

Unnamed: 0,bed_rooms,bath_rooms,finished_sqft,taxvaluedollarcnt,year_built,regionidcounty,fips,lm_predictions,lars_predictions
225,3,2.0,1395.0,388006.0,1951,3101,6037,273066.633511,273138.704668
7625,3,3.0,2332.0,139666.0,1952,3101,6037,334839.469942,334737.853955
36238,3,3.0,1294.0,576204.0,1939,3101,6037,280454.819812,280541.124615
33265,3,1.0,825.0,443825.0,1954,3101,6037,230093.794376,230275.723694
9581,5,5.0,4227.0,522148.0,2007,3101,6037,497378.466059,496694.581484


## Generalized Linear Regression Model

In [17]:
mz.glm_model(train,validate)

finished_sqft      57.339299
year_built        858.899942
bath_rooms       3300.462751
dtype: float64


## Viz

In [18]:
validate.head()

Unnamed: 0,bed_rooms,bath_rooms,finished_sqft,taxvaluedollarcnt,year_built,regionidcounty,fips,lm_predictions,lars_predictions,glm_predictions
225,3,2.0,1395.0,388006.0,1951,3101,6037,273066.633511,273138.704668,268985.104202
7625,3,3.0,2332.0,139666.0,1952,3101,6037,334839.469942,334737.853955,326871.3904
36238,3,3.0,1294.0,576204.0,1939,3101,6037,280454.819812,280541.124615,256187.498411
33265,3,1.0,825.0,443825.0,1954,3101,6037,230093.794376,230275.723694,235577.94064
9581,5,5.0,4227.0,522148.0,2007,3101,6037,497378.466059,496694.581484,489369.785021


## Evaluate on Train

In [19]:
mz.lm_errors(train)

(537856500143301.56,
 52835563653966.75,
 590692063797268.2,
 24482520831.32148,
 156468.91330651427)

In [20]:
mz.baseline_mean_errors(train)

(590692063797269.4, 26887526232.294113, 163974.16330719335)

In [21]:
mz.lm_vs_baseline(train)

The Simple Linear Regression model performs better than the baseline.


In [22]:
mz.lars_errors(train)

(537857179361946.4,
 52464954708453.01,
 590322134070399.4,
 24482551748.461304,
 156469.01210291227)

In [23]:
mz.baseline_mean_errors(train)

(590692063797269.4, 26887526232.294113, 163974.16330719335)

In [24]:
mz.lars_vs_baseline(train)

The Lasso Lars Regression model performs better than the baseline.


In [25]:
mz.glm_errors(train)

(539093206706790.56,
 51087681437770.6,
 590180888144561.1,
 24538814088.342236,
 156648.69641443633)

In [26]:
mz.baseline_mean_errors(train)

(590692063797269.4, 26887526232.294113, 163974.16330719335)

In [27]:
mz.glm_vs_baseline(train)

The Generalized Linear Regression model performs better than the baseline.


## Evaluate on Validate

In [28]:
mz.lm_errors(validate)

(224641246361031.3,
 22146195761761.594,
 246787442122792.9,
 23857396597.39075,
 154458.39762664493)

In [29]:
mz.baseline_mean_errors(validate)

(246164954232602.16, 26143261919.350273, 161688.7810559232)

In [30]:
mz.lm_vs_baseline(validate)

The Simple Linear Regression model performs better than the baseline.


In [31]:
mz.lars_errors(validate)

(224640852651076.6,
 21991297315948.836,
 246632149967025.44,
 23857354784.523853,
 154458.26227341758)

In [32]:
mz.baseline_mean_errors(validate)

(246164954232602.16, 26143261919.350273, 161688.7810559232)

In [33]:
mz.lars_vs_baseline(validate)

The Lasso Lars Regression model performs better than the baseline.


In [34]:
mz.glm_errors(validate)

(224945396077587.7,
 21294726822306.92,
 246240122899894.62,
 23889697969.15757,
 154562.92559717406)

In [35]:
mz.baseline_mean_errors(validate)

(246164954232602.16, 26143261919.350273, 161688.7810559232)

In [36]:
mz.glm_vs_baseline(validate)

The Generalized Linear Regression model performs better than the baseline.
