In [1]:
# standard DS imports
import pandas as pd
import numpy as np

# viz and stats
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import plotly.express as px
from scipy import stats

# for feature selection verification and evaluation 
from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans

# local functions
import wrangle

# Zillow Project with Clustering

----

## Executvie Summary:

Goals:
- Identify factors evaluated in home value
- Build a model to best predict home value
- Minimize Root Square Mean Error (RMSE) in modeling

Key Findings:


## -------------->>>>  BELOW NEEDS EDITS   <<<<------------
- Location data and use of home square-footage is the most impactful for predicting home value. 
- Adding a bathroom increases home value more than adding a bedroom. 
- All models (LarLasso, Quadratic Regression, Cubic Regression) predicted home value better than the baseline.

Takeaways:
 - My best model, Quadratic Regression, only reduced the baseline error by \\$35,000 or 13% of total baseline error. 
 - More in home features and/or quality of life by location data would greatly improve the model. 

Recommendations:
- Evaluate data that tax value assessors use in their assessment. They have a policy and procedure they must follow, and being able to use their process in predicting home value would be essential to building better models moving forward.

----

## 1. Planning

 - Create deliverables:
     - README
     - final_report.ipynb
     - working_report.ipynb
 - Build functional wrangle.py, explore.py, and model.py files
 - Acquire the data from the Code Up database via the wrangle.acquire functions
 - Prepare and split the data via the wrangle.prepare functions
 - Explore the data utilizing clustering and define hypothesis. Run the appropriate statistical tests in order to accept or reject each null hypothesis. Document findings and takeaways.
 - Create a baseline model in predicting home cost and document the RSME.
 - Fit and train three (3) regression models to predict cost on the train dataset.
 - Evaluate the models by comparing the train and validation data.
 - Select the best model and evaluate it on the train data.
 - Develop and document all findings, takeaways, recommendations and next steps. 

-----

## 2. Data Acquisition
In this step, I called my acquire_zillow function from wrangle.py. This function:
- grabs the data from the CodeUp database via a SQL query
- creates a local CSV of the table, if not already saved locally

#### Data Dictionary

| Target | Type | Description |
| ---- | ---- | ---- |
| value | int | The assessed tax value amount of the home |


| Feature Name | Type | Description |
| ---- | ---- | ---- |
| area | float | Sum of square feet in the home |
| baths | float | Count of bathrooms in the home |
| beds | float | Count of bedrooms in the home |
| decade | int | The decade the home was built in |
| extras | float | Sum of the home's bathrooms, bedrooms, stories, pool, and if it has a garage |
| garage | int | Sum of square feet in the garage |
| half_bath | int | 1 if the home has a half bath, 0 if not |
| lat | float | The home's geographical latitude |
| lat_long | float | The home's latitude divided by its longitude |
| living_space | float | The home area in sqft minus 132sqft per bedroom and 40sqft per bathroom (average sqft per respective room) |
| location | object | The human-readable county name the home is in |
| long | float | The home's geographical longitude |
| los_angeles | int | 1 if the home is in Los Angeles County, 0 if not | 
| lot_size | float | Sum of square feet of the piece of land the home is on |
| orange | int | 1 if the home is in Orange County, 0 if not |
| pool | int | 1 if the home has a pool, 0 if not |
| stories | int | Count of how many levels or stories the home has |
| ventura | int | 1 if the home is in Ventura County, 0 if not|
| yard_size | float | The lot size minus the home area in sqft |
| year_built | float | The year the home was built |
| zipcode | float | The US postal service 5-digit code for the home's location |

### Acquisition Takeaways
- The zillow data brings in 10 columns of data covering 52,441 homes. 
- These homes had transactions during 2017 and are tagged as Single Family Residences.

----

## 3. Data Preparation

#### Preparation Takeaways
- stuff and things

-----

## 4. Data Exploration

### Explore Takeaways:
- Finding 1
- Finding 2
- Finding 3

----

## 5. Data Modeling

#### My goal is to minimize RMSE while maintaining a healthy R<sup>2</sup> in order to minimize error while still being able to account for it.
Select features were dropped in order to maximize model fitting based on all findings. It is highly apparent location metrics play a key role in determining assessed home value. The location feature that was dropped below was split out to be los_angeles, orange, and ventura features in order to prevent skew in modeling. 

| Features Kept | Features Dropped |
| ---- | ---- |
| baths | location |
| beds | decade |
| area | yard_space |
| lot_size |  |
| zipcode |  |
| lat |  |
| long |  |
| lat_long |  |
| los_angeles |  |
| orange |  |
| ventura |  |
| living_space |  |
| half_bath |  |
| pool |  |
| stories |  |
| garage |  |
| extras |  |

### Modeling Takeaways:
- The quadratic regression model performed best with \\$196,592 RMSE and a .305 R<sup>2</sup> value.
- The cubic regression model appears to be overfit.
- All three models out performed the baseline by at least \\$30,000.
- Despite beating baseline, none of these models are able to predict home value with a high degree of certainty. 
- Going forward I would look into other machine learning methods to create a better fit model.

----

# Conclusion

## -------------->>>>  BELOW NEEDS EDITS   <<<<------------


Home value is assessed through a myraid of metrics taken about the home. Location and area based information have proven the most valuable, but there is still room for improvement. My best model only reduced the root mean squared error by \\$35,000 from the baseline results; a 13% reduction in error.

### Recommendations:
- Evaluate data that tax value assessors use in their assessment. They have a policy and procedure they must follow, and being able to use their process in predicting home value would be essential to building better models moving forward.
- Add data or begin tracking school rankings and crime rates for each neighborhood. I predict sections of homes with higher school ratings and low crime rates will value for more than homes with low school ratings or high crime rates.

### Next Steps:
- Feature engineer more detailed depictions of the use of the area inside the home. Specifically determine the kitchen vs living area sections of the home and see how this effects the model.
- Develop a model using different machine learning techniques focused on geographical distance. Home value is often geographically clusered as depicted in our finidngs. 

----