# Final Report
## Clustering Analysis (with Regression modeling)
### Brandyn Waterman, 4/8/2022, Innis Cohort
Hello and welcome! Let's begin with the needed imports:

In [6]:
# Dataframe manipulations
import pandas as pd
import numpy as np

# Modules needed to perform necessary functions
import wrangle_zillow as w
import explore as e

# Turning off warnings
import warnings
warnings.filterwarnings('ignore')

## Overview:
The purpose of this project is to assist in the prediction of log error for Zillow's Zestimate house value predictions. This will be done by:
- Identifying some of the key drivers behind the log error
- Applying these insights to regression models that can help predict the log error
- Sharing learned insights to provide recommendations and solutions moving forward 

### Planning:
Prior to interacting with the data we want to lay out some of our intentions/initial questions:

Some of the initial questions for the data: 
1. Do primary house attributes impact log error? (bedrooms, bathrooms, age, squarefeet)
2. Do secondary house attributes impact log error? (num_fireplace, threequarter_baths, hottub_or_spa, has_pool)
3. Does geography impact log error? (latitude, longitude, regionidzip, fips)
4. Can we successfully use any of our features to cluster for log error predictions?
    - Geographic clustering
        - Latitude/Longitude
    - Continuous feature clustering
5. Does log error being positive or negative arise from any of the features?

Some of the hypotheses to be explored:
1. Is there a linear relationship between log error and our continuous features? (Pearsonr)
2. ...

### Acquire:
The wrangle_zillow.py module contains the functions used to acquire our data. The get_db_url() function assists in accessing the SQL server and then using a query and the acquire_zillow() function we gather the necessary data and store it in a dataframe. Our initial dataframe contains a number of columns that will be narrowed down through preparation and exploration of the data. 

In [7]:
# In our wrangle_zillow() module we use our acquire_zillow() function to gather the Zillow data from the SQL server
zillow = w.acquire_zillow()

Using cached csv


### Prepare:
After acquiring our data we will need to do a fair bit of modification and or manipulation to make it wholly useful for our purposes. The following are the steps that were taken:
1. Ensuring we are only working with single unit properties, utilizing identifiers from the SQL server
2. Identifying a lack of proper data input for some columns and filling the nulls to signify better inputs
    - Main data this was utilized for: fireplace, hottub/spa, pool, three quarter bath, tax delinquency
3. Dropping leftover null values, and unwanted data (based on unusable or incorrect data inputs)
    - This is done with all encompassing mechanisms (dropna()), non-null proportion requirements by row or column, and eliminating faulty data inputs (e.g. 0 bedrooms)
4. Feature engineering age from yearbuilt data
5. Ensuring columns are the correct data type
6. Removal of outliers to make our outcomes as generally usable as possible
7. Encoding our currently recognized categorical columns
    - fips, hottub_or_spa, has_pool, tax_delinquency
8. Renaming our columns for easier use
9. Splitting our original dataframe into train, validate, and test dataframes

In [9]:
# In our wrangle_zillow() module we use our prepare_zillow() function, with our acquired dataframe
# We clean, prepare, and split the dataframe to produce train, validate, and test dataframes
train, validate, test = w.prepare_zillow(zillow)

### Explore:
...

In [5]:
# We will set our alpha for all of our statistical testing
alpha = .05

#### Question 1: ...

#### Question 2: ...

#### Question 3: ...

#### Question 4: ...

#### Question 5: ...

### Scaling:
...

### Clustering:
...

### Exploration Summary:
...