---

Project & Final Report Created By: Rachel Robbins-Mayhill 2022-04-08

---

In [1]:
![Logo](zillow-com-logo.png "Zillow")

zsh:1: unknown file attribute: z


## PROJECT DESCRIPTION

Zillow is the leading real estate and rental marketplace dedicated to empowering consumers with data, inspiration and knowledge around the place they call home, and connecting them with the best local professionals who can help. According to the National Association of Realtors, there are over 119 million homes in the United States, over 5 million of which are sold each year. 80% of these homes have been viewed on Zillow regardless of their market status.

Zillow serves the full lifecycle of owning and living in a home: buying, selling, renting, financing, remodeling and more. It starts with Zillow's living database of more than 110 million U.S. homes - including homes for sale, homes for rent and homes not currently on the market, as well as Zestimate home values, Rent Zestimates and other home-related information.

The Zestimate is a key element driving webtraffic to Zillow, where sellers, buyers, agents, and curiosity-seekers gain knowledge of a home's value. In fact, over the years, Zillow has built a solid reputation around the Zestimate. The Zestimate takes in layers of data regarding a homes features and location and presents buyers and sellars with a value of a home. Zillow publishes Zestimates for 104 million homes, updating them weekly.

Although Zillow has a model to assist in predicting a home's value, they are looking to fine-tune the model and improve upon it. This project has been requested by the Zillow Data Science Team.

### PROJECT GOAL

The goal of this project is to find key drivers of property value for Single Family Properties and to construct an improved Machine Learning Regression Model to predict property tax assessed values for these properties using the features of the properties. The improved model will help Zillow develop more accurate, dependable, and trustworthy Zestimates, thus sustaining and bolstering their loyal customer base. 

Upon completion of the model, the project will make recommendations on what does and doesn't impact property values and deliver the recommendations in a report to the Data Science team at Zillow, so they can understand the process that developed the conclusion and have the information available to replicate the findings. 


### INITIAL QUESTIONS

1. Is square feet of a property a driver of property value while controling for location?
2. Are the number of bedrooms and bathrooms a driver of the value of a property when controlling for square footage?
3. Is the square footage a driver of the value of a property when controllng for **bedrooms**?
4. Is the square footage a driver of the value of a property when controllng for **bathrooms**?

---

Imports used for this project can be viewed in the imports.py file located in the Clustering Project Repository.

In [2]:
from imports import *

### I. Acquire the Data

The data for this report was acquired by accessing 'zillow' from the Codeup SQL database. The following query was used to acquire the data:

In [3]:
'''
The query below is used to join 9 tables from the zillow dataset in the Codeup SQL Cloud Database.  
The tables joined are: properties_2017, predictions_2017, airconditioningtype, architecturalstyletype, 
buildingclasstype, heatingorsystemtype, propertylandusetype, storytype, typeconstructiontype. 
The data is filtered to only include the observationswith non-null latitude and longitude and with a 
transaction date occurring in 2017. 
'''

sql = """
SELECT prop.*, 
       pred.logerror, 
       pred.transactiondate, 
       air.airconditioningdesc, 
       arch.architecturalstyledesc, 
       build.buildingclassdesc, 
       heat.heatingorsystemdesc, 
       landuse.propertylandusedesc, 
       story.storydesc, 
       construct.typeconstructiondesc 
FROM   properties_2017 prop  
       INNER JOIN (SELECT parcelid,
       					  logerror,
                          Max(transactiondate) transactiondate 
                   FROM   predictions_2017 
                   GROUP  BY parcelid, logerror) pred
               USING (parcelid) 
       LEFT JOIN airconditioningtype air USING (airconditioningtypeid) 
       LEFT JOIN architecturalstyletype arch USING (architecturalstyletypeid) 
       LEFT JOIN buildingclasstype build USING (buildingclasstypeid) 
       LEFT JOIN heatingorsystemtype heat USING (heatingorsystemtypeid) 
       LEFT JOIN propertylandusetype landuse USING (propertylandusetypeid) 
       LEFT JOIN storytype story USING (storytypeid) 
       LEFT JOIN typeconstructiontype construct USING (typeconstructiontypeid) 
WHERE  prop.latitude IS NOT NULL 
       AND prop.longitude IS NOT NULL AND transactiondate <= '2017-12-31' 
"""

In [4]:
# Acquire data from SQL using module found in wrangle.py
df = wrangle.get_zillow()
# Obtain number of rows and columns for orginal dataframe
df.shape

Reading from csv file...


(77381, 67)

- Once acquired, a new Dataframe containing all necessary data was created. 
    - Original DF -> 77,381 rows and 67 columns.
    - Prepared DF -> ______ rows and __ columns.

### II. Prepare the Data

This acquired table was then analyzed and cleaned to eliminate data errors, clarify confusion, and code non-numeric data into more useful numeric types. 

#### Tasks for preparing data
Some of the data cleaning and engineering strategies that were employed were:

1. Dropping null values (126 rows in total)
    - due to the low number of null values, the decision was made to drop them
2. Converting datatypes (state_county_code': object, 'year_built': int)
3. Drop 'county_id' column
4. Clarifying the FIPS/state_county_code definition, identifying the State and Counties the codes belong to.
5. Creating a separate column identifying the county for each property in string format for readability -> county_code_bin
6. Creating dummy columns for county codes.
7. Creating categorical columns to better visualize and compare data for:
    - 7a. square feet -> home_sizes (small, medium, large, extra large)
    - 7b. total rooms -> total_rooms (bedrooms + bathrooms)
    - 7c. bedroom bins -> small, medium, large, extra large
    - 7d. bathroom bins -> small, medium, large, extra large

#### Results of Preparing the Data

In [5]:
df= wrangle.remove_columns(df, cols_to_remove = ['censustractandblock','finishedsquarefeet12',
                                     'buildingqualitytypeid', 'heatingorsystemtypeid', 'propertyzoningdesc', 
                                     'heatingorsystemdesc', 'unitcnt'])

In [6]:
df = wrangle.data_prep(df, prop_required_column=.5, prop_required_row=.5)

(66937, 33)


In [8]:
# View of the first 5 rows of the cleaned table
df.shape

(5, 33)