King County Real Estate Model

Flatiron Data Science Project - Phase 2
Prepared and Presented by: Sarah Zoeller and Melody Peterson
Presentation PDF

Business Problem

King County Real Estate is a luxury real estate company serving sellers and buyers in the high income earning areas of King County, Washington. The company wants to understand which features translate to higher housing prices in these areas, as well as develop a model to predict price based on housing features.

Data

This project uses the King County House Sales dataset, which can be found in kc_house_data.csv in the data folder in this repo. The description of the column names can be found in column_names.md in the same folder. In an effort to narrow the scope of the data to suit our business problem, we also obtained census data of individual income tax returns by zip code for the state of Washington. An editted version of this data can be found in agi_zip_code.xlsx in the data folder in this repo. The cleaning and selection of relevent data from this dataset can be seen in the Additional_Data notebook in the repo.

Modeling Process

Following the OSEMN (Obtain, Scrub, Explore, Model, Interpret) data science framework, we began with an understanding of our business problem and the acquisition of data. We then followed an iterative process of cleaning and exploring the data, checking for issues with modeling assumptions, creating and testing a model, interpreting the model, and reevaluating the data.

In the initial data exploration, after subsetting the data to the top zipcodes, we checked the distributions of the independent variables for normal distributions. Although it is not required for the data to be distributed normally, it can result in better models and predictions.

As part of the data cleaning/scrubbing phase, we checked for duplicates, and treated place holder values and missing values in ways to best retain as much data as possible while keeping the integrity of the data. We also checked for multicollinearity among the independent variables and found several variables with high correlations, including: sqft living/sqft above, sqft living/grade, sqft living 15/sqft living, grade/sqft above, bathrooms/sqft living.

Once the data had been cleaned we further explored by looking at plots of the data for linear relationships, normal distributions, and skew caused by outliers. Many of the variables appeared to be skewed by abnormally high outliers. We used IQR to remove price outliers from the dataset before our train test split.

After creating an initial baseline model, several of the continuous variables were log transformed and scaled to make them more normally distributed and comparable to each other.

We then iterated through the modeling process, interpreting our results after each model, and making changes and adjustments based on statistical significance of the variables. For our final model, you can see in this graph how our predictions match up with the actual data on which we trained the model as well as on predicting the test data value for Sale Price.

By holding all variables except one constant at their mean, we can visualize the relationship between sale price and any given variable as predicted by our model.

Conclusions

Significant features in luxury homes include waterfront property, location (zip codes, longitude), and square foot above ground
Having more floors or bedrooms does not necessarily imply higher sale price
Bottom Line: location and square footage are the most important features in determining sale price

Next Steps / Future Work

Refine dataset (expand and cut certain zip codes)
Subset model for different price ranges
Investigate polynomial relationships and interactions between variables in greater detail

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
backup		backup
data		data
images		images
.gitignore		.gitignore
Additional_Data.ipynb		Additional_Data.ipynb
README.md		README.md
data_cleaning_EDA_Modeling_new.ipynb		data_cleaning_EDA_Modeling_new.ipynb
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

King County Real Estate Model

Business Problem

Data

Modeling Process

Conclusions

Next Steps / Future Work

About

Releases

Packages

Contributors 2

Languages

swzoeller/Housing-Regression-Project

Folders and files

Latest commit

History

Repository files navigation

King County Real Estate Model

Business Problem

Data

Modeling Process

Conclusions

Next Steps / Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages