Skip to content

Used past sales prices, along with 15+ home attributions/ features to help home sellers understand the value of their home prior to deciding whether to sell.

Notifications You must be signed in to change notification settings

rgpihlstrom/Multiple-Linear-Regression

Repository files navigation

Phase 2 Project

Film_Crew

King County Realty - Increasing Revenue From Seller Commissions

Author:Russell Pihlstrom

Overview

This project uses Multiple Linear Regression to make predictions on potential future prices of homes sold in King County Washington and potentially surrounding areas. By analyzing past actual prices for homes sold against the features of each respective home, I developed a predictive model that explains 82% (R^2 = .82) of the variability in price. This algorithm,“model”, can be used to predict the value each feature contributes to the final sales price of a home along with predicting potential future sales prices for homes where required data to run the model are available. The following 6 features had the greatest weights within the model: Previous Year Appraisals Values, School District Rank, Number of Fortune 500 Companies within 10 miles, Grade Of Home, and Quantity of Finished Above Square Footage.

Models

The model was developed using actual sales transactions that took place between May 2014 – May 2015 in the County of King in the state of Washington. The model was developed to help a hypothetical real estate firm called King County Reality increase revenue from seller commissions by addressing two areas the CEO of King County Reality believes were gaps in current business processes:

  1. Help agents make more valuable recommendations to potential home sellers on the features in which to focus improving upon in order to maximize sales price. The model allowed me to advise agents to recommend focusing potential sellers in the following areas: Above Square Footage, Grade of Home, Improving School District, Inviting Fortune 500 expansion/ addition.

  2. Setting "better" or more informed sales prices. Beyond simply analyzing the impact features can have on predicting sales prices, my research suggested adding Assessor Appraisal Value as a future feature to make more accurate predictions. This is a feature that had no mentions as a valuable predictor previous to my discovery through final model creation.

Ultimately, the model can be used as a stand along tool to help agents set prices that reflect true market value, as well as an educational tool to improve agent acumen. This will give King County Reality the sophistication it needs to be more successful as it faces the challenges associated with rapidly rising predicted home prices along with higher levels of inventory in the form of new construction.

Business Problem

The King County Realty company is looking to increase its revenue from seller commissions. Housing market experts are predicting a sharp increase in the prices of homes along with increases in housing inventory due to new construction. The CEO wants to ensure the firm is prepared to appropriately meet the challenges of this new housing market and believes his firm needs to be more data driven in its approach to helping customers buy and sell homes. The CEO has hired me to help leverage its current data sources along with finding new data sources to help agents in two primary areas

  1. Better understand and predict the relationship between home improvement projects/ features and there impact on final sales price
  2. Understand if there are other data sources that agents can use to help set smarter prices

Historically prices are set using several sources of data that is presented to agents in the form of a Comparative Market Analysis (CMA) report. The CEO is wondering if there are other sources of data to help triangulate the prices set for newly listed homes.

Overall, pressure is mounting as home market dynamics, customer expectations, and competition from other boutique reality firms are increasing.

Data

  1. Provided Data: The initial source of data entailed approximately 21k sales transactions that took place between May 2014 – May 2015 in the county of King in the state of Washington. In addition to providing the actual sales price reached for each home, the provided data also included metadata for each home sold such as: Square Footage of Lot, Square Footage Above, Square Footage of Basement, Grade of Home, Condition of Home, Views Available, Other misc features.

  2. Scraped Data: King County Tax Assessor Data: This data contains several years of past appraisal information used to justify taxes levied per house. The data from the Tax Assessor Office also include additional physical attributes as well as school district associated with each Parcel ID. Example: https://blue.kingcounty.com/Assessor/eRealProperty/Dashboard.aspx?ParcelNbr=7338200240

  3. Scraped Data: King County School District & Rank: This data is used to get the ranking of the school district associated with each of the homes in the initial dataset discussed above. Example: https://backgroundchecks.org/top-school-districts-in-washington-2018.html

  4. Downloaded Data: I downloaded a variety of other data such as “Hot Zip Code” data, Attraction data, as well as creating custom bins/ groupings of data that were used in supporting model development. Example: https://www.realtor.com/research/reports/hottest-markets/

Methods

This project uses the Crisp DM methodology to generate and optimize the final Multiple Linear Regression model. The model developed provides the opportunity to use a home’s features (physical and or location based) as well as other information scraped from the assessor’s office including the home appraisal tax data. The top 6 features, attributes that had the heaviest weights (coefficients), in predicting a home’s price were: Previous Year Appraisals Values, School District Rank, Number of Fortune 500 Companies within 10 miles, Grade Of Home, and Above Square Footage. In addition to those listed several other features were used in explaining variability in home prices.

As prescribed by the Crisp DM methodology, model development was very iterative. I began by doing secondary research around the basic business drivers of the real estate industry, more specifically I researched what features are most attractive to buyers and sellers along with the methods used by real estate agents to set home prices. Early in the iterative research/ modeling process it was obvious that the model was missing key elements to determine the attractiveness of the home’s location. I addressed this gap by scraping several sources that would allow me to ascertain the richness of a home’s location (School District & Ranking, Proximity to Attractions & Corporate Headquarters of the 12 fortune 500 companies located in King County, Zip Code Hotness Scores, and other misc information). Unfortunately, only a handful of these sources were present in the final model as the VIF and correlation tests resulted in many of them being excluded.

Adhering to Linear Regression Assumptions: The assumption requirements associated with using multiple linear regressions to make predictions were followed during model development and certainly added to the number of iterations required to find the optimum combination of data & features. The assumptions include: linearity, multicollinearity, homoscedastic, and error normality.

Results

The final model generated a predictive power (Adj R^2) of .82

The features that had the most influence/ weights include:

- 1.2 x (Previous Year Appraisals Values)

- .34 x (Above Square Footage)

- .22 x (School District Rank)

- .15 x (Number of Fortune 500 Companies)

- Other x (See Power Point Presentation or Jupyter Notebook “Models” for details)

A_Players

Greatest Insight: The weight “Assessor Appraisal Value” had, relative to the second most heavy feature “Above Square Footage”, was very surprising and insightful. The reason this was so insightful was the lack of reference to this variable in all my research in regards to determining a homes sales price. Research suggested that agents primarily use CMA (Comps) to set pricings and while I did not have a feature related to CMA in my model, I would hypothesize it would be stronger than “Assessor Appraisal Value” in predictive power. That being said, I hypothesize that "Assessor Appraisal Value" would be the second most heavy feature in a model that contained both "Comp Prices" and "“Assessor Appraisal Value". Given this hypothesis I would recommend it be added as a potential attribute to all home predictive models. Most research suggests the more common features “Location, Location, Location” as well as “Above Square Footage” as the most helpful. I believe “Assessor Appraisal Value" is currently being overlooked.

One caveat to this recommendation would be if significant improvements had been made to a home since the last onsite visit of a tax assessor. Per the King County website, assessors make onsite assessments once every 6 years, this would make older assessments less accurate if significant improvements have been made to a home since the last visit. The same can be said if significant declines in value had occurred since the last onsite appraisal visit.

Extra Credit

Given the greater than 200% difference in weight between the “Assessor Appraisal Value” feature and the next most impactful feature, “Above Square Footage”, I thought it would be interesting to see how well our model would predict "Assessor Appraisal Values”, more specifically, I used the original model and changed the target value predicted "Homes Sales Price to predicted “Assessor Appraisal Value”. Interestingly my model predicted/ explained the "Assessor Appraisal Value" with greater strength than "Homes Sales Price". Keep in mind my second model did not include the “Assessor Appraisal Value” as a feature. See below for a comparison of the same model predicting both "Home Price" and "Assessors Appraisal Value".

NA

INSIGHT: What this suggests is the factors that the assessors use to create appraisals are very similar to those that were provided in our initial dataset and used to build our early models. The fact that our model explained more variability in "Assessors Appraisal Value" suggests that assessors use more tangible features when creating appraisals vs. intangible features. This conclusion can be drawn given the heavy presence of tangible features in the dataset. More specifically, Assessors rely more heavily on the tangible features (Above Sqr Ft, Lot Size, Grade, Etc.) when developing appraisals than buyers and seller use to achieve reaching final prices. My Hypothesis is buyers and sellers rely on intangibility features more heavily when determining an estimated value of a home. The greater variance in our original model to predict home sales vs. assesor appraisal value suggests the usage of features that cannot be quantified such as subjective prefrences.

Conclusions/ Recommendations

This analysis leads to three recommendations that would help King County Reality achieve higher seller commissions:

1. Physical "Controllable" Features: As real estate agents look to coach prospective sellers on features to improve in order to maximize sales price they should focus on “Above Square Footage”, as well as increasing “Grade” which is related to features such as custom cabinets, and other various customizations that make homes less "basic" and more designer/ custom


2. Location "Influenceable" Features: While the above "controllable" features allow sellers to make short-term efforts and realize instant benefits, location based features, such as the ranking of the school district in which a home resides or the number of attractions within close proximity to a home, are very difficult to impact in a direct manner. However, over the course of several years a home owner can look to invest in efforts and finances to improve school performance and or influence political decisions related to politicians who believe in bringing more attractions to the area in which a home resides.

3. Setting "Better" More Informed Prices: The 200% delta between the feature “Assessor Appraisal Value” and the next biggest feature "Above Square Footage" suggests that adding “Assessor Appraisal Value” to home price estimate models could be very instrumental in augmenting current processes real estate agents use in setting prices. This could be especially true in situations where CMA data is either not available, nor as applicable as desired.

Next Steps

Further analyses could yield additional insights to help King County Reality increase revenue from sales commissions

  • Augment the Current Model Look to add prediction accuracy by studying the homes that showed the largest difference between the predicted and the actual prices of homes sold. Additionally, looking to add features requiring interactivity could be explored as many factors that would seem to be impactful were eliminated due to VIF.
  • Creating Additional Models The model created was optimized for the most frequent "Common/ Main stream" home prices, quantity of square footage, number of bathroom/ rooms, lot size and several other related features were used to create the model. The model was optimized for homes that had prices between ~$300k - ~$900k, Bedrooms between 1-4, Lot Size 4,000 – 12,000, Other). Additional models could be created focused on homes <$200k, >$900k, Waterfront, large Lot Size, etc.
  • Deployment Once we have optimized our models and/or generated enough models to account for the wide variety of homes present in the King County district I would look to automate and deploy the models via a web based interface and make it available to the larger public for impromptu consumption and a potential Marketing tool to generate web traffic and brand awareness.

For More Information

See the full analysis in the Jupyter Notebooks or review our Presentation.

For additional info, contact me here: Russell Pihlstrom

About

Used past sales prices, along with 15+ home attributions/ features to help home sellers understand the value of their home prior to deciding whether to sell.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published