Predict 'Days Active on RDC' using Survival Analysis

Purpose

Given a new property listing on RDC, we want to predict how long it will be active on the market (i.e. days until it becomes pending).

Some Assumtions / Hypotheses

Homes with more pageviews (especially in first week) sell more quickly.
Homes with more pictures sell more quickly (but at some point, more pictures is correlated to higher-priced homes which slows down sale).
In most cities, high-end or low-end homes sell the slowest; mid-tier homes sell fastest.
Listing price z-score (median of the area) would be an important property-characteristic-based predictor for estimating days on market.

Predicting the number of days for a new listing to become pending is similar to survival analysis. Instead of surviving patients, we have homes that don’t sell. And instead of patients who survived over the course of the medical study (which are “right-censored” observations), we have homes that don’t sell by the end of the analysis timeframe.

"Two Models" Approach

A model for listings with little or no engagement history (brand new listings).
A model for listings with engagement history.

The reason for a two-prongs approach is that for listings with some engagement history, pageviews (especially pageviews during first week of listing) would most likely be the dominant predictor for number of days remaining active on RDC. But for brand new listings with no engagement history whatsoever, we can only predict based on property attributes. Therefore, it makes sense to have different approaches for estimating days on market.

Survival Analysis:

Use Kaplan-Meiers survival model for data exploration/visualization and for discovering important features for modeling. For example, below is a Kaplan-Meiers survival curve of listings in the Bay Area:

Below is a K-M curve for Homes with >10 photos in listing versus homes with <10 photos in listing:

Cox regression for prediction: For building multivariate models (i.e. using property attributes or pageviews) to predict survival of new listings, I used Cox Proportional-Hazard Regression:

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.ipynb_checkpoints		.ipynb_checkpoints
images		images
.DS_Store		.DS_Store
README.md		README.md
R_survival_v1.R		R_survival_v1.R
Time-to-Sell (10-28-2016).R		Time-to-Sell (10-28-2016).R
Time-to-Sell (10-31-2016).R		Time-to-Sell (10-31-2016).R
pending_LA_apr15_nov01.csv		pending_LA_apr15_nov01.csv
pending_NY_apr15_nov01.csv		pending_NY_apr15_nov01.csv
pending_SF_apr15_nov01.csv		pending_SF_apr15_nov01.csv
pending_SF_apr15_oct25.csv		pending_SF_apr15_oct25.csv
sf_listings_TE.csv		sf_listings_TE.csv
survival_analysis_v1.ipynb		survival_analysis_v1.ipynb
survival_analysis_v2.ipynb		survival_analysis_v2.ipynb
survival_analysis_v3.ipynb		survival_analysis_v3.ipynb
survival_analysis_v4.ipynb		survival_analysis_v4.ipynb
survival_analysis_v5_SF.ipynb		survival_analysis_v5_SF.ipynb
survival_analysis_v6_LA.ipynb		survival_analysis_v6_LA.ipynb
survival_analysis_v7_SF.ipynb		survival_analysis_v7_SF.ipynb
whas100.csv		whas100.csv

wubr2000/survival_analysis

Folders and files

Latest commit

History

Repository files navigation

Predict 'Days Active on RDC' using Survival Analysis

Purpose

Some Assumtions / Hypotheses

"Two Models" Approach

Survival Analysis:

About

Resources

Stars

Watchers

Forks

Languages