# Household Location Choice Model (HLCM) for Single and Multi Family Housing  


Arezoo Besharati, Paul Waddell, UrbanSim, July 2018 

This notebook demonstrates the use of the LargeMultinomialLogit model template to construct, estimate, and evaluate a Household Location Choice Model for the San Francisco Bay Area.

In the process of developing the model, we also demonstrate some data checking and transformations to improve the model.

The model structure and specification are informed and limited by the available data, which is based on the data used by the Metropolitan Transportation Commission for their operational model.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminaries</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load data</a></span></li></ul></li><li><span><a href="#Create-Additional-Household-and-Building-Variables" data-toc-modified-id="Create-Additional-Household-and-Building-Variables-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Create Additional Household and Building Variables</a></span></li><li><span><a href="#Model-Estimation" data-toc-modified-id="Model-Estimation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model Estimation</a></span><ul class="toc-item"><li><span><a href="#Initialize-rent_sqft-by-Running-the-Hedonic-Regression" data-toc-modified-id="Initialize-rent_sqft-by-Running-the-Hedonic-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Initialize rent_sqft by Running the Hedonic Regression</a></span></li><li><span><a href="#Large-Choice-Set-Single-Family" data-toc-modified-id="Large-Choice-Set-Single-Family-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Large Choice Set Single-Family</a></span></li><li><span><a href="#Large-Choice-Set-Multi-Family" data-toc-modified-id="Large-Choice-Set-Multi-Family-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Large Choice Set Multi-Family</a></span></li></ul></li><li><span><a href="#Create-a-Chooser-Filter-and-Tag-Their-Buildings" data-toc-modified-id="Create-a-Chooser-Filter-and-Tag-Their-Buildings-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create a Chooser Filter and Tag Their Buildings</a></span></li><li><span><a href="#Add-Flag-to-Buildings-Table-Identifying-Chosen-Buildings" data-toc-modified-id="Add-Flag-to-Buildings-Table-Identifying-Chosen-Buildings-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Add Flag to Buildings Table Identifying Chosen Buildings</a></span><ul class="toc-item"><li><span><a href="#Constrained-Choice-Set-Single_Family" data-toc-modified-id="Constrained-Choice-Set-Single_Family-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Constrained Choice Set Single_Family</a></span></li><li><span><a href="#Constrained-Choice-Set-Multi_Family" data-toc-modified-id="Constrained-Choice-Set-Multi_Family-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Constrained Choice Set Multi_Family</a></span></li></ul></li><li><span><a href="#Model-Prediction" data-toc-modified-id="Model-Prediction-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Model Prediction</a></span></li></ul></div>

## Preliminaries

In [1]:
import os; os.chdir('../../')
import numpy as np, pandas as pd 
import matplotlib.pyplot as plt
import warnings;
warnings.simplefilter('ignore')
%load_ext autoreload
%autoreload 2
from urbansim_templates import modelmanager as mm
from urbansim_templates.models import LargeMultinomialLogitStep
import orca
import seaborn as sns
%matplotlib notebook
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [2]:
mm.initialize()

Loading model step 'hlcm_constrained_sf'
Loading model step 'hedonic_price_sqft_single_family'
Loading model step 'hedonic_price_sqft_multi_family'
Loading model step 'hedonic_rent_sqft'


### Load data

First we import data sources which contains the references to all the actual data sources.  Then we import models, which contains the code to initialize networks and do network aggregations.

In [3]:
# Load any script-based Orca registrations
from scripts import datasources
from scripts import models

Orca uses 'lazy evaluation' which means it won't load data until we do something that requires it.

In [4]:
orca.list_tables()

['parcels',
 'buildings',
 'craigslist',
 'rentals',
 'nodessmall',
 'nodeswalk',
 'units',
 'households',
 'persons',
 'jobs']

Orca also uses 'broadcasts' to explitly indicate the merge relationships for doing table merges.

In [5]:
orca.list_broadcasts()

[('parcels', 'buildings'),
 ('buildings', 'units'),
 ('units', 'households'),
 ('households', 'persons'),
 ('nodeswalk', 'rentals'),
 ('nodeswalk', 'parcels'),
 ('nodessmall', 'rentals'),
 ('nodessmall', 'parcels')]

## Create Additional Household and Building Variables

In [16]:
# scale income and create race dummies
hh = orca.get_table('households').to_frame()
hh.income_k = hh.income/1000
hh.white = (hh.race_of_head == 1).astype(int)
hh.black = (hh.race_of_head == 2).astype(int)
hh.asian = (hh.race_of_head == 6).astype(int)
hh.hisp = (hh.hispanic_head == 'yes').astype(int)
hh.single = (hh.persons == 1).astype(int)
hh.elderly = (hh.age_of_head > 65).astype(int)
hh.rich = (hh.income > 150000).astype(int)
hh.poor = (hh.income < 40000).astype(int)
hh.has_children = (hh.children > 0).astype(int)
  
# building_type dummies
bld = orca.get_table('buildings').to_frame()
bld.single_family = (bld.building_type_id == 1).astype(int)
bld.multi_family = (bld.building_type_id == 3).astype(int)
bld.mixed_use = (bld.building_type_id > 3).astype(int)
bld.two_four_stories = ((bld.stories > 1) & (bld.stories < 5)).astype(int)
bld.five_six_stories = ((bld.stories > 5) & (bld.stories < 7)).astype(int)
bld.sevenplus_stories = (bld.stories > 6).astype(int)
bld.yrblt_2000 = (bld.year_built > 2000).astype(int)
bld.two_four_new = (bld.yrblt_2000 * bld.two_four_stories).astype(int)
bld.five_six_new = (bld.yrblt_2000 * bld.five_six_stories).astype(int)
bld.sevenplus_new = (bld.yrblt_2000 * bld.sevenplus_stories).astype(int)
bld.three_plus_stories = (bld.stories > 2).astype(int)

# add the columns

orca.add_column('households', 'income_k', hh.income_k)
orca.add_column('households', 'white', hh.white)
orca.add_column('households', 'black', hh.black)
orca.add_column('households', 'asian', hh.asian)
orca.add_column('households', 'hispanic', hh.hisp)
orca.add_column('households', 'elderly', hh.elderly)
orca.add_column('households', 'rich', hh.rich)
orca.add_column('households', 'poor', hh.poor)
orca.add_column('households', 'has_children', hh.has_children)
orca.add_column('households', 'single', hh.single)

orca.add_column('buildings', 'single_family', bld.single_family)
orca.add_column('buildings', 'multi_family', bld.multi_family)
orca.add_column('buildings', 'mixed_use', bld.mixed_use)
orca.add_column('buildings', 'two_four_stories', bld.two_four_stories)
orca.add_column('buildings', 'five_six_stories', bld.five_six_stories)
orca.add_column('buildings', 'yrblt_2000', bld.yrblt_2000)
orca.add_column('buildings', 'two_four_new', bld.two_four_new)
orca.add_column('buildings', 'five_six_new', bld.five_six_new)
orca.add_column('buildings', 'sevenplus_new', bld.sevenplus_new)
orca.add_column('buildings', 'three_plus_stories', bld.three_plus_stories)

<orca.orca._SeriesWrapper at 0x116997978>

In [20]:
# Assign a random number to each household (with a fixed seed) to be able to select random samples
np.random.seed(12345)
hh['hh_random'] = np.random.uniform(0,1,len(hh))
orca.add_column('households', 'hh_random', hh.hh_random)

<orca.orca._SeriesWrapper at 0x116998cc0>

## Model Estimation

### Initialize rent_sqft by Running the Hedonic Regression

Note below the use of modelmanager to get the regression model step we registered earlier, from yaml.  We then update some attributes of the model to switch from estimating the model on rental listings, to using the fitted model to predict rent_sqft on buildings

In [9]:
mrent = mm.get_step('hedonic_rent_sqft')
mrent.tables = ['buildings', 'parcels', 'nodessmall', 'nodeswalk']
mrent.out_filters = ['residential_units > 0']
mrent.out_column = 'rent_sqft'
mrent.tables

['buildings', 'parcels', 'nodessmall', 'nodeswalk']

In [10]:
bld = orca.get_table('buildings').to_frame()
bld['rent_sqft'] = 0
orca.add_table('buildings', bld)

<orca.orca.DataFrameWrapper at 0x10ae638d0>

In [11]:
%%time
mrent.run()

CPU times: user 17.3 s, sys: 3.02 s, total: 20.3 s
Wall time: 12.1 s


### Large Choice Set Single-Family 

In [18]:
%%time
m1 = LargeMultinomialLogitStep()
m1.choosers = ['households']
m1.alternatives = ['buildings','parcels','nodeswalk','nodessmall']
m1.choice_column = 'building_id'
m1.alt_sample_size = 50

#Filters on choosers
m1.chooser_filters = ['building_type == 2 & recent_mover == 1 & 0 <income < 1000000']

#Filters on alternatives
m1.alt_filters = ['residential_units == 1',
                 '0 < avg_income_500_walk < 500000',
                 'sqft_per_unit > 0']


m1.model_expression = ' \
np.log1p(rent_sqft) + \
np.log(income):np.log1p(rent_sqft) + \
persons:np.log(res_sqft_per_unit) + \
np.log1p(acres) + \
pop_jobs_ratio_25000 + \
persons:avg_hhs_500_walk + \
rich:prop_rich_500_walk + \
poor:prop_poor_500_walk + \
single:prop_singles_500_walk + \
elderly:prop_elderly_500_walk + \
white:prop_white_500_walk + \
black:prop_black_500_walk + \
asian:prop_asian_500_walk + \
hispanic:prop_hisp_500_walk\
- 1'

m1.name = 'hlcm'
m1.tags = ['single_family', 'test']
m1.fit()

                  CHOICEMODELS ESTIMATION RESULTS                  
Dep. Var.:                chosen   No. Observations:         10,484
Model:         Multinomial Logit   Df Residuals:             10,470
Method:       Maximum Likelihood   Df Model:                     14
Date:                 2018-07-25   Pseudo R-squ.:             0.137
Time:                      14:32   Pseudo R-bar-squ.:         0.136
AIC:                  70,848.632   Log-Likelihood:      -35,410.316
BIC:                  70,950.238   LL-Null:             -41,013.649
                                        coef   std err         z     P>|z|   Conf. Int.
---------------------------------------------------------------------------------------
np.log1p(rent_sqft)                  -4.5043     0.209   -21.538     0.000             
np.log(income):np.log1p(rent_sqft)    0.3770     0.018    21.121     0.000             
persons:np.log(res_sqft_per_unit)     0.0009     0.003     0.337     0.736             
np.log1p(acres) 

### Large Choice Set Multi-Family 

In [21]:
%%time
m2 = LargeMultinomialLogitStep()
m2.choosers = ['households']
m2.alternatives = ['buildings','parcels','nodeswalk','nodessmall']
m2.choice_column = 'building_id'
m2.alt_sample_size = 50

#Filters on choosers
m2.chooser_filters = ['building_type > 2 &\
                      recent_mover == 1 &\
                      hh_random < .5 & \
                      persons < 8 & \
                      workers < 4 & \
                      0 <income < 500000']

#Filters on alternatives
m2.alt_filters = ['residential_units > 1',
                 '0 < avg_income_500_walk < 500000',
                 '0 < rent_sqft < 1000',
                 'pop_1500_walk > 0',
                 'res_price_per_sqft < 1500',
                 'res_sqft_per_unit < 6000',
                 'residential_units < 1000',
                 'sqft_per_unit > 0']


m2.model_expression = ' np.log(residential_units) + \
yrblt_2000:np.log(residential_units) + \
year_built + \
np.log1p(rent_sqft) + \
np.log(income):np.log1p(rent_sqft) + \
np.log1p(income):np.log1p(res_sqft_per_unit) + \
np.log1p(units_500_walk) + \
np.log1p(jobs_25000) + \
rich:prop_rich_500_walk + \
poor:prop_poor_500_walk + \
single:prop_singles_500_walk + \
elderly:prop_elderly_500_walk + \
white:prop_white_500_walk + \
black:prop_black_500_walk + \
asian:prop_asian_500_walk + \
hispanic:prop_hisp_500_walk\
- 1'

m2.name = 'hlcm'
m2.tags = ['multi_family','test']
m2.fit()

                  CHOICEMODELS ESTIMATION RESULTS                  
Dep. Var.:                chosen   No. Observations:         20,907
Model:         Multinomial Logit   Df Residuals:             20,891
Method:       Maximum Likelihood   Df Model:                     16
Date:                 2018-07-25   Pseudo R-squ.:             0.409
Time:                      14:36   Pseudo R-bar-squ.:         0.409
AIC:                  96,739.599   Log-Likelihood:      -48,353.799
BIC:                  96,866.764   LL-Null:             -81,788.665
                                                  coef   std err         z     P>|z|   Conf. Int.
-------------------------------------------------------------------------------------------------
np.log(residential_units)                       1.1660     0.006   180.839     0.000             
yrblt_2000:np.log(residential_units)           -0.0627     0.007    -9.003     0.000             
year_built                                      0.0071     0.000

## Create a Chooser Filter and Tag Their Buildings

Here we create a flag on the buildings table to identify which buildings our filtered choosers are located in.  We will use this later to constrain the universal choice set for model estimation to  these buildings.  This is an effort to approximate both the choice set that these recent_movers actually had available to them in the market when they were searching, and also to approximate the way that the simulation will work: with only units that are vacant being used to create the universal choice set.  Note that in general practice this restriction has not been used.  Our testing suggests that the constraint provides more realistic results.

In [None]:
hh = orca.get_table('households').to_frame()
hh.columns

In [None]:
# Apply the filtering we will use when we estimate the single family model
hh_sf = hh[(hh['building_type'] == 2)  & (hh['recent_mover'] == 1) \
         & (hh['income'] > 0) & (hh['income'] < 1000000)]
len(hh_sf)

In [None]:
sf_tmp = pd.DataFrame(hh_sf.building_id.unique(), columns=['building_id'])
sf_tmp['sf_choice_set'] = 1
sf_tmp = sf_tmp.set_index('building_id')

In [None]:
# Apply the filtering we will use when we estimate the multi family model
hh_mf = hh[(hh['building_type'] > 2) &  (hh['hh_random'] < .5) & (hh['recent_mover'] == 1) \
         & (hh['income'] > 0) & (hh['income'] < 500000)]
len(hh_mf)

In [None]:
mf_tmp = pd.DataFrame(hh_mf.building_id.unique(), columns=['building_id'])
mf_tmp['mf_choice_set'] = 1
mf_tmp = mf_tmp.set_index('building_id')

## Add Flag to Buildings Table Identifying Chosen Buildings

In [None]:
bld = orca.get_table('buildings').to_frame()
bld2 = bld.merge(sf_tmp, how='left', left_index=True, right_index=True)
bld2.sf_choice_set= bld2.sf_choice_set.fillna(0)
bld3 = bld2.merge(mf_tmp, how='left', left_index=True, right_index=True)
bld3.mf_choice_set= bld3.mf_choice_set.fillna(0)
bld = bld3
orca.add_column('buildings', 'sf_choice_set', bld.sf_choice_set)
orca.add_column('buildings', 'mf_choice_set', bld.mf_choice_set)

### Constrained Choice Set Single_Family

In [None]:
%%time
m3 = LargeMultinomialLogitStep()
m3.choosers = ['households']
m3.alternatives = ['buildings','parcels','nodeswalk','nodessmall']
m3.choice_column = 'building_id'
m3.alt_sample_size = 50

#Filters on choosers
m3.chooser_filters = ['building_type == 2 & recent_mover == 1 & 0 <income < 1000000']

m3.alt_filters = ['sf_choice_set == 1 & sqft_per_unit > 0']

# np.log(residential_units) +
# np.log(income):np.log(avg_income_500_walk) + \
# np.log1p(income):np.log1p(rich_1500_walk) + \
# np.log1p(income):np.log1p(poor_1500_walk) + \
# np.log1p(persons):np.log1p(sqft_per_unit) + \

#np.log1p(res_price_per_sqft) + \
#np.log1p(income):np.log1p(res_price_per_sqft) + \


m3.model_expression = ' \
np.log1p(rent_sqft) + \
np.log(income):np.log1p(rent_sqft) + \
np.log1p(income):np.log(res_sqft_per_unit) + \
np.log1p(income):np.log(acres) + \
pop_jobs_ratio_25000 + \
np.log(jobs_25000+1) + \
persons:avg_hhs_500_walk + \
rich:prop_rich_500_walk + \
poor:prop_poor_500_walk + \
single:prop_singles_500_walk + \
elderly:prop_elderly_500_walk + \
white:prop_white_500_walk + \
black:prop_black_500_walk + \
asian:prop_asian_500_walk + \
hispanic:prop_hisp_500_walk\
- 1'

m3.name = 'hlcm_constrained_sf'
m3.tags = ['single_family', 'constrained']
m3.fit()

In [None]:
# register the model
m3.register()

In [None]:
len(m3._get_df(tables=m3.choosers, filters=m3.chooser_filters))

### Constrained Choice Set Multi_Family

In [None]:
%%time
m4 = LargeMultinomialLogitStep()
m4.choosers = ['households']
m4.alternatives = ['buildings','parcels','nodeswalk','nodessmall']
m4.choice_column = 'building_id'
m4.alt_sample_size = 50

#Filters on choosers
m4.chooser_filters = ['building_type > 2 & \
                      hh_random < .5 & \
                      recent_mover == 1 & \
                      persons < 8 & \
                      workers < 4 & \
                      0 <income < 500000']

#Filters on alternatives
m4.alt_filters = ['residential_units > 1',
                 '0 < avg_income_500_walk < 500000',
                 '0 < rent_sqft < 1000',
                 'res_sqft_per_unit < 6000',
                  'sqft_per_unit > 0',
                  'residential_units < 1000',
                 'mf_choice_set == 1']

m4.model_expression = ' np.log(residential_units) + \
yrblt_2000:np.log(residential_units) + \
year_built + \
np.log1p(rent_sqft) + \
np.log(income):np.log1p(rent_sqft) + \
np.log1p(income):np.log1p(res_sqft_per_unit) + \
np.log1p(units_500_walk) + \
np.log1p(jobs_25000) + \
rich:prop_rich_500_walk + \
poor:prop_poor_500_walk + \
single:prop_singles_500_walk + \
elderly:prop_elderly_500_walk + \
white:prop_white_500_walk + \
black:prop_black_500_walk + \
asian:prop_asian_500_walk + \
hispanic:prop_hisp_500_walk\
- 1'

m4.name = 'hlcm_constrained_mf'
m4.tags = ['multi_family','constrained', 'hlcm']
m4.fit()

In [None]:
# register the model
m4.register()

In [None]:
tmp = m4.mergedchoicetable.to_frame()

In [None]:
tmp.head()

In [None]:
tmp.shape

In [None]:
tmp.shape[0]/50

In [None]:
len(m4._get_df(tables=m4.choosers, filters=m4.chooser_filters))

In [None]:
chosen = tmp[tmp['chosen']==1]

In [None]:
tmp_d = chosen.describe().transpose()

In [None]:
# number of choosers/agents/households/observations
len(m_mf._get_df(tables=m_mf.choosers, filters=m_mf.chooser_filters))

In [None]:
m_mf.fitted_parameters
#or
#mm.get_step('hlcm_multi_family').fitted_parameters

## Model Prediction

In [None]:
m3.out_chooser_filters = ['building_type > 2 &\
                          hh_random < .2 &\
                          recent_mover == 1 &\
                          0 <income < 500000']

m.out_alt_filters = ['residential_units == 1',
                         '0 < avg_income_500_walk < 500000',
                         'sqft_per_unit > 0']

In [None]:
%%time
m_mf.run()

In [None]:
print(m_mf.probabilities.shape)
m_mf.probabilities.head()

In [None]:
### number of observations/choosers
print(len(m_mf.probabilities.observation_id.unique()))
### or 
#len(m_mf.choices)

### number of unique alternatives
print(len(m_mf.probabilities.building_id.unique()))

### number of alternatives
print(len(m_mf.probabilities.building_id))

In [None]:
# summed probability 

predict_df=m_mf.probabilities.groupby('building_id')['probability'].sum().to_frame()
predict_df.head()

In [None]:
plt.hist(predict_df['probability'],bins= 100);


In [None]:
# Check that choices are plausible
choices = pd.DataFrame(m_mf.choices)
df = pd.merge(m_mf.probabilities, choices, left_on='observation_id', right_index=True)
df['chosen'] = 0
df.loc[df.building_id == df.choice, 'chosen'] = 1
print(df.head())

In [None]:
print(np.corrcoef(df.probability, df.chosen))

In [None]:
### join predicted df and df 
#hh_f = hh[(hh['building_type'] > 2) & (hh['hh_random'] < .2) & (hh['recent_mover'] == 1)\
#        & (hh['income'] > 0) & (hh['income'] < 500000)]
             
#df = orca.merge_tables(target = 'buildings', tables = ['buildings','parcels','nodeswalk','nodessmall'])
  
#hh_f_data = hh_f.merge(df, left_on='building_id', right_index=True)
#hh_f_data.columns.tolist()

#predict= pd.merge(predict_df,hh_f_data, left_index=True,right_on='building_id',how='left', sort=False)
#predict[['probability','building_id']].head()

#predict_2= pd.merge(predict_df,df, left_index=True,right_index=True,how='left', sort=False)
#predict_2.head()