# Urbansim template - experiemnt

The objective of this experiment is to pinpoint possible issues with the urbansim template. Especially the Small Multinomial Logit. 
https://urbansim-templates.readthedocs.io/en/latest/model-steps.html#small-multinomial-logit

### Summary

I have created a table with 1000 observations, for each observation, I randomly assign some variables (age and income) and a choicen alternative out of a set of four alternatives. I run a logistic regression with the same specification in three different python libraries: 1) Urbansim templates, 2)Statsmodels MNLogit, and 3)Pylogit. Statsmodels and pylogit have the same results. Results for urbansim template differ significantly. 

In [29]:
#Importing libraries
import pandas as pd
import numpy as np 
import orca
import os; os.chdir('../')
import random
import random 
from collections import OrderedDict

from urbansim_templates import modelmanager as mm
from urbansim_templates.models import SmallMultinomialLogitStep

from statsmodels.discrete.discrete_model import MNLogit


import pylogit as pl

## Create table 

This is a random generated table of data that will be the same for all estimation process. 

In [30]:
X = pd.DataFrame({'intercept': 1,
                  'age': [random.randint(1,20) for x in range(1000)] })

x_1 = pd.concat([X, pd.get_dummies([random.randint(1,5) for x in range(1000)], prefix='income', prefix_sep='_')], axis=1)
# x_1.drop('income_5', axis = 1, inplace = True)
x_1['y'] = [random.randint(1,4) for x in range(1000)]

x_1.head()

Unnamed: 0,intercept,age,income_1,income_2,income_3,income_4,income_5,y
0,1,9,0,1,0,0,0,3
1,1,12,0,1,0,0,0,4
2,1,10,1,0,0,0,0,2
3,1,11,0,0,1,0,0,3
4,1,1,0,1,0,0,0,2


In [31]:
#Checking alternatives
x_1.y.unique()

array([3, 4, 2, 1])

## Urbansim Template

In [14]:
# #Adding the table to orca
orca.add_table('school_trip', x_1)

<orca.orca.DataFrameWrapper at 0x7ff2480bceb8>

In [15]:
#Creating model speficication. 
example_specification = OrderedDict()
example_names = OrderedDict()

example_specification["intercept"] = [2, 3, 4]
example_names["intercept"] = ['ASC 2', 'ASC 3', 'ASC 4' ]

example_specification["age"] = [2, 3, 4]
example_names["age"] = ['age 2', 'age 3', 'age 4' ]

example_specification["income_1"] = [2, 3, 4]
example_names["income_1"] = ['income_1_2', 'income_1_3', 'income_1_4' ]

example_specification["income_2"] = [2, 3, 4]
example_names["income_2"] = ['income_2_2', 'income_2_3', 'income_2_4' ]

example_specification["income_3"] = [2, 3, 4]
example_names["income_3"] = ['income_3_2', 'income_3_3', 'income_3_4' ]

example_specification["income_4"] = [2, 3, 4]
example_names["income_4"] = ['income_4_2', 'income_4_3', 'income_4_4' ]

In [21]:
#Model template
m = SmallMultinomialLogitStep()
m.name = 'STOD_choice'
m.tables = ['school_trip']
m.choice_column = 'y'
m.model_expression = example_specification
m.fit()

Log-likelihood at zero: -837.1426
Initial Log-likelihood: -837.1426
Estimation Time for Point Estimation: 0.04 seconds.
Final log-likelihood: -837.1426




                     Multinomial Logit Model Regression Results                    
Dep. Variable:                     _chosen   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      982
Method:                                MLE   Df Model:                           18
Date:                     Tue, 27 Aug 2019   Pseudo R-squ.:                   0.000
Time:                             19:16:18   Pseudo R-bar-squ.:              -0.022
AIC:                             1,710.285   Log-Likelihood:               -837.143
BIC:                             1,798.625   LL-Null:                      -837.143
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
intercept_2          0   5.93e+06          0      1.000   -1.16e+07    1.16e+07
intercept_3          0   5.93e+06          0      1.000   -1.16e+07    1.16e+07
intercep

  self._store_inferential_results(np.sqrt(np.diag(self.cov)),
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)
  self._store_inferential_results(np.sqrt(np.diag(self.robust_cov)),


## Statmodels MNLogit

In [23]:
MNLogit(x_1.y, x_1.drop(['y','income_5'],axis = 1)).fit().summary().add_table_params

Optimization terminated successfully.
         Current function value: 1.377704
         Iterations 4


<bound method Summary.add_table_params of <class 'statsmodels.iolib.summary.Summary'>
"""
                          MNLogit Regression Results                          
Dep. Variable:                      y   No. Observations:                 1000
Model:                        MNLogit   Df Residuals:                      982
Method:                           MLE   Df Model:                           15
Date:                Tue, 27 Aug 2019   Pseudo R-squ.:                0.004099
Time:                        19:17:24   Log-Likelihood:                -1377.7
converged:                       True   LL-Null:                       -1383.4
                                        LLR p-value:                    0.7281
       y=2       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
intercept     -0.4860      0.276     -1.763      0.078      -1.026       0.054
age            0.0289      0.017      1.7

## Pylogit

Pylogit estimation requires that the data is in the long format - each row is a observation-alternative pair. 

In [32]:
# Getting table in long format
dfs = []
for x in range(len(x_1)):
#     print (x_2.loc[x,:])
    df = pd.DataFrame({'obs_id': x + 1,
                       'alt_id':[1, 2, 3, 4],
                       'age': x_1.loc[x,'age'],
                       'income_1':x_1.loc[x,'income_1'],
                       'income_2':x_1.loc[x,'income_2'],
                       'income_3':x_1.loc[x,'income_3'],
                       'income_4':x_1.loc[x,'income_4'],
                       'y':x_1.loc[x,'y'],})
    dfs.append(df)

x_3 = pd.concat(dfs)
x_3['chosen'] = (x_3.alt_id == x_3.y).astype(int)
x_3.head()

Unnamed: 0,obs_id,alt_id,age,income_1,income_2,income_3,income_4,y,chosen
0,1,1,9,0,1,0,0,3,0
1,1,2,9,0,1,0,0,3,0
2,1,3,9,0,1,0,0,3,1
3,1,4,9,0,1,0,0,3,0
0,2,1,12,0,1,0,0,4,0


In [33]:
example_mnl = pl.create_choice_model(data=x_3,
                                     alt_id_col='alt_id',
                                     obs_id_col='obs_id',
                                     choice_col='chosen',
                                     specification=example_specification,
                                     model_type="MNL",
                                     names=example_names)

example_mnl.fit_mle(np.zeros(18))
example_mnl.get_statsmodels_summary()

Log-likelihood at zero: -1,386.2944
Initial Log-likelihood: -1,386.2944
Estimation Time for Point Estimation: 0.05 seconds.
Final log-likelihood: -1,381.6223




0,1,2,3
Dep. Variable:,chosen,No. Observations:,1000.0
Model:,Multinomial Logit Model,Df Residuals:,982.0
Method:,MLE,Df Model:,18.0
Date:,"Tue, 27 Aug 2019",Pseudo R-squ.:,0.003
Time:,19:39:13,Pseudo R-bar-squ.:,-0.01
AIC:,2799.245,Log-Likelihood:,-1381.622
BIC:,2887.584,LL-Null:,-1386.294

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ASC 2,0.1629,0.276,0.589,0.556,-0.379,0.705
ASC 3,0.3574,0.272,1.312,0.190,-0.177,0.891
ASC 4,0.0364,0.280,0.130,0.897,-0.513,0.585
age 2,-0.0005,0.016,-0.030,0.976,-0.032,0.031
age 3,-0.0109,0.016,-0.685,0.493,-0.042,0.020
age 4,-0.0011,0.016,-0.071,0.944,-0.032,0.029
income_1_2,-0.2160,0.290,-0.745,0.456,-0.784,0.352
income_1_3,-0.2591,0.286,-0.906,0.365,-0.820,0.301
income_1_4,0.1325,0.288,0.459,0.646,-0.433,0.698
