The purpose of this notebook is two-fold. In it, I aim to:
1. Reproduce the MNL model used in "Brownstone, Davide and Train, Kenneth (1999). 'Forecasting new product penetration with flexible substitution patterns'. Journal of Econometrics 89: 109-129." (p. 121).
2. 'Check' the MNL model for lack-of-fit between observable features of the data and predictions from the model.

In [1]:
import sys
from collections import OrderedDict

import scipy.stats
import pandas as pd
import numpy as np
import pylogit as pl

sys.path.insert(0, '../src/')
from visualization import predictive_viz as viz

%matplotlib inline



# Load the car data

In [2]:
car_df = pd.read_csv("../data/interim/car_long_format.csv")

# Create the necessary variables

In [3]:
car_df.head().T

Unnamed: 0,0,1,2,3,4
obs_id,1,1,1,1,1
alt_id,1,2,3,4,5
choice,1,0,0,0,0
college,0,0,0,0,0
hsg2,0,0,0,0,0
coml5,0,0,0,0,0
vehicle_size,3,3,2,2,3
acceleration,4,4,6,6,2.5
price_over_log_income,4.17534,4.17534,4.81771,4.81771,5.13889
top_speed,95,95,110,110,140


In [4]:
# Create the 'big_enough' variable
car_df['big_enough'] =\
    (car_df['hsg2'] & (car_df['vehicle_size'] == 3)).astype(int)

# Determine the type of car
car_df['sports_utility_vehicle'] =\
    (car_df['body_type'] == 'sportuv').astype(int)

car_df['sports_car'] =\
    (car_df['body_type'] == 'sportcar').astype(int)
    
car_df['station_wagon'] =\
    (car_df['body_type'] == 'stwagon').astype(int)

car_df['truck'] =\
    (car_df['body_type'] == 'truck').astype(int)

car_df['van'] =\
    (car_df['body_type'] == 'van').astype(int)

# Determine the car's fuel usage
car_df['electric'] =\
    (car_df['fuel_type'] == 'electric').astype(int)

car_df['compressed_natural_gas'] =\
    (car_df['fuel_type'] == 'cng').astype(int)

car_df['methanol'] =\
    (car_df['fuel_type'] == 'methanol').astype(int)

# Determine if this is an electric vehicle with a small commute
car_df['electric_commute_lte_5mi'] =\
    (car_df['electric'] & car_df['coml5']).astype(int)

# See if this is an electric vehicle for a college educated person
car_df['electric_and_college'] =\
    (car_df['electric'] & car_df['college']).astype(int)

# See if this is a methanol vehicle for a college educated person
car_df['methanol_and_college'] =\
    (car_df['methanol'] & car_df['college']).astype(int)
    
# Scale the range and acceleration variables
car_df['range_over_100'] = car_df['range'] / 100.0
car_df['acceleration_over_10'] = car_df['acceleration'] / 10.0
car_df['top_speed_over_100'] = car_df['top_speed'] / 100.0
car_df['vehicle_size_over_10'] = car_df['vehicle_size'] / 10.0
car_df['tens_of_cents_per_mile'] = car_df['cents_per_mile'] / 10.0

In [5]:
car_df.loc[car_df.choice == 1, 'fuel_type'].value_counts()

electric    1491
gasoline    1310
cng         1062
methanol     791
Name: fuel_type, dtype: int64

# Create the utility specification

In [6]:
car_mnl_spec, car_mnl_names = OrderedDict(), OrderedDict()

cols_and_display_names =\
    [('price_over_log_income', 'Price over log(income)'),
     ('range_over_100', 'Range (units: 100mi)'),
     ('acceleration_over_10', 'Acceleration (units: 0.1sec)'),
     ('top_speed_over_100', 'Top speed (units: 0.01mph)'),
     ('pollution', 'Pollution'),
     ('vehicle_size_over_10', 'Size'),
     ('big_enough', 'Big enough'),
     ('luggage_space', 'Luggage space'),
     ('tens_of_cents_per_mile', 'Operation cost'),
     ('station_availability', 'Station availability'),
     ('sports_utility_vehicle', 'Sports utility vehicle'),
     ('sports_car', 'Sports car'),
     ('station_wagon', 'Station wagon'),
     ('truck', 'Truck'),
     ('van', 'Van'),
     ('electric', 'EV'),
     ('electric_commute_lte_5mi', 'Commute < 5 & EV'),
     ('electric_and_college', 'College & EV'),
     ('compressed_natural_gas', 'CNG'),
     ('methanol', 'Methanol'),
     ('methanol_and_college', 'College & Methanol')]
    
for col, display_name in cols_and_display_names:
    car_mnl_spec[col] = 'all_same'
    car_mnl_names[col] = display_name


# Estimate the MNL model

In [7]:
# Initialize the mnl model
car_mnl = pl.create_choice_model(data=car_df,
                                 alt_id_col='alt_id',
                                 obs_id_col='obs_id',
                                 choice_col='choice',
                                 specification=car_mnl_spec,
                                 model_type='MNL',
                                 names=car_mnl_names)

# Create the initial variables for model estimation
num_vars = len(car_mnl_names)
initial_vals = np.zeros(num_vars)

# Estimate the mnl model
car_mnl.fit_mle(initial_vals, method='BFGS')

# Look at the estimation results
car_mnl.get_statsmodels_summary()

Log-likelihood at zero: -8,338.8486
Initial Log-likelihood: -8,338.8486
Estimation Time: 0.15 seconds.
Final log-likelihood: -7,394.6247




0,1,2,3
Dep. Variable:,choice,No. Observations:,4654.0
Model:,Multinomial Logit Model,Df Residuals:,4633.0
Method:,MLE,Df Model:,21.0
Date:,"Sat, 09 Jun 2018",Pseudo R-squ.:,0.113
Time:,16:07:44,Pseudo R-bar-squ.:,0.111
converged:,False,Log-Likelihood:,-7394.625
,,LL-Null:,-8338.849

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Price over log(income),-0.1855,0.027,-6.801,0.000,-0.239 -0.132
Range (units: 100mi),0.3503,0.027,13.060,0.000,0.298 0.403
Acceleration (units: 0.1sec),-0.7187,0.111,-6.489,0.000,-0.936 -0.502
Top speed (units: 0.01mph),0.2626,0.081,3.245,0.001,0.104 0.421
Pollution,-0.4441,0.102,-4.366,0.000,-0.644 -0.245
Size,0.9307,0.317,2.937,0.003,0.310 1.552
Big enough,0.1397,0.077,1.809,0.070,-0.012 0.291
Luggage space,0.4916,0.191,2.575,0.010,0.117 0.866
Operation cost,-0.7663,0.076,-10.111,0.000,-0.915 -0.618


# Replication Results

The original modeling results cannot be replicated. When using the same model specification as the original authors, my coefficient estimates are different than those obtained in the original study.

The major differences seem to be with the various fuel type variables and their interactions. I am not sure why.

Using the coefficient estimates from the paper does not work either.

My suspicion is that my variables are not defined the same way as in the paper.

### See if paper results can be replicated:

In [8]:
paper_vals =\
    np.array([-0.185,
               0.350,
              -0.716,
               0.261,
              -0.444,
               0.935,
               0.143,
               0.501,
              -0.768,
               0.413,
               0.820,
               0.637,
              -1.437,
              -1.017,
              -0.799,
              -0.179,
               0.198,
               0.443,
               0.345,
               0.313,
               0.228])
    
np.log(car_mnl.predict(car_df,
                       param_list=[paper_vals, None, None, None],
                       return_long_probs=False,
                       choice_col='choice')).sum()

-7458.0897811913037

The answer appears to be no.

The results from "Brownstone, Davide and Train, Kenneth (1999). 'Forecasting new product penetration with flexible substitution patterns'. Journal of Econometrics 89: 109-129." cannot be directly reproduced using the data in `car_long_format.csv`.