# Unordered Multinomial Choice:
---
In the previous set of slides, we looked at choices where we had a multinomial outcome but the choice was clearly vertical, with a clear ordering. In many situations though we are interested in an unordered set of choices:
* No Purchase
* Brand A
* Brand B
* Brand C

In [1]:
sqrt(1000)

While it may be possible to say that *Brand A* is on average more desirable than *Brand B*, we want the data to pick this out for us, rather than impose it. Here we'll estimate probabilities for a particular individual making some choice, but we can think about these as being market shares.

Again, the techniques come in two main flavors (though there are others, and sub-flavors that we'll touch on a bit below ):
* Multinomial Probit
* Multinomial Logit

More so than in the binary case, multinomial *logits* are the focal technique, as they are much easier to estimate in practice, however, the way to think about the modeled process is fairly similar.

Suppose we know the error distribution $\epsilon$, we will again think of there being a series of hidden latent variable representing the utility (where these methods are sometimes referred to as random utility models). For person  $i$ we will assume that they derive some anticipated happiness from option $m$ given by $U^i(m)$. If given a choice between options 1 to 4 they would then compare the amounts  $U^i(1)$, $U^i(2)$, $U^i(3)$ and $U^i(4)$ and make the choice that gives them the greatest expected outcome.

In this way, we can think of the latent variables $U^i(m)$ as being the Utility person $i$ gets from choosing item $m$, and that they are maximizing their own outcomes. We will model this utility as having a linear form in the explanatory variables (though this also can be relaxed):
$$ U^i(m) = x_{im}\beta+z_i\gamma_m+\epsilon_{im} $$

While we don't observe the utilities, we instead observe a choice for product $m$ only if $U_i(m)$ is greater than all of the other choices for person $i$

$$ U^i(m) = x_{im}\beta+z_i\gamma_m+\epsilon_{im} $$
The data here is composed of:
* Choice attributes $x_{im}$ that can vary across both choices and decision makers (but don't have to)
* Individual attributes $z_i$ that are held constant across choices but that can be weighted differently across choices

The parameters we will try and estimate are:
* $\beta$ which tells us how different characteristics of the choice feed into utility, and are constant across both decision makers and choices
* $\gamma_m$ for each option $m$ telling us how specific individual characteristics feed into how much a product is liked/disliked

## Example
Let's look at an example to make all of this a bit clearer. We'll look at the travel choices for 3880 travellers between [Montreal and Toronto](https://www.google.com/maps/dir/Montreal,+QC,+Canada/Toronto,+ON,+Canada), which appears in the [Vignettes](https://cran.r-project.org/web/packages/mlogit/vignettes/c3.rum.html) accompanying the multinomial package [documentation](https://cran.r-project.org/web/packages/mlogit/mlogit.pdf)

The transporation options were:
* Car
* Bus
* Train
* Air

In [2]:
library(mlogit)
data("ModeCanada",package = "mlogit") # this loads it in from the Mlogit package
tail(ModeCanada,10)  

Unnamed: 0_level_0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
Unnamed: 0_level_1,<int>,<fct>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>
15511,4321,air,0,388,155.3,46,115,16,45,0,4
15512,4321,bus,0,388,27.96,186,140,24,45,0,4
15513,4321,car,1,388,73.72,297,0,0,45,0,4
15514,4322,train,0,347,60.6,193,155,3,35,0,3
15515,4322,air,0,347,155.3,46,120,16,35,0,3
15516,4322,car,1,347,65.93,267,0,0,35,0,3
15517,4323,train,0,323,60.6,193,200,3,45,0,2
15518,4323,car,1,323,61.37,278,0,0,45,0,2
15519,4324,train,0,150,28.5,63,105,1,70,0,2
15520,4324,car,1,150,28.5,134,0,0,70,0,2


For each individual we have a normalized measure of:
* Income (`income`)
* Whether they live in an urban area (`urban`)

For each choice option we have a measure of:
* The cost (`cost`)
* The in vechile time (`ivt`)
* The out of vechile time (`ovt`)

The dependent variable here is `choice` where this set of options available to each decision maker is show for each of the `case` ids (where `noalt` summarizes the number of choices considered)

We want to only consider the decision makers who consider all four choices  and we need to tell the `mlogit` estimation how things are organized. The entries at the individual level are then just repeated in each row with the different `cases` where the choice options are given in the `alt` column. To let the mlogit command know this organization we use the `dfidx` command (here I subset too to select choices with four options)

In [3]:
MC <- dfidx(ModeCanada, subset =  noalt == 4, idx = c("case", "alt"))

We can then specify the estimation equation which by default is estimated using a multinomial logit:

In [4]:
mlogit.est <- mlogit( choice ~ cost + freq + ovt | income | ivt , MC)  

The formula object above is made up of three parts:
1. Terms which can vary across individuals $i$ and choices $m$  but that have a constant effect in each utility
    * $\beta_\text{cost}$, $\beta_\text{freq}$, $\beta_\text{ovt}$
2. Terms which are individual specific but which will be estimated with a separate effect for each choice
    * $\gamma^\text{air}_\text{income}$, $\gamma^\text{bus}_\text{income}$ , $\gamma^\text{car}_\text{income}$,  $\gamma^\text{train}_\text{income}$
3. Terms which vary across $i$ and $j$, but where the variable has a choice-specific effect
    * $\delta^\text{air}_{\text{ivt}}$, $\delta^\text{bus}_{\text{ivt}}$, $\delta^\text{car}_{\text{ivt}}$, $\delta^\text{train}_{\text{ivt}}$

(A fourth part can be used for terms affecting the variance!)

When we allow for choice specific terms, this estimation is often referred to as *conditional logit*.

Let's look at the estimated parameters:

In [16]:
0.1906138/0.003532941	

In [17]:
summary(mlogit.est)


Call:
mlogit(formula = choice ~ cost + freq + ovt | income | ivt, data = MC, 
    method = "nr")

Frequencies of alternatives:choice
    train       air       bus       car 
0.1666067 0.3738755 0.0035984 0.4559194 

nr method
9 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.00014 
successive function values within tolerance limits 

Coefficients :
                  Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept):air -3.2741952  0.6244152  -5.2436 1.575e-07 ***
(Intercept):bus -2.5758571  1.0845227  -2.3751 0.0175439 *  
(Intercept):car -1.4300823  0.3013764  -4.7452 2.083e-06 ***
cost            -0.0333389  0.0070955  -4.6986 2.620e-06 ***
freq             0.0925297  0.0050976  18.1517 < 2.2e-16 ***
ovt             -0.0430036  0.0032247 -13.3356 < 2.2e-16 ***
income:air       0.0381466  0.0040831   9.3426 < 2.2e-16 ***
income:bus      -0.0509401  0.0181702  -2.8035 0.0050553 ** 
income:car       0.0101536  0.0031648   3.2083 0.0013353 ** 
ivt:train       -0.0014504  0.0011875  -1.2214 0.

Under the multinomial logit model the errors are assumed to have what is called an *extreme-value distribution*. This is a little bit funkier than a logisitic, but the assumption is made to ensure a similar representation for the odds ratio as before. If the latent variable for option $j$ from the $M$ possible options is $U_i(j)$, then the probability that someone chooses option $j$ is given by:
$$ \Pr\left\{Y_i=j\right\}=\frac{\exp\left\{U_i(j)\right\}}{\sum_{m=1}^M \exp\left\{U_i(m) \right\}}$$

So because the denominator is a constant in all of the choices, the odds ratio for any two options (here 1 and 2) is given by:
 $$\frac{ \Pr\left\{Y_i=1\right\}}{ \Pr\left\{Y_i=2\right\}}=\frac{\exp\left\{U_i(1)\right\}}{\exp\left\{U_i(2)\right\}} $$

And the log-odds-ratio is just the difference in the latent variables/utilities:
 $$\log\left(\frac{ \Pr\left\{Y_i=1\right\}}{ \Pr\left\{Y_i=2\right\}} \right)=U_i(1)-U_i(2)$$

## Independence of Irrelevant Alternatives (IIA)
This property of multinomial logit is called *Independence of Irrelevant Alternatives*. That is, that the odds-ratio comparing *any* two outcomes is purely a function of the characteristics for those two choices, and does not respond to other features of the overall set of choices. 

While this is mostly a nice feature that was designed into the approach, you can show that it is absurd in some settings. (In particular, if *choice 3* is a direct substitute for product 1 you would imagine some of it's features (its pricing) could affect the relative chances between choices 1 and 2.

There are some ways to remove the IIA assumption across pre-specified Nests, where this technique is called *Nested Logit*. This allows for a degree of correlation in the error terms within the nests choices, but independence across the nests.

In particular if you look at the [vignettes](https://cran.r-project.org/web/packages/mlogit/vignettes/c4.relaxiid.html) there is a nice example for Japanese firms choosing regions for investment within the EU, where the nests are chosen at the country level. However, the process here is a bit more involved, using the parameters from the more-standard country-wide multinomial logits as instruments for the second model.  

## Multinomial Probits
Another option that allows for a degree of correlation across the errors is to use a multinomial Probit. Here, the assumption on the error primitives is a bit easier to understand, where we let the error terms have a multivariate Normal distribution $\boldsymbol{\epsilon}\sim \mathcal{N}(0,\Sigma)$ (though scale is not fully identified here, so you should think of the variance matrix as being more of a correlation matrix). 

In contrast to the logit, in the probit we can allow for the errors to be correlated across the choice:
* So the fact that I have a high idiosyncratic shock to my personal utility on a *Mercedes* can be correlated with my also having a high idiosyncratic shock on an *Audi*, and a negative shock to say a *GM*.

Running the probit as opposed to the logit model is relatively easy, where I just add the `probit=TRUE` flag to the model estimation command from before.

In [6]:
mprobit.est <- mlogit( choice ~ cost + freq +ovt | income | ivt, MC,probit=TRUE ) 

In [7]:
AIC(mprobit.est)
AIC(mlogit.est)

In [8]:
summary(mprobit.est)


Call:
mlogit(formula = choice ~ cost + freq + ovt | income | ivt, data = MC, 
    probit = TRUE)

Frequencies of alternatives:choice
    train       air       bus       car 
0.1666067 0.3738755 0.0035984 0.4559194 

bfgs method
23 iterations, 0h:1m:44s 
g'(-H)^-1g = 4.74E-07 
gradient close to zero 

Coefficients :
                   Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept):air -2.81296211  0.30921547 -9.0971 < 2.2e-16 ***
(Intercept):bus -2.00242481  0.94359556 -2.1221 0.0338275 *  
(Intercept):car -0.48301166  0.11583068 -4.1700 3.046e-05 ***
cost            -0.01076199  0.00312332 -3.4457 0.0005696 ***
freq             0.04570382  0.00300708 15.1987 < 2.2e-16 ***
ovt             -0.01322724  0.00157690 -8.3881 < 2.2e-16 ***
income:air       0.01814586  0.00195387  9.2871 < 2.2e-16 ***
income:bus      -0.02406673  0.01460376 -1.6480 0.0993564 .  
income:car       0.00398359  0.00101304  3.9323 8.413e-05 ***
ivt:train       -0.00060007  0.00050683 -1.1840 0.2364265    


However, the model now takes a lot longer to run and solve. The reason for this is that we do not generally have closed form expressions for the likelihood that a one outcome from a multivariate normal is larger than all of the others! As such when the model is trying to figure out the likelihood for a particular parameter mix it is using *Simulated Maximum Likelihood*.

That is, it randomly draws a random sample from the relevant multivariate normal and uses this to compute the probability that the utility to option $j$ is larger than for the other $M-1$ options.

This simulation approach to maximum likelihood becomes more common as the models get more involved, and can mean that estimation can take a substantial amount of time... which means that bootstrapping standard errors can take a **very** long time

## Mixed Logit

Finally, another alternative to induce correlations is to allow for the **coefficients** to be random. That is suppose our prior model was just:
$$U_i(j)=\beta x_{ij} +\epsilon_{ij}$$

In a random coefficients model person $i$ would have their own particular value for $\beta_i$, reflecting idiosyncratic tastes for $x_{ij}$ say. As such their utility would be given by:
$$U_i(j)=\beta_i x_{ij} +\epsilon_{ij}$$
where we would make a parametric assumption on the distribution such as $\beta_i\sim\mathcal{N}(\beta,\sigma^2_\beta)$.

Given the value of $\beta_i$ the probability of making choice $j$ is given by:
$$ \Pr\left\{Y_i=j\right\}=\frac{\exp\left\{\beta_i x_{ij} +\epsilon_{ij} \right\}}{ \sum_{m=1}^M  \exp\left\{\beta_i x_{im} +\epsilon_{im} \right\}}$$
So people with a very high value of $\beta_i$ might be more inclined to make some choices over others, depending on the relevant values for the $x_{im}$ terms.

In practical terms, it's very hard to assess the analytical expectation of the probabilities over the random coefficients, and we instead switch  to numerical estimates across all the possible values of $\beta_i$ under the assumed distribution using simulation. 

In making this assumption, we can practically decompose an outcome into it's average effect, and the idiosyncratic effect from variable $x_{ij}$ $(\beta_i-\beta)$, reflecting how the individual's $x_{ij}$ variable is weighted both in the considered choice probability:
$$U_i(j)=\beta x_{ij} +(\beta_i-\beta)x_{ij}+\epsilon_{ij},$$
but also in the other choice probabilities via the $x_{im}$ terms for the other choices.

By assuming a distribution for $\beta_i$ we will use aggregation across multiple individuals to extract the average value $\beta_i$ and the scale of the individual effects (a standard deviation). The R `mlogit` command lets us do that by adding details on the variable we are allowing to be random, and providing its distribution via the `rpar` option. Here I set the cost and outside the vehicle time parameters to be random with a normal assumption.

In [None]:
mixedlogit.est <- mlogit( choice ~ cost + freq +ovt | income | ivt, MC, rpar=c(cost="u")   )

It is possible to specify other distributions too, such as uniform (`u`), triangular (`t`) and log-normal (`ln`)

In [None]:
c("Logit"= AIC(mlogit.est), "Mprobit"=AIC(mprobit.est),"Mixed.logit"= AIC(mixedlogit.est));

In [32]:
head(mixedlogit.est$probabilities)

train,air,bus,car
0.3758118,0.21544206,0.0017846306,0.4069615
0.2967972,0.31237687,0.0005162383,0.3903097
0.4894438,0.08887799,0.0051784771,0.4164998
0.2805015,0.34677221,0.0004688045,0.3722575
0.2846652,0.33187927,0.0004613962,0.3829941
0.3645871,0.22295496,0.0012659196,0.411192


In [None]:
summary(mixedlogit.est)

## Prediction and Counterfactuals

In [None]:
# Current predicted shares by transpotation mode in each model
rbind(  
"logit.est"=apply(fitted(mprobit.est, outcome = FALSE), 2, mean),
"probit.est"=apply(fitted(mlogit.est, outcome = FALSE), 2, mean),
"mix.logitest"=apply(fitted(mixedlogit.est, outcome = FALSE), 2, mean)
)

So let's see what happens in our estimated model as we change things (the whole point of having a model!). Here we'll double the cost of driving a car.

In [None]:
colnames(MC3)

In [None]:
MC2 <- MC
#MC2$cost <- ifelse(MC2$alt=="car",MC2$cost*2, MC2$cost)
MC2$ivt <- ifelse(MC2$alt=="train",(MC2$ivt)/2.5, MC2$ivt)
MC2$cost <- ifelse(MC2$alt=="train",(MC2$cost)*3, MC2$cost)

#MC2$cost <- ifelse(MC2$alt=="train",MC2$cost*1.2, MC2$cost)

In [49]:
logit.share <- apply(fitted(mlogit.est, outcome = FALSE), 2, mean)
logit.cf.share <- apply(predict(mlogit.est, newdata=MC2), 2, mean)
data.frame(cbind(
"original"= logit.share, 
"counterfactual"=  logit.cf.share,
"difference (%)"= round(100*(logit.cf.share-logit.share),2)
))

Unnamed: 0_level_0,original,counterfactual,difference....
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
train,0.166606693,0.381540183,21.49
air,0.373875495,0.428948344,5.51
bus,0.003598417,0.006806765,0.32
car,0.455919395,0.182704708,-27.32


In [50]:
probit.share <- apply(fitted(mprobit.est, outcome = FALSE), 2, mean)
probit.cf.share <- apply(predict(mprobit.est, newdata = MC2 ), 2, mean)
data.frame( cbind( 
"original"=probit.share,
"counterfactual"=  probit.cf.share,
"difference (%)"= round(100*(probit.cf.share-probit.share),2)  
)) 

Unnamed: 0_level_0,original,counterfactual,difference....
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
train,0.165444484,0.454617432,28.92
air,0.369058777,0.401491182,3.24
bus,0.003659158,0.005080722,0.14
car,0.462836532,0.139143783,-32.37


In [51]:
mxlogit.share <- apply(fitted(mixedlogit.est, outcome = FALSE), 2, mean)
mxlogit.cf.share <- apply(predict(mixedlogit.est, newdata = MC2), 2, mean)
data.frame( cbind(
"original"=mxlogit.share,
"counterfactual"=  mxlogit.cf.share,
"difference (%)"= round(100*(mxlogit.cf.share-mxlogit.share),2)  
))

Unnamed: 0_level_0,original,counterfactual,difference....
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
train,0.167325567,0.417971088,25.06
air,0.371126992,0.400377534,2.93
bus,0.003608577,0.007824095,0.42
car,0.457938863,0.173827283,-28.41


### Conclusion
All three models agree on the conclusion that the main winner from an increase in the cost of cars would be the Train system. So if you were forecasting substantial increases to the costs of running an automobile, you might want to invest in your rail network! 

Depending on how much time we have, we will maybe try to talk about how these techniques with multinomial logits can be incorporated into models of an entire industry. These models use game theoretic models of an oligopoly (similar to the things you looked at with Richard) to understand how prices and product characteristics affect outcomes. After estimation, these structural models (called BLP models after the authors of [this article](https://doi.org/10.2307/2171802), Berry, Levinsohn and Pakes) can be used to make predictions about what would happen if you raised/lowered prices. (See this [Nevo paper](https://doi.org/10.1111/j.1430-9134.2000.00513.x) on a guide for practioners!)