# Maximum Simulated Likelihood Estimation Using SFM as Examples


$$\require{color}
\definecolor{purple}{RGB}{114,0,172}
\definecolor{green}{RGB}{45,177,93}
\definecolor{red}{RGB}{251,0,29}
\definecolor{blue}{RGB}{18,110,213}
\definecolor{orange}{RGB}{217,86,16}
\definecolor{pink}{RGB}{203,23,206}
$$


## What's it and Why?
 

- MSLE is an extension of MLE. It is often used in lieu of MLE when the likelihood function of the model cannot be derived analytically. In many cases, it happens when the model's density function involves an integral that cannot be expressed in closed form. We may call such a model _analytically intractable_. In this case, the likelihood function is approximated through simulation methods.


- Why we may care for analytically intractable models? Why not settle only for the tractable models?

  - Analytically tractable models are easier to estimate, but the simplicity is often achieved through rigid behavioral assumptions.

  - For instance, the stochastic frontier models are tractable if the distribution assumptions are \{normal, truncated normal\}, \{normal, exponential\}, etc.. However, the models become intractable for many other distribution combinations.

  - As another example, the multinomial logit model is tractable and easy to estimate, but it needs to impose the independence of irrelevant alternatives (IIA) assumption, which may not easy to justify. 
><p style="line-height:110%; font-size:92%;margin-left:10%"> The IIA assumption states that the probability of choosing one alternative over another do not depend on the presence or absence of any other alternatives in the choice set. In other words, the relative preference between any two alternatives is independent of the attributes or existence of any other alternatives in the choice set. </p>
  - Moreover, models with unobserved heterogeneity and/or  random effects often contain integrals that cannot be solved analytically.

  - Therefore, if we want to relax the assumptions, we will have to use the MSLE to deal with the intractable models.



- The basic idea of MSLE is to simulate data (generate a series of random draws, known as Monte Carlo draws, from predetermined distributions that represent the characteristics of the model) for the model based on a given set of parameters. The data is then used to construct and approximate the likelihood (i.e., the _simulated likelihood_) function in the estimation. When the parameters are changed in the optimization process, a different set of data is genereated and the model is estimated again. This process continues until the optimization process is done.


- In a nutshell, MSLE replaces the integral with a numerically approximated integral.

  - As we have learned, there are different methods of numerical integrations, including the quadrature method, the Monte Carlo (MC) method, and the quasi-Monte Carlo (QMC) method. The term of "MSLE" is usually reserved for cases using the MC and QMC methods.
  
  - Nevertheless, the quadrature method could be used in dealing with the intractable likelihoods. At least for the one-dimensional integration problems, the quadrature method has superior performance.




- MSLE's asymptotic distribution is the same as MLE's if $S \rightarrow \infty$, $N \rightarrow \infty$, and $\sqrt{N}/S \rightarrow 0$ where $S$ is the number of simulation draws.




## Stochastic Frontier Models: Examples


A typical parametric stochastic frontier (SF) model may be represented
by 

\begin{aligned}
  y_i & = x_i \beta + \varepsilon_i, \\
  \varepsilon_i & = v_i - u_i, \\
  v_i & \sim N(0,\sigma_v^2), \quad \mbox{(most commonly used)} \\
  u_i & \sim  \mbox{some distribution such that $u_i \ge 0$},
\end{aligned}

where $v_i$ and $u_i$ are independent to each other. The likelihood-based estimation requires deriving the PDF of the model. Given the independence assumption, the PDF of the $i$th observation is

\begin{align}
f(\varepsilon_i \mid \theta)  = \int_0^\infty f_{v,u}(\underbrace{\varepsilon_i + u_i}_{\color{red} =v_i}, u_i \mid \theta)\, d u_i 
    \underbrace{=}_{indep} \int_0^\infty f_v(\varepsilon_i + u_i \mid \theta) f_u(u_i \mid \theta)\, d u_i,\label{eq:e_density}
\end{align}


   
where $f_{v,u}(\cdot)$ is the joint distribution of $v$ and $u$ and
$\theta = (\beta, \sigma_v^2, \ldots)$ is the vector of parameters of this model. Given the
independence of $v_i$ and $u_i$, $f_{v,u}(\varepsilon_i+ 
u_i, u_i \mid \theta)$ would be a product of the two random variables'
density functions.

## Conventional MLE Approach 

If we derive the PDF of $\varepsilon_i$ in closed form, we may estimate the model's parameters via MLE. In the case of $u_i \sim N^+(\mu, \sigma_u^2)$ where $N^+(\cdot)$ represents a positive truncation from the underlying normal distribution, the PDFof $\varepsilon_i$ is:


\begin{aligned}
f_v(v|\theta) & = \frac{1}{\sqrt{2 \pi} \sigma_v}
\exp\left[ -\frac{1}{2} \left( \frac{v}{\sigma_v}  \right)^2 \right],\\
f_u(u|\theta) &  = \frac{1}{\sqrt{2 \pi} \sigma_u} \frac{1}{\Phi\left(\frac{\mu}{\sigma_u}\right)}
\exp\left[ -\frac{1}{2} \left( \frac{u-\mu}{\sigma_u}  \right)^2 \right],
\end{aligned}



\begin{aligned}
\mbox{ } & \\
f_{v,u}(\varepsilon + u, u | \theta)  & =
f_v(\varepsilon_i + u_i \mid \theta) f_u(u_i \mid \theta) \\
\mbox{ } & \\
& = \frac{\exp\left[ -\frac{1}{2} \left(\frac{\mu+\varepsilon}{\sigma_v}  \right)^2 + \left( \frac{u-\mu}{\sigma_u} \right)^2  \right]}{2 \pi \sigma_v \sigma_u \Phi\left(\frac{\mu}{\sigma_u} \right)},
\end{aligned}



\begin{align}
\mbox{ } \notag \\
f(\varepsilon_i\mid \theta) = \int_0^\infty f_{v,u}(\varepsilon_i + u_i, u_i \mid \theta)\, d u_i = \frac{\exp\left[ -\frac{1}{2}\left(\frac{ \mu + \varepsilon_i}{\sqrt{\sigma_v^2 + \sigma_u^2}}\right)^2
    \right]}{\sqrt{2\pi}\sqrt{\sigma_v^2+ \sigma_u^2}\left[ \frac{\Phi\left(\frac{\mu}{\sigma_u} \right)}{\Phi\left(\frac{\mu_*}{\sigma_*} \right)} \right]  },\label{eq:LL_mle}
\end{align}
    
where 

\begin{align}
\mu_*  = \frac{ \mu \sigma_v^2-\varepsilon_i \sigma_u^2}{\sigma_v^2+ \sigma_u^2},\quad
\sigma_*^2  = \frac{\sigma_v^2\sigma_u^2}{\sigma_v^2+ \sigma_u^2}.\label{eq:sigs}
\end{align}


Let $l_i(\theta \mid \varepsilon_i)$ be the likelihood function of the
$i$th observation. Given that $l_i(\theta \mid \varepsilon_i) 
\propto f(\varepsilon_i \mid \theta)$, the model's parameters may be
estimated by maximizing the following log-likelihood function of the
model: 

$$\begin{aligned}
  \ln L = \sum_{i=1}^N \ln f(\varepsilon_i \mid \theta ).
\end{aligned}$$

A challenge of the MLE approach is in deriving the density function of $\varepsilon_i = v_i - u_i$ in the closed form, e.g., \eqref{eq:LL_mle}. Some distribution combinations of $v_i$ and $u_i$ either do not have the closed forms or are very difficult to derive. The constraint has in some sense limited the development of the SF literature.

## Monte Carlo Simulation Approach (MCSA) 


To understand MSLE, we re-write \eqref{eq:e_density} using the conditional density:

\begin{align}
   f(\varepsilon_i \mid \theta)  = \int_0^\infty f_{v,u}(\varepsilon_i + u_i, u_i \mid \theta)\, d u_i,  
     & = \int_0^\infty f_{v|u}(\varepsilon_i + u_i \mid \theta, u_i) f_u(u_i \mid \theta)\, d u_i \notag \\
     & = \mbox{E}_{f_u(u \mid \theta)} \left[ f_{v|u}(\varepsilon_i + u_i \mid \theta, u_i)  \right ],\label{eq:sum_msle}
\end{align}
     
where the expectation is over $u_i$ which has the density $f_u(u_i \mid \theta)$. Thus, the log-likelihood of the model may be represented by the sum of the log of \eqref{eq:sum_msle} over $i$: 

\begin{align}
\ln L & = \sum_{i=1}^N \ln \mbox{E}_{f_u(u \mid \theta)} \left[ f_{v|u}(\varepsilon_i + u_i \mid \theta, u_i)  \right ] \label{eq:LL_MSLE}.
\end{align}
   
   
In the estimation, the expected value is approximated by its empirical
counterpart:

\begin{align}
\mbox{E}_{f_u(u \mid \theta)} \left[ f_{v|u}(\varepsilon_i + u_i \mid \theta, u_i)  \right] \approx
  \frac{1}{S}\sum_{s=1}^S  f_{v|u}(\varepsilon_i + u_i^s \mid \theta, u_i^s),\label{eq:LL_msle}
\end{align}  
  
  
where $u_i^s$ is the $s$th element of the random sample $\mathbf{u}_i^S = (u_i^1, u_i^2, \ldots, u_i^S)$ drawn from the distribution of $u_i$. Given $\mathbf{u}_i^S$, \eqref{eq:LL_msle} is easy to calculate since $f_{v|u}(\cdot)$ is the PDF of $v_i$ which is often assumed to follow a normal distribution. Maximizing the model in \eqref{eq:LL_MSLE} and \eqref{eq:LL_msle} yields MSLE estimates of the model.


>> The above estimation method may look complicated. However, it could be understood in a rather intuitive way. Note that the model is
>>
>> \begin{aligned}
  y_i = x_i' \beta + \underbrace{v_i - u_i}_{=\epsilon_i}.
 \end{aligned}
>> 
>> Though $u_i$ is unobservable, we could _simulate_ its characteristics by drawing many values from its distribution. Let $u_i^s$ be the $s$th drawn value. Since $u_i^s$ is a drawn value and no longer random, we re-write the model as:
>> 
>> \begin{align}
 y_i  & = (x_i' \beta - u_i^s) + v_i.\label{eq:eq1} \\
 y_i - x_i' \beta + u_i^s & = v_i,\\
 \epsilon_i + u_i^s & = v_i
 \end{align}
>> 
>> Here, \eqref{eq:eq1} only has one random variable in the model, $v_i$, which is often assumed to be normally distributed. So, we end up with a model that **_looks like_ one that has a normally distributed random variable**.





<div class="alert alert-block alert-info", style="max-width: 100%;">

**Formally, the estimation may be described by the following steps. (Assume that $v_i \sim N(0, \sigma_v^2)$.)**    
    
    
1. Calculate the likelihood value of a given $i$ (i.e., \eqref{eq:LL_msle}):

  - Note that $\varepsilon + u_i = v_i$ which has the assumed normal distribution.
      
  - Compute $\hat{\varepsilon}_i + u_i^s$:
        
    - $\hat{\varepsilon}_i = y_i - f(x_i; \hat{\beta})$;
        
    - $u_i^s$ is the $s$th draw from the distribution of $u_i$;
    
  - Compute $f_{v|u}(\hat{\varepsilon}_i + u_i^s)$ where the $f(\cdot)$ is the density function of $v_i$'s distribution (because $\varepsilon_i + u_i = v_i$). We have assumed $v_i$ to be normally distributed.
    
    - For instance, if $\hat{\varepsilon}_i = 0.5$ and $u_i^s = 0.1$, then the value is calculated by the following code.
    
    ```julia
    using Distributions
    f(e, sigma_v) = pdf(Normal(0, sigma_v), e)     
    f(0.5 + 0.1, sigma_v)  # this!
    ```    
  - Compute $\frac{1}{S}\sum_{s=1}^S f_{v|u}(\hat{\varepsilon}_i + u_i^s)$ which is the simulated likelihood value of the $i$th observation.
    
    - For instance, if $\hat{\varepsilon}_i = 0.5$, then the value may be calculated by the following code.
    
    ```julia
    # draw `u_list` as a S-element vector from the dist of u_i 
    sum(f.(0.5 .+ u_list, sigma_v))/S
    ```
    
2. Calculate the log-likelihood value of a given $i$.

  - Take log on the value calculated from Step 1.   
    
    ```julia
    log(sum(f.(0.5 .+ u_list, sigma_v))/S)
    ```      
   
3. Calculate the log-likelihood value of the model (i.e., \eqref{eq:LL_MSLE}).

  - Repeat Step 1 & 2 for all the observations ($i=1,\ldots,N$) and add them up.
    
    ```julia    
    logLike = Array{Real}(undef, size(y,1))
    for i in 1:size(y,1) 
       logLike[i] = log(sum(f.(ϵ[i,1] .+ u_list, sigma_v)/S))
    end
    sum(logLike)  # better than running sum
    ```        
   
  - You should use the same set of $u_i^s$ for a given observation $i$ during the iteration. It avoids chatters.
  - On the other hand, you could use different draws for different observation $i$. That would improve efficiency.
  - 同個observation用同樣的 set of $u_i^s$, 不同observation可用不同的 set of $u_i^s$
  - 上課說的要用同一個 random vector, 指的是作爲最底層的那個隨機數列的那個 vector. 以 inverse transform sampling 為例，它在生成最後的 random draw of u 之前，需要先有一個 U(0,1) 的 vector, 這個就是最底層的那個隨機數列。對於一個 i, 這個數列在每一次的 iteration 最好都不要變。有了這個隨機數列，再套用到 inverse transform sampling 之後，抽出來的 u  當然就會隨着 sigma_u^2 的不同而有差異，所以兩件事沒有矛盾。

  - 另外， 我現在也想通了，為什麼有些同學的程式無法收斂時，聽了我的建議把 Xoshiro(..) 加如就可以收斂：因為加入這個 RNG，才是真正確保了「每一次的估計，都要用同一個 random vector 」這件事。如果沒有這個 RNG，那麼每次生成 U(0,1) 的時候，每次都會不一樣，這樣就不容易收斂了。


  
</div>

In terms of estimation, the key step here is to get $\mathbf{u}_i^S$.
Econometric softwares may have provided random number generators (RNG)
for some distributions. For instance, the RNGs for uniform and normal
random variables are standard in most of the software. However, it is
rarely a good idea to rely on these RNG to draw $\mathbf{u}_i^S$ for
MSLE, for two reasons. First, the RNGs of some distributions are not
available in many software. Second and more importantly, software' RNGs
(such as Stata's `runiform()` and `rnormal()`) usually provide
pseudo-random numbers (as opposed to quasi-random numbers) which are
inefficient when used in the simulations (taking many more draws
to achieve a given precision level).

The better and more efficient method is to use quasi-random numbers. It could be accomplished using the inverse transform sampling method coupled with low discrepancy sequences. To illustrate, let $\Psi(u_i) \in (0,1)$ be the CDF of $u_i$. Then, we could draw a value of $u_i$ by $u_i^s = \Psi^{-1}(m)$ where $m \sim \mbox{Uniform}(0,1)$ and $\Psi^{-1}(\cdot)$ is the inverse CDF (also called the quantile function) of $u_i$. The Halton sequence, which is one type of low discrepancy sequences, could be used for $m$. Because the sequence is generated in a deterministic and strategic manner (thus,*quasi-*random) that it covers the $(0,1)$ more evenly for a given $S$, the simulation can be performed more efficiently. The limitation is that the inverse function has to be available in an appropriate form such that we could substitute $m$ for the Halton sequence and obtain the quasi-random sample of $\mathbf{u}_i^S$. Unfortunately, the requirement is not easily met for some distributions in many of the software packages.

### Numerical Integration Approach (NIA) 

This approach is intuitive in that it simply numerically evaluates the integration in \eqref{eq:e_density} to obtain the density of $\varepsilon_i$. Let $f^*(\varepsilon_i \mid \theta)$ be such a density. Then, the model's log-likelihood function may be represented by 

\begin{aligned}
  \ln L = \sum_{i=1}^N \ln f^*(\varepsilon_i \mid \theta).
\end{aligned}

Compared to the conventional MLE, there is no need of deriving the
model's joint distribution in closed form as shown
in \eqref{eq:LL_mle} to \eqref{eq:sigs}. Without such a burden, all the distribution
assumptions for $v_i$ and $u_i$ ($>0$) could be used to construct
$\varepsilon_i = v_i - u_i$; all we need is the PDF of $v_i$ and $u_i$.
The numerical integration procedure is invariant to the choice of
distribution assumptions.

Compared to MCSA, there is no need to have the inverse CDF function for
$u_i$ (because we integrate out $u_i$ !!, no need to draw random sample of $u_i$ now), making NIA widely applicable to different distributional choices.


There are a few methods of doing numerical integrations in NIA which
could be modified/used for SF model estimation. We introduce two of them
here: the Gaussian quadrature method and the (quasi) Monte Carlo
integration method.

#### The Gaussian Quadrature Method (NIA-GQ)

The quadrature method uses a set of chosen points (*weights* and
*nodes*) to evaluated the function and to approximate the integration.
The quadrature-rule based integration approximation takes the following
general form:

\begin{align}
I = \int_\Omega p(u)f(u)\, du \approx \sum_{j=1}^n \omega_j f(\xi_j),\label{eq:quad}
\end{align}


where $\xi_j$s are nodes and $\omega_j$s are the corresponding weights.
Many integration methods fall into this category, including the
trapezoidal rule, the Simpson rule, etc.. Here, we consider the
*Gaussian* quadrature rules which are designed in a way such that $n$
quadrature nodes would integrate a polynomial of degree $2n-1$ exactly.

With Gaussian quadratures, integrations with different domain ($\Omega$)
and/or different weight function $p(u)$ may require different sets of
the node-weight combinations (called *rules*). The often-used rules
include the Gauss-Legendre, Gauss-Hermite, and the Gauss-Laguerre rules.
For the purpose of SF analysis where $u_i \geq 0$, we have
$\Omega = (0,\, \infty)$ and the appropriate quadrature rule would be
the Gauss-Laguerre rule to which the $p(u)$ function is $e^{-u}$. Then a
SF model could be estimated by (see
also \eqref{eq:e_density}): 

\begin{align}
  f(\varepsilon_i \mid \theta) & = \int_0^\infty f_v(\varepsilon_i + u_i\mid \theta,) f_u(u_i\mid \theta)\, d u_i \notag \\
           & = \int_0^\infty e^{-u_i} \tilde{f}_{v,u}(\varepsilon_i, u_i \mid \theta)\, d u_i 
            \approx \sum_{j=1}^n \omega_j \tilde{f}_{v,u}(\varepsilon_i, \xi_j \mid \theta),\label{eq:GL}
\end{align}
            
where

\begin{aligned}
\tilde{f}_{v,u}(\varepsilon_i, u_i\mid \theta)  = e^{u_i} f_v(\varepsilon_i + u_i \mid \theta) f_u(u_i \mid \theta),
\end{aligned}

and $(\omega_j, \xi_j)$ are the Gauss-Laguerre rules. The model's log-likelihood function may be represented by the sum of \eqref{eq:GL} over the observations.

The quadrature method is fast and accurate, making it ideal at least for one-dimensional problems. The downside of the method is the curse of dimensionality: For multi-dimensional problems, the method's convergence rate quickly deteriorates.

#### The Monte Carlo Integration Method (NIA-MCI)

Alternative to the quadrature method is the Monte Carlo integration
which is based on the simple approximation: 

\begin{aligned}
I =  \int_0^1 f(t)\, dt  =  \mbox{E}[f(t)]  
       \approx \frac{1}{S}\sum_{s=1}^S f(t^s),
\end{aligned}
       
where $t^s$ is the $s$th element of the sample $\mathbf{t}^S 
= (t^1, t^2, \ldots, t^S)$ drawn from a uniform distribution in
$(0,\,1)$. This example's $t$ has a domain in $(0,\,1)$ so that it can
translate directly to the expected value expression. To apply the method
to a SF model, we use a change of variables to change the model's
integrator from $u \in (0,\,\infty)$ to $t \in (0,\,1)$. To wit, 

\begin{align}
f(\varepsilon_i \mid \theta) & = \int_0^\infty f_{v,u}(\varepsilon_i + u_i, u_i \mid \theta)\, du_i 
           = \int_0^1 g(\varepsilon_i, t_i \mid \theta)\, dt_i \notag \\
         & =  \mbox{E}[g(\varepsilon_i, t_i \mid \theta)]  
       \approx \frac{1}{S}\sum_{s=1}^S g(\varepsilon_i, t^s \mid \theta),\label{eq:LL_mci}
\end{align}
       
       
where $t_i = h(u_i) \in (0,\,1)$, $t^s \in (0,\,1)$ is the $s$th element of the sample $\mathbf{t}^S = (t^1, t^2, \ldots, t^S)$, and 

\begin{aligned}
g(\varepsilon_i, t_i\mid \theta) = f_{v,u}\left(\varepsilon_i + h^{-1}(t_i),\, h^{-1}(t_i)\mid \theta\right)\times J(t_i),
\end{aligned}

where $J(\cdot)$ is the Jacobian. The log-likelihood of the model may be represented by the sum of the log of \eqref{eq:LL_mci} over observations. There are a few choices of the transformation function $h(\cdot)$ (or its inverse function $h^{-1}(\cdot)$) for the SF model. For instance, $u = t/(1-t)$, to which the Jacobian is $1/(t-1)^2$.

Regarding $\mathbf{t}^S$, we could use pseudo-random numbers from a
uniform distribution in $(0,\,1)$, and the result is typically referred
to as the Monte Carlo integration. An alternative and more efficient way
is to use quasi-random numbers from a low discrepancy sequence (such as
the Halton sequence). The result is a quasi-Monte Carlo integration. In
fact, as we will demonstrate formally in the research, equally-spaced
grids in $(0,1)$ could serve the purpose well at least for
one-dimensional problems (cross-sectional models).

The above NIA-MCI method has a few advantages over the popular MCSA:

-  intuitive: It directly integrates out $u_i$ from the density
    function.

-  simple: The Halton sequence is widely available. For cross-sectional
    models, the Halton sequence may not even be needed; the simple gride
    sequence would suffice. In this way, there is no need of specific
    softwares or routines.

-  applicable to all distributions: Do not require distribution
    sampling, thus no need of the inverse CDFs. Only need the PDF of
    $v_i$ and $u_i$.

To sum up: Both of the NIA-GQ and NIA-MCI have important advantages over
the more-popular MCSA. Yet, they are under-utilized in simulation-based
estimation, particularly in the field of SF analysis. When used
properly, they have the potential for lifting the distribution
restrictions on SF models, which is an appealing feature that could not
be easily shared with MCSA.