# Stochastic Frontier Models: A Mini Introduction

$$\require{color}
\definecolor{purple}{RGB}{114,0,172}
\definecolor{green}{RGB}{45,177,93}
\definecolor{red}{RGB}{251,0,29}
\definecolor{blue}{RGB}{18,110,213}
\definecolor{orange}{RGB}{217,86,16}
\definecolor{pink}{RGB}{203,23,206}
$$

## Stochastic Production Frontiers: The Idea


Conventional production function assumes that firms, on average, are able to realize the full potential of the technology, and deviations from the potential are purely random.

However, evidence shows that firms may fail to achieve the technology's potential in a systematic way.
- poor managements, untrained workers, regulations, etc.
- The **technical inefficiency** causes the actual output to fall below the potential output.
- The output loss due to the technical inefficiency should be accounted for in the econometric estimation.


### The relationship between the potential output ($y_i^*$), the actual output ($y_i$), and the effect of technical inefficiency ($u_i$)


$$\begin{aligned}
  y_i  = {\color{pink} y_i^*}\times e^{-u_i};\quad  u_i \geq 0,\ \quad 0 < e^{-u_i} \leq 1.
\end{aligned}$$

Take logs on both sides:

$$\begin{aligned}
 \ln y_i = {\color{pink} \ln y_i^*} - u_i.
\end{aligned}$$

In the econometric model, ${\color{pink} \ln y_i^*}$ is the **frontier**: the observed quantity $\ln y_i$ is bounded above by ${\color{pink} \ln y_i^*}$. We may choose a functional form for ${\color{pink}\ln y_i^*}$ and add a statistical error ($v_i$) in the measurement of the frontier:

\begin{aligned}
{\color{pink} \ln y_i^*} = \ln f(x; \beta) + v_i,
\end{aligned}

where $f(\cdot)$ is a non-stochastic function. Thus, ${\color{pink} \ln y_i^*}$ becomes the _**stochastic frontier**_. 


- If $f(x; \beta)$ is Cobb-Douglas, $\ln f(\cdot)$ is a linear function of elements of $x$.

- If $f(x; \beta)$ is translog, $\ln f(\cdot)$ is a linear function of elements of $x$ and the interaction terms between them.


Put everything together,

\begin{align}
 \ln y_i  & = {\color{pink} \ln y_i^*} - u_i,\qquad u_i \geq 0,\\
 {\color{pink} \ln y_i^*} & = \ln f(x; \beta) + v_i,\\
 \Longrightarrow \quad \ln y_i & = \ln f(x; \beta) +v_i - u_i,
\end{align} 

A different way to express the setup:

\begin{align}
 \ln y_i & = \ln f(x; \beta) + \epsilon_i,\\
 \epsilon_i & = v_i - u_i.
\end{align} 


Essentially, it is a standard regression model augmented by $-u_i$ where $u_i \geq 0$ accounts for output loss due to technical inefficiency.

### Empirical Applications

Although the SFM was originally developed in the production function context, it can be applied to other cases where $u_i$ is not necessarily interpreted as the _technical inefficiency_. Essentially, if there exits a _boundary_ in the data's representation, it could potentially be modeled by the SFM.


- actual wage rate vs. the potential wage offer ($u_i$: the shortfall of wage due to imperfect job match) Imperfect information may entail search cost in the labor market, and one may accept a lower wage job if the person cannot keep searching and finding the best offer.

- actual capital investment vs. frictionless level of investment ($u_i$: the lost investment due to financing constraint): Imperfect information in the capital market leads to financing constraints on capital investment. Firms may not be able to borrow enough to finance the investment.

- GDP now vs. GDP in the steady state ($u_i$: the GDP to be realized in the steady state): When a poor country is catching up to the steady state, the latter becomes the frontier of the GDP.

## Estimation Strategy

- distribution-free approach 
  - will skip

- parametric approach 
  - Make distribution assumptions on $v_i$ and $u_i$.



## The (In)efficiency Index



>> ![twographs.PNG](attachment:twographs.PNG)


How do we estimate the extent of efficiency or inefficiency? Recall that $y  = y^* e^{-u}$, or, $ \ln y  = \ln y^* - u, \ u \geq 0.$ 


-   **efficiency index** (Battese and Coelli 1988):
$$e^{-u} = y/y^*.$$
  - the percentage of realized potential; 
  - Between 0 and 1 (more efficient if close to 1); 
  - Estimated by $E[e^{-u} | \epsilon ]$.



-  **inefficiency index** (Jondrow et al. 1982): 
$$u = \ln\left(\frac{y^*}{y}\right) = 
          \ln\left(\frac{y + \Delta y}{y}\right) = \ln\left(1 + \frac{\Delta 
          y}{y}\right) \approx \frac{\Delta y}{y}.$$
  - the foregone output as a percentage of current output because of inefficiency;
  - the value is in the range  $[0, \infty)$;
  - Estimated by $E[u|\epsilon]$.



## Purposes of a Stochastic Frontier Estimation

1.  estimate the production technology (i.e., the $\beta$
    coefficients);

2.  understand whether inefficiency is a serious phenomenon in the
    sample (i.e., $E(u_i)$ or $E(e^{-u_i})$);

3.  understand what leads to inefficiency (i.e., what determines $u_i$);

  - May parameterize the parameter of $u_i$. E.g., $\mu_i = \gamma_0 + \gamma_1 z_i$ and/or $\sigma_{u,i}^2 = \exp(\rho_0 + \rho_1 z_i)$.


4.  compare efficiency levels across producers (i.e., $E(u_i)$ vs.
    $E(u_j)$);

5.  make efficiency rankings among producers.

## Parametric Estimation of the Stochastic Frontier Models


\begin{aligned}
 \ln y_i =  \ln f(x_i; \beta) + v_i  - u_i, 
\end{aligned}


-  impose distributional assumption on $v_i$ and $u_i$,

  - derive the log-likelihood function and obtain estimates of the
        coefficients by MLE; or, the MoM estimator;

  - obtain *observation-specific* measure of technical inefficiency
        (tricky!).


### Parametric Approach: u is Half-Normal

$$\begin{aligned}
 \ln y_i  = & \ln y^*_i - u_i,\\
 \ln y^*_i  = & x'\beta + v_i, \quad (\mbox{the efficient stochastic frontier}),\\
 u_i  \sim &  N^+(0,\,\sigma_u^2), \qquad  v_i  \sim  N(0,\, \sigma_v^2),
\end{aligned}$$

Or $$\begin{aligned}
 \ln y_i  =  x'\beta + \epsilon_i;\quad \epsilon_i  = & v_i - u_i,
\end{aligned}$$

Assuming truncated normal ensures that $u_i\ge 0$

In [1]:
using Distributions, StatsPlots, Interact

d1(σ) =  Truncated(Normal(0, σ), 0, Inf)

@manipulate for σ in 0.1:0.1:2
 plot(d1(σ), xlim=[0.0, 4.0], ylim=[0.0, 1.6])
end

##### Log-Likelihood Function of the Normal - Half Normal Model 

$$\label{mle:h:like}
L_i = - \ln \left(\frac{1}{2}\right) -\frac{1}{2}\ln (\sigma_v^2 + \sigma_u^2) + \ln
\phi\left(\frac{\epsilon_i}{\sqrt{\sigma_v^2 + \sigma_u^2}} \right) +
\ln \Phi\left(\frac{\mu_{*i}}{\sigma_*} \right),$$ where



\begin{aligned}
 \mu_{*i}  = \frac{-\sigma_u^2 \epsilon_i}{\sigma_v^2 + \sigma_u^2},\qquad
 \sigma_*^2  = \frac{\sigma_v^2  \sigma_u^2}{\sigma_v^2 + \sigma_u^2}. 
\end{aligned}



##### Exogenous Determinants (may skip)

1.  the two-steps approach: After obtaining $E(u_i|\epsilon_i)$, run a Logit
    regression of $E(u_i|\epsilon_i)$ on exogenous determinants
    ${z}_i$. But this approach has serious problems:

    1.  the iid assumption of $u_i$ is violated;

    2.  Wang and Schmidt (2002) show that the bias is significant.

2.  the one-step approach: (Ford et al. 1993) $u \sim N^+(0,
      \sigma_u^2=g({Z}_i{\delta})) \equiv N^+(0, \exp({Z}_i{\delta}))$

    1.  be careful of the interpretation: $\delta$ is not marginal
        effect, although the sign is the same as the marginal effect's
        sign.

    2.  "So, this model assumes ${Z}_i$ only affect the *variance* of
        inefficiency\"? No! $\sigma_u^2$ is the variance of the
        *pre-truncating* distribution of $u$. After the truncation,
        $E(u) = q(\sigma_u^2)$ and $V(u)=h(\sigma_u^2)$.

### Parametric Approach: u is Truncated Normal

$$\begin{aligned}
 \ln y_i & = x'\beta + \epsilon_i,\\
 \epsilon_i & =  v_i - u_i,\\
 u_i & \sim  N^+(\mu,\,\sigma_u^2),\\
 v_i & \sim  N(0,\, \sigma_v^2).
\end{aligned}$$

-  The functional form is more flexible than the half-normal
    distribution.

-  The mode of the distribution is not necessarily 0.

In [2]:
using Distributions, StatsPlots, Interact

d2(μ, σ) =  Truncated(Normal(μ, σ), 0, Inf)

@manipulate for μ in 0:0.1:1, σ in 0.1:0.1:2
 plot(d2(μ, σ), xlim=[0.0, 3.0], ylim=[0.0, 1.6])
end




##### Log-likelihood Function of the Normal - Truncated Normal Model



$$L_i = -\frac{1}{2}\ln (\sigma_v^2 + \sigma_u^2) + \ln
\phi\left(\frac{\mu + \epsilon_i}{\sqrt{\sigma_v^2 + \sigma_u^2}}
\right) + \ln \Phi\left(\frac{\mu_{*i}}{\sigma_*} \right) - \ln
\Phi\left(\frac{\mu}{\sigma_u} \right),\nonumber$$ 

where

$$\begin{aligned}
\mu_{*i}  = \frac{\sigma_v^2 \mu - \sigma_u^2 \epsilon_i}{\sigma_v^2
 + \sigma_u^2}, \qquad
\sigma_*^2 = \frac{\sigma_v^2 \sigma_u^2}{\sigma_v^2 +
\sigma_u^2}.
\end{aligned}$$



### Other distributional assumptions used in the literature

-  exponential distribution:

    -  Has only 1 parameter, easy to estimate, very similar to the
        half-normal distribution.

-  Gamma distribution (Greene):

    -  Difficult to estimate (flat likelihood surface).

###### Why there aren't more distribution combinations? Mainly because a closed form likelihood function is not available.