02-mc.Rmd

# Monte Carlo Integration and Variance Reduction {#mcint}

## Monte Carlo Integration

Consider integration problem of a integrable function $g(x)$. We want to compute

$$\theta \equiv \int_a^b g(x) dx$$

<!-- For instance, $g(x) = e^{x^2}$ -->

<!-- ```{example, mcex} -->
<!-- $$\int_0^1 e^{x^2} dx$$ -->
<!-- ``` -->

<!-- It seems tricky to compute the integral \@ref(exm:mcex) analytically even though possible. So we implement *simulation* concept here, based on the following theorems. -->

For instance, standard normal cdf.

```{example, mcex, name = "Standard normal cdf"}
Compute values for

$$\Phi(x) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}}\exp\bigg(-\frac{t^2}{2}\bigg)dt$$
```

It might be impossible to compute this integral with hand. So we implement *simulation* concept here, based on the following theorems.

```{theorem, wlaw, name = "Weak Law of Large Numbers"}
Suppose that $X_1, \ldots, X_n \iid (\mu, \sigma^2 < \infty)$. Then

$$\frac{1}{n}\sum_{i = 1}^n X_i \stackrel{p}{\rightarrow} \mu$$

Let $g$ be a measurable function. Then

$$\frac{1}{n}\sum_{i = 1}^n g(X_i) \stackrel{p}{\rightarrow} g(\mu)$$
```

```{theorem, slaw, name = "Strong Law of Large Numbers"}
Suppose that $X_1, \ldots, X_n \iid (\mu, \sigma^2 < \infty)$. Then

$$\frac{1}{n}\sum_{i = 1}^n X_i \stackrel{a.s.}{\rightarrow} \mu$$

Let $g$ be a measurable function. Then

$$\frac{1}{n}\sum_{i = 1}^n g(X_i) \stackrel{a.s.}{\rightarrow} g(\mu)$$
```

### Simple Monte Carlo estimator

```{theorem, mcint, name = "Monte Carlo Integration"}
Consider integration \@ref(eq:muint). This can be approximated via appropriate pdf $f(x)$ by

$$\hat\theta_M = \frac{1}{N} \sum_{i = 1}^N g(X_i)$$
```

Suppose that we have a distribution $f(x) = I_{spt g}(x)$, i.e. *uniform distribution*. Let $spt g = (a, b)$.

\begin{equation}
  \begin{split}
    \theta & \equiv \int_{spt g} g(x) dx \\
    & = \int_a^b g(x) dx \\
    & = \int_0^1 g(a + (b - a)t)(b - a) dt \\
    & \equiv \int_0^1 h(t) dt \\
    & = \int_0^1 h(t) I_{(a, b)}(t) dt \\
    & = E[h(U)] \qquad U \sim unif(0, 1)
  \end{split}
  (\#eq:muint)
\end{equation}

By *the Strong law of large numbers* \@ref(thm:slaw),

$$\frac{1}{n}\sum_{i = 1}^n h(U_i) \stackrel{a.s.}{\rightarrow} E\Big[h(U)\Big] = \theta$$

where $U \sim unif(0, 1)$. Thus, what we have to do here are two things.

1. representing $g$ as $h$.
2. generating lots of $U_i$

<!-- Go back to Example \@ref(exm:mcex). -->

<!-- ```{solution} -->
<!-- \begin{equation*} -->
<!--   \begin{split} -->
<!--     \theta & \equiv \int_0^1 e^{x^2} dx \\ -->
<!--     & = \int_0^1 \frac{e^{x^2}}{f(x)}f(x) dx \qquad f(x) = \frac{e^x}{e - 1} : pdf \\ -->
<!--     & = \int_0^1 (e - 1)\exp(x^2 - x)f(x)dx \\ -->
<!--     & \approx \frac{1}{M} \sum_{m = 1}^M (e - 1)\exp(X_m^2 - X_m) -->
<!--   \end{split} -->
<!-- \end{equation*} -->

<!-- Then generate $X_1, \ldots, X_M \sim f(x)$. -->

<!-- Let $F(X_1), \ldots, F(X_M) \iid unif(0, 1)$ where -->

<!-- $$F(x) = \int_0^x f(t)dt = \frac{e^x - 1}{e - 1}$$ -->

<!-- i.e. $U_1 = \frac{e^{X_1} - 1}{e - 1}, \ldots, U_M = \frac{e^{X_M} - 1}{e - 1} \iid unif(0, 1)$. Hence, -->

<!-- $$X_m = \ln (1 + (e - 1) U_m)$$ -->
<!-- ``` -->

<!-- i.e. -->

<!-- 1. $u_1, \ldots, u_M \iid unif(0,1)$ -->
<!-- 2. $x_i = \ln (1 + (e - 1) u_i)$ -->

<!-- ```{r} -->
<!-- x <- log(1 + (exp(1) - 1) * runif(10000)) -->
<!-- mean((exp(1) - 1) * exp(x^2 - x)) -->
<!-- ``` -->

<!-- This method is also helpful solving high-dimensional problem. -->

<!-- ```{example, mcex2, name = "Higher dimensional problem"} -->
<!-- $$\int_0^1 \int_0^1 e^{(x_1 + x_2)^2} dx_1 dx_2$$ -->
<!-- ``` -->

<!-- ```{solution} -->
<!-- \begin{equation*} -->
<!--   \begin{split} -->
<!--     I & \equiv \int_0^1 \int_0^1 e^{(x_1 + x_2)^2} dx_1 dx_2 \\ -->
<!--     & = \int_0^1\int_0^1 \frac{e^{(x_1 + x_2)^2}}{f(x_1, x_2)}f(x_1, x_2) dx_1dx_2 \qquad f(x) = \frac{e^{(x_1 + x_2)}}{(e - 1)^2} = \frac{e^{x_1}}{e - 1} + \frac{e^{x_2}}{e - 1} \\ -->
<!--     & = \int_0^1\int_0^1 (e - 1)^2\exp((x_1 + x_2)^2 - x_1 - x_2)f(x_1, x_2)dx_1dx_2 \\ -->
<!--     & \approx \frac{1}{M} \sum_{m = 1}^M (e - 1)^2\exp((X_{1m} + X_{2m})^2 - X_{1m} - X_{2m}) -->
<!--   \end{split} -->
<!-- \end{equation*} -->
<!-- ``` -->

<!-- Hence, -->

<!-- 1. $u_{1m}, u_{2m} \sim unif(0,1), \quad m = 1, \ldots, M$ -->
<!-- 2. $x_{jm} = \ln(1 + (e - 1)u_{jm}), \quad j = 1, 2, \quad m = 1, \ldots, M$ -->

<!-- ```{r} -->
<!-- tibble( -->
<!--   x1 = log(1 + (exp(1) - 1) * runif(10000)), -->
<!--   x2 = log(1 + (exp(1) - 1) * runif(10000)) -->
<!-- ) %>%  -->
<!--   summarise(int = mean((exp(1) - 1)^2 * exp((x1 + x2)^2 - x1 - x2))) -->
<!-- ``` -->

Go back to Example \@ref(exm:mcex).


```{solution}
Case 1: $x > 0$

Since $\Phi(x)$ is symmetry,

$$\Phi(0) = \frac{1}{2}$$

Fix $x > 0$.

\begin{equation*}
  \begin{split}
    \int_0^x \exp\bigg(-\frac{t^2}{2}\bigg) dt & = \int_0^x x\exp\bigg(-\frac{t^2}{2}\bigg)\frac{I_{(0, x)}(t)}{x} dt \\
    & \approx \frac{1}{N} \sum_{i = 1}^N x\exp\bigg(-\frac{U_i^2}{2}\bigg)
  \end{split}
\end{equation*}

with $U_1, \ldots, U_N \iid unif(0, x)$.

Case 2: $x \le 0$

Recall that $\Phi(x)$ is symmetry.

Hence,

$$
\hat\Phi(x) = \begin{cases}
  \frac{1}{\sqrt{2\pi}} \frac{1}{N} \sum_{i = 1}^N x\exp\bigg(-\frac{U_i^2}{2}\bigg) + \frac{1}{2} \equiv \hat\theta(x) & x \ge 0 \\
  1 - \hat\theta(-x) & x < 0
\end{cases}
$$
```

```{r}
phihat <- function(x, y) {
  yi <- abs(y)
  theta <- mean(yi * exp(-x^2 / 2)) / sqrt(2 * pi) + .5
  ifelse(y >= 0, theta, 1 - theta)
}
```

Then compute $\hat\Phi(x)$ for various $x$ values.

```{r}
phi_simul <- foreach(y = seq(.1, 2.5, length.out = 10), .combine = bind_rows) %do% {
  tibble(
    x = y,
    phi = pnorm(y),
    Phihat = 
      tibble(x = runif(10000, max = y)) %>% 
      summarise(cdf = phihat(x, y = y)) %>% 
      pull()
  )
}
```

```{r, echo=FALSE}
knitr::kable(phi_simul, col.names = c("x", "pnorm", "mc"), caption = "Simple MC estimates of Normal cdf for each x", longtable = TRUE)
```


### Hit-or-Miss Monte Carlo

Hit-or-Miss approach is another way to evaluate integrals.

```{example, estpi, name = "Estimation of $\\pi$"}
Consider a circle in $\R$ coordinate.

$$x^2 + y^2 = 1$$

Since $y = \sqrt{1 - x^2}$,

\begin{equation}
  \int_0^1 \sqrt{1 - t^2} dt = \frac{\pi}{4}
  (\#eq:mcpi)
\end{equation}
```

By estimating Equation \@ref(eq:mcpi), we can estimate $\pi$, i.e.

$$\pi = 4 \int_0^1 \sqrt{1 - t^2} dt$$

Simple MC integration can also be used.

\begin{equation*}
  \begin{split}
    \int_0^1 \sqrt{1 - t^2} dt & = \int_0^1 \sqrt{1 - t^2} I_{(0,1)}(t) dt \\
    & \approx \frac{1}{N} \sum_{i = 1}^N \sqrt{1 - U_i^2}
  \end{split}
\end{equation*}

```{r}
circ <- function(x) {
  4 * sqrt(1 - x^2)
}
```

```{r}
tibble(x = runif(10000)) %>% 
  summarise(mc_pi = mean(circ(x)))
```


On the other way, hit-or-miss MC method applies geometric probability.

```{r hmmc, echo=FALSE, fig.cap="Hit-or-Miss"}
tibble(x = seq(0, 1, by = .01)) %>% 
  mutate(y = sqrt(1 - x^2)) %>% 
  ggplot(aes(x, y)) +
  geom_path() +
  geom_ribbon(aes(ymin = 0, ymax = 1), alpha = .3) +
  geom_area(fill = gg_hcl(1), alpha = .5)
```

See Figure \@ref(fig:hmmc). From each coordinate, generate

- $X_i \iid unif(0,1)$
- $Y_i \iid unif(0,1)$

Then the proportion of $Y_i \le \sqrt{1 - X_i^2}$ estimates $\frac{\pi}{4}$.

```{r}
tibble(x = runif(10000), y = runif(10000)) %>% 
  summarise(hitormiss = mean(y <= sqrt(1 - x^2)) * 4)
```


## Variance and Efficiency

We have seen two apporoaches doing the same task. Now we want to *evaluate them*. Denote that simple Monte Carlo integration \@ref(thm:mcint) is estimating the *expected value of some random variable*. Proportion, which approximates probability is expected value of identity function.

The common statistic that can evaluate estimators expected value might be their variances.

### Variance

Note that variance of sample mean is $Var(\overline{g(X)}) = \frac{Var(g(X))}{N}$. This property is one of estimating variance of $\hat\theta$.

\begin{equation}
  \widehat{Var}(\hat\theta) = \frac{1}{N}\bigg( \frac{1}{N} \sum_{i = 1}^N (g(X_i) - \overline{g(X_i)}) \bigg) = \frac{1}{N^2} \sum_{i = 1}^N (g(X_i) - \overline{g(X_i)})
  (\#eq:mcsamvar)
\end{equation}

For example,

```{r}
tibble(x = runif(10000)) %>% 
  summarise(mc_pi = var(circ(x)) / 10000)
```

However, this *variance of sample mean* is used in situation when we are in sample limitation situation. We do not have to stick to this. Now, Generating samples as many as we want is possible. So we try another approach: *parametric bootstrap*.

```{r mcintvar, echo=FALSE, fig.cap="Empircal distribution of $\\hat\\theta$"}
knitr::include_graphics("images/mcint.png")
```

See Figure \@ref(fig:mcintvar). If we estimate $E\Big[g(U \sim unif(a, b))\Big]$, we can get $\theta$. Generate $M$ samples $\{ U_1^{(j)}, \ldots, U_N^{(j)} \}, j = 1, \ldots M$ from this $U \sim unif(a, b)$. In each sample, calculate MC estimates $\hat\theta^{(j)}$. Now we have $M$ MC estimates $\hat\theta$. This gives empirical distribution of $\hat\theta$. By *drawing a histogram*, we can see the outline.

\begin{algorithm}[H] \label{alg:algmcint}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \Input{$\theta = \int_a^b g(x) dx$}
  \For{$m \leftarrow 1$ \KwTo $M$}{
    Generate $U_1^{(m)}, \ldots, U_N^{(m)} \iid unif(a, b)$ \;
    Compute $\hat\theta^{(j)} = \frac{(b - a)}{N} \sum g(U_i^{(j)})$\;
  }
  $\bar{\hat\theta} = \frac{1}{M} \sum \hat\theta^{(j)}$\;
  $\widehat{Var}(\hat\theta) = \frac{1}{M - 1} \sum (\hat\theta^{(j)} - \bar{\hat\theta})^2$\;
  \Output{$\widehat{Var}(\hat\theta)$}
  \caption{Variance of $\hat\theta$}
\end{algorithm}

Since we have to generate large size of data, `data.table` package will be used.

```{r, eval=FALSE}
library(data.table)
```

Group operation can be used. Additional column (`sam`) would indicate group, and for each group MC operation would be processed. The following is the function generating `data.table` before group operation.

```{r}
mc_data <- function(rand, N = 10000, M = 1000, char = "s", ...) {
  data.table(
    u = rand(n = N * M, ...),
    sam = gl(M, N, labels = paste0("s", 1:M))
  )
}
```


```{r}
pi_mc <-
  mc_data(runif)[,
                 .(mc_pi = mean(circ(u))),
                 keyby = sam]
```


```{r smchis, fig.cap="Empirical distribution of $\\hat\\pi$ by simple MC"}
pi_mc %>% 
  ggplot(aes(x = mc_pi)) +
  geom_histogram(bins = 30, col = gg_hcl(1), alpha = .7) +
  xlab(expression(pi)) +
  geom_vline(xintercept = pi, col = gg_hcl(2)[2])
```

As in Algorighm $\ref{alg:algmcint}$, we can compute the variance as below.

```{r}
(mc_var <-
  pi_mc[,
        .(mc_variance = var(mc_pi))])
```

On the other hand, we need to generate two sets of random numbers for hit-or-miss MC.

```{r}
pi_hit <-
  mc_data(runif)[
    , u2 := runif(10000 * 1000)
  ][,
    .(hitormiss = mean(u2 <= sqrt(1 - u^2)) * 4),
    keyby = sam]
```

```{r simhit, fig.cap="Simple MC and Hit-or-Miss MC"}
pi_mc[pi_hit] %>% 
  melt(id.vars = "sam", variable.name = "hat") %>% 
  ggplot(aes(x = value, fill = hat)) +
  geom_histogram(bins = 30, alpha = .5, position = "identity") +
  xlab(expression(pi)) +
  geom_vline(xintercept = pi, col = I("red")) +
  scale_fill_discrete(
    name = "MC",
    labels = c("Simple", "Hit-or-Miss")
  )
```

```{r}
(hit_var <-
  pi_hit[,
         .(hit_variance = var(hitormiss))])
```

### Efficiency

See Figure \@ref(fig:simhit). It is obvious that Hit-or-Miss estimate produces larger variance than simple MC.

```{definition, eff, name = "Efficiency"}
Let $\hat\theta_1$ and $\hat\theta_2$ be two estimators for $\theta$. Then $\hat\theta_1$ is more efficient than $\hat\theta_2$ if

$$\frac{Var(\hat\theta_1)}{Var(\hat\theta_2)} < 1$$
```

In other words, if $\hat\theta_1$ has smaller variance than $\hat\theta_2$, then $\hat\theta_1$ is said to be efficient, which is preferable.

```{r, echo=FALSE}
mc_var %>% 
  bind_cols(hit_var) %>% 
  mutate(mc_efficient = (mc_variance / hit_variance < 1)) %>% 
  knitr::kable(
    col.names = c("SimpleMC", "Hit-or-Miss", "SimpleMCefficiency"),
    caption = "Simple MC versus Hit-or-Miss",
    longtable = TRUE
  )
```


## Variance Reduction

Consider Equation \@ref(eq:mcsamvar) based on $Var(\hat\theta) = \frac{\sigma^2}{N}$. This variance can always reduced by adding $N$. But we want to reduce variance less computationally.

### Antithetic Variables

Consider correlated random variables $U_1$ and $U_2$. Then we have

$$Var\bigg( \frac{U_1 + U_2}{2} \bigg) = \frac{1}{4}\Big( Var(U_1) +  Var(U_2) + 2Cov(U_1, U_2)\Big)$$

See the last term $Cov(U_1, U_2)$. If we generate $U_{i1}$ and $U_{i2}$ negatively correlated, we can get reduced variance than previous i.i.d. sample

$$Var\bigg( \frac{U_1 + U_2}{2} \bigg) = \frac{1}{4}\Big( Var(U_1) +  Var(U_2)\Big)$$

```{lemma, antiunif}
$U$ and $1 - U$ are identically distributed, but *negatively correlated*.

\begin{enumerate}
  \item $U \sim unif(0,1) \Leftrightarrow 1 - U \sim unif(0,1)$
  \item $Corr(U, 1 - U) = -1$
\end{enumerate}
```

This is well-known property of uniform distribution. Instead of generating $N$ uniform numbers, try $\frac{N}{2}$ $U_i$ and make corresponding $\frac{N}{2}$ $1 - U_i$. This sequence becomes negatively correlated, so we can reduce the variance as mentioned.

When can we replace previous numbers with these *antithetic variables*? We usually plug-in the numbers in some function $h$ to get Monte carlo integration. The thing is, our target is $h$, not $U$. $h(U)$ and $h(1 - U)$ should *still be negatively correlated*. Hence, $h$ should be *monotonic function*.

```{corollary, antifun}
If $g = g(X_1, \ldots, X_n)$ is monotone, then

$$Y = g(F_X^{-1}(U_1), \ldots, F_X^{-1}(U_n))$$

and

$$Y^{\prime} = g(F_X^{-1}(1 - U_1), \ldots, F_X^{-1}(1 - U_n))$$

are negatively correlated.
```

\begin{algorithm}[H] \label{alg:alganti}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \Input{$h: monotonic$}
  \For{$m \leftarrow 1$ \KwTo $M$}{
    Generate $U_{1,1}^{(m)}, \ldots, U_{\frac{N}{2},1}^{(m)} \iid unif(0, 1)$\;
    Set $U_{i,2}^{(m)} := 1 - U_{i,1}^{(m)} \iid unif(0, 1)$\;
    $\{U_{i}^{(m)}\}_1^N = \{ U_{1,1}^{(m)}, \ldots, U_{{\frac{N}{2}},2}^{(m)} \}$\;
    $\hat\theta^{(j)} = \frac{1}{N} \sum h(U_i^{(j)})$\;
  }
  $\bar{\hat\theta} = \frac{1}{M} \sum \hat\theta^{(j)}$\;
  $\widehat{Var}(\hat\theta) = \frac{1}{M - 1} \sum (\hat\theta^{(j)} - \bar{\hat\theta})^2$\;
  \Output{$\widehat{Var}(\hat\theta)$}
  \caption{Variance of $\hat\theta$ using antithetic variables}
\end{algorithm}

Check again Example \@ref(exm:mcex). We have try to calculate

$$\Phi(x) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}}\exp\bigg(-\frac{t^2}{2}\bigg)dt$$

using simple monte carlo. To make the support $(0, 1)$, let $y = \frac{t}{x}$ be a change of variable. Then

\begin{equation*}
  \begin{split}
    \int_0^x \exp\bigg(-\frac{t^2}{2}\bigg) dt & = \int_0^1 x\exp\bigg(-\frac{(xy)^2}{2}\bigg) dy \\
    & \approx \frac{1}{N} \sum_{i = 1}^N x\exp\bigg(-\frac{(xU_i)^2}{2}\bigg)
  \end{split}
\end{equation*}

```{r}
phiunif <- function(x, y) {
  yi <- abs(y)
  theta <- mean(yi * exp(-(yi * x)^2 / 2)) / sqrt(2 * pi) + .5
  ifelse(y >= 0, theta, 1 - theta)
}
```

Consider $\Phi(2)$.

```{r}
phi2 <-
  mc_data(runif)[,
                 .(p2 = phiunif(u, y = 2)),
                 keyby = sam]
```

Now apply antithetic variables.

```{r}
phi2_anti <-
  mc_data(runif, N = 10000 / 2)[,
                                u2 := 1 - u] %>% 
  melt(id.vars = "sam", value.name = "U") %>% 
  .[,
    .(anti_p2 = phiunif(U, y = 2)),
    keyby = sam]
```

```{r svsanti, fig.cap="Use of antithetic variables"}
phi2[phi2_anti] %>% 
  melt(id.vars = "sam", variable.name = "hat") %>% 
  ggplot(aes(x = value, fill = hat)) +
  geom_histogram(bins = 30, alpha = .5, position = "identity") +
  xlab(expression(pi)) +
  geom_vline(xintercept = pnorm(2), col = I("red")) +
  scale_fill_discrete(
    name = "MC",
    labels = c("Simple", "Antithetic")
  )
```

Obviously, variance has been reduced.

```{r}
phi2[phi2_anti] %>% 
  melt(id.vars = "sam", variable.name = "hat") %>% 
  .[,
    .(variance = var(value)),
    by = hat]
```

### Control Variates

Recall that we are trying to estimate $\theta = Eg(X)$ here in MC integration. Consider other output random variable. Suppose that $\mu_f \equiv Ef(Y)$ is known. It is obvious that

\begin{equation}
  \hat\theta_c = g(X) + c\Big(f(Y) - \mu_f\Big)
  (\#eq:controlest)
\end{equation}

is an unbiased estimator for $\theta$ for any $c \in \R$. Then we have

\begin{equation}
  Var\hat\theta_c = Varg(X) + c^2 Varf(X) + 2cCov(g(X), f(X))
  (\#eq:controlvar)
\end{equation}

Recall that our goal is to minimize this $Var\hat\theta_c$. What value of $c$ is to be determined? Note that Equation \@ref(eq:controlvar) is quadratic function of $c$.

\begin{equation}
  \begin{split}
    Var\hat\theta_c & = Varf(X) c^2 + 2cCov(g(X), f(X)) + Varg(X) \\
    & = Var f(X) \bigg( c + \frac{Cov(g(X), f(X))}{Var f(X)} \bigg)^2 + Var g(X) - \frac{Cov(g(X), f(X))^2}{Var f(X)}
  \end{split}
  (\#eq:controlc)
\end{equation}

From Equation \@ref(eq:controlc), the variance is minimized at

\begin{equation}
  c^{\ast} = - \frac{Cov(g(X), f(X))}{Var f(X)}
  (\#eq:controlsol)
\end{equation}

with minimum variance

$$Var\hat\theta_{c^{\ast}} = Var g(X) - \frac{Cov(g(X), f(X))^2}{Var f(X)}$$

By this, we can reduce the variance of estimation as much as possible (using $f(X)$). Here, $f(X)$ is called a *control variate* for $g(X)$.

\begin{algorithm}[H] \label{alg:algcontrol}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \Input{$g$, control variate $f$ with mean $\mu_f$}
  \For{$m \leftarrow 1$ \KwTo $M$}{
    Generate $U_1^{(m)}, \ldots, U_N^{(m)} \iid unif$\;
    Set $g = g(U_i)$ and $f = f(U_i)$\;
    Compute $\hat{c}^{\ast(m)} = - \frac{\widehat{Cov}(g, f)}{\widehat{Var}(f)}$\;
    $\hat\theta_{c^{\ast}}^{(j)} = g + c^{\ast(m)}(f - \mu_f)$\;
  }
  $\bar{\hat\theta} = \frac{1}{M} \sum \hat\theta_{c^{\ast}}^{(j)}$\;
  $\widehat{Var}(\hat\theta) = \frac{1}{M - 1} \sum (\hat\theta_{c^{\ast}}^{(j)} - \bar{\hat\theta})^2$\;
  \Output{$\widehat{Var}(\hat\theta)$}
  \caption{Variance of $\hat\theta$ using control variables}
\end{algorithm}

```{example, intex, name = "Variance reduction by control variate"}
Apply each simple MC, antithtic variate, and control variate to

$$\int_0^1 e^x dx$$
```

```{r}
N <- 100
M <- 1000
```

Denote that the true value is

$$\int_0^1 e^x dx = e - 1 = `r exp(1) - 1`$$

We might compare each estimate to this.

```{solution, name = "Simple MC"}
We only need $U \sim unif(0, 1)$.

\begin{equation}
  \begin{split}
    \theta & = \int_0^1 e^x dx \\
    & = \int_0^1 e^x I_{(0,1)}(x)dx \\
    & \approx \frac{1}{N} \sum_{i = 1}^N e^{u_i}, \qquad u_i \iid unif(0, 1)
  \end{split}
  (\#eq:mcsol)
\end{equation}
```

```{r}
theta_sim <-
  mc_data(runif, N = N, M = M)[,
                               .(mc = mean(exp(u))),
                               keyby = sam]
```

```{r, echo=FALSE}
theta_sim
```

```{solution, name = "Antithetic variate"}
For $N^{\prime} = \frac{N}{2}$,

Consider $u_1, \ldots, u_{N^{\prime}} \iid unif(0,1)$ and $1 - u_{1}, \ldots, 1 - u_{N^{\prime}} \iid unif(0,1)$.

See Equation \@ref(eq:mcsol). Then we can compute antithetic estimator by

\begin{equation*}
  \begin{split}
    \hat\theta_A & = \frac{1}{N} \sum_{i = 1}^{N / 2} \Big( e^{u_i} + e^{1 - u_i} \Big) \\
    & = \frac{1}{N / 2} \sum_{i = 1}^{N / 2} \bigg( \frac{e^{u_i} + e^{1 - u_i}}{2} \bigg) \\
    & = \text{sample mean}
  \end{split}
\end{equation*}
```

```{r}
theta_anti <-
  mc_data(runif, N = N / 2, M = M)[,
                                   u2 := 1 - u][,
                                                .(anti = mean((exp(u) + exp(u2)) / 2)),
                                                keyby = sam]
```

Now look at the results of the two.

```{r compantihat, fig.cap="Antithetic variate estimator"}
theta_sim[theta_anti] %>% 
  melt(id.vars = "sam", variable.name = "simul", value.name = "integral") %>% 
  ggplot(aes(x = integral, fill = simul)) +
  geom_histogram(bins = 30, position = "identity", alpha = .5) +
  xlab(expression(theta)) +
  geom_vline(xintercept = exp(1) - 1, col = I("red")) +
  scale_fill_discrete(
    name = "MC",
    labels = c("Simple", "Antithetic")
  )
```

It is clear that antithetic variate have reduced variance.

```{solution, name = "Control variate"}
Consider

$$g(U) = e^U$$

and

$$f(U) = U$$

with $U \sim unif(0, 1)$.

Note that

$$E(U) = \frac{1}{2}$$

Then

$$\hat\theta_C = e^U + c\bigg(U - \frac{1}{2}\bigg)$$

is an unbiased estimator of $\theta = \int_0^1 e^xdx$.

To reduce variance, we need to set $c$ to be

$$c^{\ast} = -\frac{Cov(e^U, U)}{Var(U)}$$
```

Since we do not know the exact number, we estimate this from each Monte Carlo sample.

```{r}
theta_con <-
  mc_data(runif, N = N, M = M)[,
                               chat := - cov(exp(u), u) / var(u),
                               by = sam][,
                                         .(con = mean(exp(u) + chat * (u - 1 / 2))),
                                         keyby = sam]
```

```{r}
thetahat <- theta_sim[theta_anti][theta_con]
```

```{r compcontrol, fig.cap="Use of Control variable"}
thetahat %>% 
  melt(id.vars = "sam", variable.name = "simul", value.name = "integral") %>% 
  ggplot(aes(x = integral, fill = simul)) +
  geom_histogram(bins = 30, position = "identity", alpha = .5) +
  xlab(expression(theta)) +
  geom_vline(xintercept = exp(1) - 1, col = I("red")) +
  scale_fill_discrete(
    name = "MC",
    labels = c("Simple", "Antithetic", "Control")
  )
```

It looks like control variate have less variance, but what is more important is that both methods successfully have reduced it.

```{r}
thetahat[,
         lapply(.SD, sd),
         ,.SDcols = -"sam"]
```

### Antithetic variate as control variate

Both antithetic variate and control variate reduce variance using covariance between two random variables. Actually, *antithetic variate is a special case of control variate*. See Equation \@ref(eq:controlest).

```{lemma}
Control variate estimator is a linear combination of unbiased estimators of $\theta$.
```

Consider any two unbiased estimator $\hat\theta_1$ and $\hat\theta_2$ for $\theta = Eg(X)$. Build control variate as following.

$$\hat\theta_c = c\hat\theta_1 + (1 - c)\hat\theta_2$$

It is obvious that $\hat\theta_c$ is also unbiased of $\theta$ for every $c \in \R$.

\begin{equation}
  Var(\hat\theta_c) = Var(\hat\theta_2) + c^2 Var(\hat\theta_1 - \hat\theta_2) 2c Cov(\hat\theta_2, \hat\theta_1 - \hat\theta_2)
  (\#eq:controlvar2)
\end{equation}

Let $\hat\theta_1$ and $\hat\theta_2$ be antithetic variate choice. Recall that *antithetic variate* give that for $\hat\theta_1$ and $\hat\theta_2$,

$$\hat\theta_1, \hat\theta_2 \sim IID, \quad Corr(\hat\theta_1, \hat\theta_2) = -1$$

It follows that

$$Cov(\hat\theta_1, \hat\theta_2) = - Var(\hat\theta_1)$$

and that

$$Var(\hat\theta_c) = (4c^2 - 4c + 1)Var(\hat\theta_1)$$

Hence, it leads to choosing optimal

$$\hat\theta_{c^{\ast}} = \frac{\hat\theta_1 + \hat\theta_2}{2}$$

which we have been used in antithetic variate.

### Several control variates

To summarize, control variate try to reduce variance by combining unbiased estimatros of the target parameter. We have used one variate $f(X)$. It might be possible to extend to multiple variates, so to speak, $f_1(X), \ldots, f_k(X)$. Thanks to the linearity of expectation,

$$\hat\theta_c = g(X) + \sum_{i = 1}^k c_i \Big(f_i (X) - \mu_i\Big)$$

is also unbiased estimator, where $\mu_i = E f_i(X)$. How to get each $c_i^{\ast}$? Rather than using variance and covariance, we can *fitting linear regression*.

### Control variates and regression

See Equation \@ref(eq:controlest) and Equation \@ref(eq:controlsol). It can be found that we were estimating linear regression coefficient as LSE.

```{lemma, lse, name = "Least squares estimator"}
Consider $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$. Then

$$\hat\beta_1 = \frac{\sum(X_i - \overline{X})(Y_i - \overline{Y})}{\sum (Y_i - \overline{Y})} = \frac{\widehat{Cov}(X, Y)}{\widehat{Var}(Y)}$$
```

Control variate estimator $\hat\theta_c = g(X) + c\Big(f(Y) - \mu_f\Big)$ can be expressed in regression model as

$$Eg(X) = \beta_0 + \beta_1 E f(X)$$

Then

\begin{equation}
  \hat\beta_1 = \text{LSE of}\: g(X) \:\text{on}\: f(X) = \frac{\widehat{Cov}(g(X), f(X))}{\widehat{Var}(f(X))} = -\hat{c}^{\ast}
  (\#eq:controlbeta)
\end{equation}

Note that

$$\hat\beta_0 = \overline{g(X)} + \hat{c}^{\ast} \overline{f(X)}$$

This matches to $\hat\theta_{c^{\ast}}$ in previous section.

\begin{equation}
  \hat\beta_0 + \hat\beta_1 \mu_f = \overline{g(X)} + \hat{c}^{\ast} (\overline{f(X)} - \mu_f) = \hat\theta_{c^{\ast}}
  (\#eq:controlreg)
\end{equation}

Also, we can get the error variance estimate

$$\hat\sigma^2 = \widehat{Var}(X + \hat{c}^{\ast}Y) = MSE$$

and

$$\widehat{Var}\hat\theta_c^{\ast} = \frac{\hat\sigma^2}{N}$$

From Example \@ref(exm:intex), we can change the code computing $c^{\ast}$ `- cov(exp(u), u) / var(u)` to `lm(exp(u) ~ u)$coef[2]`.

```{r}
mc_data(runif, N = N, M = M) %>% 
  .[,
    chat := lm(exp(u) ~ u)$coef[2],
    by = sam] %>% 
  .[,
    .(con = mean(exp(u) + chat * (u - 1 / 2))),
    by = sam]
```

In fact, we can use Equation \@ref(eq:controlreg) directly: `predict(lm, newdata = data.frame(u = mean(u)))`.

\begin{algorithm}[H] \label{alg:algconreg}
  \SetAlgoLined
  \SetKwInOut{Input}{input}
  \SetKwInOut{Output}{output}
  \Input{$g$, control variate $f$ with mean $\mu_f$}
  \For{$m \leftarrow 1$ \KwTo $M$}{
    Generate $U_1^{(m)}, \ldots, U_N^{(m)} \iid unif$\;
    Set $g = g(U_i)$ and $f = f(U_i)$\;
    Regression $g \sim f$\;
    Predict the regression at $\overline{U}^{(m)}$. It is $\hat\theta_{c^{\ast}}^{(j)}$\;
  }
  $\bar{\hat\theta} = \frac{1}{M} \sum \hat\theta_{c^{\ast}}^{(j)}$\;
  $\widehat{Var}(\hat\theta) = \frac{1}{M - 1} \sum (\hat\theta_{c^{\ast}}^{(j)} - \bar{\hat\theta})^2$\;
  \Output{$\widehat{Var}(\hat\theta)$}
  \caption{Control variables and regression}
\end{algorithm}

```{r}
mc_data(runif, N = N, M = M)[,
                             .(con = predict(lm(exp(u) ~ u, data = .SD), 
                                             newdata = data.table(u = 1 / 2))),
                             by = sam]
```

Now, how to deal with multiple control variates?

$$X = \beta_0 + \sum_{i = 1}^k \beta_i Y_i + \epsilon$$

Using *multiple linear regression* model, we can choose optimal $c^{\ast}$ and estimate control variate estimate.


## Importance Sampling

Simple MC computes

$$\int_A g(x) f(x) dx = E g(X) = \theta$$

for some density function $f$. This method uses random number from $f$ itself so that

$$\int_A g(x) f(x) dx \approx \frac{1}{N} \sum_{i = 1}^N g(X_i)$$

where $X_1, \ldots, X_N \iid f$. This is why MC integration is called *direct sampling*. Sometimes, however, we face unkown distribution. In this case, generating from $f$ directly is not easy. Even we can, it can be inefficient. The solution is *indirect method*: draw a sample from another pdf $h$. This is called **importance sampling**.

### Importance sampling

Consdier MC integration as before.

$$\int_A g(x) f(x) dx = E g(X) = \theta$$

How about uniform random number set with simple MC as before? However, uniform random numbers does not apply to unbounded intervals. When the target function is not that uniform, especially, generating numbers uniformly can be inefficient.

\begin{equation}
  \begin{split}
    E_f g(X) & = \int_A g(x) f(x) dx \\
    & = \int_A g(x) \frac{f(x)}{\phi(x)}\phi(x) dx, \qquad \phi: \text{density on}\: A \\
    & = E_{\phi}\frac{g(X)f(X)}{\phi(X)} \\
    & \approx \frac{1}{N} \sum_{i = 1}^N \frac{g(X_i)f(X_i)}{\phi(X_i)}, \qquad X_i \iid \phi
  \end{split}
  (\#eq:impsample)
\end{equation}

Here, $\phi$ is called the *envelope* or the *importance sampling function*. This is just simple arithmetic, so it is possible to choose any density $\phi$. However, we should take good one. Typically, one should select $\phi$ so that

\begin{equation}
  \phi(x) \approx \lvert g(x) \rvert f(x) \quad \text{on}\: A
  (\#eq:goodphi)
\end{equation}

with finite variance.

```{example, whatphi, name = "Choice of importance function"}
Obtain MC estimate of

$$\int_0^1 \frac{e^{-x}}{1 + x^2} dx$$

by importance sampling.
```

```{r}
g_target <- function(x) {
  exp(-x - log(1 + x^2)) * (x > 0) * (x < 1)
}
```


Consider candiate envelopes

$$
\begin{cases}
  \phi_0(x) = 1, & 0 < x < 1 \\
  \phi_1(x) = e^{-x}, & 0 < x < \infty \\
  \phi_2(x) = \frac{1}{\pi(1 + x^2)}, & x \in \R \\
  \phi_3(x) = \frac{e^{-x}}{1 - e^{-1}}, & 0 < x < 1 \\
  \phi_4(x) = \frac{4}{\pi(1 + x^2)} & 0 < x < 1
\end{cases}
$$

```{r}
f0 <- function(x) {
  (x > 0) * (x < 1)
}
#------------------
f1 <- function(x) {
  exp(-x) * (x > 0)
}
#------------------
f2 <- function(x) {
  1 / (pi * (1 + x^2))
}
#------------------
f3 <- function(x) {
  exp(-x) / (1 - exp(-1)) * (x > 0) * (x < 1)
}
#------------------
f4 <- function(x) {
  4 / (pi * (1 + x^2)) * (x > 0) * (x < 1)
}
```


```{r candimp, fig.cap="Importance funtions $\\phi_0, \\ldots, \\phi_4$"}
tibble(x = seq(.01, .99, by = .01)) %>% 
  mutate_all(.funs = list(~g_target(.), ~f0(.), ~f1(.), ~f2(.), ~f3(.), ~f4(.))) %>% 
  gather(-x, key = "funs", value = "value") %>% 
  ggplot(aes(x = x, y = value, colour = funs)) +
  geom_path()
```

Each importance function is drawn in Figure \@ref(fig:candimp). $f_1$ shows similar patterns to $g$.

### Variance in importance sampling

From Equation \@ref(eq:impsample),

$$\theta = \int_A g(x)dx = \int_A \frac{g(x)}{\phi(x)}\phi(x) dx = E\bigg[ \frac{g(X)}{\phi(X)} \bigg] \approx \frac{1}{N} \sum \frac{g(X_i)}{\phi(X_i)}$$

where $X_1, \ldots, X_N \iid \phi$. Then

\begin{equation*}
  \begin{split}
    Var\hat\theta & = E\hat\theta^2 - (E\hat\theta)^2 \\
    & = \int_A \bigg(\frac{g(x)}{\phi(x)}\bigg)^2 \phi(x) dx - \theta^2 \\
    & = \int_A \frac{g(x)^2}{\phi(x)} dx - \theta^2
  \end{split}
\end{equation*}

Hence, the mimimum variance

$$\bigg( \int_A \lvert g(x) \rvert dx \bigg)^2 - \theta^2$$

is obtained when

$$\phi(x) = \frac{\lvert g(x) \rvert}{\int_A \lvert g(x) \rvert dx}$$

But we do not know the value of denominator. It might be hart to get the exact function giving the minimum variance, but choosing $\phi$ close to the shape of $\lvert g \rvert$ would produce good result. To check our criterion \@ref(eq:goodphi) more clearly, compute $\frac{g}{\phi_i}$.

```{r candimp2, fig.cap="Ratio $\\frac{g}{\\phi_i}$"}
tibble(x = seq(.01, .99, by = .01)) %>% 
  mutate_all(.funs = list(~g_target(.), ~f0(.), ~f1(.), ~f2(.), ~f3(.), ~f4(.))) %>% 
  gather(-x, -g_target, key = "funs", value = "value") %>% 
  mutate(value = g_target / value) %>% 
  ggplot(aes(x = x, y = value, colour = funs)) +
  geom_path()
```

What is the closest to $1$? $f_1$, of course. Would this function produce the best result, i.e. variance?

```{r}
theta_imp0 <-
  mc_data(runif, N = 100)[,
                          .(phi0 = mean(g_target(u) / f0(u))),
                          keyby = sam]
```

```{r}
theta_imp1 <-
  mc_data(rexp, N = 100, rate = 1)[,
                                   .(phi1 = mean(g_target(u) / f1(u))),
                                   keyby = sam]
```

```{r}
rf2 <- function(n) {
  x <- rcauchy(n)
  x[(x > 1) | (x < 0)] <- 2 # catch overflow errors in g
  x
}
theta_imp2 <-
  mc_data(rf2, N = 100)[,
                        .(phi2 = mean(g_target(u) / f2(u))),
                        keyby = sam]
```

```{r}
rf3 <- function(n) {
  u <- runif(n)
  x <- -log(1 - u * (1 - exp(-1))) # inverse transformation method
  x
}
#---------------------------
theta_imp3 <-
  mc_data(rf3, N = 100)[,
                        .(phi3 = mean(g_target(u) / f3(u))),
                        keyby = sam]
```

```{r}
rf4 <- function(n) {
  u <- runif(n)
  tan(pi * u / 4) # inverse transformation method
}
#--------------------
theta_imp4 <-
  mc_data(rf4, N = 100)[,
                        .(phi4 = mean(g_target(u) / f4(u))),
                        keyby = sam]
```

```{r}
theta_imp <- theta_imp0[theta_imp1][theta_imp2][theta_imp3][theta_imp4]
```

```{r impcomp, fig.cap="Empirical distribution of each importance sampling"}
theta_imp %>% 
  melt(id.vars = "sam", variable.name = "imp_fun", value.name = "integral") %>% 
  ggplot(aes(x = integral, fill = imp_fun)) +
  geom_histogram(bins = 30, position = "identity", alpha = .5) +
  xlab(expression(theta))
```

```{r}
theta_imp[,
          lapply(.SD, sd),
          , .SDcols = -"sam"]
```

$f_3$ and possibly $f_4$ yields the lowest variance. What happened to $f_1$? Its support is $(0, \infty)$, so many values would be generated outside of $(0,1)$. This results in many zeros in the sum of $\frac{g}{f}$.