## 21. Nonparametric Curve Estimation

In this Chapter we discuss the nonparametric estimation of probability density functions and regression functions, which we refer to a **curve estimation**.

In Chapter 8 we saw it is possible to consistently estimate a cumulative distribution function $F$ without making any assumptions about $F$.  If we want to estimate a probability density function $f(x)$ or a regression function $r(x) = \mathbb{E}(Y | X = x)$ the situation is different.  We cannot estimate these functions consistently without making some smoothness assumptions.  Correspondingly, we will perform some sort of smoothing operation with the data.

A simple example of a density estimator is a **histogram**.  To form a histogram estimator of a density $f$, we divide the  real line into disjoint sets called **bins**.  The histogram estimator is a piecewise constant function where the height of the function is proportional to the number of observations in each bin.  The number of bins is an example of a **smoothing parameter**.  If we smooth too much (large bins) we get a highly biased estimator while if smooth too little (small bins) we get a highly variable estimator.  Much of curve estimation is concerned with trying to optimally balance variance and bias.

### 21.1 The Bias-Variance Tradeoff

Let $g$ denote an unknown function and let $\hat{g}_n$ denote an estimator of $g$.  Bear in mind that $\hat{g}_n(x)$ is a random function evaluated at a point $x$; $\hat{g}_n$ is random because it depends on the data.  Indeed, we could be more explicit and write $\hat{g}_n(x) = h_x(X_1, \dots, X_n)$ to show that $\hat{g}_n(x)$ is a function of the data $X_1, \dots, X_n$ and that the function could be different for each $x$.

As a loss function, we will use the **integrated square error (ISE)**:

$$ L(g, \hat{g}_n) = \int (g(u) - \hat{g}_n(u))^2 du$$

The **risk** or **mean integrated square error (MISE)** is:

$$ R(g, \hat{g}) = \mathbb{E}\left(L(g, \hat{g}) \right) $$

**Lemma 21.1**.  The risk can be written as 

$$ R(g, \hat{g}) = \int b^2(x) dx + \int v(x) dx $$

where

$$ b(x) = \mathbb{E}(\hat{g}_n(x)) - g(x) $$

is the bias of $\hat{g}_n(x)$ at a fixed $x$ and

$$ v(x) = \mathbb{V}(\hat{g}_n(x)) = \mathbb{E}\left( \hat{g}_n(x) - \mathbb{E}(\hat{g}_n(x))^2\right) $$

is the variance of $\hat{g}_n(x)$ at a fixed $x$.

In summary,

$$ \text{RISK} = \text{BIAS}^2 + \text{VARIANCE} $$

When the data is over-smoothed, the bias term is large and the variance is small.  When the data are under-smoothed the opposite is true.  This is called the **bias-variance trade-off**.  Minimizing risk corresponds to balancing bias and variance.

### 21.2 Histograms

Let $X_1, \dots, X_n$ be IID on $[0, 1]$ with density $f$.   The restriction on $[0, 1]$is not crucial; we can always rescale the data to be on this interval.  Let $m$ be an integer and define bins

$$ B_1 = \left[0, \frac{1}{m} \right), B_2 = \left[\frac{1}{m}, \frac{2}{m} \right), \dots, B_m = \left[\frac{m - 1}{m}, 1 \right] $$

Define the **binwidth** $h = 1 / m$, let $v_j$ be the number of observations in $B_j$, let $\hat{p}_j = v_j / n$ and let $p_j = \int_{B_j} f(u) du$.

The **histogram estimator** is defined by 

$$
\hat{f}_n(x) = \begin{cases}
\hat{p}_1 / h & x \in B_1 \\
\hat{p}_2 / h & x \in B_2 \\
\vdots & \vdots\\
\hat{p}_m / h & x \in B_m
\end{cases}
$$

which we can write more succinctly as

$$ \hat{f}_n(x) = \sum_{j=1}^n \frac{\hat{p}_j}{h} I(x \in B_j) $$

To understand the motivation for this estimator, let $p_j = \int_{B_j} f(u) du$ and note that, for $x \in B_j$ and $h$ small,

$$ \hat{f}_n(x) = \frac{\hat{p}_j}{h} \approx \frac{p_j}{h} = \frac{\int_{B_j} f(u) du}{h} \approx \frac{f(x) h}{h} = f(x) $$

The mean and the variance of $\hat{f}_n(x)$ are given in the following Theorem.

**Theorem 21.3**.  Consider fixed $x$ and fixed $m$, and let $B_j$ be the bin containing $x$.  Then,

$$ 
\mathbb{E}(\hat{f}_n(x)) = \frac{p_j}{h} 
\quad \text{and} \quad
\mathbb{V}(\hat{f}_n(x)) = \frac{p_j (1 - p_j)}{nh^2}
$$

Let's take a closer look at the bias-variance tradeoff.  Consider some $x \in B_j$.  For any other $u \in B_j$,

$$ f(u) \approx f(x) + (u - x) f'(x) $$

and so

$$ 
\begin{align}
p_j = \int_{B_j} f(u) du &\approx \int_{B_j} (f(x) + (u - x) f'(x)) du \\
&= f(x) h + h f'(x) \left(h \left(j - \frac{1}{2} \right) - x \right)
\end{align}
$$

Therefore, the bias $b(x)$ is

$$
\begin{align}
b(x) &= \mathbb{E}(\hat{f}_n(x)) - f(x) = \frac{p_j}{h} - f(x) \\
&\approx \frac{f(x) h + h f'(x) \left(h \left(j - \frac{1}{2} \right) - x \right)}{h} - f(x) \\
&= f'(x) \left(h \left(j - \frac{1}{2} \right) - x \right)
\end{align}
$$

If $\overline{x}_j$ is the center of the bin, then

$$
\begin{align}
\int_{B_j} b^2(x) dx &= \int_{B_j} (f'(x))^2 \left(h \left(j - \frac{1}{2} \right) - x \right)^2 dx \\
&\approx (f'(\overline{x}_j))^2 \int_{B_j} \left(h \left(j - \frac{1}{2} \right) - x \right)^2 dx \\
&= (f'(\overline{x}_j))^2 \frac{h^3}{12}
\end{align}
$$

Therefore,

$$
\begin{align}
\int_0^1 b^2(x) dx &= \sum_{j=1}^m \int_{B_j} b^2(x) dx \approx \sum_{j=1}^m (f'(\overline{x}_j))^2 \frac{h^3}{12} \\
&= \frac{h^2}{12} \sum_{j=1}^m h(f'(\overline{x}_j))^2 \approx \frac{h^2}{12} \int_0^1 h(f'(\overline{x}_j))^2 dx
\end{align}
$$

Note that this increases as a function of $h$. 

Now consider the variance.  For $h$ small, $1 - p_j \approx 1$, so

$$
\begin{align}
v(x) &\approx \frac{p_j}{nh^2}\\
&= \frac{f(x)h + h f'(x)\left(h \left(j - \frac{1}{2} \right) - x \right)}{nh^2} \\
&\approx \frac{f(x)}{nh}
\end{align}
$$

when we keep the dominant term.  So

$$
\int_0^1 v(x) dx \approx \frac{1}{nh}
$$

Note that this decreases with $h$.  Putting this all together, we get:

**Theorem 21.4**.  Suppose that $\int (f'(u))^2 du < \infty$.  Then

$$ R(\hat{f}_n, f) \approx \frac{h^2}{12} \int (f'(u))^2 du + \frac{1}{nh}$$

The value $h^*$ that minimizes this expression is

$$ h^* = \frac{1}{n^{1/3}} \left( \frac{6}{\int (f'(u))^2 du} \right)^{1/3}$$

With this choice of binwidth,

$$ R(\hat{f}_n, f) \approx \frac{(3/4)^{2/3} \left( \int (f'(u))^2 du \right)^{1/3}}{n^{2/3}} $$

Theorem 21.4 is quite revealing.  We see that with an optimally chosen bandwidth, the MISE decreases to 0 at rate $n^{-2/3}$.  By comparison, most parametric estimators converge at rate $n^{-1}$.  The formula for optimal binwidth $h^*$ is of theoretical interest but it is not useful in practice since it depends on the unknown function $f$.

A practical way of choosing the binwidth is to estimate the risk function and minimize over $h$.  Recall that the loss function, which we now write as a function of $h$, is

$$
\begin{align}
L(h) &= \int \left( \hat{f}_n(x) - f(x) \right)^2 dx \\
&= \int \hat{f}_n^2(x) dx - 2 \int \hat{f}_n(x) f(x) dx + \int f^2(x) dx
\end{align}
$$

The last term does not depend on the binwidth $h$ so minimizing the risk is equivalent to minimizing the expected value of

$$ J(h) = \int \hat{f}_n^2(x) dx - 2 \int \hat{f}_n(x) f(x) dx $$

We shall refer to $\mathbb{E}(J(h))$ as the risk, although it differs from the true risk by the constant term $\int f^2(x) dx$.

The **cross-validation estimator of  risk** is

$$ \hat{J}(h) = \int \left( \hat{f}_n(x) \right)^2 dx - \frac{2}{n} \sum_{i=1}^n \hat{f}_{(-i)}(X_i)$$

where $\hat{f}_{(-i)}$ is the histogram estimator obtained after removing the $i$-th observation.  We refer to $\hat{J}(h)$ as the cross-validation score or estimated risk.

**Theorem 21.6**.  The cross-validation estimator is nearly unbiased:

$$ \mathbb{E}(\hat{J}(x)) \approx \mathbb{E}(J(x)) $$

In principle, we need to recompute the histogram $n$ times to compute $\hat{J}(x)$.  Moreover, this has to be done for all values of $h$.  Fortunately, there is a shortcut formula.

**Theorem 21.7**.  The following identity holds:

$$ \hat{J}(h) = \frac{2}{(n - 1)h} + \frac{n+1}{n-1} \sum_{j=1}^m \hat{p}_j^2 $$