## 21. Nonparametric Curve Estimation

In this Chapter we discuss the nonparametric estimation of probability density functions and regression functions, which we refer to a **curve estimation**.

In Chapter 8 we saw it is possible to consistently estimate a cumulative distribution function $F$ without making any assumptions about $F$.  If we want to estimate a probability density function $f(x)$ or a regression function $r(x) = \mathbb{E}(Y | X = x)$ the situation is different.  We cannot estimate these functions consistently without making some smoothness assumptions.  Correspondingly, we will perform some sort of smoothing operation with the data.

A simple example of a density estimator is a **histogram**.  To form a histogram estimator of a density $f$, we divide the  real line into disjoint sets called **bins**.  The histogram estimator is a piecewise constant function where the height of the function is proportional to the number of observations in each bin.  The number of bins is an example of a **smoothing parameter**.  If we smooth too much (large bins) we get a highly biased estimator while if smooth too little (small bins) we get a highly variable estimator.  Much of curve estimation is concerned with trying to optimally balance variance and bias.

### 21.1 The Bias-Variance Tradeoff

Let $g$ denote an unknown function and let $\hat{g}_n$ denote an estimator of $g$.  Bear in mind that $\hat{g}_n(x)$ is a random function evaluated at a point $x$; $\hat{g}_n$ is random because it depends on the data.  Indeed, we could be more explicit and write $\hat{g}_n(x) = h_x(X_1, \dots, X_n)$ to show that $\hat{g}_n(x)$ is a function of the data $X_1, \dots, X_n$ and that the function could be different for each $x$.

As a loss function, we will use the **integrated square error (ISE)**:

$$ L(g, \hat{g}_n) = \int (g(u) - \hat{g}_n(u))^2 du$$

The **risk** or **mean integrated square error (MISE)** is:

$$ R(g, \hat{g}) = \mathbb{E}\left(L(g, \hat{g}) \right) $$

**Lemma 21.1**.  The risk can be written as 

$$ R(g, \hat{g}) = \int b^2(x) dx + \int v(x) dx $$

where

$$ b(x) = \mathbb{E}(\hat{g}_n(x)) - g(x) $$

is the bias of $\hat{g}_n(x)$ at a fixed $x$ and

$$ v(x) = \mathbb{V}(\hat{g}_n(x)) = \mathbb{E}\left( \hat{g}_n(x) - \mathbb{E}(\hat{g}_n(x))^2\right) $$

is the variance of $\hat{g}_n(x)$ at a fixed $x$.

In summary,

$$ \text{RISK} = \text{BIAS}^2 + \text{VARIANCE} $$

When the data is over-smoothed, the bias term is large and the variance is small.  When the data are under-smoothed the opposite is true.  This is called the **bias-variance trade-off**.  Minimizing risk corresponds to balancing bias and variance.