# ESLII

# 1 supervised learning

## 1.1 Terminology
* X: input variable
* Y: quantitative output variable
* G: quanlitative output variable
* $X_j$: the jth component of input
* $x_i$: the ith observed value of X, a p-vector for the ith observation
* $\mathbf{x}_j$: a N-vector consisting of all the observations on variable $X_j$ 
> In general, vectors will not be bold, except when they have N components; this convention distinguishes a p-vector of inputs $x_i$ for the ith observation from the N -vector $\mathbf{x}_j$ consisting of all the observations on variable $X_j$. Since all vectors are assumed to be column vectors, the ith row of $\mathbf{X}$ is $x^T_i$ , the vector transpose of $x_i$.
* $\mathbf{X}$: matrix
* $\mathcal{R}$: set of real numbers
* $\mathcal{G}$: set of categories
* $\hat{Y}$: prediction of the output Y
* $\hat{G}$: prediction of the output G
* training data
* scatterplot
* decision boundary
* low variance, high variance
* low bias, high bias
* knn, 1-nearest-neighbor
* loss function
* EPE: expected (squared) prediction error 
    
## 1.2 Variable

* **input** variable
    * also referred as
        * **predictor**, in statistical literature
        * **independent variables**, more classical literature
        * **feature**, in pattern recognition literature
    * denoted by the symbol $X$
        * if X is a vector, it's components can be accessed by subscripts $X_j$
        * **Observed values is written as $x_j$**
        
* **output** variable 
    * also referred as
        * **responses**
        * **dependent variable**
    * denoted by 
        * $Y$, for quantitative output
        * $G$, for qualitative output
    
* variable type
    * **quantitative** 
        * **ordering** and **metric notion**
    * **qualitative**, finite set $\mathcal{G}$, for example $\mathcal{G} = {0,1,...,9}$
        * no explicit ordering
        * AKA **categorical**, **discrete**, **factor**
        * typically represented numerically by **code**
            * “success” or “failure,” “survived” or “died.”
            * represented by a single binary digit or bit as 0 or 1, or else by −1 and 1
            * sometimes referred as **target**
        * **dummy variables** 
            * When there are more than two categories
            * A K-level qualitative variable is represented by a vector of K binary variables or bits
            * only one of which is “on” at a time
    * **ordered categorical**
        * such as small, medium and large
        * there is an ordering between the values, but no metric notion

* prediction tasks
    * **regression**, when we predict quantitative outputs, 
    * **classification**, when we predict qualitative outputs
    
* Matrices are represented by bold uppercase letters $\mathbf{X}$
    * all vectors are assumed to be column vectors
    * A set of N input p-vectors $x_i , i = 1, . . . , N$ would be represented by the N×p

> Learning Definition: we can loosely state the learning task as follows: given the value of an input vector X, make a good prediction of the output Y, denoted by $\hat{Y}$ (pronounced “y-hat”). If Y takes values in $\mathcal{R}$ then so should $\hat{Y}$; likewise for categorical outputs, $\hat{G}$ should take values in the same set $\mathcal{G}$ associated with G.

> For a two-class G, one approach is to denote the binary coded target as Y , and then treat it as a quantitative output. The predictions $\hat{Y}$ will typically lie in [0, 1], and we can assign to $\hat{G}$ the class label according to whether $\hat{y} > 0.5$. This approach generalizes to K-level qualitative outputs as well.

* **training data**: We need data to construct prediction rules, often a lot of it. We thus suppose we have available a set of measurements $(x_i , y_i )$ or $(x_i , g_i)$, i = 1, . . . , N, known as the training data, with which to construct our prediction rule.



## 1.3 Linear Models and Least Square
** Linear Model **
* Given a vector of inputs $X^T = (X_1 , X_2 , . . . , X_p )$, we predict the output Y via the model
$$ \hat{Y} = X^T \hat{\beta}$$
* In general $\hat{Y}$ can be a K–vector, in which case β would be a p × K matrix of coefficients.

**Least Square**
* In this approach, we pick the coefficients β to minimize the residual sum of squares
$$RSS(β) = (y − Xβ)^T (y − Xβ)$$
* where X is an N × p matrix with each row an input vector, and y is an N -vector of the outputs in the training set.
* Differentiating w.r.t. β we get the **normal equations**
$$X^T (y − Xβ) = 0$$
* If $X^T X$ is nonsingular, then the unique solution is given by
$$\hat{β} = (X^T X)^{−1} X^T y$$

<img width=600 src="images/esl-linear-regression.png" />

## 1.4 Nearest-Neighbor Methods
* Nearest-neighbor methods use those observations in the training set $\mathcal{T}$ closest in input space to x to form $\hat{Y}$. Specifically, the k-nearest neighbor fit for $\hat{Y}$ is defined as follows:
$$ \hat{Y}(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i $$
* where $N_k (x)$ is the neighborhood of x defined by the k closest points $x_i$ in the training sample. Closeness implies a metric, which for the moment we assume is Euclidean distance. So, in words, we find the k observations with $x_i$ closest to x in input space, and average their responses.

* 1-nearest-neighbor classification: $\hat{Y}$ is assigned the value $y_l$ of the closest point $x_l$ to x in the training data.

* effective number of parameters of k-nearest neighbors is N/k

<img width=600 src="images/esl-1-nn.png" />

## 1.5 From Least Squares to Nearest Neighbors
* The linear model makes huge assumptions about **structure** and yields **stable** but possibly **inaccurate** predictions. The linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumption that a linear decision boundary is appropriate. In language we will develop shortly, it has **low variance** and potentially **high bias**.

* The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often **accurate** but can be **unstable**. They do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation. However, any particular subregion of the decision boundary depends on a handful of input points and their particular positions, and is thus wiggly and unstable—**high variance** and **low bias**.

* A large subset of the most popular techniques in use today are variants of these two simple procedures. In fact 1-nearest-neighbor, the simplest of all, captures a large percentage of the market for low-dimensional problems. 

## 1.6 Statistical Decision Theory

Let $X \in \mathbb{R}^p$ denote a real valued random input vector, and $Y \in \mathbb{R}$ a real valued random output variable, with joint distribution $Pr(X, Y )$. We seek a function f (X) for predicting Y given values of the input X. This theory requires a **loss function** 
$$\tag{loss function} L(Y, f (X))$$ 
for penalizing errors in prediction, and by far the most common and convenient is **squared error loss**: 
$$\tag{squared error loss} L(Y, f (X)) = (Y − f (X))^2$$. 
This leads us to a criterion for choosing f ,
$$ \tag{expected prediction error} \begin{align*} \\
EPE(f) &= E(Y - f(X))^2 \\
&= \int [y-f(x)]^2 Pr(dx, dy) \\
\end{align*} $$
the **expected (squared) prediction error**. 

### Regression
By conditioning on X, we can write EPE as
$$ EPE(f) = E_X E_{Y \mid X}([Y-f(X)]^2 \mid X)$$
and we see that it suffices to minimize EPE pointwise:
$$ f(x) = \underset{c}{argmin} E_{Y \mid X}([Y-c]^2 \mid X=x)$$
The solution is
$$ \tag{conditional expectation} f(x) = E(Y \mid X=x)$$ 
the **conditional expectation**, also known as the **regression function**. <font color=red>Thus the best prediction of Y at any point X = x is the conditional mean, when best is measured by average squared error.</font>

** Nearest-neighbor **
The nearest-neighbor methods attempt to directly implement this recipe using the training data. At each point x, we might ask for the average of all those $y_i$s with input $x_i$ = x. Since there is typically at most one observation at any point x, we settle for 
$$\hat{f}(x) = Ave(y_i |x_i \in N_k (x)),$$
where “Ave” denotes average, and $N_k (x)$ is the neighborhood containing the k points in T closest to x. Two approximations are happening here:
* expectation is approximated by averaging over sample data;
* conditioning at a point is relaxed to conditioning on some region “close” to the target point.

For large training sample size N , the points in the neighborhood are likely to be close to x, and as k gets large the average will get more stable. In fact, under mild regularity conditions on the joint probability distribution Pr(X, Y ), one can show that as $N, k \to \infty$ such that $k/N \to 0$, $\hat{f}(x) → E(Y |X = x)$.

** Linear regression **
How does linear regression fit into this framework? The simplest explanation is that one assumes that the regression function f (x) is approximately linear in its arguments:
$$f (x) \approx x^T \beta $$
This is a **model-based approach**—we specify a model for the regression function. Plugging this linear model for f (x) into $EPE(f) = E(Y - f(X))^2$ and differentiating we can solve for β theoretically:
$$\tag{Linear}\beta = [E(XX^T)]^{−1} E(XY)$$

Note we have not conditioned on X; rather we have used our knowledge of the functional relationship to pool over values of X. The least squares solution $\hat{\beta} = (X^T X)^{−1} X^T y$ amounts to **replacing the expectation in (Linear) by averages over the training data**.

So both k-nearest neighbors and least squares end up approximating conditional expectations by averages. But they differ dramatically in terms of model assumptions:
* Least squares assumes f (x) is well approximated by a **globally linear function**.
* k-nearest neighbors assumes f (x) is well approximated by a **locally constant function**.

### Classification
* An estimate $\hat{G}$ will assume values in $\mathcal{G}$
* Our loss function can be represented by a K × K matrix L, where $K = card(\mathcal{G})$. L will be zero on the diagonal and nonnegative elsewhere, where L(k, l) is the price paid for classifying an observation belonging to class $\mathcal{G}_k$ as $\mathcal{G}_l$ . Most often we use the zero–one loss function, where all misclassifications are charged a single unit.

The expected prediction error is
$$EPE = E[L(G, \hat{G}(X))]$$
where again the expectation is taken with respect to the **joint distribution Pr(G, X)**. Again we condition, and can write EPE as
$$\tag{Expected Prediction Error} EPE = E_X \sum^K_{k=1}L[\mathcal{G}_k, \hat{G}(X)] Pr(\mathcal{G}_k \mid X)]$$
and again it suffices to minimize EPE pointwise:
$$ \hat{G}(x) = \underset{g \in \mathcal{G}}{argmin} \sum^K_{k=1} L(\mathcal{G}_k, g) Pr(\mathcal{G}_k \mid X=x)$$
With the 0–1 loss function this simplifies to
$$ \hat{G}(x) = \underset{g \in \mathcal{G}}{argmin} [1 - Pr(g \mid X = x)]$$
or simply
$$ \tag{Bayes classifier} \hat{G}(x) = \mathcal{G}_k\ if\ Pr(\mathcal{G}_k \mid X = x) = \underset {g \in \mathcal{G}}{max} Pr(g|X=x) $$
This reasonable solution is known as the **Bayes classifier**, and says that we classify to the most probable class, using the conditional (discrete) distribution Pr(G|X).The error rate of the Bayes classifier is called
the **Bayes rate**.

Again we see that the k-nearest neighbor classifier directly approximates this solution—a majority vote in a nearest neighborhood amounts to exactly this, except that conditional probability at a point is relaxed to conditional probability within a neighborhood of a point, and probabilities are estimated by training-sample proportions.



## 1.7 Local Methods in High Dimensions
We have examined two learning techniques for prediction so far: the **stable but biased linear model** and the **less stable but apparently less biased class of k-nearest-neighbor estimates**. It would seem that with a reasonably large set of training data, we could always approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging, since we should be able to find a fairly large neighborhood of observations close to any x and average them. This approach and our intuition breaks down in high dimensions, and the phenomenon is commonly referred to as the **curse of dimensionality**.

** no longer local **
* So to capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer “local.”

<img width=600 src="images/esl-curse-of-dimensionality.png" />

** sparse sampling density ** 
* Sampling density is proportional to $N^{1/p}$, where p is the dimension of the input space and N is the
sample size. If $N_1 = 100$ represents a dense sample for a single input problem, then $N_{10} = 100^{10}$ is the sample size required for the same sampling density with 10 inputs. Thus in high dimensions all feasible training
samples sparsely populate the input space.

** all sample points are close to an edge of the sample ** 
Consider N data points uniformly distributed in a p-dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbor estimate at the origin. The median distance from the origin to the closest data point is given by the expression
$$ d(p, N) = (1 - \frac{1}{2}^{1/N})^{1/p} $$
For N = 500, p = 10 , d(p, N ) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to the boundary of the sample space than to any other data point. The reason that this presents a problem is that prediction is much more difficult near the edges of the training sample. One must extrapolate from neighboring sample points rather than interpolate between them.
