# Elements of Statistical Learning

### Chapter 1

This book is about learning from data. 

In a typical **supervised** scenario, we have outcomes (dependent/output/response variable), usually quantitative/categorical that we wish to predict based on a set of features (independent variables/inputs/predictors). We have a training set of data, in which we observe the outcome and feature. We build a prediction model, or learner, which will enable us to predict the outcome for new unseen objects. 

In the **unsupervised** learning problem, we observe only the features and have no measurements of the outcome. Our task is to describe how the data are organized or clustered.

Examples:
 - Email spam classification
 - Prostate cancer regression prediction of log prostate specific antigen
 - Handwritten digit recognition
 - DNA unsupervised learning

### Chapter 2

In supervised learning we use the term `regression` when the output is quantitative and `classification` when the output is qualitative (a.k.a categorical/discrete). A third variable type is ordered categorical, such as small, medium and large.

Given X, make a good prediction of the output Y, denoted by $\hat{Y}$

#### Linear Models and least squares

We model the relationship between X and Y as a linear function of X. We include a 1 in the X vector to account for the intercept. (here X is a single vector, but in general it can be a matrix where each row is a sample and each column is a feature, in this case Y would be a vector of outcomes)
$$\hat{Y}  = \hat{\beta}_0 + \sum_{j=1}^pX_j\hat{\beta}_j\\ = X^{T}\hat{\beta}$$

To fit the model the most common approach is to minimize the residual sum of squares (RSS)
$$
    RSS(\beta) = \sum_{i=1}^N(y_i - x_i^T\beta)^2\\
    = (y - X\beta)^T(y - X\beta)
$$
Minimising this gives the least squares estimates of the coefficients 
$$\hat{\beta} = (X^TX)^{-1}X^Ty$$

#### Nearest-Neighbor Methods
We use the k-nearest neighbors to predict the outcome of a new sample. 
$$\hat{Y}(x) = \frac{1}{k}\sum_{x_i \in N_k(x)}y_i$$
where $N_k(x)$ is the set of k points in the training set closest to x.

#### Decision Theory

In general we seek function $f(X)$ predicting Y. We define the loss function $L(Y, f(X))$ which measures the cost of predicting $f(X)$ when the true value is Y. The most common loss function is the squared error loss 
$$L(Y, f(X)) = (Y - f(X))^2$$
The value of $f(X)$ that minimizes the expected prediction error is the conditional expectation of Y given X, $f(x) = E(Y|X=x)$


#### Curse of dimensionality

Looking at the nearest neighbours method, as the number of dimensions increases, the volume of the space increases so that the nearest neighbours are no longer as close. This is the curse of dimensionality and increases the bias of the nearest neighbour method.

The book shows that by leveraging structure of the problem, such as linearity, we can reduce the impact of the curse of dimensionality reducing the bias and variance of the model.

#### Supervised Learning and Function Approximation



Often we can reframe the supervised problem as a statistical model where
$$
Y = f(X) + \epsilon
$$
where $E(\epsilon) = 0$

We want to estimate $f(X)$ using the training data. We can view this as a `learning problem` where we iteratively improve our estimate of $f(X)$ or as a `function approximation problem`.

To find the optimal function we often minimize the square loss but a more general approach is to maximised the likelihood of the data given some assumed model. In the case of a linear model assuming the errors are normally distributed we can show maximising the likelihood is equivalent to minimizing the square loss.

Besides a linear model or nearest neighbours other methods include:
 - `Roughness penalty methods`: These reduce model complexity by penalizing the complexity of the model in the loss function. This is also known as `regularization`.
 - `Kernel methods`, similar to nearest neighbours but with a `weighted average of the neighbours` to account for the distance.
 - `Basis functions` assume f is of the form $\sum_{j=1}^M\theta_jh_j(x)$ where $h_j(x)$ are functions of x. These approaches linear models, splines, single layer neural networks and radial basis functions. 

#### bias variance tradeoff

Many models have a smoothing parameter that controls the complexity of the model. The choice of this parameter is a tradeoff between bias and variance. A model with high bias will underfit the data, while a model with high variance will overfit the data.

The expected prediction error can be decomposed into the irreducible error, the squared bias and the variance of the model.