# Chapter 2: Statistical Learning

To understand what statistical learning is, let's look at an example. We clients would like to know *how to improve sales of a particular product*. Their advertising dataset consists of sales of their product in 200 different markets. Each of these markets have three different media outlets: TV, radio, and newspaper. The data are displayed as follows: 

![](../images/figure_2.1.png)

It is not possible to directly increase the sales of the product. However, the clients can control the advertising expenditure in each of the three media. So, if we can determine that there is an association between advertising and sales, then we can instruct them to adjust advertising budgets, thereby increasing sales. Our goal, therefore, is to develop an accurate model that can be used to predict sales on the basis of the three media budgets. 

In this case, the advertising budgets are *input variables* while sales is an *output variable*. The inputs are defined by X, the *independent variable* while the output is defined by Y, the *dependent variable*.

In the most general form, a given model has a quantitative response Y and p different predictors, X = (X$_1$, X$_2$, ..., X$_p$). So the general form can be written as:

\begin{equation}
Y = f(X) + \epsilon
\end{equation}

where f is some fixed but unknown function of Xs and $\epsilon$ is a random *error term* which is indepedent of X and has mean zero. In this formulation, f represents the *systematic* information that X provides about Y. In essence, statistical learning refers to a set of approaches for estimating f. We now look at some of the key theoretical concepts that arise in estimating f, as well as tools for evaluating the estimates obtained. 

There are two reasons why we wish to estimate f: *prediction* and *inference*. 

### Prediction

In many situations X are readily available but Y cannot easily be obtained. However, the $\epsilon$ average out to zero. Hence, we can write our generalized equation as follows:

\begin{equation}
\hat{Y} = \hat{f}(X)
\end{equation}

where $\hat{f}$ represents an estimate of f, and $\hat{Y}$ represents the resulting prediction for Y. In such cases, we consider $\hat{f}$ as a *black box* in the sense, we are not concerned about the exact form of $\hat{f}$ as long as it yields accurate predictions for Y. 

The accuracy of $\hat{Y}$ as a prediction of Y depends on two quantities, which we call the *reducible error* and the *irreducible error*. These errors come about because $\hat{f}$ is not a perfect estimate of f. However, by reducing the *reducible error* we can get really close f. On the other hand, the *irreducible error* is something we do not have any control. No matter what, we cannot reduce that error. We look at both of these errors as follows:

\begin{equation}
E(Y - \hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + Var(\epsilon) 
\end{equation}

where the first term on the right is reducile error and the second is irreducible error. The focus of this book is on techniques for estimating f with the air to minimizing the reducible error.

> The irreducible error will always provide an upper bound on the accuracy of four predictions for Y

### Inference

We are often interested in understanding the way that Y is affected as Xs change. In this case, we wish to estimate f, but our goal is not necessarily to make predictions for Y. More specifically, we wish to understand how Y changes as a function of X. In this situation f cannot be treated as a black box because we need to know its exact form. In this setting we may be interested in answering the following questions: 

* Which predictors are associated with the response?
* What is the relationship between the response and each predictor? 
* Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated? 

Most often examples in daily lives fall into the prediction settings, the inference settings, or a combination of the two. Our example at the beginning of the chapter falls under inference settings. We could then consider answering the following questions: (1) Which media contribute to sales; (2) which media generate the biggest boost in sales? (3) How much increase in sales is associated with a given increase in TV advertising? 

Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate. For example, *linear models* allow for relatively simple interpretable inference, but may not yield as accurate predictions as other approaches. In contrast some highly non-linear models may yield very accurate predictions but are difficult to interpret. 

So, how do we estimate f? Broadly speaking, most statistical learning methods can be characterised as either *parameteric* or *non-parameteric*. 

* Parametric methods involve a two-step model-based approach
    1. We make an assumption about the functional form of f. Generally, we assume it to be linear in X. It has the following form:
    \begin{equation}
    f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p
    \end{equation}
    In this case we just have to estimate the p + 1 coefficient of the model
    2. After the model is selected, we use the training data to *fit* the model. The fit will help us estimate the p + 1 coefficients. The most common approach to fitting such a model is the *ordinary least squares*. Once we have found the coefficients, we then use the test data to evaluate the model. 

A potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. It may therefore have a poor fit. This problem can be addressed by choosing *flexible* models that can fit many different possible functional forms for f. However, a more flexible model requires more parameters and have a risk of *overfitting*. Overfitting means, that the model begins to follow noise too closely. 

> Parameteric forms are less flexible but easy to interpret

* Non-parametric models have the following: 
    1. They make no assumption about the functional form of f. Instead they seek an estimate of f that gets as close to the data points as possible without overfitting. 
    2. Once the model is selected we use the training and test datasets to fit the model and evaluate it, respectively. 
    
Non-parametric models are at times better than parametric models as they do not make any assumption of the functional form of f. Hence, they fit the data better and the predictions are more accurate. They are also more flexible. However, non-parametric require more parameters to estimate f than parametric models. 

> Non-parametric models are more flexible but hard to interpret