# What is Statistical Learning?

* Statistical learning refers to a vast set of tools for understanding data.
* These can be classified as supervised or unsupervised.
* if we determine that there is an association between advertising and sales, then we can instruct our client to adjust advertising budgets

$$Y = f(X) + \epsilon$$

* f is some fixedbut uknown function of 
* represents the systematic information that X provides about Y
* $\epsilon$ is a random error term.
* f connects the input variable to the output

In essense statistical learning refers to a set of approaches for estimating f.

## Why Estimate f?

Two main reasons that we may wish to estimate $f$: prediction and inference:

### Prediction

Since the error term averages to zero, we can predict Y using:
$$\hat Y = \hat f(X),$$

where $\hat f$ represents our estimate for $f$, and $\hat Y$ represent the resulting prediction of $Y$.

* The accuracy of $\hat Y$ as a prediction of $Y$ depends on two quantities: The reducible error and the irreducible error.  
* The error is reducible because we can potentially improve the accuracy of $\hat f$ by using the most appropriate statistical learning technique to estimate f.
* Y is alos a function of ε, which, by definition, cannot be predicted using X.

Why is the irreducible error larger than zero? The quantity ε may contain unmeasured variables that are useful in predicting $Y$. The quantity ε may also contain unmeasurable variation.

![image.png](attachment:81628747-fb74-4254-b891-8bf871ff25f0.png)


Where $E(Y - \hat Y)^2$ represents the average, or *expected value*, and Var(ε) represents the variance associated with the error term ε.  

the irreducible error will always provide an upper bound on the accuracy of our prediction for $Y$

E.g Prediction: The company is not interested in obtaining a deep understanding of the relationships be- tween each individual predictor and the response; instead, the company simply wants to accurately predict the response using the predictors

### Inference Paradigm

We are often interested in understanding the association between $Y$ and $X_1, \dots, X_p$. In this situation we wish to estimate $f$, but our goal is not necessarily to make predicitions for $Y$. Now $\hat f$ cannot be treated as a black box, because we need to know its exact form. In this setting one might be interested in answering the following questions:

* Which predictors are associated with the response?
* What is the relationship between the response and each predictor?
* Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

One may be interested in answering questions such as:

* Which media are associated with sales?
* Which media generate the biggest boost in sales? or
* How large of an increase in sales is associated with a given increase in TV advertising?

------
* Linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches.
* In contrast, some of the highly non-linear approaches potentially provide accurate predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging.

-----

## How Do We Estimate f?

* We will always assume that we have observed a set of n different points. 
* These observations are called training data
* Are used to train or teach our method how to estimate f
* In other words we want to fin a function $\hat f$ such that $Y \approx \hat f(X)$ for any observation $(X, Y)$

* Most statistical learning methods for this task can be characterized as either parametric or non-parametric.

### Parametric Methods

Parametric methods involve a two-step model-based approach:

1. We make an assumption about the functional form, or shape, of f. For example f is linear in X:
$$f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p ~~~~ \text{      (2.4)}$$
    * So we only need to estimate the $p+1$ coefficients $\beta_0, \beta_1, \dots, \beta_p$

2. After a model has been selected, we need a procedure that uses the training data to fit or train the model. (*fit train*)
$$Y \approx  \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p$$
    * The most common approach to fitting the model from above is referred to *least squares*

it is generally much easier to estimate a set of pa- rameters, such as $\beta_0,\beta_1,...,\beta_p$ in the linear model (2.4), than it is to fit an entirely arbitrary function f

<div class="alert alert-block alert-warning">
More complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.
</div>

### Non-Parametric Methods

* They do not make explicit assumptions about the functional form of f.
* They seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly
* By avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f
* (-) a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.
* A thin-plate spline is used to estimate f.
* In order to fit a thin-plate spline, the data analyst must select a level of smoothness

## The Trad-Off Between Prediction Accuracy and Model Interpretability

![image.png](attachment:559ebe55-40ee-4fd4-9aa1-9ada4eca3065.png)

* Linear regression is a relatively inflexible approach, because it can only generate linear functions
* The thin plate splines are considerably more flexible because they can generate a much wider range of possible shapes to estimate f.

Depends on your intersts: inference vs prediction

* if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately— interpretability is not a concern.

## Supervised vs Unsupervised Learning

![image.png](attachment:229be7a6-cf1c-4115-bc17-3c8d77ccfec5.png)

<div class="alert alert-block alert-warning">
However, sometimes the question of whether an analysis should be considered supervised or unsupervised is less clear-cut. We refer to this setting as a semi-supervised learning problem.</div>

![image.png](attachment:255f78be-c302-4c96-a60a-54373a22baca.png)

## Regression vs. Classificatin Problems

* Variables can be characterized as either quantitative or qualitative (also known as categorical).
* Quantitative variables take on numerical values. Examples include a person’s age, height, or income ...
* In contrast, qualitative variables take on values in one of K different classes, or categories. Examples a person’s marital status (married or not), the brand of prod- uct purchased (brand A, B, or C), whether a person defaults on a debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic Leukemia, or No Leukemia).

<div class="alert alert-block alert-info">
We tend to refer to problems with a quantitative response as regression problems, while those involv- ing a qualitative response are often referred to as classification problems.</div>

<div class="alert alert-block alert-warning">
Logistic regression is a classifica- tion method
</div>

* Some statistical methods, such as K-nearest neighbors and boosting, can be used in the case of either quantitative or qualitative responses.

# Assessing Model Accuracy

* *There is no free lunch in statistics*: no one method dominates all others over all possible data sets.
* Hence it is an important task to decide for any given set of data which method produces the best results.

## Measuring the Quality of Fit

* We need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.
* In the regression setting, the most commonly-used measure is the *mean squared error* (MSE), given by
$$MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat f(x_i))^2, ~~~~~\text{  (2.5)}$$
    * where $hat f(x_i)$ is the prediction that $\hat f$ gives for the $i$th observation.
* MSE will be small if the predicted responses are very close to the true responses
* MSE will be large if for some observations, the predicted and true responses differ substantially

<mark>The MSE in (2.5) is computed using the training data that was used to
fit the model, and so should more accurately be referred to as the training
MSE.</mark>


* We are really not interested in whether $\hat f(x_i) \approx y_i$; instead, we want to know whether $\hat f(x_0)$ is approximately equal to $y_0$, where ($x_0, y_0$) is a previously *unseen test observation* not used to train the statistical learning method. 

<div class="alert alert-block alert-warning">
We want to choose the method that gives the lowest test MSE,as opposed to the lowest training MSE.</div>

Case to avoid: the training set MSE can be quite small, but the test MSE is often much larger.

![image.png](attachment:39e61625-7034-447b-8d38-601b8b63eae5.png)

When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f

## The Bias-Variance Trade-Off

* The U-shape observed in the test MSE curve above is the result of two competing properties of statistical learning methods.
* The expected test MSE, for a given value $x_0$ can always be decomposed into the sum of three fundamental quantities: the variance of $\hat f(x_0)$, the squared bias of $\hat f(x0)$ and the variance of the error terms $\epsilon$

![image.png](attachment:ed2c0dda-bde6-4233-b38f-97b525b4957a.png)

* The average test MSE $E(y_0 - \hat f(x_0))^2$ would be obtained by repeatedly estimating $f$ using a large number of training sets, and testes each as $x_0$

<div class="alert alert-block alert-info">
Equation 2.7 tells us that in order to minimize the expected test error,
we need to select a statistical learning method that simultaneously achieves low variance and low bias.</div>



*Variance* refers to the amount by which $\hat f$ would change if we estimated it using a different training data set. Since the training data are used to fit thestatistical learning method, different training data sets will result in a different $\hat f$.
* Ideally the estimate for $f$ should not vary too much between training sets.
* If a method has high variance then small changes in the training data can result in large changes in $\hat f$.
* In general, more flexible statistical methods have higher variance.
*  The flexible green curve in figure 2.9 is following the observations very closely. Thus It has high variance.
* The orange least squares line is relatively inflexible and has low variance


*Bias* refers to the error that is introduced by approximating a real-life problem.

* As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease.

* The relationship between bias, variance, and test set MSE given in Equation 2.7 

This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low.

<div class="alert alert-block alert-warning">
To take an extreme example, suppose that the true f is linear. In this situation linear regression will have no bias, making it very hard for a more flexible method to compete. In contrast, if the true f is highly non-linear and we have an ample number of training observations, then we may do better using a highly flexible approach.
</div>

## The Classification Setting

Suppose that we seek to estimate f on the basis of training observations $\{(x_1, y_1), \dots, (x_n, y_n)\}$, where now $y_1, \dots, y_n$ are qualitative. The most common approach for quantifying the accuracy of our estimate $\hat f$ is the training error rate, the proportion of mistakes that are made if we apply our estimate $\hat f$ to the training observations, *training error rate*:

![image.png](attachment:8e854adc-b546-4808-b6ca-1fbe5471b7b1.png)

* Here $\hat y_i$ is the predicted class label for the ith observation using $\hat f$. 
* $I(y_i \neq \hat y_i)$ is an indicator variable, with  $I(y_i \neq \hat y_i) = 0$ means a correct classification of the ith observation.
* The test error rate associated with a set of test observations of the form $(x_0, y_0)$ is given by 

![image.png](attachment:21a5c78b-44b1-4ae1-8602-592965dd6516.png)

* A good classifier is one of which the test error rate is smallest.


### The Bayes Classifier

![image.png](attachment:3562c822-404e-4cf1-b77f-bee040e9daa1.png)

The orange shaded region reflects the set of points for which Pr(Y = orange|X) is greater than 50%, while the blue shaded region indicates the set of points for which the probability is below 50%.

* This is called the Bayes decision boundary.

 In general, the overall Bayes error rate is given by
 
 ![image.png](attachment:abc9cb32-6c4b-471e-81e2-049952f72501.png)

### K-Nearest Neighbors

In theory we would always like to predict qualitative responses using the bayes classifier. But for real data, we do not know the conditional distribution of $Y$ given $X$ and so computing the Bayes classifier is impossible.

<div class="alert alert-block alert-info">
Bayes classifier serves as an unattainable gold standard against which to compare other methods.</div>

KNN: Given a positive integer K and a test observation x0, the KNN classifier first identifies the K points in the training data that are closest to $x_0$, represented by $N_0$. It then estimates the conditional probability for class j as the fraction of points in $N_0$ whose response values equal j:

![image.png](attachment:d35c9a14-922d-4396-89d8-730067a49d02.png)

Finally, KNN classifies the test observation $x_0$ to the class with the largest probability from (2.12)

![image.png](attachment:b2c96d83-cd92-4656-abf0-244f22a9dbcd.png)

* It consists of two blue points and one orange point, resulting in estimated probabilities of 2/3 for the blue class and 1/3 for the orange class. Hence KNN will predict that the black cross belongs to the blue class.
* In the right-hand panel of Figure 2.14 we have applied the KNN approach with K = 3 at all of the possible values for X1 and X2, and have drawn in the corresponding KNN decision boundary.

<div class="alert alert-block alert-info">
Despite the fact that it is a very simple approach, KNN can often pro- duce classifiers that are surprisingly close to the optimal Bayes classifier.</div>

![image.png](attachment:5b901217-8128-451e-9eed-8bdef2329d65.png)

<div class="alert alert-block alert-info">
As K grows, the method becomes less flexible and produces a decision boundary that is close to linear. This corresponds to a low-variance but high-bias classifier.
</div>

<div class="alert alert-block alert-warning">
In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method. The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task. 
</div>