# Chapter 2

## 2.1 Introduction

- Normally given inputs and outputs. Our goal is to use the inputs to predict the outputs

- Inputs are more commonly referred to as the predictors or independent variables.

- The response is commonly referred to as the output or dependent variables

- This setup is called supervised learning since we have an output to our features.

## 2.2 Variable Type and Terminology

- Quantitative variables like numbers

- Qualitative variables like categories without a specific ordering.

- Regression is when we predict quantitative variables for the output

- Classification is when we predict qualitative variables for the output

- Last type is Ordered Categorical where there is an ordering to the categories like small, medium, and large. Of course small is closer to medium than large so the ordering matters.

- Dealing with Categorical Inputs: if only two options can use a binary coding style $(0,1)$. If there is $K$ options then can do dummy variables. This is a vector of $K$ 0's with one 1 where the variable is true.

- X normally represents the input, Y represents the output, G if the output is qualitative.

- Observed values will be written in lower case such as $x_i$. This refers to a column of X. It's length is $N$.

- Matrix will be represented by bold, capital letters like **X**. They will be denoted by Row x Columns. So a set of $N$ input  $p$ vectors $x_i = 1,....,N$ would be $ N 𝚇 p$.

- All vectors are assumed to be column vectors coming from the p predictors. Hence if we want to refer to a row vector we would need to transpose the matrix.

- Thus we now can make predictors on our input data to replicate our output data. Our predicted output is called $Ŷ$ or $Ĝ$ for categorical variables.

- IF there data has two output classes we can break then down to binary and pick a threshold value to split the variables into $G_1$ or $G_2$ like $y<0.5$

## 2.3 Two Simple Approaches to Prediction: Least Squares and K Closest Neighbors.

- Least Squares: Stable but can be inaccurate

- K Closest Neighbors: Unstable but can be extremely accurate

### 2.3.1 Linear Model and Least Squares

- The term $B_0$ is know as the intercept but in machine learning it is called the bias. If it is not included then it cuts the y axis at the origin.

- Can write it as: $$ \widehat{Y} = \widehat{B_{0}} + \sum_{j=1}^p X_j \widehat{B_j}$$

-Assume the X = 1 and include $B_0$ then you can write it as an inner product $$ \hat{Y} = X^T \hat{B}$$

- Thus $f'(X) = B$ the steepest gradient possible.



#### Least Squares

- Goal is to minimize the RSS(Residual Sum of Squares). $$\text{RSS(B)} = \sum_{i=1}^{N}(y_i - x_i^TB)^2 $$ This is a quadradic function and can be solved for a minimum.

- Another way to write it is in Matrix Notation. $$\text{RSS(B)} = (Y -XB)^T(Y-XB)$$

- Differienate with regards to $B$ $$ X^T(y - XB) = 0$$

- If $X^TX$ is nonsingular then $$B = (X^TX)^{-1}-X^Ty$$

#### K Nearest Neighbors Methods

- $$ \hat{Y}(x) = \sum_{x_i ∈ N_k(x)} y_i$$ where $N_k$ is the neighborhood of $k$ closest $x_i$ points.

- So we use Euclidean distance on $k$ points that are closest to $x$ then average them.

- With regression we pick $p$ parameters but with k nearest neighbors we pick $k$ neighbors.

- k = 1 will always give us the best value on the training data. However the optimal value of $k$ is normally around $N\k > p$. Cross Validation is a great way to find the most optimal $k$.

### 2.3.2 From Least Squares to K Nearest Neighbors

- Least squares has a stable decision boundary that is based on all the points. Therefore it has low variance but a high bias. However it has an assumption that the boundary line is linear.

- K Nearest Neighbor decision boundary relies on only a handful of points $k$. Therefore it has high variance but low bias. It is much more unstable. No assumption here as the boundary line can easily change based on $k$.



#### Alternative to KNN

- K Nearest Neighbor has more complex methods that build off it. Such as kernel methods that decrease smoothly to zero the futher away the point is rather than the (0,1) value from KNN.

- In high dimensional space the Kernels are modified to emphasis certain variables more than others.

#### Alternative to Least Squares

- Local regression fits models to locally weighted least squares instead of constant local fit.

- Linear Model fit to a basis expansion of the original variables

- Projection and Neural Networks consist of sums of nonlinear transformed linear models

## 2.4 Statistical Decision Theory

- K Nearest Neighbors can provide an amazing approximation if N is extremely large. However, we usually don't have enough data for this and settle for a regression approach.

- Also if P is too large , KNN fails also as there are too many neighborhoods for the local points.

- KNN neighbor assume an average based on local points while linear regression assumes an average based on global points.

- Learn the best method for classification is the Bayes decision boundary. It uses condition probability to classify the point. We see KNN does a relaxed version of this by using the class of the points around it.

## 2.5 Local Methods in High Dimensions

- The two algorithms learned so far break down with high dimensionality. Aka large values of p compared to the amount of data.

- The problem is that with this many dimensions, our neighborhood became extremely vast over the space. They are no longer local. We can try to reduce their k but then our variance drastically spikes.

- Another problem is most points fall closer to the edges than to the center. Predicting results near the edge is more difficult than predicting them around the center.

- Also as dimensions get higher, it is extremely difficult to keep the same proportion of sampling density. It is exponential.

- We derived the MSE = Variance + $\text{Bias}^2$. By
- 1. Adding in the $ - E(\hat{y}) +  E(\hat{y})$.
- 2. Factoring
- 3. Eliminating terms by Expectation of Linearity
- 4. Reducing to Variance and $\text{Bias}^2$

- The key to curing the curse of high dimensionality is extremely large $N$ sample size or extremely small variance $σ^2$. This is because the equation for Expected Prediction Error is $$σ^2 p /N$$



## 2.6 Statistical Models, Supervised Learning, and Function Approximations

* Can use other functions if they specifically describe the data. Especially if it is in high dimension

### 2.6.1 A Statistical Model for Joint Distribution $Pr(X,Y)$

* Additive model incorporates error, assumes the error estimate is $E(e) = 0$, and is independent

$$ Y = f(X) + e$$

### 2.6.2 Supervised Learning

* Learn by examples like using least squares

* Learns from the input/output example


### 2.6.3 Function Approximation

* Linear Basis Expansion where we use multiple functions to describe our model

$$ f_{\theta}(x) = \sum_{k=1}^K h_k(x) \theta_k$$

* Where $h_k$ could be any function, one common one is the sigmoid transformation.

$$ h_k(x) = \frac {1}{1 + \exp{-x^TB_k}}$$

* We can use RSS on this going through each point for each function.

$$RSS(\theta) = \sum_{i=1}^N (y - f_{\theta}(x_i))^2$$

* In Some cases RSS doesn't make sense maybe for a classification problem or a density problem. In these cases we would use maximal likelihood estimator.

$$L(\theta) = \sum_{i=1}^N \text{log}Pr_{\theta}(y_i) $$

## 2.7 Structure Regression Models

* Might not want an extremely flexible method if there can be a more accurate structured rigid method

* We might want this especially do this in high dimensions and sometimes in low dimensions.

### 2.7.1 Difficulty of the Problem

* We might sometimes want to reshape the problem to make it more accurate. In this case we would eliminate or change constraints which makes another variable to change.

* To fix this we develop models that look for neighborhoods that are extremely dense and have a large number of close points.

* This works well and even in high dimensions if the variance is small enough and the sample size is large enough.

* Our goal in this book is to moderate the inputs using these different parameters to make accurate models. However to find these neighborhoods we need tight, dense, high N areas.

## 2.8 Class of Restricted Estimators

* These classes are not mutually exclusive

* They are ways to capture or categorize the neighborhoods

### 2.8.1 Roughness Penalty and Bayesian Methods

* We penalize the RSS with a roughness score. The rougher the curve the more penalty is applied

$$ \text{PRSS}(f ; \lambda) = RSS(f) + \lambda J(f) $$

$$ \text{PRSS}(f ; \lambda) = \sum_{i=1}^N (y - f(x_i))^2 + \lambda \int [f^{''}(x)]^2 dx $$

### 2.8.2 Kernel methods and Local Regression

* The local Neighborhood is defined by a Kernel function. It assigns weight to the local points.

$$ K_{\lambda}(x_0, x) = \frac{1}{\lambda} \exp{-\frac{||x - x_0||^2}{2 \lambda}}$$

* The simplest form of kernel density is Nadaraya - Watson

$$\hat{f(x_0)} =  \frac{\sum_{i=1}^NK_{\lambda}(x_0, x_i) (y_i)}{\sum_{i=1}^NK_{\lambda}(x_0, x_i)} $$

* Define a local regression estimate as

$$RSS(f_\theta, x_0) = \sum_{i=1}^N K_{\lambda}(x_0, x_i)(y - f_{\theta}(x_i))^2 $$

 $$f_{\theta}(x) = \theta _0$$ we get Nadaraya equation

 $$ f_{\theta}(x) = \theta _0 + \theta_1 (x)$$ gives us the RSS equation


### 2.8.3 Basis Function and Dictionary Methods

* Use polynomials, and a variety of flexible methods.

$$ f_{\theta}(x) = \sum_{k=1}^K h_k(x) \theta_k$$

* These methods are called linear expansion of basis functions. Called linear since $\theta$ is some form of linear function.

* Can use splines with the above function utilizing piece-wise functions. With knots connecting them.

* Radial basis function are symmetric p dimensional kernels located at particular centroids.

$$ f_{\theta}(x) = \sum_{m=1}^M K_{\lambda_m}(\mu_m, x) \theta_m$$

* You could use alternative kernels here.

* Lastly we can project instead of shrink with an adaptive basis function

$$f_{\theta}(x) = \sum_{m=1}^M B_m \sigma (\alpha_m^Tx + b_m) $$

* Where $\alpha$ is the projection



## 2.9 Model Selection and Bias-Variance Trade off

* Difficult to pick the correct parameters for a model like the $\lambda $ value to use, width of the kernel, or the number of basis functions to use

* Also difficult how much to the train the data on the training set. This is where bias vs variance trade off comes in.

* The more we train our model the closer average of our values become to the predicted values. Thus lowering bias.

* However further away our predictions become from our average.

* Must find the perfect balance of the bias-variance trade off