# Chapter 2

## 2.1 Introduction

- Normally given inputs and outputs. Our goal is to use the inputs to predict the outputs

- Inputs are more commonly referred to as the predictors or independent variables.

- The response is commonly referred to as the output or dependent variables

- This setup is called supervised learning since we have an output to our features.

## 2.2 Variable Type and Terminology

- Quantitative variables like numbers

- Qualitative variables like categories without a specific ordering.

- Regression is when we predict quantitative variables for the output

- Classification is when we predict qualitative variables for the output

- Last type is Ordered Categorical where there is an ordering to the categories like small, medium, and large. Of course small is closer to medium than large so the ordering matters.

- Dealing with Categorical Inputs: if only two options can use a binary coding style $(0,1)$. If there is $K$ options then can do dummy variables. This is a vector of $K$ 0's with one 1 where the variable is true.

- X normally represents the input, Y represents the output, G if the output is qualitative.

- Observed values will be written in lower case such as $x_i$. This refers to a column of X. It's length is $N$.

- Matrix will be represented by bold, capital letters like **X**. They will be denoted by Row x Columns. So a set of $N$ input  $p$ vectors $x_i = 1,....,N$ would be $ N 𝚇 p$.

- All vectors are assumed to be column vectors coming from the p predictors. Hence if we want to refer to a row vector we would need to transpose the matrix.

- Thus we now can make predictors on our input data to replicate our output data. Our predicted output is called $Ŷ$ or $Ĝ$ for categorical variables.

- IF there data has two output classes we can break then down to binary and pick a threshold value to split the variables into $G_1$ or $G_2$ like $y<0.5$

## 2.3 Two Simple Approaches to Prediction: Least Squares and K Closest Neighbors.

- Least Squares: Stable but can be inaccurate

- K Closest Neighbors: Unstable but can be extremely accurate

### 2.3.1 Linear Model and Least Squares

- The term $B_0$ is know as the intercept but in machine learning it is called the bias. If it is not included then it cuts the y axis at the origin.

- Can write it as: $$ \widehat{Y} = \widehat{B_{0}} + \sum_{j=1}^p X_j \widehat{B_j}$$

-Assume the X = 1 and include $B_0$ then you can write it as an inner product $$ \hat{Y} = X^T \hat{B}$$

- Thus $f'(X) = B$ the steepest gradient possible.



#### Least Squares

- Goal is to minimize the RSS(Residual Sum of Squares). $$\text{RSS(B)} = \sum_{i=1}^{N}(y_i - x_i^TB)^2 $$ This is a quadradic function and can be solved for a minimum.

- Another way to write it is in Matrix Notation. $$\text{RSS(B)} = (Y -XB)^T(Y-XB)$$

- Differienate with regards to $B$ $$ X^T(y - XB) = 0$$

- If $X^TX$ is nonsingular then $$B = (X^TX)^{-1}-X^Ty$$

#### K Nearest Neighbors Methods

- $$ \hat{Y}(x) = \sum_{x_i ∈ N_k(x)} y_i$$ where $N_k$ is the neighborhood of $k$ closest $x_i$ points.

- So we use Euclidean distance on $k$ points that are closest to $x$ then average them.

- With regression we pick $p$ parameters but with k nearest neighbors we pick $k$ neighbors.

- k = 1 will always give us the best value on the training data. However the optimal value of $k$ is normally around $N\k > p$. Cross Validation is a great way to find the most optimal $k$.

### 2.3.2 From Least Squares to K Nearest Neighbors

- Least squares has a stable decision boundary that is based on all the points. Therefore it has low variance but a high bias. However it has an assumption that the boundary line is linear.

- K Nearest Neighbor decision boundary relies on only a handful of points $k$. Therefore it has high variance but low bias. It is much more unstable. No assumption here as the boundary line can easily change based on $k$.



#### Alternative to KNN

- K Nearest Neighbor has more complex methods that build off it. Such as kernel methods that decrease smoothly to zero the futher away the point is rather than the (0,1) value from KNN.

- In high dimensional space the Kernels are modified to emphasis certain variables more than others.

#### Alternative to Least Squares

- Local regression fits models to locally weighted least squares instead of constant local fit.

- Linear Model fit to a basis expansion of the original variables

- Projection and Neural Networks consist of sums of nonlinear transformed linear models

## 2.4 Statistical Decision Theory

- K Nearest Neighbors can provide an amazing approximation if N is extremely large. However, we usually don't have enough data for this and settle for a regression approach.

- Also if P is too large , KNN fails also as there are too many neighborhoods for the local points.

- KNN neighbor assume an average based on local points while linear regression assumes an average based on global points.

- Learn the best method for classification is the Bayes decision boundary. It uses condition probability to classify the point. We see KNN does a relaxed version of this by using the class of the points around it.

## 2.5 Local Methods in High Dimensions
