# Regression Models

We write a simple model with response $Y$, features $X = (X_1, X_2,...,X_p)$ and error $\epsilon$ as

$$ Y = f(X) + \epsilon $$

The ideal $f(x) = E(Y|X=x)$ where $E(Y|X=x)$ is the expected value of $Y$ at $x$ is known as the <i> regression function</i>.

We then say that $f(x)$ is the <i> ideal </i> predictor of $Y$ with regard to the mean-squared prediction error. In other words, $f(x) = E(Y|X=X)$ is the function that minimizes $E[(Y-g(X))^2|X=x]$ over all functions $g$ at all points $X=x$.

Our uncontrolled error $\epsilon = Y - f(x)$ is known as the <i> irreducible error </i>.

Thus, For any estimate $\hat{f}(x)$ of $f(x)$, we have

$$ E[(Y - \hat{f}(X))^2|X=x] = [f(x) - \hat{f}(x)]^2 + Var(\epsilon)$$

### But how should we estimate $f$?

We can relax the definition a let

$$ \hat{f}(x) = \text{Avg}(Y|X \in \mathcal{N}(x)) $$

where $\mathcal{N}$ is some <i> neighborhood </i> of $x$ i.e. a an acceptable range where $x$ could exist.

# Dimensionality and Structured Models

Near neighbor averaging can be pretty good for a small number of features i.e. $p \leq 4$ and large $N$. When $p$ is large, nearest neighbor methods become less useful due to the <i> curse of dimensionality</i>. Nearest neighbors tend to be far away in high dimensions.

We need a reasonable fraction of the $N$ values of $y_i$ to average in order to lower the variance--e.g $10\%$. In high dimensions, $10\%$ doesn't need to be local, so we lose the spirit of estimation.

## Parametrics and structured models

The <i> linear model </i> is the most basic parametric model specified in $p + 1$ parameters

$$ f_L(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$

## Assessing Model Accuracy

Suppose we fit a model $\hat{f}(x)$ to some training data and we wish to see how well it performs. We can compute the average squared predicted error over our training data $(Tr)$:

$$ \text{MSE}_{Tr} = \text{Avg}_{i \in Tr} [y_i - \hat{f}(x_i)]^2 $$

This may be biased toward more overfit models. Instead, we should compute our MSE over a test set $(Te)$:

$$ \text{MSE}_{Te} = \text{Avg}_{i \in Te} [y_i - \hat{f}(x_i)]^2 $$

We want to minimize the mean-squared error across the training data $Tr$ and test data $Te$.



# Bias-Variance Tradeoff

If we fit a model $\hat{f}(x)$ to a set of training data $Tr$ and the true model is $Y = f(X) + \epsilon \rightarrow f(x) = E(Y|X=x)$, then 

$$ \text{E}(y_0 - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\epsilon) $$

where $$ \text{Bias}(\hat{f}(x_0)) = \text{E}(\hat{f}(x_0)) - f(x_0) $$

In other words, the bias is the difference between the average prediction of $x_0$ average over the training data sets and the truth of $f(x_0)$.

# Classification Problems

In a classification problem, our response $Y$ is qualitative as opposed to quantitative.

$$ Y = \mathcal{C}(X) $$

Suppose there are $K$ elements in $\mathcal{C}$ numbered $1,2,...,K$. Let the <i>conditional class probability</i> at $x$ be

$$ p_k(x) = P(Y=k|X=x) \; \; \; k=1,2,...,K $$

The <i>Bayes optimal classifier</i> at $x$ is 

$$ \mathcal{C}(x) = j \; \; \; \text{if} \; \; \; P_j(x) = \text{max}\{p_1(x),p_2(x),...,p_k(x)\} $$

We can measure th performance of $\hat{\mathcal{C}}(x)$ using the misclassification error rate:

$$ \text{Err}_{Te} = \text{Avg}_{i \in Te} \text{I}[y_i \neq \hat{\mathcal{C}}(x_i)] $$

The Bayes classifier has the smallest error in the population.  Structed models for $\mathcal{C}(x)$ include support vector machines, logistic regression, generalized additive models, and others.