# Intro To Statistical Learning

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<font color = "blue">**2.1 What is Statistical Learning</font>**

* Demonstration:
  - Look at <font color = "brown">Advertising</font> data set that consists of <font color = "brown">sales</font> in 200 different markets
  - Look at advertising budget for <font color = "brown">TV, radio, newspaper</font>.
  - Can't directly increase sales → look at correlation between advertising and sale
  - want to develop accurate model used to predict sales based on media budgets
  - input - ad budgets
    - aka predictors, independent variables, features
  - output - <font color = "brown">sales</font>
    - aka *response* or *dependent variable*


Notation:


*   Y - response
*   X - features, predictors, input
*   n - # of distinct data points

  \begin{align}
    X = \begin{pmatrix}
          x_1 \\
          x_2 \\
          x_3
        \end{pmatrix}  
  \end{align}

  Write model as:

  \begin{align}
    Y = f(x) + ϵ
  \end{align}

  where ϵ is error/discrepancy

**<font color="blue">What is f(x) good for?</font>**

*    make predictions of y
  - $\hat Y = \hat f(X)$
      - accuracy of $\hat Y$ depends on the *reducible error* and *irreducible error*

\begin{align}
  E(Y-\hat Y)^2 &= E[f(X) + \epsilon - \hat f(X)]^2 \\
  &= [f(X) - \hat f(X)]^2 + Var(ϵ) )
\end{align}  

*    understand what components of $X=(X_1,X_2,...X_p)$ are important to explaining Y

* Inference
  - Which predictors are associated with response?
  - What is the relationship between the response with each predictor
  - Can the relationship between Y and each predictor be adequately summarized using linear equation, or is the relationship more complicated?

**<font color="blue">Is there an ideal f(x)</font>**
*    The ideal formula $f(x) = E(Y|X = x)$ is called the <font color = "green">regression formula</font>

**<font color="red">Dimenstionality and Structured models</font>**

* Nearest Neighbor averaging good for small $p$
  - bad for large p due to <font color = "green">curse of dimensionality </font>
  - nearest neighbors are further away in higher dimensions




**<font color="blue">Parametric and structured models</font>**
* work around to fix nearest neighbor problem is a <font color="green"> linear  model</font>(this is an example of a parametric model)

\begin{align}
  f_{L}(X) = β_0 + β_1X_1 + β_2X_2 + ... + β_pX_p
\end{align}

↪ specified in terms of $p+1$ parameters $β_0, β_1, ... ,β_p$
*    estimate parameters by fitting model to training data
  
  ↪ typically never correct → linear model serves as an approximation to the unknown true function $f(X)$

****<font color="blue">Trade offs</font>****
*    Prediction vs interpretability
*    Good fit vs underfit or overfit
*    Parsimary vs blackbox

* Why choose a more restrictive method vs a more refllexive one?
  - restricitve models are more interpretable

**<font color = "red">Model Selection and Bias Variance Tradeoff</font>**

**<font color = blue>Assessing Model Accuracy</font>**

*    suppose we fit a model $\widehat f(x)$  to some training data $Tr = \{X_i, Y_i\}^{N}_{1}$

    ↪ see how well it performs by computing MSE

\begin{align}
  MSE_{Tr} = Ave_{i ϵ Tr}[y_i-\hat f(x_i)]^2
\end{align}

*    might be biased toward more over fit models → do the same for test data(<font color = "green">fresh data</font>)

\begin{align}
  MSE_{Te} = Ave_{i ϵ Te}[y_i-\hat f(x_i)]^2
\end{align}

**<font color = "blue">Bias-Variance Trade-Off</font>**
*    Suppose we fit a model $\hat f(x)$ to some training data $Tr$
  -   Let $(x_0, y_0)$ be a test observation drawn from populaion
      -   if true model is $Y = f(x)+ ϵ$ with $f(x) = E(Y|X=x)$ then:

      \begin{align}
        E(y_0 - \hat f(x_0))^2 = Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 + Var(ϵ)
      \end{align}
*    Note: $Bias(\hat F(x_0)) = E[\hat f(x_0)] - f(x_0)$
  -    typically as $\hat f$ increases → variance ⇑, bias ↓

  ↪ choose flexibility on test error → <font color = "green"> bias variance tradeoff</font>

**<font color="red">Classification</font>**

**<font color="blue"> Classification Problems</font>**
  
*     spam or ham
    
  - Y in this case is <font color = "green">qualitative</font>
  - 𝐶 = (spam, ham) → digit class 𝐶 = (0,1,2, ... , 9)

    ↪ build classifier 𝐶(x) that assignes a class label to a future observation of X

*    measure performace of $\hat C(x)$ by using misclassification error rate

\begin{align}
  Err_{Te} = Ave_{i ϵTe}I[y_i \neq \widehat C(x_i)]
\end{align}

**Some notes on K nearest neighbors**
*   Too many dimensions will lead to underfitting and to little dimensions will lead to over fitting

# Linear Regression

**<font color="blue">Linear Regression Using a single predictor X</font>**

-    Assume a model

\begin{align}
  Y = β_0 + \beta_1X + ϵ
\end{align}

  where $β_1$ and $β_0$ are 2 unknown constants(represent <font color = green>slope</font> and <font color = green>intercept</font>)

  ↪ aka <font color = green>coeffecients</font> or <font color = green>parameters</font>

-    Given som estimates $\hat β_0$ and $\hat β_1$ → predict future sales using

\begin{align}
  \hat y = \hat β_0 + β_1x
\end{align}

**<font color="blue">Estimation of the parameters by least squares</font>**

-    let $\hat y_i = \hat β_0 + \hat β_ix_i$ → based on its $ith$ value of $x$
↪ then $e_i = y_i - \hat y_i$ represents the <font color = "green">residual</font>
-    <font color = "green">residual sum of squares</font>(RSS)

\begin{align}
  RSS  &= e^2_1 + e^2_2 + ... + e^2_n \\
  &=(y_1 - \hat β_0 - \hat β_1x_1)^2 + (y_2- \hat β_0 - \hat βx_2)^2 + ... + (y_n- \hat β_0 - \hat βx_n)^2
\end{align}

**<font color="blue">Assessing the accuracy of coefficient estimates</font>**
-   
