Figures taken from < https://online.stat.psu.edu/stat501/lesson/1/1.1 >

# Linear Models

## 2023-10-10

# What is linear regression?

A statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

  - One variable, denoted $x$, is regarded as the **predictor**, explanatory, or independent variable.
  - The other variable, denoted $y$, is regarded as the **label**, outcome, or dependent variable.

# Types of relationships

![celsius / Fahrenheit graph](assets/temperature.jpeg)

Although **deterministic relationships** can often be linear, there's not much point in modelling them. We know that $\text{Fahr} = \frac{9}{5}\text{Cels}+32$ exactly

# Statistical relationships

![latitude / skin cancer incidence graph](assets/cancer.png)

The label, $y$, is the mortality due to skin cancer (number of deaths per 10 million people) and the predictor, $x$, is the latitude (degrees North) at the center of each of lower 48 states in the United States ([U.S. Skin Cancer data](https://online.stat.psu.edu/onlinecourses/sites/stat501/files/data/skincancer.txt)) 

# Line of best fit

Given that:
  - $y_{i}$ denotes the observed label for input $i$
  - $x_{i}$ denotes the observed predictor for input $i$
  - $\hat{y}_{i}$ denotes the predicted label for input $i$
  
We can define the equation for a best fitting line as:
$$
\hat{y}_{i} = b_{0}+b_{1}x_{i}
$$

# Observation vs prediction

Because we're measuring **statistical** and not **deterministic** relationships, there will always be some error in our predictions (a **residual**). This can be quantified as

$$
e_{i} = y_{i} - \hat{y}_{i}
$$

# Least squares criterion

One way to minimize the prediction error is to minimize the sum of the squared prediction errors:

$$
Q = \sum^{n}_{i=1}(y_{i}-\hat{y}_{i})^{2}
$$

Applied to a linear regression, we minimize the equation for the sum of squared prediction errors:

$$
Q = \sum^{n}_{i=1}(y_{i}-(b_{0}+b_{1}x_{i}))^{2}
$$

to get the **least squares estimates** for $b_{0}$ and $b_{1}$:

$$
b_{0} = \bar{y}-b_{1}\bar{x} \\
b_{1} = \frac{\sum^{n}_{i=1}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum^{n}_{i=1}(x_{i}-\bar{x})^{2}}
$$

# Simple linear regression

Linear regressions have four general conditions to be valid predictors:
  - **Linear function**: The mean of the response, $E(Y_{i})$, at each value of $x_{i}$ is a linear function of $x_{i}$
  - **Independent**: The errors, $e_{i}$, are independent
  - **Normally distributed**: The errors, $e_{i}$, at each value of $x_{i}$ follow a Normal distribution
  - **Equal variances**: The errors, $e_{i}$, at each value of $x_{i}$ have equal variances, denoted $\sigma^{2}$

# Residuals vs fit

![residual vs fit plot](assets/residual.png)

![residual patterns](assets/residual-patterns.png)

# Multiple regression

Multiple linear regression that relates a $y$-variable to $n-1$ $x$-variables is denoted by:

$$
y_{i} = \beta_{0} + \beta_{1}x_{i,1}+\beta_{2}x_{i,2} + \ldots + \beta_{n}x_{i,n}
$$

## Matrix notation

$$
\underbrace{
  \vphantom{
    \begin{bmatrix}
      1 & x_1 \\
      1 & x_2 \\
      \vdots & \vdots \\
      1 & x_n
    \end{bmatrix}
  }
  \begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_n
  \end{bmatrix}
}_{
  \begin{gathered}
    Y
  \end{gathered}
} = \underbrace{
  \begin{bmatrix}
    1 & x_1 \\
    1 & x_2 \\
    \vdots &\vdots \\
    1&x_n
  \end{bmatrix}
}_{
  \begin{gathered}
    =X
  \end{gathered}
}
\underbrace{
  \vphantom{
    \begin{bmatrix}
      1 & x_1 \\
      1 & x_2 \\
      \vdots & \vdots \\
      1 & x_n
    \end{bmatrix}
  }
  \begin{bmatrix}
    \beta_0 \\
    \beta_1 \\
  \end{bmatrix}
}_{
  \begin{gathered}
    \beta
  \end{gathered}
} + \underbrace{
  \vphantom{
    \begin{bmatrix}
      1 & x_1 \\
      1 & x_2 \\
      \vdots & \vdots \\
      1&x_n
    \end{bmatrix}
  }
  \begin{bmatrix}
    \epsilon_1 \\
    \epsilon_2 \\
    \vdots \\
    \epsilon_n
  \end{bmatrix}
}_{
  \begin{gathered}
    + \epsilon
  \end{gathered}
}
$$

or just $Y=X\beta+\epsilon$

# Generalized linear models

There are 3 components to any GLM:

  - **Random component**: specifies the probability distribution of the response variable. E.g., the normal distribution for $Y$ in a classical regression model
  - **Systematic component**: specifies the linear combination of explanatory variables. E.g., $\beta_{0} + \beta_{1}x_{1}+\beta_{2}x_{2}$ in a linear regression
  - **Link function**: denoted $\eta$ or $g(\mu)$. Specifies the link between the random and systematic components. E.g., $\eta = g(E(Y_{i}))=E(Y_{i})$ for a classical regression

This class of model takes the form $y_{i} \sim N(x_{i}^{T}\beta, \sigma^{2})$ where

  - $x_{i}$ contains known covariates, and
  - $\beta$ contains the coefficients to be estimated

$y_{i}$ is assumed to follow an exponential family distribution with mean $\mu_{i}$, which is assumed to be a function of $x_{i}^{T}\beta$

# Common GLMs

  - Binary logistic regression
    - Models the odds of "success" for a binary response variable with a logit link function
    - Distribution is assumed to be binomial with a single trial and success probability $E(Y) = P$
    - $\text{logit}(P_{i}) = \text{ln}(\frac{P_{i}}{1-P_{i}})$
  - Poisson regression
    - Models how the mean of a discrete (i.e., a count) response variable $Y$ depends on a set of explanatory variables
    - Distribution if Poisson with mean $\lambda$
    - Log link function is used, $\text{ln}(\lambda{_i})$