<div style="display: flex; align-items: center; gap: 2px;">
  
  <div style="text-align: left; padding: 0;">
   <h2 style="font-size: 1.8em; margin-bottom: 0;"><b>Moving beyond Linearity...</b></h2>
   <br>
   <h3 style=" font-size: 1.2em;margin-bottom: 0;">Logistic Regression</h3>
   <h3 style="font-size: 1.2em; margin-bottom: 0; color: blue;"><i>Dr. Satadisha Saha Bhowmick</i></h3>
  </div>

  <div style="margin-right: 5px; padding: 0;">
    <img src="images/intro-pic.png" align="right" alt="intro-pic" style="width: 70%;">
    <!-- TEXT NEXT TO IMAGE -->
      <div style="font-size: 0.5em;">
        <p>Woman teaching geometry, from a fourteenth-century edition of Euclid‚Äôs geometry book.</p>
      </div>
  </div>

</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Learning Outcomes
- Generalized Linear Models
- Logistic Regression

### Motivation

We have seen how specific assumptions of linearity are broadly applicable constructs.

- Can produce versatile models that can work with numerous predictors.
- Can accommodate non linear transformations that interact linearly with each other.

Not good for modeling probabilities or data with non-normal response variables!<br>($\color{blue}{\textbf{Recall linear regression}}$)

### Ordinary Least Squares (OLS) Assumptions

Ordinary least squares (OLS) regression relies on several key assumptions:

1. **The response variable is continuous and unbounded**
   
   $$
   Y \in (-\infty, \infty)
   $$

2. **Errors are normally distributed**
   
   $$
   Y = X\beta + \varepsilon,
   \qquad
   \varepsilon \sim \mathcal{N}(0, \sigma^2)
   $$

3. **Constant variance (homoscedasticity)**
   
   $$
   \mathrm{Var}(Y \mid X) = \sigma^2
   $$

These assumptions are often violated for many real-world outcomes, such as binary responses, count data, and proportions.

### What we need?

A unified, consistent framework to model specific assumptions of linearity for diverse data types, where linear regression is too restrictive.

Enter $\color{blue}{\textbf{Generalized Linear Models}}$.

### Generalized Linear Models

A generalized linear model (GLM) is a model in which: 
- a response variable is drawn from a distribution 
- its expected value $\mathbb{E}(Y|X)$ is related to the explanatory variables through a regression equation

### Generalized Linear Models

There's actually $3$ components to a Generalized Linear Model.
- <b>Random Component</b>: specifies¬†the probability distribution of the response variable. 
    - Normal distribution for $Y$ in the classical regression model
    - Bernoulli distribution for¬†$Y$ in the binary logistic regression model
- <b>Systematic Component</b>: specifies the explanatory variables $(X_1, X_2, \dots, X_k)$ in the model and their optimal linear combination
    - Linear and Logistic Regression: $\eta = \beta_0 + \beta_1 X_1 + \ldots +\beta_kX_k $
- <b>Link Function</b>: specifies the link between the random and the systematic component. The link function $g(\cdot)$ indicates how the expected value of the response variable $\mathbb{E}(Y|X)$ relates to the linear combination of explanatory variables.
    - $g(\mathbb{E}(Y\mid X_1,...,X_k)) = \eta$
    

### Generalized Linear Model Assumptions

- The data $ùëå_1, ùëå_2  \dots ùëå_ùëõ$ are independently distributed, i.e., cases are independent.
- The dependent variable $ùëå_ùëñ$  does <b>NOT</b> need to be normally distributed.
- $Y$ comes from a distribution in the exponential family. (Poisson, Binomial, Normal, $\chi^2$, Categorical, etc.).
- Mean-Variance Relationship: In GLM, the variance depends only on the expected value.
    - $\textrm{Var}(Y\mid X_1,...,X_k) = f(\mathbb{E}(Y \mid X_1,..,X_k))$
- A GLM does <b>NOT</b> assume a linear relationship between the response variable and the explanatory variables.
- A GLM <b>does</b> assume a linear relationship between the <b><i>transformed expected response</i></b> in terms of the link function and the explanatory variables.


### Consider the code below. 

Is $Y$ a generalized linear model of $X_1$ and $X_2$? What about $Z$?

In [None]:
X1 = np.random.normal(0,1,size=100)
X2 = np.random.uniform(-1,1,size=100)
errors = np.random.normal(0,1,size=100)
Y = np.exp(X1+X2+7)+X1*errors

In [None]:
X1 = np.random.normal(0,1,size=100)
X2 = np.random.uniform(-1,1,size=100)
errors = np.random.normal(0,1,size=100)
Z = np.exp(X1+X2+7)+(X1+X2+7)*errors

### OLS Regression

A special case of Generalized Linear Models.
- Random component - A continuous target variable $ùëå$ with a normal distribution with mean $\mu$ and constant variance $\sigma$.
- Systematic component ‚Äì linear combinations of continuous or discrete explanatory variables.
- Link function ‚Äì Identity function, $\eta = g(\mathbb{E}(Y)) = \mathbb{E}(Y)$

### OLS Regression

In ordinary least squares regression, we typically assume that the errors
$\varepsilon$ are independent of $X$ and have constant variance,
$$
\varepsilon \perp X,
\qquad
\mathrm{Var}(\varepsilon \mid X) = \sigma^2.
$$

In generalized linear models, we relax the assumption of constant variance.
Instead, we assume that the conditional variance of the response depends on
the conditional mean, which is a function of $X$:
$$
\mathrm{Var}(Y \mid X) = f(\mathbb{E}[Y \mid X])
$$


### Binary Classification

In Binary Classification, we have a binary variable $Y \in \{0,1\}$, and predictors $X = (x_1,..,x_k)$. If for a given data point we have $y=1$ we say that it is in the positive class, otherwise it is in the negative class.

### Bernoulli Trials

If $Y \in \{0,1\}$ represents success or failure for a single independent trial, then
$$
\boxed{Y \sim \text{Bernoulli}(\pi)}
$$
where
$
\pi = \mathbb{P}(Y = 1).
$

The Bernoulli distribution has:
$$
\mathbb{E}[Y] = \pi,
\qquad
\mathrm{Var}(Y) = \pi(1-\pi)
$$


Hence, for binary classification with independent trials, the response variable $Y$ follows a Bernoulli distribution (for a single trial).

### Binary Classification

For given values of $X$, we have that $\mathbb{E}(Y \mid X)$ is just the probability $\pi$ that $Y$ is $1$.

Note also for independent trials, where the conditional distribution $Y \mid X$ follows a Bernoulli distribution with chance of success $\pi_i$, has variance $\pi_i(1-\pi_i)$. Hence, <i>it only depends on $\mathbb{E}(Y \mid X)$</i>.

Hence, we automatically satisfy $2$ out of the $3$ assumptions for a Generalized Linear Model for any binary classification with independent trials.

### Link Function
A binary classification problem (with independent trials) can then be modeled as a Generalized Linear Model as long as their is a link function $g$ so that:
$$g(\mathbb{E}(Y \mid X)) = \beta_0 + \beta_1 X_1 + \ldots \beta_k X_k$$

This is exactly the setting of $\color{blue}{\textbf{Logistic Regression}}$, which models the
<i>Bernoulli mean</i> $\pi$ as a function of predictors via a link function.
- <b>Log-Odds</b>: Here we use the log odds and our link functions is $g(\pi) = \log(\frac{\pi}{1-\pi})$. (Common in Statistics/Data Science)

### Logistic Regression
To round up the GLM specifications of Logistic Regression.

- <b>Random component</b> ‚Äì Binary response variable that follows a Bernoulli distribution with probability of success $\pi$
    - Or, Binomial RV with a single trial and success probability $\pi$.
- <b>Systematic component</b> ‚Äì Linear combinations of continuous or discrete explanatory variables, $\eta = \beta_0 + \beta_1 X_1 + \ldots \beta_k X_k$.
- <b>Link function</b> ‚Äì log odds or <b>logit function</b>, $\eta = g(\pi) = \log(\frac{\pi}{1-\pi})$
    - $\pi = g^{-1}(\eta)$



### Likelihood

$\color{blue}{\textbf{Which model makes the data we actually observed most plausible?}}$

<i>This is a very general procedure to get an estimator for nearly any model parameter we might be interested in.</i>

- Probability asks <i>how likely is this outcome?</i>
    - $\mathbb{P}(\text{data} \mid \theta)$
    - The probability of observing the data given fixed model parameters~$\theta$.
- Likelihood asks <i>which parameter values make this observed outcome most likely?</i>
    - Reverse the perspective, treat data as fixed
    - Relabels this as a function of the parameters $$\boxed{L(\theta) = \mathbb{P}(\text{observed data} \mid \theta)}$$

### Coin Flip Example

Suppose we observe $7$ heads and $3$ tails from $10$ independent coin flips.
If $p$ denotes the probability of heads, then the likelihood is:
$$
L(p) = p^{7}(1-p)^{3}
$$

Here, the data ($7$ heads, $3$ tails) are fixed, and $p$ is the quantity we vary to check at what value the likelihood is maximized.<br>
The likelihood is maximized at $\hat{p} = \frac{7}{10} = 0.7$.


### Connection to Ordinary Least Squares

In linear regression, we assume:
$$
Y_i = x_i^\top\beta + \varepsilon_i,
\qquad
\varepsilon_i \sim \mathcal{N}(0,\sigma^2).
$$

Under this assumption, the likelihood of the data is proportional to:
$$
\exp\!\left(
-\frac{1}{2\sigma^2}
\sum_{i=1}^n (y_i - x_i^\top\beta)^2
\right)
$$

If we maximize this likelihood with respect to $\beta$ it is equivalent to minimizing the sum of squared errors $\sum_{i=1}^n (y_i - x_i^\top\beta)^2$.  
<i>Ordinary least squares is a special case of maximum likelihood estimation</i>.

### Estimating Parameters for GLM via MLE

For non-Gaussian data, such as binary responses, where $Y_i \sim \text{Bernoulli}(\pi_i)$, the squared-error criterion is no longer appropriate.

Instead, we specify the likelihood (objective function) and choose $\beta$ that maximizes it:
$$
\mathcal{L}(\beta)
=
\prod_{i=1}^n
\pi_i^{y_i}(1-\pi_i)^{1-y_i},
\qquad
\pi_i = g^{-1}(x_i^\top\beta)
$$

### MLE for Binary Classification

Because likelihoods involve products over probabilities which can cause underflow issues, we typically work with the $\color{blue}{\textbf{log-likelihood}}$.
$$
\ell(\beta) = - \sum_{i=1}^n y_i \log(\hat{\pi_i}) + (1-y_i) \log(1-\hat{\pi_i})
$$
This has the same maximizer as $L(\theta)$ but is easier to compute and differentiate. Here, $\pi_i$ comes from the <i>Logistic model</i>.

### Logit Link Function

<i>How can we make sense of the logit link function?</i>

$$
\textit{logit}(\pi) = \eta; \qquad
\eta = \beta_0 + \beta_1 X_1 + \ldots \beta_kX_k 
$$

Also, $\textit{logit}(\pi) = log(\frac{\pi}{1-\pi})$.

With some algebraic manipulation we can derive $\pi$ using the $\color{blue}{\textbf{Sigmoid}}$ function.
$$
\pi = \frac{e^{\eta}}{1+e^{\eta}} = \frac{1}{1+e^{-\eta}} = \text{sigmoid}(\eta)
$$

### Working with Generalized Linear Models.

Much of the power with Generalized Linear Models comes from the fact that we can do much of the same things that we can do with linear models. Looking at 

- Cross Validation: Replace MSE or RSS, with appropriate Loss Function.
- AIC/BIC: For Model Selection, again using Log-Likelihood formula above.
- Regularization: ($L^1$ or $L^2$ Regularization) can be used (and in fact are used more frequently, because they improve convergence.

### Thats all folks!

<h4><b>For now...</b></h4>

<div style="text-align: center;">
    <img src="images/end-slide.jpg" alt="GLM The End" scale="0.01;" style="width: 40%;">
</div>