## Logistic Regression 

In **logistic regression**, we predict **probabilities** for binary classification:

$$
P(y_i = 1 \mid X_i) = \hat{y}_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}}, \quad \text{where } z_i = X_i^\top \beta
$$

This gives:

* $\hat{y}_i$ = predicted probability that $y_i = 1$
* So naturally, $1 - \hat{y}_i$ = predicted probability that $y_i = 0$

---

### Line 1: Defining the probability model

$$
P(y_i = 1 \mid X_i) = \hat{y}_i = \sigma(z_i)
$$

* You're modeling the probability that the label $y_i = 1$ given the input $X_i$
* Logistic regression outputs $\hat{y}_i$, a number between 0 and 1 (a probability)
* That’s done using the **sigmoid function** $\sigma(z_i)$, which squashes the output of a linear model

---

### Line 2: Complement probability

$$
P(y_i = 0 \mid X_i) = 1 - \hat{y}_i
$$

* Since $y_i$ is binary (0 or 1), the only two possibilities are:

  * $y_i = 1$ with probability $\hat{y}_i$
  * $y_i = 0$ with probability $1 - \hat{y}_i$

---

### Line 3: General form for both 0 and 1

$$
P(y_i \mid X_i) = \hat{y}_i^{y_i} (1 - \hat{y}_i)^{(1 - y_i)}
$$

This is the **core trick**.

Why does this formula work?

| Case    | $y_i$ | Becomes                                         | Interpretation |
| ------- | ----- | ----------------------------------------------- | -------------- |
| Class 1 | 1     | $\hat{y}_i^1 (1 - \hat{y}_i)^0 = \hat{y}_i$     | correct        |
| Class 0 | 0     | $\hat{y}_i^0 (1 - \hat{y}_i)^1 = 1 - \hat{y}_i$ | correct        |

So this single expression **works for both classes**.

It says:

> "The probability of seeing label $y_i$ is equal to the predicted probability raised to the power of the actual label."

---

### Line 4: Likelihood of the whole dataset

$$
L(\beta) = \prod_{i=1}^{m} \hat{y}_i^{y_i} (1 - \hat{y}_i)^{(1 - y_i)}
$$

* For **independent samples**, the probability of observing the full dataset is the **product** of individual probabilities.
* This is called the **likelihood function**.
* It’s written as a function of the parameters $\beta$, since we’re trying to choose the best $\beta$ to make the data most likely.

