# Logistic Regression & Numerical Optimization

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from ipywidgets import interact, fixed

# Logistic Regression

### Idea

Get the probability of a sample belonging to a class.

- regression task used for classification: 
    - the output is the probability of a sample belonging to a class.
    - the output is transformed to a binary value using a threshold.
- useful if we want to prioritize the most likely class.

### Hypothesis Class = Logistic Function

Transforms an observation ($x$) into a probability using the function: 

$$
    H(x) = \frac{1}{1 + e^{-\langle \purple{w}, x \rangle}}
$$

- the "Logistic Function"
- Parameterized by $w$.
- The probability that we label $x$ as $1$

**Note:** 
The function is **monotonic** (always increasing). So, the probability of a sample belonging to a class increases as the dot product of the weights and the sample increases.


#### As a stochastic model

$$
    H(X) = P(Y=1 | X,\purple{w})
$$


<div id="container" style="display: flex">
    <img src="img/IMG_DF2CD5C31F78-1.jpeg" style="width: 25%; margin-right:2%"></img>
    <div>
        As you can see, using a linear model (i.e. linear regression) is not appropiate.
    </div>
</div>

In [None]:
# Plot a sigmoid function:

def sigmoid(x, a, b):
    return 1 / (1 + np.exp(-a * (x - b)))

x = np.linspace(-10, 10, 100)
y = sigmoid(x, 1, 0)

plt.plot(x, y)
plt.yticks([0,0.5, 1])
plt.xticks([])
plt.xlabel('<w,x>')
plt.ylabel('h(x)')

In [13]:
## Logistic Function

def getLogisticFunction(w: np.ndarray):
    return lambda x: 1 / (1 + np.exp(np.dot(w,x)))

## Finding Parameters, $w$

### IDEA: View Likelihood Function in terms of a Bernoulli Variable

<div id="container" style="display: flex">
    <img src="img/img2.jpeg" style="width: 25%; margin-right:2%"></img>
    <div>
        <ul>
            <li>The probability of a sample belonging to a class is modeled as a Bernoulli variable.</li>
            <li>p changes depending on where the sample is in the feature space (ie: p=H(x))</li>
        </ul>
    </div>
</div>

### Derivation

#### 1. Using the bernoulli RV

We model the outcome, $y$ as a Bernoulli RV: 

$$
    \begin{aligned}
        L(\pink{p} | y) 
        &= 
        \prod_{i=1}^n \pink{p}^{y_i} (1 - \pink{p})^{1-y_i}
        \\
        &= 
        \begin{cases} 
            \prod_{i=1}^n \pink{p} & \text{for } y_i = 1 \\
            \prod_{i=1}^n (1 - \pink{p}) & \text{for } y = 0
        \end{cases}
    \end{aligned}
$$

#### 2. Log Likelihood

$$
    \begin{aligned}
        \log L(\pink{p} | y) 
        &= 
        \sum_{i=1}^n \log \pink{p}^{y_i} + (1 - \pink{p})^{1 - y_i}
        \\
        &=
        \sum_{i=1}^n y_i \log \pink{p} + \log (1 - y_i)(1 - \pink{p})
        \\
        &=
        \begin{cases} 
            \sum_{i=1}^n \log \pink{p}
            & \text{for } y_i = 1 
            \\
            \sum_{i=1}^n \log (1 - \pink{p})
            & \text{for } y = 0
        \end{cases}
    \end{aligned}

#### 4. Using our Logistic Function 

i.e. $p = H(x)=P[Y=1 | x,w]$

$$
        \log L(\pink{p} | y) 
        = 
        \begin{cases} 
                \sum_{i=1}^n \log(\pink{1-\frac{1}{1+e^{-\langle \purple{w}, x \rangle}}})
                & \text{for } y_i = 1 
                \\
                \sum_{i=1}^n  \log (1 - (\pink{1-\frac{1}{1+e^{-\langle \purple{w}, x \rangle}}}))
                & \text{for } y = 0
        \end{cases}
$$

$$
        \log L(\purple{w} | x,y)
        =
        \begin{cases} 
                \sum_{i=1}^n - \log(1 + e^{- \langle \purple{w}, x \rangle})
                & \text{for } y_i = 1 
                \\
                \sum_{i=1}^n  - \log (1 + e^{\langle \purple{w}, x \rangle}))
                & \text{for } y = 0
        \end{cases}
$$

#### 4. Use function for $y$'s sign

Let 


$$
\tilde{y}_i = 
    \begin{cases}
        1  & \text{for } y_i = 1 \\
        -1 & \text{for } y_i = 0
    \end{cases}
$$

Then: 

$$
        \log L(\purple{w} | x,y)
        =
        \sum_{i=1}^n - \log(1 + e^{-\tilde{y}_i \langle \purple{w}, x \rangle})
$$

### Find $w$ that maximizes the likelihood

$$
    \begin{aligned}
        \hat{\purple{w}}
        &= 
        \text{argmax}_{\purple{w}} \sum_{i=1}^n - \log(1 + e^{- \tilde{y}_i \langle \purple{w}, x \rangle})
        \\
        &=
        \text{argmin}_{\purple{w}} \sum_{i=1}^n \log(1 + e^{- \tilde{y}_i \langle \purple{w}, x \rangle})
        \\
        &=
        \text{argmin}_{\purple{w}} \frac{1}{n} \sum_{i=1}^n \log(1 + e^{- \tilde{y}_i \langle \purple{w}, x \rangle})
        
    \end{aligned}
$$

Get the expected value: 

$$
    E[l_{\purple{w}}(x_i, y_i)] = \frac{1}{n} \sum_{i=1}^n \log(1 + e^{- \tilde{y}_i \langle \purple{w}, x \rangle})
$$

**Cross Entropy Loss:**

"Cross entropy" means we mean is that we are comparing the predicted probability $y$ with the true probability ($\tilde{y}$)

This is what we are trying to minimize.

$$
    l_{\purple{w}}(x, \tilde{y}) = \log(1 + e^{- \tilde{y}_i \langle \purple{w}, x \rangle})
$$

#### Summary: 

<img src="img/img3.jpeg" style="width:50%" ></img>


### Minimizing Cross Entropy Loss

#### Approach #1: Gradient Descent

**Recall:** *Gradient* is the direction of the steepest ascent (vector of partial derivatives). 

**Idea:** Move in the opposite direction of the gradient to minimize the loss. 

Let $\eta =$ step size, then: 

$$
    w_t = w_{t-1} - \eta \nabla f(w_{t-1})
$$

For the loss function, this is computed over a batch of $n$ samples: 

$$
    \nabla l_{w_{t-1}}(x, \tilde{y}) = \frac{1}{n} \sum_{i=1}^n \nabla l_{w_{t-1}}(x_i, \tilde{y}_i)
$$

#### Approach #2: Newton's Method

**Idea:** Use the second derivative to get a better approximation of the minimum.

Newton's Method finds the roots of $f(w)$ by iteratively updating $w$ using the following formula:

<img src="img/img4.jpeg" style="width: 40%"/>

In parameter estimation, we want the roots of $l'_w(x)$:

$$
    \begin{aligned}
        w_1 & = w_0-\frac{l'_w(x)}{l''_w(x)}
        \\ &= 
        w_0 - \frac{\nabla l_w(x)}{\nabla^2 l_w(x)}
    \end{aligned}
$$

## Multinomial Regression

#### *(Generalizing to Multi Class)*


The **Softmax** Function. For $k$ classes:

$$
    Pr[y=k | x,w] = \frac
        {e^{\langle w_k, x\rangle}}
        {\sum_{i=1}^k e^{\langle w_i, x\rangle}}
$$

<div style="display: flex">
    <div>
        <h3>Binary</h3>
        <img src="./img/img5.png" style="max-width: 50%; object-fit: contain;"/>
    </div>
    <div>
        <h3>Multiclass</h3>
        <img src="./img/img6.png" style="max-width: 50%; object-fit: contain;"/>
    </div>
</div>