# Logistic Regression

The goal of logistic regression is to train a classifier that can make a decision about the class of a new input observation.py


## Binary

We begin with the sigmoid function $\sigma(a) = \frac{1}{1 + e^{-a}}$. This function maps a input value to a output value between
[0, 1] which is exactly where probability lies.

The output of the classifier can be $c_1, c_2$, we want to know the probability a sample is belonging to a class, that is
we want to know $p(C=c_1 | X=x_i) \text{ and }p(C=c_2 | X=x_i)$

Let $p(C=c_1 | X=x_i) = \sigma(w^Tx) = \frac{1}{1 + e^{-w^Tx_i}}$, then $p(C=c_2 | X=x_i) = 1 - p(C=c_1 | X=x_i) = 1 - \sigma(w^Tx_i)
= 1 - \frac{1}{1 + e^{-w^Tx_i}} = \frac{e^{-w^Tx_i}}{1 + e^{-w^Tx_i}}$

Where $w = [b, w^1, ..., w^d]^T$, $x = [1, x^1, ...., x^d]^T$

Now, we suppose we have N samples, $\{x_i, t_i\}$, we want to use MLE to estimate the model parameter w.
Let $y_i = 0$ for $t_i = C_1$ and $y_i = 1$ for $t_i = C_2$

The likelihood $L(w)$:

$L(w) = \prod^{N}_{i=1} p(C=t_i | X=x_i; w)$

$\implies l(w) = \sum_{i=1}^{N}log(p(C=t_i | X=x_i; w)) = y_i log(p(C=c_2 | X=x_i; w)) + (1 - y_i) log(p(C=c_1 | X=x_i; w))$

Now, instead of maximize the log likelihood, we can minimize $-l(w)$:

$E_{ce}(w) = -l(w) = - \sum_{i=1}^{N}log(p(C=t_i | X=x_i; w)) = y_i log(p(C=c_2 | X=x_i; w)) + (1 - y_i) log(p(C=c_1 | X=x_i; w))$ (this $-l(w)$ is called **cross entropy**)

\begin{aligned}
\implies \frac{\partial E_{ce}(w)}{\partial w} &= -\frac{\partial}{\partial w} \sum^{N}_{i=1} y_i log(\frac{e^{w^Tx_i}}{1 + e^{-w^Tx_i}}) + (1 - y_i) log(\frac{1}{1 + e^{-w^Tx_i}})\\
&= -\sum^{N}_{i=1} y_i(-x_i) - y_i log(1 + e^{-w^T x_i}) - (1 - y_i) log(1 + e^{-w^T x_i})\\
&= -\sum^{N}_{i=1} y_i(-x_i) + y_i x_i e^{-w^T x_i}\frac{1}{1 + e^{-w^Tx_i}} + (1 - y_i) x_i e^{-w^T x_i}\frac{1}{1 + e^{-w^Tx_i}}\\
&= \sum^{N}_{i=1} y_i(x_i) - x_i \frac{e^{-w^T x_i}}{1 + e^{-w^Tx_i}}\\
&= \sum^{N}_{i=1} x_i( y_i - \frac{e^{-w^T x_i}}{1 + e^{-w^Tx_i}})\\
&= \sum^{N}_{i=1} x_i( y_i - p(C=c_2 | X=x_i))\\
\end{aligned}

Since this is a non-linear function of w, we do not have a direct solution for w, but we can use iterative method to update w

## Multi-class

Now, instead of two classes, we have K classes that we want to classify our sample $x_i$, that is, we want to know
$p(C=c_1 | X=x_i)$, $p(C=c_2 | X=x_i)$, ... , $p(C=c_K | X=x_i)$:

Define $p(C=c_i | X=x_n) = \frac{e^{w_i^Tx_n}}{\sum_{j=1}^{K} e^{w_j^T x_n}}$ (softmax function), we can clearly see that
$\sum_{i=1}^{K}p(C=c_i | X=x_n) = 1$

We first one hot code our target class $c_j$, s.t class $c_j = [t_1, t_2, ...., t_K]^T, t_i\in\{0, 1\}$ and only $t_j = 1$
(i.e $c_3 = [0, 0, 1, 0, ...., 0]^T$)

Then, we want to use MLE to find the corresponding parameters $w_1, w_2, w_3, ..., w_K$ for our probability distribution given N
sample pairs $\{x_i, g_i\}$

Thus,

$L(w_1, ..., w_K) = \prod_{n=1}^{N} p(C=g_i | X=x_n; w_1, ..., w_K)$

\begin{aligned}
\implies -l(w_1, ..., w_K) & = -\sum_{n=1}^{N} log(p(C=g_i | X=x_n; w_1, ..., w_K))\\
&= -\sum_{n=1}^{N} t_{n1}log(p(C=c_1 | X=x_n;w_1, ..., w_K)) +, ..., + t_{nK}log(p(C=c_K | X=x_n;w_1, ..., w_K))\\
&= -\sum^{N}_{i=1} \sum^{K}_{i=1} t_{ni} log(p(C=c_i | X=x_n; w_1, ..., w_k))\\
\end{aligned}

The final equation is called multi-class cross entropy

By taking the derivative w.r.p to the parameters $w_1, ..., w_K$ we have:

$\frac{\partial -l(w_1, ..., w_K)}{\partial w_i} = \sum_{n=1}^{N} [p(C=c_i | X=x_n; w_1, ..., w_K) - t_{ni}]x_n$

Also, for l2 regularization

$\frac{\partial -l(w_1, ..., w_K)}{\partial w_i} = \sum_{n=1}^{N} [p(C=c_i | X=x_n; w_1, ..., w_K) - t_{ni}]x_n + \lambda w_i$



In [None]:
class LogisticRegression:

    def __init__(self, c, max_iter, lr):

        self.lr = lr
        self.c = c
        self.max_iter = max_iter

    def fit(self, x_train, y_train):

        n, k = y_train.shape


