# Logistic Regression

The goal of logistic regression is to train a classifier that can make a decision about the class of a new input observation.py


## Binary

We begin with the sigmoid function $\sigma(a) = \frac{1}{1 + e^{-a}}$. This function maps a input value to a output value between
[0, 1] which is exactly where probability lies.

## Why $\sigma(a) ?$

we want to model the log odds of $log(\frac{p(c=c_1 | x=x_i)}{p(c=c_2 | x=x_i)}) = log(\frac{p(c=c_1 | x=x_i)}{1 - p(c=c_1 | x=x_i)})
= w^Tx_i = a$

$\implies \frac{p(c=c_1 | x=x_i)}{1 - p(c=c_1 | x=x_i)} = e^a \implies p(c=c_1 | x=x_i) = \frac{e^a}{1 + e^a} = \frac{1}{1 + e^{-a}}$

## Next

The output of the classifier can be $c_1, c_2$, we want to know the probability a sample is belonging to a class, that is
we want to know $p(C=c_1 | X=x_i) \text{ and }p(C=c_2 | X=x_i)$

Let $p(C=c_1 | X=x_i) = \sigma(w^Tx) = \frac{1}{1 + e^{-w^Tx_i}}$, then $p(C=c_2 | X=x_i) = 1 - p(C=c_1 | X=x_i) = 1 - \sigma(w^Tx_i)
= 1 - \frac{1}{1 + e^{-w^Tx_i}} = \frac{e^{-w^Tx_i}}{1 + e^{-w^Tx_i}}$

Where $w = [b, w^1, ..., w^d]^T$, $x = [1, x^1, ...., x^d]^T$

Now, we suppose we have N samples, $\{x_i, t_i\}$, we want to use MLE to estimate the model parameter w.
Let $y_i = 0$ for $t_i = C_1$ and $y_i = 1$ for $t_i = C_2$

The likelihood $L(w)$:

$L(w) = \prod^{N}_{i=1} p(C=t_i | X=x_i; w)$

$\implies l(w) = \sum_{i=1}^{N}log(p(C=t_i | X=x_i; w)) = y_i log(p(C=c_2 | X=x_i; w)) + (1 - y_i) log(p(C=c_1 | X=x_i; w))$

Now, instead of maximize the log likelihood, we can minimize $-l(w)$:

$E_{ce}(w) = -l(w) = - \sum_{i=1}^{N}log(p(C=t_i | X=x_i; w)) = y_i log(p(C=c_2 | X=x_i; w)) + (1 - y_i) log(p(C=c_1 | X=x_i; w))$ (this $-l(w)$ is called **cross entropy**)

\begin{aligned}
\implies \frac{\partial E_{ce}(w)}{\partial w} &= -\frac{\partial}{\partial w} \sum^{N}_{i=1} y_i log(\frac{e^{w^Tx_i}}{1 + e^{-w^Tx_i}}) + (1 - y_i) log(\frac{1}{1 + e^{-w^Tx_i}})\\
&= -\sum^{N}_{i=1} y_i(-x_i) - y_i log(1 + e^{-w^T x_i}) - (1 - y_i) log(1 + e^{-w^T x_i})\\
&= -\sum^{N}_{i=1} y_i(-x_i) + y_i x_i e^{-w^T x_i}\frac{1}{1 + e^{-w^Tx_i}} + (1 - y_i) x_i e^{-w^T x_i}\frac{1}{1 + e^{-w^Tx_i}}\\
&= \sum^{N}_{i=1} y_i(x_i) - x_i \frac{e^{-w^T x_i}}{1 + e^{-w^Tx_i}}\\
&= \sum^{N}_{i=1} x_i( y_i - \frac{e^{-w^T x_i}}{1 + e^{-w^Tx_i}})\\
&= \sum^{N}_{i=1} x_i( y_i - p(C=c_2 | X=x_i))\\
\end{aligned}

Since this is a non-linear function of w, we do not have a direct solution for w, but we can use iterative method to update w

## Multi-class

Now, instead of two classes, we have K classes that we want to classify our sample $x_i$, that is, we want to know
$p(C=c_1 | X=x_i)$, $p(C=c_2 | X=x_i)$, ... , $p(C=c_K | X=x_i)$:

Define $p(C=c_i | X=x_n) = \frac{e^{w_i^Tx_n}}{\sum_{j=1}^{K} e^{w_j^T x_n}}$ (softmax function), we can clearly see that
$\sum_{i=1}^{K}p(C=c_i | X=x_n) = 1$

We first one hot code our target class $c_j$, s.t class $c_j = [t_1, t_2, ...., t_K]^T, t_i\in\{0, 1\}$ and only $t_j = 1$
(i.e $c_3 = [0, 0, 1, 0, ...., 0]^T$)

Then, we want to use MLE to find the corresponding parameters $w_1, w_2, w_3, ..., w_K$ for our probability distribution given N
sample pairs $\{x_i, g_i\}$

Thus,

$L(w_1, ..., w_K) = \prod_{n=1}^{N} p(C=g_i | X=x_n; w_1, ..., w_K)$

\begin{aligned}
\implies -l(w_1, ..., w_K) & = -\sum_{n=1}^{N} log(p(C=g_i | X=x_n; w_1, ..., w_K))\\
&= -\sum_{n=1}^{N} t_{n1}log(p(C=c_1 | X=x_n;w_1, ..., w_K)) +, ..., + t_{nK}log(p(C=c_K | X=x_n;w_1, ..., w_K))\\
&= -\sum^{N}_{n=1} \sum^{K}_{i=1} t_{ni} log(p(C=c_i | X=x_n; w_1, ..., w_k))\\
\end{aligned}

The final equation is called multi-class cross entropy

By taking the derivative w.r.p to the parameters $w_1, ..., w_K$ we have:

$\frac{\partial -l(w_1, ..., w_K)}{\partial w_i} = \sum_{n=1}^{N} [p(C=c_i | X=x_n; w_1, ..., w_K) - t_{ni}]x_n$

Also, for l2 regularization

$\frac{\partial -l(w_1, ..., w_K)}{\partial w_i} = \sum_{n=1}^{N} [p(C=c_i | X=x_n; w_1, ..., w_K) - t_{ni}]x_n + \lambda w_i$



In [19]:
import import_ipynb
from optimizer import SGD, Momentum
import numpy as np
from tqdm import tqdm

class LogisticRegression:

    def __init__(self, optimizer=SGD, c=1, max_iter=2000, lr=0.01, tor=0.0000001, verbose=False):

        self.lr = lr
        self.c = c
        self.max_iter = max_iter
        self.optimizer = optimizer
        self.tor = tor
        self.verbose = verbose

        self.weights = None
        self.k = None
        self.d = None

    def fit(self, x_train, y_train):

        n_y, self.k = y_train.shape
        x_train = np.column_stack([np.ones(n_y), x_train])
        n_x, self.d = x_train.shape
        i = 0

        if not self.weights:
            self.weights = np.random.randn(self.d, self.k)

        opt = self.optimizer(lr=self.lr, model_vars=[self.weights])

        prev_matrix = 0
        dif = np.linalg.norm(self.weights - prev_matrix)

        if self.k > 1:

            while (i <= self.max_iter) and (dif >= self.tor):

                prev_matrix = self.weights

                if self.verbose and (i % self.verbose == 0):
                    print(f'iteration: {i}, loss: {self._cal_train_loss(x_train, y_train)}')

                self.weights = opt([self._mc_ce_grad(x_train, y_train)])[0]
                dif = np.linalg.norm(self.weights - prev_matrix)
                i += 1

    def predict(self, x_test):

        x_test = np.column_stack([np.ones(x_test.shape[0]), x_test])
        pred = [np.argmax(self._cal_softmax(x_i)) for x_i in x_test]
            
        return pred


    def _mc_ce_grad(self, x, labels):

        n = x.shape[0]
        output_g = np.zeros((self.d, self.k))

        for i, x_i in enumerate(x):

            soft_max = self._cal_softmax(x_i)
            temp_g = np.repeat(x_i.reshape(-1, 1), self.k, axis=1) * (soft_max - labels[i])
            output_g += temp_g

        return (output_g + self.c * self.weights)/ n

    def _cal_softmax(self, x):

        bot = 0
        top = []

        for k in range(self.k):

            ea = np.exp(np.dot(self.weights[:, k], x))
            bot += ea
            top.append(ea)

        return np.array(top / bot)

    @staticmethod
    def cross_entropy(y_pred, y_true):

        loss = 0

        for i, v in enumerate(y_true):

            y_pred_i = np.log(y_pred[i])
            loss += np.dot(y_pred_i, v.T)

        return -loss / len(y_true)

    def _cal_train_loss(self, x_train, y_train):
        
        y_pred = self._cal_softmax(x_train[0])

        for i in x_train[1:]:
            
            y_pred = np.row_stack([y_pred, self._cal_softmax(i)])
            
        return self.cross_entropy(y_pred, y_train)
        

In [20]:
import numpy as np
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder

x = load_iris()['data']
y = load_iris()['target']

In [21]:
y_train = OneHotEncoder().fit_transform(y.reshape(-1,1)).toarray()

In [22]:
lr = LogisticRegression(max_iter=1000, optimizer=Momentum, verbose=100)
lr_sgd = LogisticRegression(max_iter=1000, optimizer=SGD, verbose=100)

In [23]:
lr.fit(x, y_train)

iteration: 0, loss: 5.098582316688093
iteration: 100, loss: 0.3517027932892621
iteration: 200, loss: 0.28276999963179617
iteration: 300, loss: 0.2456013513841422
iteration: 400, loss: 0.22242954025713682
iteration: 500, loss: 0.20679155779461852
iteration: 600, loss: 0.1956724764944553
iteration: 700, loss: 0.18747024379886304
iteration: 800, loss: 0.181253620758808
iteration: 900, loss: 0.17644388709793463
iteration: 1000, loss: 0.17266229070230382


In [24]:
lr_sgd.fit(x, y_train)

iteration: 0, loss: 11.220214124414223
iteration: 100, loss: 1.2584502937847382
iteration: 200, loss: 0.7440648809633511
iteration: 300, loss: 0.5916334654839972
iteration: 400, loss: 0.5207964125813188
iteration: 500, loss: 0.47726521907148084
iteration: 600, loss: 0.4461981568121418
iteration: 700, loss: 0.422029801982726
iteration: 800, loss: 0.402202417434294
iteration: 900, loss: 0.38536475871017123
iteration: 1000, loss: 0.3707261901775692


In [25]:
pred = lr.predict(x)
pred_sgd = lr_sgd.predict(x)

In [26]:
sklearn_lr = LR()
sklearn_lr.fit(x, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [27]:
pred_sklearn = sklearn_lr.predict(x)


In [28]:
accuracy_score(pred, y)

0.98

In [29]:
accuracy_score(pred_sgd, y)

0.9733333333333334

In [30]:
accuracy_score(pred_sklearn, y)

0.9733333333333334