Logistic regression is a common model for classification problems. It is applicabale to both binary and multiclass classification, however for starters it's necessary to understand the logic behind the use of logistic regression in binary classification problem. 
<br/><br/>
The basis of logistic regression is **sigmoid** function: $\sigma(x)=\frac{1}{1+e^{-x}}$ It's highly applicable in classification problemnts, since it is close to either 0 or 1 across most of its domain:

In [1]:
import numpy as np
import plotly.express as px

sigmoid = lambda x: 1 / (1+np.exp(-x))
X = np.arange(-20,20)
px.line(x=X, y=sigmoid(X), title=r'$\sigma(x)$')

Another important term is **logit**, which is a logarithm of odds. **Odds** is a ratio between probabilities $\frac{p}{1-p}$. Since both odds and logits are monotonically increasing functions, we can apply log-transformation to odds.
<br/><br/>
It is important to explain the intuition behind odds. **Odds** is a quantitive measure of a success. Suppose $p=0,8$ and $q=1-p=0,2$. Therefore the odds of a success will be $\frac{0,8}{0,2}=4$. This can be best interpreted as "we encounter an unfavorable outcome once in every 4+1 events". Therefore, the higher is the $p$, the higher are the odds.
<br/><br/>
As to **logits**, the function approaches $-\infty$ for the lowest odds and approaches $+\infty$ for the highest odds. Logits, that are based on odds, allow us to shift from restricted domain of $[0, 1]$ set by probabilities distribution to unrestricted domain $[-\infty; +\infty]$

In [29]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

with np.errstate(divide='ignore'): # locally ignore zero-division warnings
    odds = lambda p: p / (1-p)
    P = np.linspace(0,1,50)

    fig = make_subplots(rows=1, cols=2, subplot_titles=["Odds(p)",'Log[Odds(p)]'])
    fig.add_trace(
        go.Scatter(x=P, y=odds(P), mode="lines"),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=odds(P), y=np.log(odds(P)), mode="lines"),
        row=1, col=2, 
    )
    fig.update_layout(height=500, width=800)
    fig.show()

Having explained logits, the problem of logistic regression is centered around finding $\beta$ that best decribe $log[odds(p)]=\beta\times X$. Now, to simplify it even further, we can reduce the optimization problem to solving for $p$ (which is also $P(Y=1)$ since we are looking at the binary classification) instead of logit:
$$
\begin{align*}
    \beta X = log(\frac{p}{1-p}) \\
    e^{-\beta X} = \frac{1-p}{p} \\
    e^{-\beta X} = \frac{1}{p} - 1 \\
    e^{-\beta X} + 1 = \frac{1}{p} \\
    p = \frac{1}{e^{-\beta X} + 1}
\end{align*}
$$
From this derivation we can clearly see the **relation** between **logit function** and **sigmoid**. It's also important to note that logit function, being able to monotonically map $[0, 1]$ onto $(-\infty, +\infty)$, is actually doing an inverse operation of sigmoid, which, on the contrary, maps $(-\infty, +\infty)$ onto $[0, 1]$.

Estimation of the best $\hat{\beta}$ is done via Maximum Likelihood Estimation (**MLE**). The logic behind MLE is as follows:
* We take samples labeled as 1 and 0 respectively; 
* For samples with label "1" we estimate such $\hat{\beta}$ that the entropy amongst samples is the lowest ($P(Y=1)\rightarrow 1$): $\prod_{\{y_i=1\}}p(x_i)\rightarrow max$. We do a product because we presume that $x_i$ are independent variables;
* Same is done for samples with label "0": $\prod_{\{y_i=0\}}[1-p(x_i)]\rightarrow max$;
* Combining two problems into one we get a definition of $\text{Likelihood}(\beta)=\prod_{\{y_i=1\}}p(x_i)\times \prod_{\{y_i=0\}}[1-p(x_i)]\rightarrow max$

To get rid of the product we then further simplify the formula:
$$
\begin{aligned}
    L(\beta) = \prod_{\{y_i=1\}}p(x_i)\times \prod_{\{y_i=0\}}[1-p(x_i)]= \\
    =\prod_{X}p(x_i)^y_i\times[1-p(x_i)]^(1-y_i)= \\
    = \{\text{we proceed to take ln of both parts of the equation for the Likelihood}\} = \\
    = \sum_X y_ilog[p(x_i)]+(1-y_i)log[1-p(x_i)] = \\
    = \{\text{we then replace p with previously derived sigmoid}\}= \\
    = \sum_X y_ilog(\frac{1}{1+e^{-\beta X}}) + (1-y_i)[log(\frac{e^{-\beta X}}{1+e^{-\beta X}})]= \\
    = \sum_X y_i[log(\frac{1}{1+e^{-\beta X}})-log(\frac{e^{-\beta X}}{1+e^{-\beta X}})] + log(\frac{e^{-\beta X}}{1+e^{-\beta X}})= \\
    = \sum_X y_i[log(\frac{1}{e^{-\beta X}})] + log(\frac{e^{-\beta X}}{1+e^{-\beta X}}\times\frac{e^{\beta X}}{e^{\beta X}})= \\
    = \sum_X y_i[log(\frac{1}{e^{-\beta X}})] + log(\frac{1}{1+e^{\beta X}})= \\
    = \sum_X y_i\times\beta X - log(1+e^{\beta X})
\end{aligned}
$$
The last formula is called Log-likelihood Function (**LLF**). We can reduce the problem to maximizing LLF instead of original $\text{Likelihood}(\beta)$, since it is yet again a monotonic transformation. In the end, the optimization problem can be formulated as follows: $\hat{\beta}=argmax_\beta[Log(L(\beta))]=argmax_\beta[\sum_X y_i\times\beta X - log(1+e^{\beta X})]$.

This concludes the prerequisite math behind the logistic regression for **binary classification**. Now we need to set up a training loop in order to calculate best $\beta$ by adjusting them by the gradient on every $t-\text{ith}$ epoch: $\beta_{t+1}=\beta_t-\eta \nabla L(f(x; \beta), y)$. 
<br/><br/>
Now we need to calculate gradients with respect to $\beta_i (i\in N)$ and intercept $\beta_0$ (where $y$ is a real probability and $\hat{y}$ is a sigmoid (predicted) probability):
$$
\begin{aligned}
& \frac{\partial L_{\mathrm{CE}}(\hat{y}, y)}{\partial \mathbf{\beta}}=\frac{1}{m}(\hat{\mathbf{y}}-\mathbf{y}) \mathbf{x}_i^T \\
& \frac{\partial L_{\mathrm{CE}}(\hat{y}, y)}{\partial \beta_0}=\frac{1}{m}(\hat{\mathbf{y}}-\mathbf{y})
\end{aligned}
$$

In [52]:
from sklearn.metrics import accuracy_score

class LogisticRegression:

    def __init__(self):
        self.losses = []
        self.accuracies = []
        self.Beta = []
        self.llfs = []
        self.intercept = []
        self.m = 0

    def sigmoid(self, yHat):
        # find probability based by passing
        # logit into a sigmoid function
        return 1 / (1 + np.exp(-yHat))
    
    def llf(self, y, yHat, logHat):
        # log likelihood function
        return (y*logHat-np.log(1+yHat)).mean()

    def bce(self, y, yHat):
        # binary cross entropy
        # special case of log likelihood where we
        # have a bernoulli distribution of only 2
        # unique labels
        return (-y * np.log(yHat) - (1-y) * np.log(1-yHat)).mean()

    def predict(self, X, predict_new=False):
        if predict_new:
            newIntercept = np.ones((X.shape[0],1))
            X = np.concatenate((newIntercept,X),axis=1)
        p = self.sigmoid(np.dot(X,self.Beta))
        logit = np.log(p / (1-p))
        return p, logit
    
    def fit(self, X, y, epochs=1000, lr=.001, verbose_level=50) -> None:
        """
            X: np.ndarray
                - feature matrix
            y: np.ndarray
                - target labels
            epochs: int
                - N of training epochs
            lr: float
                - learning rate
            verbose_level: int
                - output loss and accuracy every verbose_level
        """
        self.intercept = np.ones((X.shape[0], 1)) 
        X = np.concatenate((self.intercept, X), axis=1)
        self.m = X.shape[0]
        self.Beta = np.zeros(X.shape[1]).reshape(-1,1)

        for epoch in range(epochs+1):
            yHat, logitHat = self.predict(X)

            # gradient with respect to Beta
            self.Beta -= lr * 1/self.m * np.dot(X.T, (yHat - y))

            # gradient with respect to intercept
            self.intercept -= lr * 1/self.m * (yHat - y)

            self.llfs.append(self.llf(y, yHat, logitHat))
            self.losses.append(self.bce(y, yHat))

            self.accuracies.append(accuracy_score(y,yHat.round()))
            if epoch % verbose_level == 0:
                print(f'LLF {self.llfs[-1]} :: BCE {self.losses[-1]} :: Accuracy {self.accuracies[-1]}')

Next step is training logistic regression on some binary classification problem. For the sake of demonstration, $75\%$ sample of the popular "breast cancer" dataset has been chosen. 

In [29]:
from sklearn import datasets
bc = datasets.load_breast_cancer()

In [30]:
from sklearn.model_selection import train_test_split

X, y = bc.data, bc.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((426, 30), (143, 30), (426, 1), (143, 1))

In [69]:
logReg = LogisticRegression()
with np.errstate(divide='ignore', over='ignore', invalid='ignore'):
    logReg.fit(X_train,y_train,lr=.001,verbose_level=100)

LLF -0.40546510810816455 :: BCE 0.6931471805599453 :: Accuracy 0.37089201877934275
LLF nan :: BCE nan :: Accuracy 0.37089201877934275
LLF nan :: BCE nan :: Accuracy 0.7347417840375586
LLF nan :: BCE nan :: Accuracy 0.8333333333333334
LLF nan :: BCE nan :: Accuracy 0.8169014084507042
LLF nan :: BCE nan :: Accuracy 0.8802816901408451
LLF nan :: BCE nan :: Accuracy 0.8708920187793427
LLF nan :: BCE nan :: Accuracy 0.4413145539906103
LLF nan :: BCE nan :: Accuracy 0.8802816901408451
LLF nan :: BCE nan :: Accuracy 0.8896713615023474
LLF nan :: BCE nan :: Accuracy 0.8896713615023474


Now we apply previously trained logistic regression model on test sample that is $25\%$ of the dataset

In [67]:
from sklearn.metrics import classification_report

with np.errstate(divide='ignore',over='ignore'):
    p, logits = logReg.predict(X_test, predict_new=True)
print(
    classification_report(y_test.reshape(-1,1), p.round().reshape(-1,1))
)

              precision    recall  f1-score   support

           0       0.93      0.93      0.93        54
           1       0.96      0.96      0.96        89

    accuracy                           0.94       143
   macro avg       0.94      0.94      0.94       143
weighted avg       0.94      0.94      0.94       143



We can now compare metrics from our model to scikit-learn implementation. As we can see current implementation has pretty close to `scikit-learn`'s model.

In [70]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
yPred = clf.predict(X_test)
print(
    classification_report(y_test.reshape(-1,1), yPred.reshape(-1,1))
)

              precision    recall  f1-score   support

           0       0.96      0.94      0.95        54
           1       0.97      0.98      0.97        89

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.97      0.97      0.96       143




A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

