[![Saint Martin's Beach, Bangladesh](https://i.postimg.cc/BZCw6y5z/Group-1log-reg.png)](https://postimg.cc/Z0CLswMP)
Saint Martin's Island, Bangladesh. Image by Riyadh Razzaq

**NOTE**: Most of what I'm writing below is from *The Elements of Statistical Learning* by Hastie, Tibshirani, Friedman.

The only reason logistic regression is called regression because it has partial linear regression similarity. The logistic regression models the posterior probabilities of $K$ classes  via linear transformation on $x$. The same linear transformation used in linear regression. The model equation is, 

$$
\log \frac{\Pr(G=K-1 \mid X=x)}{Pr(G=K \mid X=x} = \beta_{0} + \beta^{T}X
$$

This model is specified in terms of $K-1$ log-odds (sum of all of them will be 1). Logistic regression generally designed for a two-class problem but can be extended for multi-class problem too. Here I've only shown the two-class problem, $K=2$. After a few calculations on the above equation, 

$$
Pr(Y=K_1 \mid X=x) = \frac{exp \left(\beta_0 + \beta_1 X_1 + ... + \beta_n X_n \right)}{1+exp \left(\beta_0 + \beta_1  X_1 + ... + \beta_n X_n \right)}  \label{eq1}\tag{1} \\
and, \\
\Pr(Y=k_2 \mid X=x) = 1 - \Pr(Y=K_1 \mid X=x)
$$

Right hand side of $( \ref{eq1} )$ is called sigmoid, $\sigma$ function. A function, $f(X,\beta) =\beta_0 + \beta_1 X_1 + ... + \beta_n X_n$ produces unbounded output as in, values can be any real numbers. $\sigma$ function transform that to $[0,1]$ so it can be used as probability. The log-likelyhood can be written as, 

$$
\ell(\beta) = \sum_{i=1}^{N} y_i \ log \ p(x_i; \beta) + (1-y_i) \ log \ (1-p(x_i;\beta)) \label{eq2}\tag{2}
$$

Here, $p(x_i;\beta)$ is the probability of a single sample, $Pr(y=K_1 \mid X=x)$ given parameters, $\beta$.  $\ell(\beta)$ is also known as cost function. There is no closed-form solution for this function, but we can optimize using optimizer such as Newton-Raphson, Gradient Descent etc. I'll be using Batch Gradient Descent. But to optimize for $\beta$, we need the gradient of it. 

$$
\frac{\partial \ell(\beta)}{\partial \beta} = \sum_{i=1}^{N} (y_i - p(x_i; \beta))x_i \label{eq3}\tag{3}
$$

Finally, update rule is,

$$
\beta^{new} = \beta^{old} - learning\_rate * (- \frac{\partial \ell(\beta)}{\partial \beta} )
$$

These equations were for a single iteration. Implementation is easier and faster matrix notation. Variables that we will need for gradient descent,

|Variable|Dimension|Note|
|-|-|-|
|$X$|n_samples, n_features + 1|Input matrix, extra all 1s column will be added in zeroth position; it will be multiplied with intercept|
|$y$|n_samples|output vector|
|$\beta$|n_feature+1|Parameters. $\beta_0$ is intercept|
|$learning\_rate$|float|Step size in each iteration.|
|$error$|n_samples|cost, aka, log-likelyhood function|
|$gradient$|n_features + 1|Gradient of the cost function|

and finally, in each iteration,

$$
------------- \\
Forward Propagation \\
------------- \\
\hat{y} = \sigma(X \cdot \beta) \\
error =  \left(\frac{-1}{n\_samples}\right) \left(y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})\right) \\
------------- \\
Backward Propagation \\
------------- \\
gradient = \left(\frac{1}{n\_samples}\right) \left( X^T \cdot error \right) \\
\beta = \beta - learning\_rate * \left( - gradient \right)
$$


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
X = (X - np.min(X))/(np.max(X)-np.min(X))
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [None]:
def pipeline(X,y,X_test, y_test, alpha, max_iter, bs):
    """
    Sklearn Sanity Check
    """
    print("-"*20,'Sklearn',"-"*20)
    sk = LogisticRegression(penalty='none',max_iter=max_iter)
    sk.fit(X,y)
    sk_y = sk.predict(X_test)
    print('Accuracy ',accuracy_score(y_test, sk_y))
    
    print("-"*20,'Custom',"-"*20)
    me = LogisticReg(learning_rate=alpha,max_iteration=max_iter,batch_size=bs)
    me.fit(X,y)
    yhat = me.predict(X_test)
    me_score = accuracy_score(y_test, yhat)
    print('Accuracy ',me_score)
    me.plot()

In [None]:
class LogisticReg:
    
    def __init__(self, learning_rate=0.01,max_iteration=100, batch_size=10):
        self.learning_rate = learning_rate
        self.max_iter = max_iteration
        self.batch_size = batch_size
        
    def sigmoid(self,x):
        if isinstance(x, np.ndarray):
            result = np.zeros((x.shape[0]))
            for i in range(x.shape[0]):
                result[i] = np.exp(x[i]) / (1+ np.exp(x[i]))
            return result
        else:
            return np.exp(x) / (1+ np.exp(x))
    
    def sgd(self, X, y):
        n_samples, n_features = X.shape
        self.betas = np.zeros(n_features) # column vector, parameters
        costs = []
        for it in range(self.max_iter):
            indices = np.random.randint(0,X.shape[0],size=self.batch_size)

#             ----------------------
#                 Forward Pass
#             ----------------------
            prediction = self.sigmoid(np.dot(X[indices, :],self.betas))
            
            error = y[indices] - prediction
#             ----------------------
#                 Backward Pass
#             ----------------------
            cost = (-1 / indices.shape[0]) * (y[indices] @ np.log(prediction) + (1 - y[indices]) @ np.log(1-prediction) )
            gradient = (1 / indices.shape[0]) * (X[indices, :].T @ error)
        
            self.betas = self.betas - (self.learning_rate * -gradient)
            costs.append(cost)
            
            if it % (self.max_iter / 10)==0:
                accuracy = accuracy_score(y[indices],np.round(prediction))
                print(f"iteration: {it}, Cost: {cost}, Accuracy: {accuracy}")
            
        self.history = costs
            
        
    def plot(self):
        fig, ax = plt.subplots(1,1,figsize=(20,10),facecolor='white')
        ax.plot(range(self.max_iter),self.history)
        plt.show()
        
    def fit(self, X, y):
        """
        Fit logistic model using Stochastic Gradient Descent
        """
        print(X.shape)
        X = np.insert(X,0,1,axis=1) # add 1s for matrix multiplication
        
        self.sgd(X,y)
    
    def predict(self, X):
        X = np.insert(X,0,1,axis=1)
        yhat = np.dot(X,self.betas)
        yhat = self.sigmoid(yhat)
        return np.round(yhat)
    
    def score(self, X,y):
        yhat = self.predict(X)
        return accuracy_score(y,yhat)

In [None]:
pipeline(X_train, y_train, X_test, y_test, alpha=2,max_iter=1000, bs=250)

# also, Gradient Descent

In [None]:
class LRGD:
    
    def __init__(self, learning_rate=0.01,max_iteration=100, batch_size=10):
        self.learning_rate = learning_rate
        self.max_iter = max_iteration
        self.batch_size = batch_size
        
    def sigmoid(self,x):
        if isinstance(x, np.ndarray):
            result = np.zeros((x.shape[0]))
            for i in range(x.shape[0]):
                result[i] = np.exp(x[i]) / (1+ np.exp(x[i]))
            return result
        else:
            return np.exp(x) / (1+ np.exp(x))
    
    def gd(self, X, y):
        n_samples, n_features = X.shape
        self.betas = np.zeros(n_features) # column vector, parameters
        costs = []
        for it in range(self.max_iter):
            
            indices = np.arange(0,n_samples,1)
#             indices = np.random.randint(0,X.shape[0],size=self.batch_size)

#             ----------------------
#                 Forward Pass
#             ----------------------
            prediction = self.sigmoid(np.dot(X[indices, :],self.betas))
            
            error = y[indices] - prediction
#             ----------------------
#                 Backward Pass
#             ----------------------
            cost = (-1 / indices.shape[0]) * (y[indices] @ np.log(prediction) + (1 - y[indices]) @ np.log(1-prediction) )
            gradient = (1 / indices.shape[0]) * (X[indices, :].T @ error)
        
            self.betas = self.betas - (self.learning_rate * -gradient)
            costs.append(cost)
            
            if it % (self.max_iter / 10)==0:
                accuracy = accuracy_score(y[indices],np.round(prediction))
                print(f"iteration: {it}, Cost: {cost}, Accuracy: {accuracy}")
            
        self.history = costs
            
        
    def plot(self):
        fig, ax = plt.subplots(1,1,figsize=(20,10),facecolor='white')
        ax.plot(range(self.max_iter),self.history)
        plt.show()
        
    def fit(self, X, y):
        """
        Fit logistic model using Stochastic Gradient Descent
        """
        X = np.insert(X,0,1,axis=1) # add 1s for matrix multiplication
        self.gd(X,y)
    
    def predict(self, X):
        X = np.insert(X,0,1,axis=1)
        yhat = np.dot(X,self.betas)
        yhat = self.sigmoid(yhat)
        return np.round(yhat)
    
    def score(self, X,y):
        yhat = self.predict(X)
        return accuracy_score(y,yhat)

In [None]:
wgd = LRGD(learning_rate=2, max_iteration=5000, batch_size=50)
wgd.fit(X_train,y_train)
print(wgd.score(X_test, y_test))
wgd.plot()

In [None]:
sk = LogisticRegression(max_iter=1000)
sk.fit(X_train, y_train)
sk.score(X_test,y_test)