<a href="https://colab.research.google.com/github/sytrinh/machine-learning-from-scratch/blob/main/machine_learning_algorithms/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

Logistic regression is not a regression problem but a classification problem (binary classification for specific), in which the outcome only has two unique values (usually 0 and 1).

In linear regression, we generally need to find a function $\mathbf{f}$ so that:

$$y = f(\mathbf{w}^T \mathbf{x}),$$
where $\mathbf{x}$ and $\mathbf{w}$ are column vectors. This is not appropriate for a binary classification problem because of two reasons:
- First, linear regression deals with continuous values whereas classification problems mandate discrete values.
- Second, when $\mathbf{x}$ is large, the predicted value $\hat{y}$ is far more larger than $1$; and when $\mathbf{x}$ is small, the predicted value $\hat{y}$ can be negative. These two cases do not make sense for a binary classification problem.

## Logistic regression model

### One data point

In logistic regression, instead of predicting the outcome directly, we model to predict the probability of the outcome equal to 1 or 0. 

$$P(y_i=1 | \mathbf{x}_i; \mathbf{w}) = f(\mathbf{w}^T \mathbf{x})$$

$$P(y_i=0 | \mathbf{x}_i; \mathbf{w}) = 1 - f(\mathbf{w}^T \mathbf{x}),$$

where $P(y_i=1 | \mathbf{x}_i; \mathbf{w})$ is the probability that $y_i=1$ given the model parameters $\mathbf{w}$ and the data $\mathbf{x}_i$, and $P(y_i=0 | \mathbf{x}_i; \mathbf{w})$ is the probability that $y_i=0$ given the model parameters $\mathbf{w}$ and the data $\mathbf{x}_i$. Note that $\mathbf{x}_i$ and $\mathbf{y}_i$ are random variables of one data point. $\mathbf{x}_i$ contains values of all features in that data point, and $\mathbf{y}_i$ is the outcome.

These probabilities can be combined in only one formula:

$$P(y_i | \mathbf{x}_i; \mathbf{w}) = z_i^{y_i} (1-z_i)^{1-y_i},$$
where $z_i = f(\mathbf{w}^T \mathbf{x})$

### Whole training set

- $\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, ... , \mathbf{x}_M]^T$: a $M \times N$ matrix containing all feature values.
- $\mathbf{y} = [y_1, y_2, ... , y_M]^T$: an $M$-dimensional vector containing all outcome values.

The probability of $\mathbf{y}$ given $\mathbf{X}$ is:

$$P (\mathbf{y} | \mathbf{X}; \mathbf{w})$$

To be likely to receive the outcomes $\mathbf{y}$ in reality, this probability must be large. Therefore, we need to find $\mathbf{w}$ that maximize this probability.

$$\mathbf{w} = \arg\max_{\mathbf{w}} P(\mathbf{y}|\mathbf{X}; \mathbf{w})$$

This problem, which finds the model parameters so that the model produces outcomes closest to the data, is called **maximum likelihood estimation** (MLE).

### Assumption

To solve this problem, we need to assume that the data points are independent of each other. it means that $y_i$ is independent of $y_j$ with $i \neq j$, and $y_i$ is also independent of $x_j$. Then,

$$P (\mathbf{y} | \mathbf{X}; \mathbf{w}) = \prod_{i=1}^M P(y_i| \mathbf{x}_i; \mathbf{w}) = \prod_{i=1}^M z_i^{y_i}(1 - z_i)^{1- y_i}$$

To find the maximum of this probability, we can instead find the minumum of the following function:

$$J(\mathbf{w}) = - \frac{1}{M} \log P(\mathbf{y}|\mathbf{X}; \mathbf{w}) = -\frac{1}{M} \sum_{i=1}^M(y_i \log {z}_i + (1-y_i) \log (1 - {z}_i)),$$

where $J(\mathbf{w})$ is the cost function, and the formula on the right hand side is called **cross entropy**, which is often used to measure distance between two distribution. We can vectorize as follows:

$$J(\mathbf{w}) = - \frac{1}{M} \log P(\mathbf{y}|\mathbf{X}; \mathbf{w}) = - \frac{1}{M} \Big(-\mathbf{y}^T \log \mathbf{z} - (1-\mathbf{y})^T \log (1-\mathbf{z}) \Big)$$

where $\mathbf{z} = f(\mathbf{X} \mathbf{w})$

### Sigmoid function

Another assumption in logistic regression is that the function $f$ is assumed to be the **sigmoid** function: 

$$f(s) = \frac{1}{1 + e^{-s}} \triangleq \sigma(s)$$

This function is bounded by 0 and 1, and takes values in $(0, 1)$, so it is suitable for the binary classification problem.

### Minimizing the loss function

To minimizing the loss function, we use the Gradient Descent algorithm. 

Using the sigmoid function, we have:

$$\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \frac{1}{M} \mathbf{X}^T (\mathbf{z}-\mathbf{y})$$

and update step is:

$$\mathbf{w} := \mathbf{w} - \alpha \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}}$$

Note that when considering bias, we have: $\mathbf{z} = f(\mathbf{X} \mathbf{w} + b)$. We can easily find the gradient respect to $b$ and so the update formula.


## Implementation

In [154]:
import numpy as np 
import pandas as pd 
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split 

dataset = load_breast_cancer(as_frame=True)

In [155]:
# View the data
dataset.data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [156]:
dataset.target

0      0
1      0
2      0
3      0
4      0
      ..
564    0
565    0
566    0
567    0
568    1
Name: target, Length: 569, dtype: int64

In [157]:
dataset.data.shape

(569, 30)

In [165]:
data = dataset.data.to_numpy()
target = dataset.target.to_numpy()

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data, target, train_size=0.8, shuffle=True, random_state=1) 

# Normalizing the data
mean = np.mean(X_train, axis=0)
std = np.std(X_train, axis=0)
X_train = (X_train - mean)/std
X_test = (X_test - mean)/std

In [166]:
class MyLogR:

  def __init__(self, lr=0.01, n_iters=5000):
    self.lr = lr
    self.n_iters = n_iters
    self.params = {
        'W': None,
        'b': None
    }

  def sigmoid(self, s):
    return 1/(1 + np.exp(-s))
  
  def init_params(self, n_features):
    self.params['W'] = np.random.randn(n_features, 1)
    self.params['b'] = 0

  def calculate_probs(self, X):
    W, b = self.params['W'], self.params['b']
    a = X @ W
    z = self.sigmoid(X @ W + b)
    return z

  def gradient_descent(self, X, z, y):
    M = X.shape[0]
    W, b = self.params['W'], self.params['b']
    dW = 1/M * X.T @ (z-y)
    db = 1/M * np.sum(z-y)
    W = W - self.lr*dW
    b = b - self.lr*db
    self.params['W'] = W
    self.params['b'] = b

  def train(self, X_train, y_train):
    X = np.asarray(X_train).copy()
    y = np.asarray(y_train).reshape((-1,1))
    assert X.shape[0] == y.shape[0]
    
    M, N = X.shape
    self.init_params(N)
    for i in range(self.n_iters):
      z = self.calculate_probs(X)
      self.gradient_descent(X, z, y)

  def predict(self, X_test):
    W, b = self.params['W'], self.params['b']
    probs = self.sigmoid(X_test @ W + b)
    return probs.flatten()


mylogr = MyLogR()
mylogr.train(X_train, y_train)

# Predict
y_pred = (mylogr.predict(X_test) > 0.5)*1
mylogr_accuracy = np.sum(y_pred == y_test)/len(y_test)
mylogr_accuracy

0.9736842105263158

In [167]:
# Using sklearn

from sklearn.linear_model import LogisticRegression 
logR = LogisticRegression()
logR.fit(X_train, y_train)

# Predict
sk_y_pred = logR.predict(X_test)
sk_accuracy = np.sum(sk_y_pred == y_test)/len(y_test)
sk_accuracy

0.9736842105263158

In [169]:
print(f"My Implementation: {mylogr_accuracy}\nSklearn Implementation: {sk_accuracy}")

My Implementation: 0.9736842105263158
Sklearn Implementation: 0.9736842105263158
