# Generalized Linear Model with Gradient Descent

1. [gradient descent](#gradient-descent)
1. [implement gradient descent for linear regression from scratch](#implement-gradient-descent-for-linear-regression-from-scratch)
1. [implement gradient descent for logistic regression from scratch](#implement-gradient-descent-for-logistic-regression-from-scratch)
1. [resources](#resources)

## Gradient Descent

The OLS notebook discussed the closed form solution of linear regression 

\begin{align}
\min_{\beta} L & = \mathbf{(y-X\beta)^T(y-X\beta)} \\ 
\frac{dL}{d\mathbf{\beta}} & = \mathbf{-2X^T(y-X\beta)} = 0\\
& \mathbf{ \hat{\beta}=(X^TX)^{-1}X^Ty}
\end{align}


When the dataset is large, it is not practical to find the closed form solution as the computation complexity of inverting an $n * n$ matrix ($X$ has $n$ observations) is $O(n^3)$ <sup>[1]</sup>.

Instead, we iteratively update the weight using 

\begin{align}
\beta & \rightarrow \beta - \eta \nabla_{\beta}L(\beta) \\
\beta & \rightarrow \beta - \eta \mathbf{(-2X^T(y-X\beta))} 
\end{align}

where $\eta$ is the learning rate, and $\nabla_{\beta}L(\beta)$ is the gradient vector contains the partial derivatices of the cost function with respect to each model parameter.

There are 3 types of gradient descent algorithms,
- batch gradient descent: at each updating step, the entire trainig set is used to calculate the gradient vector.
- stochastic gradient descent: at each updating step, train on a single random observation.
- mini batch gradient descent: at each updating step, train on a small random sets of observations.

In the next section, we'll implement gradient descent from scratch.


## Implement Gradient Descent for Linear Regression from Scratch

In [1]:
import numpy as np


class LinearRegressionMBGD:
    def __init__(self, batch_size: int=32, lr: float=0.01, max_iters: int=100):
        """_summary_

        Args:
            batch_size (int, optional): number of observations to train on at each step. Defaults to 32.
            lr (float, optional): learning rate. Defaults to 0.01.
            max_iters (float, optional): maximum number of training steps. Defaults to 100.
        """
        self.batch_size = batch_size
        self.lr = lr
        self.max_iters = max_iters
    
    def gradient(self, X, y, beta):
        return - 2 * X.T @ (y - X @ beta)
    
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        # initialize the weight with 0s
        n = X.shape[1]
        beta = np.zeros(n)

        for _ in range(self.max_iters):
            # randomly select a batch of observations
            idx = np.random.choice(len(y), self.batch_size, replace=False)
            X_mini_batch, y_mini_batch = X[idx, :], y[idx]
            # update to \beta - \eta * \nabla_{\beta} L(\beta)
            beta -= self.lr * self.gradient(X_mini_batch, y_mini_batch, beta)

        self.beta = beta

    def predict(self, X: np.ndarray) -> np.ndarray:
        return X @ self.beta

In [2]:
n = 4
m = 1000 
X = np.random.normal(size=(m, n))
b_true = np.random.normal(size=n)
e = np.random.normal(size=m)
y = X @ b_true + e
print(f"True parameter: {b_true}")

True parameter: [-0.00230347  1.02378018 -0.32812042 -0.49768963]


In [3]:
lr_mbgd = LinearRegressionMBGD(lr=0.001, max_iters=2000)
lr_mbgd.fit(X, y)

print(f"{'beta_hat (mbgd):':<21} {lr_mbgd.beta}")
print(f"{'beta_hat (analytical):':<21}{np.linalg.inv(X.T @ X) @ X.T @ y}")

beta_hat (mbgd):      [ 0.01535948  0.97083694 -0.3764715  -0.55437832]
beta_hat (analytical):[-0.01534097  1.00746112 -0.33289216 -0.55708937]


## Implement Gradient Descent for Logistic Regression from Scratch

Logistic regression can be written as

\begin{align}
\mathbf{s} & = \mathbf{X\beta} \\
\mathbf{y} & = \mathbf{\sigma(s)} \\
\end{align}

where $\mathbf{\sigma(\cdot)}$ is the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) or the [error function](https://en.wikipedia.org/wiki/Error_function). 

In this section we'll use the sigmoid function $\sigma(x)=\frac{1}{1+e^{-x}}$ which has a intersting property that $\frac{d}{dx}\sigma(x)=\sigma(x)(1-\sigma(x))$

Maximizing the log likelihood of observing $\mathbf{y}$ given the feature set $\mathbf{X}$ and modeling hypothesis $\mathbf{h}$ 

\begin{align}
\mathbf{1_n^T \log prob(y | X; h)}
\end{align}

is equivalent to minimizing the binary cross entropy loss <sup>[2]</sup>

\begin{align}
L = \mathbf{-y^T \log p - (1_n - y)^T \log (1_n - p)} \\
\mathbf{p = \sigma(\hat{s}) = \sigma(X\hat{\beta}) = \frac{1}{1+e^{-X\hat{\beta}}}} 
\end{align}

To get the gradient of the loss function with respect to $\beta$, we need to apply the chain rule

\begin{align}
\frac{dL}{d\beta} & = \frac{dL}{d\mathbf{p}} \frac{d\mathbf{p}}{d\mathbf{s}} \frac{d\mathbf{s}}{d\beta} \\
\frac{dL}{d\mathbf{p}} & = \mathbf{-y \oslash p - (-1)* (1_n - y) \oslash (1_n - p)}\\
\frac{d\mathbf{p}}{d\mathbf{s}} & = \mathbf{p \odot (1_n - p)}\\
\frac{d\mathbf{s}}{d\beta} & = \mathbf{X} \\
\frac{dL}{d\beta}  = \mathbf{X^T (-y \odot (1_n - p) + (1_n - y) \odot p)} & = \mathbf{X^T (p-y)} 
\end{align}

$\oslash$ is element-wise division and $\odot$ is element-wise product


The update rule can be simplified to

\begin{align}
\beta & \rightarrow \beta - \eta \nabla_{\beta} L \\
\beta & \rightarrow \beta - \eta\mathbf{X^T (p-y)}
\end{align}






In [4]:
from typing import Union
import numpy as np


def sigmoid(x: Union[int, float, np.ndarray]) -> Union[float, np.ndarray]:
    # cliping at the extremes of 64bit floating point numbers
    x = np.clip(x, -709.78, 709.78)
    return 1 / (1 + np.exp(-x))


def cross_entropy_loss(y: np.ndarray, p: np.ndarray) -> float:
    """average cross entropy loss

    Args:
        y (np.ndarray): label
        p (np.ndarray): predicted probability

    Returns:
        float: _description_
    """
    l = - y.T @ np.log(p) - (1 - y).T @ np.log(1 - p)
    n = len(y)
    return l / n


class LogisticRegressionGD:
    def __init__(self, batch_size: int=32, lr: float=0.01, max_iters: int=100):
        """Logistic regression using gradient descent

        Args:
            batch_size (int, optional): number of observations to train on at each step. Defaults to 32.
            lr (float, optional): learning rate. Defaults to 0.01.
            max_iters (float, optional): maximum number of training steps. Defaults to 100.
        """
        self.batch_size = batch_size
        self.lr = lr
        self.max_iters = max_iters
    
    def gradient(self, X, y, beta):
        p = sigmoid(X @ beta)
        return X.T @ (p - y)
    
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        # initialize the weight with 0s
        n = X.shape[1]
        beta = np.zeros(n)

        self.losses = []

        for _ in range(self.max_iters):
            # randomly select a batch of observations
            idx = np.random.choice(len(y), self.batch_size, replace=False)
            X_mini_batch, y_mini_batch = X[idx, :], y[idx]
            # update to \beta - \eta * \nabla_{\beta} L(\beta)
            beta -= self.lr * self.gradient(X_mini_batch, y_mini_batch, beta)
            # record the cross entropy loss
            p_mini_batch = self._predict_proba(X_mini_batch, beta)
            self.losses.append(cross_entropy_loss(y_mini_batch, p_mini_batch))

        self.beta = beta

    def _predict_proba(self, X, beta, eps=1e-15):
        # cliping to avoid run time warning for np.log(0)
        return np.clip(sigmoid(X @ beta), eps, 1-eps)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        return self._predict_proba(X, self.beta, eps=0)
    
    def predict(self, X: np.ndarray, threshold=0.5) -> np.ndarray:
        return (self.predict_proba(X) > threshold).astype(int)

Now we're ready to run some data through the above implementation. We'll use the breast cancer binary classification dataset from sklearn, and predict the diagnosis using mean radius, mean texture, and mean symmetry.

In [5]:
from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()
print(data.keys())

X = data.data
y = data.target
print(data.feature_names)
print(data.target_names)
print(np.unique(y, return_counts=True))

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']
(array([0, 1]), array([212, 357]))


In [6]:
feature_names = ["mean radius", "mean texture", "mean symmetry"]
mask = np.isin(data.feature_names, feature_names)
feature_idx = np.argwhere(mask).ravel()
X = X[:, feature_idx]

print(X[:3])
print(np.corrcoef(X.T))

[[17.99   10.38    0.2419]
 [20.57   17.77    0.1812]
 [19.69   21.25    0.2069]]
[[1.         0.32378189 0.14774124]
 [0.32378189 1.         0.07140098]
 [0.14774124 0.07140098 1.        ]]


Looking at `X[:3]`, it's clear that the features are in different scales. 

Since the same learning rate is applied to all model parameters in our algorithm, unscaled features will make the updates unstable. In this case we should standardize the feature set.

In [7]:
from sklearn.preprocessing import StandardScaler


X = StandardScaler().fit_transform(X)
X[:3]

array([[ 1.09706398e+00, -2.07333501e+00,  2.21751501e+00],
       [ 1.82982061e+00, -3.53632408e-01,  1.39236330e-03],
       [ 1.57988811e+00,  4.56186952e-01,  9.39684817e-01]])

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
# mini-batch gradient descent: 1 < batch_size << len(y) 
# stochastic gradient descent: batch_size = 1 
# batch gradient descent: batch_size = len(y)

m = LogisticRegressionGD(lr=0.01, batch_size=32, max_iters=1000)
m.fit(X_train, y_train)
m.beta

array([-4.31417424, -1.12064318, -1.36766058])

Run the training data through scikit-learn and statsmodels implementation of logistic regression and compare the result on the testing dataset.

In [10]:
from sklearn.linear_model import LogisticRegression

# disable regularization and intercept to match sm.Logit() and LogisticRegressionGD()
m_sl = LogisticRegression(max_iter=1000, fit_intercept=False, penalty=None)
m_sl.fit(X_train, y_train)
b_sl = np.hstack(m_sl.coef_)


import statsmodels.api as sm


m_sm = sm.Logit(y_train, X_train).fit()
b_sm = m_sm.params

Optimization terminated successfully.
         Current function value: 0.225719
         Iterations 9


In [11]:
from sklearn.metrics import roc_auc_score, precision_score, recall_score


for model_name, y_pred, b in zip(
    ["mbgd", "sklearn", "statsmodel"], 
    [m.predict(X_test), m_sl.predict(X_test), (m_sm.predict(X_test) > 0.5).astype(int)],
    [m.beta, b_sl, b_sm]
    ):
    roc_auc = roc_auc_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    print(f"{model_name:<10} roc_auc: {roc_auc:.2f}, precision: {precision:.2f}, recall: {recall:.2f}, weights{b}")

mbgd       roc_auc: 0.94, precision: 0.96, recall: 0.94, weights[-4.31417424 -1.12064318 -1.36766058]
sklearn    roc_auc: 0.94, precision: 0.96, recall: 0.96, weights[-4.34156853 -1.11208314 -1.42974143]
statsmodel roc_auc: 0.94, precision: 0.96, recall: 0.96, weights[-4.34539123 -1.11329859 -1.43053815]


In [12]:
import plotly.express as px
import pandas as pd


px.line(pd.DataFrame({"loss": m.losses}).rename_axis("iter"), title="Mini Batch Gradient Descent Cross Entropy Loss")

## Resources

1. [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
2. [Logistics Regression, Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression#Fit)
3. [LaTex math symbols](https://www.math.uci.edu/~xiangwen/pdf/LaTeX-Math-Symbols.pdf)
4. [Logistic Regression implementation by Ethen Liu](https://nbviewer.org/github/ethen8181/machine-learning/blob/master/text_classification/logistic.ipynb)