# Logistic Regression

## Motivation: Prediction whether a person getting Diabetes
* For classification we should only accept output y = 0 (no diabetes) and y = 1 (diabetes).
* Linear Regression is not suitable for this kind of task, because LinearRegression generates continues output.
* We want output: $0 <= \sigma(x) <= 1$ (where $\sigma$ is our model).

## The Sigmoid function

$$ 
\Large \text{$sigmoid(z)= \frac{1}{1+e^{-z}}$}
$$


## Logistic Regression Model

Standard Linear Regression model: $h_{w}(x)=w^{T}x$

A subtile change introduces non-linearity:

$$ 
\Large \text{$\sigma_{w}(x)= \frac{1}{1+e^{-h_{w}(x)}} = \frac{1}{1+e^{-w^{T}x}}$}
$$

## Interpretation of Model Output

$$ 
\Large \text{$\sigma_{w}(x) \approx$ estimated probability, that $y = 1$}
$$

More formally, we work with a __hypothesis__:

$$ 
\Large \text{$\sigma_{w}(x) = P(y = 1 | x; w )$}
$$


The Probability that $y = 1$, that the input is $x$ and the model is parameeteriized by $w$.

The actual labels are still discrete ($y=0$ or $y=1$), but the probabilities need to add to one:

$$ 
\Large \text{$P(y = 0 | x; w ) + P(y = 1 | x ; w) = 1$}
$$

## Loss Function of Logistic Regression

Proposed loss function (specifically adapted to logistic regression):

$$ 
\Large \text{$J(\sigma_{w}(x_{i}), y_{i}) = y_{i}log(\sigma_{w}(x_{i}))-(1-y_{i})log(1-\sigma_{w}(x_{i}))$}
$$

* J is konvex, but there is no analytical solution

## Loss to minimize

$$ 
\Large \text{$argmin_{w}J(w) = argmin_{w} \sum_{i=1}^{N} y_{i}log(\sigma_{w}(x_{i}))-(1-y_{i})log(1-\sigma_{w}(x_{i}))$}
$$

## Gradient for Logistic Regression
$$
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \frac{\partial}{\partial \mathbf{w}} \left[ \sum_{i=1}^{N} -y_i \log(\sigma_{\mathbf{w}}(\mathbf{x}_i)) - (1 - y_i) \log(1 - \sigma_{\mathbf{w}}(\mathbf{x}_i)) \right]
$$

Expanding this gradient, we get:

$$
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \sum_{i=1}^{N} - (y_i - \sigma_{\mathbf{w}}(\mathbf{x}_i)) \mathbf{x}_i
$$

In [1]:
# Import depandancies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Set a random seed
np.random.seed(42)

In [3]:
# Loading example data (note: we dont care much about data cleaning here, just about how LR works)
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable (0 or 1)
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Extract data
X = df.drop(columns=['Class variable (0 or 1)'])
y = df['Class variable (0 or 1)']

print(f'Shape of X: {X.shape}')
print(f'Shape of y: {y.shape}')

Shape of X: (768, 8)
Shape of y: (768,)


In [5]:
from sklearn.model_selection import train_test_split
# Split for training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_test: {y_test.shape}')

Shape of X_train: (614, 8)
Shape of y_train: (614,)
Shape of X_test: (154, 8)
Shape of y_test: (154,)


In [6]:
class LogisticRegression:
    def __init__(self, n_features: int) -> None:
        self.w = np.random.rand(n_features+1)

    def add_bias(self, X: np.array) -> np.array:
        bias_column = np.ones((X.shape[0], 1))
        return np.hstack((bias_column, X))

    def sigmoid(self, z: np.array) -> np.array:
        z = np.clip(z, -500, 500)  # Clip values to avoid overflow
        return 1.0 / (1.0 + np.exp(-z))
    
    def step(self, X: np.array) -> np.array:
        return np.array([1 if x >= 0.5 else 0 for x in X])

    def train(self, X: np.array, y: np.array, learning_rate: float, max_iter: int, verbose=False) -> None:
        X_b = self.add_bias(X)
        for iter in range(max_iter):
            h_w = np.dot(X_b, self.w)                           # (N x n_in+1) * (n_in+1,) = (N,)
            sigma_w = self.sigmoid(h_w)                         # (N,)
            
            J_w = -np.dot((y - sigma_w), X_b) / len(X_b)        # (N,) * (N x n_in+1) = (n_in+1,)
            self.w = self.w - learning_rate * J_w               # (N,)

            if verbose and iter % 1000 == 0:
                epsilon = 1e-15 
                sigma_w = np.clip(sigma_w, epsilon, 1 - epsilon)
                loss = -np.mean(y * np.log(sigma_w) + (1 - y) * np.log(1 - sigma_w))
                print(f'Iteration: {iter}, Loss: {loss}')

    def predict(self, x: np.array) -> np.array:
        x_b = self.add_bias(x)
        z = np.dot(x_b, self.w)
        sigma_w = self.sigmoid(z)
        return self.step(sigma_w)

In [7]:
# Create a model
model = LogisticRegression(8)

# Start training the logistic regresssion model
model.train(X_train, y_train, learning_rate=0.0001, max_iter=10000, verbose=True)

Iteration: 0, Loss: 22.107578770341963
Iteration: 1000, Loss: 1.2732021185892322
Iteration: 2000, Loss: 0.9358336732037364
Iteration: 3000, Loss: 0.8566963458223563
Iteration: 4000, Loss: 0.8177441404443307
Iteration: 5000, Loss: 0.7874535741290457
Iteration: 6000, Loss: 0.7613340421304262
Iteration: 7000, Loss: 0.7385024975987664
Iteration: 8000, Loss: 0.71871574197769
Iteration: 9000, Loss: 0.7018212621348132


In [8]:
# Calculate accuracy
def calculate_accuracy(X_test: np.array, y_test: np.array) -> float:
    y_pred = model.predict(X_test)
    return np.mean(y_test == y_pred)

In [None]:
# Test the model
model_acc = calculate_accuracy(X_test, y_test)
print(f'Accuracy: {model_acc}')

Accuracy: 0.7402597402597403
