# Statistical parameters

[Machine Learning Interpretability course](https://www.trainindata.com/p/machine-learning-interpretability)

In this notebook, we will see how to calculate the different functions and statistical parameters that are relevant for logistic regression.

In [1]:
import numpy as np

## Logistic function

Change the values of z to understand the output of the logistic function. The value of z is the result of the linear model.

In [2]:
# z is the linear equation

z = np.array([-3, -1, 0, 2, 3])

# the logistic function

def sigmoid(z):
    
    s = 1/(1+np.exp(-z))
    
    return np.round(s, 2)

sigmoid(z)

array([0.05, 0.27, 0.5 , 0.88, 0.95])

## Cost function

What we are trying to minimize when finding the coefficients for the logistic function.

The cost function takes one form when y is 1 and a different form when y is 0.

In [3]:
# when y = 1

p = np.array([0.3, 0.5, 0.8])

-np.log(p)

array([1.2039728 , 0.69314718, 0.22314355])

In [4]:
# when y = 0

p = np.array([0.7, 0.4, 0.2])

-np.log(1-p)

array([1.2039728 , 0.51082562, 0.22314355])

## Log-loss

The log-loss is another expression of the cost function, where the 2 components are put together, so we sum all the logs of the probabilities and then divide by -1/m, where m is the number of observations.

In [5]:
# target variable
y = np.array([0, 0, 1, 1, 1])

# Baseline: mean of y
p_b = np.array([0.6, 0.6, 0.6, 0.6, 0.6])

# Model
p_l = np.array([0.2, 0.3, 0.6, 0.7, 0.8])

In [6]:
from sklearn.metrics import log_loss

In [7]:
log_loss(y, p_b)

0.6730116670092563

In [8]:
log_loss(y, p_l)

0.334092522854375

## Log-likelihood

The sum of the log of the probabilities.

In [9]:
def ll(p):
    return np.sum(y*np.log(p) + (1-y)*np.log(1-p))

In [10]:
ll(p_b)

-3.3650583350462817

In [11]:
ll(p_l)

-1.670462614271875

In [12]:
# or with sklearn

- log_loss(y, p_b, normalize=False)

-3.3650583350462817

In [13]:
# or with sklearn

- log_loss(y, p_l, normalize=False)

-1.670462614271875

## Deviance

In [14]:
def deviance(y_true, y_pred):
    return 2 * log_loss(y_true, y_pred, normalize=False)

In [15]:
deviance(y, p_b)

6.730116670092563

In [16]:
deviance(y, p_l)

3.34092522854375

## p-value

The deviance follows a chi-square distribution.

In [17]:
from scipy import stats

x2 = deviance(y, p_b) - deviance(y, p_l)

df = 1

1 - stats.chi2.cdf(x2, df)

0.06562512323458003

## R-statistic

In [18]:
R = (deviance(y, p_b) - deviance(y, p_l) )/ deviance(y, p_b)

R

0.5035858377626312