<font size="6"><b>GENERALIZED LINEAR MODELS AND LOGISTIC REGRESSION: BASICS</b></font>

<font size="5"><b>Serhat Çevikel</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(plotly)
library(broom) # for extracting coefficients
library(caret) # for confusion matrix
library(pROC) # for roc and auc

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/databa"

![xkcd](../imagesba/odds_ratio.png)

(https://xkcd.com/2599/)

# Simulating and Modeling Bernoulli Responses

We know that the expected value of a random variable is its mean:


For discrete variables with finite number of possible values:

${\displaystyle \operatorname {E} [X]=\sum _{i=1}^{n}x_{i}\,p_{i},}$

And for continuous variables with infinite number of possible values:

${\displaystyle \operatorname {E} [X]=\int _{-\infty }^{\infty }xf(x)\,dx.}$

(https://en.wikipedia.org/wiki/Expected_value#Random_variables_with_density)

A linear regression model basically models the **conditional mean** ($\boldsymbol \mu$) of a response variable given the values of a set of predictors:

${\displaystyle \operatorname {E} (\mathbf {Y} \mid \mathbf {X} )={\boldsymbol {\mu }}=\mathbf {X} {\boldsymbol {\beta }}}$

While this formulation is for the conditional mean of the response variable, the model also includes an error term to account for the variability around this conditional mean:

${\displaystyle \mathbf {y} =\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }}}$

(https://en.wikipedia.org/wiki/Linear_regression)

And one of the assumptions of ordinary least squares estimator in linear regression is the assumption of equal variance or homoscedasticity:

$\operatorname E( \epsilon_i^2 | X) = \sigma^2$

So the variance of the errors conditional on the values of the predictors is the same for all predictor values.

(https://en.wikipedia.org/wiki/Ordinary_least_squares#Assumptions)

In linear regression models with OLS, we worked with predictors and response variables with unbounded values between $-\infty$ and $\infty$ and in which variance is a parameter separate from the expected value and therefore set separately. This was the case with normally distributed variables. Hence:

- All fitted values in the continuum can be meaningful since response can take any value
- And with some transformations equal variance assumption could be met to a certain degree

We know that, in normal distribution, variance is a free parameter that we can set for the distribution:

${\displaystyle {\mathcal {N}}(\mu ,\sigma ^{2})}$

(https://en.wikipedia.org/wiki/Normal_distribution)

The mean sets the location of the distribution while variance sets the scale.

However, we know that in some distribution types, the distribution is defined by only location and shape parameters and variance is bound to those values and not set separately.

The examples are:

- Binomial distribution: ${\displaystyle \operatorname {Var} (X)=npq=np(1-p)}$
- Poisson distribution: ${\displaystyle \operatorname {Var} (X)=\lambda}$
- Gamma distribution: ${\displaystyle \operatorname {Var} (X)=\alpha /\lambda ^{2}}$

Furthermore the variables drawn from these distributions are not supported on the whole continuum. Poisson and Gamma distributions have a support on non-negative values and Binomial distrubtion has a support between 0 and $n$.

Let's say for example for different p values, the expected variance of Bernoulli distribution will be:

In [None]:
px <- seq(0, 1, 1e-2)
plot(px, px*(1-px), type = "l")

Now let's make a new simulation in which response variables can only take 0 and 1 values, from Bernoulli distribution which is a special case of Binomial distribution with $n=0$

In [None]:
nobs <- 1e4

In [None]:
set.seed(5)
xterm <- rnorm(nobs, 2, 1)
#eterm <- normalize(rnorm(nobs, 0, 2))
beta0 <- -4
beta1 <- 2
yterm_raw <- beta0 + beta1 * xterm
data1 <- data.table(yterm_raw, xterm)

In [None]:
head(data1)

Note that we did not include an error term but the uncertainty around the expected mean of the response variable will be simulated in a different way:

- First we will compress the raw y values such that they are always between 0 and 1
- Each compresed value will be treated as the p of a Bernoulli trial and for each p, 0 and 1 values will be drawn from the respective Bernoulli distribution, like a coin toss.

The function to compress values on the continuum into the (0, 1) range will be the logistic function. We will come to the importance and the origin of this function shortly. Let's see how values are mapped with the logistic function:

In [None]:
logistic <- function(x) 1/(1 + exp(-x))

In [None]:
yx <- seq(-10, 10, 1e-2)
plot(yx, logistic(yx), type = "l")

The S shape is known as sigmoid.

Now let's convert raw t values into probability values of Bernoulli trials and sample a single value for each p:

In [None]:
data1[, px := logistic(yterm_raw)]

The distribution of the raw y values:

In [None]:
hist(data1$yterm_raw)

And the distribution of transformed p values:

In [None]:
hist(data1$px)

Now let's toss a single coin for each p value:

In [None]:
set.seed(10)
data1[, yterm := rbinom(.N, 1, px)]

Let's see the distribution of binomial responses across p-values, with some added jitter (random vertical noise) so that the dots are separated from each other:

In [None]:
set.seed(20)
data1 %>%
ggplot(aes(x = px, y = yterm)) +
geom_jitter()

Now let's see the variance of binomial responses across different p values. For this we create 20 bins of p values with a width of 5% each:

In [None]:
data1[, pxc := cut(px, breaks = seq(0, 1, 0.05), labels = seq(0, 1 - 0.05, 0.05))]

In [None]:
data1 %>%
group_by(pxc) %>%
summarise(vary = var(yterm)) %>%
ggplot(aes(x = pxc, y = vary, group = 1)) +
geom_line()

We see that variance is not the same across p values, in line with binomial distribution.

Now let's try to model the binomial responses with the raw x values as the sole predictor using ordinary least squares:

## Modeling Bernoulli Responses with OLS

In [None]:
model_ls1 <- lm(yterm ~ xterm, data1)

By just looking at test diagnostics, we can say we have a significant model and coefficients and we could explain around 36% of the total variation in the binomial response:

In [None]:
summary(model_ls1)

However the diagnostic plots says that, neither we have a constant variance or normally distributed residuals:

In [None]:
plot(model_ls1)

Now let's get the predictions:

In [None]:
pred_ls1 <- predict(model_ls1)

And see their distribution:

In [None]:
hist(pred_ls1)

In [None]:
summary(pred_ls1)

For the binomial response of 0 and 1 values we could have predictions as low as -0.7 and as high as 1.57, not meaningful!

## Logit Transformation

The predictions can be thought of as the p values, where each p value is the probability to get 1 in the response variable.

Since the probability can only be between 0 and 1, while the linear model's prediction can have any values between $-\infty$ and $\infty$, let's make some transformations.

First let's transform the probabilities to odds:

${\displaystyle {\frac {p}{1-p}}}$

In [None]:
odds <- function(x) x / (1-x)

In [None]:
yx <- seq(0.05, 0.95, 1e-2)
plot(yx, odds(yx), type = "l")

Now the odd values can have any value between 0 and $\infty$. We are much better. But still we cannot have negative values.

Let's do a second transformation and get the natural logarithm of the odds values:

In [None]:
yx <- seq(0.05, 0.95, 1e-2)
yxod <- odds(yx)
plot(yxod, log(yxod), type = "l")


Now we can have both negative and positive values in the whole continuum.

If we combine the two steps we get the transformation of log odds:

${\displaystyle \operatorname {logit} p=\ln {\frac {p}{1-p}}\quad {\text{for}}\quad p\in (0,1).}$

This function is known as the **logit** function.

(https://en.wikipedia.org/wiki/Logit)

In [None]:
logit <- function(x) log(x / (1-x))

In [None]:
yx <- seq(0.01, 0.99, 1e-2)
plot(yx, logit(yx), type = "l")

And what's the relationship between the logistic and logit functions?

In [None]:
yx <- seq(0.01, 0.99, 1e-2)
plot(yx, logistic(logit(yx)), type = "l")

They are inverses of each other. So:

- `logit` function takes the log odds of a probability value between 0 and 1 and maps into any value in the continuum:

  ${\displaystyle \operatorname {logit} p=\ln {\frac {p}{1-p}}\quad {\text{for}}\quad p\in (0,1).}$

- `logistic` function is the inverse of `logit`: takes a value in the continuum and maps into the (0,1) interval:

${\displaystyle f(x)={\frac {1}{1+e^{-x}}}}$

(https://en.wikipedia.org/wiki/Logistic_function)

# Generalized Linear Models and Logistic Regression Model

Now let's come to the definition of *generalized linear models*:

> The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.


The GLM consists of three elements:

- A particular distribution for modeling ${\displaystyle Y}$ from among those which are considered exponential families of probability distributions,
- A linear predictor ${\displaystyle \eta =X\beta }$ and
- A link function ${\displaystyle g}$ such that ${\displaystyle \operatorname {E} (Y\mid X)=\mu =g^{-1}(\eta )}$

(https://en.wikipedia.org/wiki/Generalized_linear_model)

In our simple example, 

- Y is supposed to be drawn from a Bernoulli distribution, which is a special case of Binomial distribution
- The link function $g$ that maps the predicted responses $\hat \y$ in the $(0, 1)$ interval to the output of the linear predictor $\eta$ in the $(\infty, -\infty)$ interval is `logit` function.
- The inverse link function $g^-1$ that maps the output of the linear predictor $\eta$ in the $(\infty, -\infty)$ interval onto the the predicted probability values in the $(0, 1)$ interval is `logistic` function.

Hence our model will be called *logistic regression*.

However we cannot use the OLS method to estimate the model parameters as we saw above. We should follow a different approach.

## Maximum Likelihood Estimation

Let's now calculate the likelihood values $\displaystyle {\mathcal {L}}$ of the binomial responses given the predicted probabilities made by estimated parameters.

Note that for any 0 and 1 value, for a given p, the probability mass function of Bernoulli distribution is:

${\displaystyle f(k;p)={\begin{cases}p&{\text{if }}k=1,\\q=1-p&{\text{if }}k=0.\end{cases}}}$

or

${\displaystyle f(k;p)=p^{k}(1-p)^{1-k}\quad {\text{for }}k\in \{0,1\}}$

(https://en.wikipedia.org/wiki/Bernoulli_distribution)

Since we don't know the true model and try to estimate the model parameters. Let's start with some arbitrary parameter values:

In [None]:
b00 <- -8
b10 <- 4

Let's calculate the fitted p-values:

In [None]:
yfit1 <- logistic(b00 + b10 * data1$xterm)

And calculate the Binomial likelihood of the values:

In [None]:
lh1 <- dbinom(data1$yterm, 1, yfit1)

In [None]:
hist(yfit1)

That's basically, taking the fitted p values for observations that correspond to 1's, and taking $1 - p$ for observations that correspond to 0's:

In [None]:
lh1b <- ifelse(data1$yterm, yfit1, 1 - yfit1)

In [None]:
identical(lh1, lh1b)

However in order to get the joint probability of individual likelihoods, we must take the product of these values, in order to calculate the likelihood of the estimated model:

In [None]:
lhm1 <- prod(lh1)

For 10,000 values the product of those values exceeds the precision, so we practically get 0:

In [None]:
lhm1

In order to solve this we get the logs of each value and sum them up using the product rule of logarithms:

${\textstyle \log _{b}(xy)=\log _{b}x+\log _{b}y}$

In [None]:
llh1 <- sum(log(lh1))
llh1

Since logarithm of a value in the (0, 1) interval is negative, the log-likelihood of the estimated model is also negative.

As we did with the SSE values in OLS, let's write a function to calculate the log-likelihood of the model with a set of parameter estimates:

In [None]:
llhx <- function(b0x, b1x, datax)
{
    datax <- copy(datax)
    datax[, y_fit := logistic(b0x + b1x * xterm)]
    datax[, sum(log(ifelse(yterm, y_fit, 1 - y_fit)))]
}

In [None]:
b0s <- seq(-6, 2, 0.5)
b1s <- seq(-1, 4, 0.5)
param_dt <- crossing(b0 = b0s, b1 = b1s)
setDT(param_dt)
param_dt[, ind := .I]
param_dt[, llh := llhx(b0, b1, data1), by = ind]
param_dt[, maxllh := max(llh)]

In [None]:
plot_ly() %>% 
      add_trace(data = param_dt,  x = ~b0, y = ~b1, z = ~llh, type="mesh3d") %>%
      add_trace(data = param_dt,  x = ~b0, y = ~b1, z = ~maxllh, type="mesh3d") %>%
        layout(autosize = F, width = 800, height = 800)

Here, since log-likelihood function is concave, we will try to estimate the parameter set that maximizes the log-likelihood such that we get the value on the blue surface where it touches the orange plane.

Due to the mathematical properties of the log-likelihood function we cannot directly calculate its maxima, however we can follow an iterative approach.

The method to be used is *iteratively reweighted least squares* (IRLS), which is equivalent to maximizing the log-likelihood of a Bernoulli distributed process using Newton's method.

(https://en.wikipedia.org/wiki/Logistic_regression)

So what is Newton-Raphson method?

At an extrema point the slope of a curve becomes 0:

![maxima](../imagesba/maxima.svg)

(https://www.mathsisfun.com/calculus/maxima-minima.html)

So the main task is to find the root (the x value where y becomes 0) of the slope or the first derivative or the gradient of the maximum likelihood function.

We can start at an arbitrary value at the curve of the first derivative:

- Draw a tangent line to the derivative curve,
- Get the point where the tangent intersects the x axis
- And get to the y point of that x coordinate on the derivative curve
- Repeat the above 3 steps until the y value on the curve converges to 0.
- The final x value is the root of the derivative curve - where it becomes 0 - and also the point where the original curve reaches its extrema

![newton-raphson](../imagesba/newton_raphson.gif)

(https://medium.com/@ruhayel/an-intuitive-and-physical-approach-to-newtons-method-86a0bd812ec3)

The general equation for the iterative method to find the root is:

${\displaystyle x_{1}=x_{0}-{\frac {f(x_{0})}{f'(x_{0})}}}$

Where the next iteration of the x value adds the ratio of the y value divided by the slope of the y value to the previous x value.

$\displaystyle \beta^{t+1} = \frac {\nabla_{\beta}l(\beta^t)}{\nabla_{\beta\beta}l(\beta^t)}$

where:

- $\beta^{t+1}$ is the value of parameters in the next iteration
- $\beta$ is the value of parameters in the last iteration
- $\nabla_{\beta}l(\beta^t)$ is the gradient (first order derivative) of the log-likelihood function w.r.t. $\beta_t$ also known as the *score function*
- $\nabla_{\beta}l(\beta^t)$ is the Hessian (second order derivative) matrix of the log-likelihood function w.r.t. $\beta_t$, negative of which is also known as the *observed information matrix*

(https://arunaddagatla.medium.com/maximum-likelihood-estimation-in-logistic-regression-f86ff1627b67)

(https://en.wikipedia.org/wiki/Generalized_linear_model#Maximum_likelihood)

So after the initial guess of parameter values at the first iteration, each next iteration is calculated by:

$\displaystyle \hat \beta_{t+1} = \hat \beta_{t} + (X^TW_{t}X)^{-1}X^T(y - \hat y_{t})$

where:

- $X$ is the matrix of predictor values and $X^T$ its transpose
- $y$ is the vector of actual response values
- $\hat y_{t}$ is the vector of predicted values which are basically ${\displaystyle \mathbf {\hat y_t} =\sigma(\mathbf {X} {\boldsymbol {\hat \beta_t }}})$ and $\sigma$ is the logistic function.
- $W_{t}$ is the diagonal matrix in which the entries are the $\hat y_t(1 - \hat y_t)$ values.

(https://www.statlect.com/fundamentals-of-statistics/logistic-model-maximum-likelihood)

Let's implement this *iterated reweighted least squares* method in a function:

In [None]:
irls <- function(bt, y, x)
{
    yt <- logistic(x %*% bt)
    wt <- diag(as.vector(yt * (1 - yt)))
    bt + solve(t(x) %*% wt %*% x) %*% t(x) %*% (y - yt)
}

Combine the initial parameters into a matrix:

In [None]:
binit <- as.matrix(c(b00, b10))
binit

And create matrices for x and y terms:

In [None]:
yx <- as.matrix(data1$yterm)

In [None]:
xmat <- cbind(1, data1$xterm)

Let's create an empty list to collect parameter values in each iteration:

In [None]:
param_l <- list()

Let's set a precision value so that the algorithm stops when difference between the parameter values in each subsequent iteration is below that precision and hence our algorithm converged:

In [None]:
prec <- 1e-6

Let's initiate the iterations

In [None]:
iter <- 1

Assign the initial parameter values into the parameter value of the iteration:

In [None]:
biter <- binit

Initiate a matrix of arbitrarily large parameter values so that in the initial iteration we ensure that we haven't converged:

In [None]:
bprev <- as.matrix(c(Inf, Inf))

Let's save the initial parameters as the first item of the list:

In [None]:
param_l[[iter]] <- biter

Let's run the iterations:

In [None]:
while (sum(abs(biter - bprev)) > prec) # as long as not converged
{
    iter <- iter + 1 # increment iterations
    bprev <- biter # save the last parameters
    biter <- irls(bprev, yx, xmat) # calculate next parameters
    param_l[[iter]] <- biter # save the next parameters in the list
}

We have done only eight iterations for the convergence:

In [None]:
iter

We see that we have converged sufficiently to the true parameter values of -4 and 2:

In [None]:
t(do.call(cbind, param_l))

Now for each iteration, let's calculate the predictions:

In [None]:
pred_l <- lapply(param_l, function(x) logistic(xmat %*% x))

And calculate the log-likelihood of the estimated model in each iteration:

In [None]:
llh_l <- sapply(pred_l, function(x) sum(log(ifelse(data1$yterm, x, 1 - x))))

In [None]:
llh_l

While the parameter values converged sufficiently, the log-likelihood also converged sufficiently to its max value.

Now let's combine actual y values, predicted y values - the probabilities - and the iteration counts into a single table. Note that in order to make the interactive plots lighter, only a portion of the data points will be filtered:

In [None]:
pred_dt <- mapply(function(a, x, y, z) data.table(xterm = a[1:5e2], ypred = as.vector(x)[1:5e2], yterm = y[1:5e2], iter = z), list(data1$xterm), pred_l, list(data1$yterm), seq_along(pred_l), SIMPLIFY = F) %>% rbindlist

In [None]:
head(pred_dt)

Let's see how the model predictions, as shown by the S curves converge to the true model in each iteration. A random jitter is added:

In [None]:
set.seed(30)
p1 <- pred_dt %>%
mutate_at("iter", as.factor) %>%
filter(iter == 1) %>%
ggplot(aes(x = xterm, y = yterm)) +
geom_jitter(height = 0.1) +
geom_line(data = pred_dt %>% mutate_at("iter", as.factor),
          aes(x = xterm, y = ypred, color = iter))

In [None]:
if (T) ggplotly(p1)

Or animate the iterations with random jitters again:

In [None]:
set.seed(40)
if (T)
{
    pred_dt %>%
    mutate(ytermjit = jitter(yterm, amount = 0.15)) %>%
    arrange(iter, xterm) %>%
    plot_ly(x = ~xterm) %>%
    add_trace(y = ~ytermjit, type = "scatter", mode = "markers") %>%
    add_trace(y = ~ypred, frame = ~iter, type = 'scatter', mode = "lines") %>%
    animation_opts(
        frame = 500, redraw = T, easing = "linear", mode = "next"
    )
}

In [None]:
gc()

## glm() function

We don't have to go through these iterative steps manually, and we will use the built in `glm()` function for generalized linear models and for logistic regression we pass the *binomial* for the model family:

In [None]:
model_glm <- glm(yterm ~ xterm, data1, family = "binomial")

Let's see the model results:

In [None]:
model_glm_sum <- summary(model_glm)

In [None]:
model_glm_sum

### Deviances and Likelihood Ratio Test (LRT)

First of all let's check whether our model is doing significantly better than the null model - with only intercept - given the number of parameters added.

First let's calculate the *null deviance* value which is -2 times the log-likelihood of the null model. So we will just calculate the mean of the response variable as the probability of success and calculate the log-likelihood:

In [None]:
nulld1 <- -2 * sum(log(ifelse(data1$yterm == 1, mean(data1$yterm), 1 - mean(data1$yterm))))
nulld1

And compare with the null deviance from model output:

In [None]:
nulld2 <- model_glm_sum$null.deviance
nulld2

They are the same!

The degrees of freedom is $n - 1$ since we only calculate the mean.

In [None]:
data1[, .N] - 1

In [None]:
dfnull <- model_glm_sum$df.null
dfnull

And now, let's calculate the residual deviance, -2 times the log-likelihood of the estimated model. So we extract the fitted values:

In [None]:
residd1 <- -2 * sum(log(ifelse(data1$yterm == 1, model_glm$fitted, 1 - model_glm$fitted)))
residd1

In [None]:
residd2 <- model_glm_sum$deviance
residd2

Again the same.

df is n - 2 since we estimate two parameters b0 and b1:

In [None]:
data1[, .N] - 2

In [None]:
dfmodel <- model_glm_sum$df.residual
dfmodel

Now we will conduct a likelihood ratio test, where the test statistic $\chi^2$ distributed and is the difference in deviance values and degrees of freedom is the difference in df values:

In [None]:
pchisq(nulld1 - residd1, dfnull - dfmodel, lower.tail = F)

We can also do the same using `anova` function. Note that simpler model is compared to the complex model, so the first parameter is passed as the null model or intercept only model:

In [None]:
anova(update(model_glm, . ~ 1), model_glm, test = "LRT")

So our model does better than the null model.

### Akaike Information Criterion (AIC)

Akaike information criterion can be used in order to compare models with different number of parameters:

${\displaystyle \mathrm {AIC} \,=\,2k-2\ln({\hat {L}})}$

(https://en.wikipedia.org/wiki/Akaike_information_criterion)

Since we have two parameters in our model:

In [None]:
2 * nrow(model_glm_sum$coefficients) + residd1

In [None]:
model_glm_sum$aic

On a stand alone basis AIC has not much to do. And in comparing different models, likelihood ratio test can also be used.

### Significance of Coefficients

Now let's come to the coefficients:

In [None]:
model_glm_coef <- tidy(model_glm)
setDT(model_glm_coef)

In [None]:
model_glm_coef

To get the standard errors, the inverse of the observed information matrix - negative of the Hessian matrix or second derivative of log-likelihood w.r.t $\beta$ - is calculated: 

In [None]:
solve(t(xmat) %*% diag(model_glm$fitted * (1 - model_glm$fitted)) %*% xmat)

This is the variance covariance matrix of the parameters:

In [None]:
vcov(model_glm)

So basically we take the diagonal of the variance-covariance matrix and get its square root:

In [None]:
sqrt(diag(vcov(model_glm)))

These are the same values reported in the coefficients table.

And test statistics are calculated by dividing the coefficients with their standard errors:

In [None]:
model_glm_coef[, estimate / std.error]

To assess their significance a Wald test is conducted and for logistic model, the estimated parameter values are assumed to follow a normal distribution in maximum likelihood estimation. So basically we conduct a z-test using normal distribution instead of t-test using Student's t-distribution:

In [None]:
2 * pmin(pnorm(model_glm_coef$statistic), 1 - pnorm(model_glm_coef$statistic))

Basically we have a highly significant p-value, practically zero.

### Interpretation of Coefficients

Contrary to multiple linear regression, in which the effect of coefficients on the response variable can be interpreted easily in their original scales, the interepretation of the coefficients in a logistic regression model is trickier:

In [None]:
model_glm_coef

Since the link function is logit (and inverse link function is logistic), the output of the linear combination is interpreted as log odd values.

In terms of log-odds we can say that:

- The intercept term is -3.95, hence when predictor values are set to zero, the log-odds takes a value of -3.95
- For every 1 increase in xterm, the log-odds increase by 1.96

We can also interpret in odd values:

In [None]:
exp(model_glm_coef$estimate)

- When predictor values are set to zero, the odds takes a value of 0.019
- For every 1 increase in xterm, the odds increase by 6.096 (7.096 - 1)

Since relationship between the coefficients and the probability values are not linear, coefficients' effect on predicted probabilities cannot be interpreted.

# Classification Performance

In the synthetic data we created, the response variable can take only one of the two values: 0 and 1. So our model is basically trying to predict one of these two values for each observation, a task known as *classification*.

The performance of a classification model can be assessed on how well the classes are predicted.

Two basic methods are conducted to assess the classification performance:

- Confusion matrix
- Receiver Operating Characteristic Curve (ROC)

## Confusion Matrix

In order to understand a confusion matrix, let's first define a positive and a negative class:

- A *negative* class shows the absence of an effect, so basically we can liken it to a null hypothesis. Negative case is usually the majority of the observations. For example in the context of a medical test, negative cases are the ones where a condition is not found, in the case of spam detection a negative case is when the mail is NOT spam.

- A *positive* class shows an effect and usually comprise a minority of the total observations. That is the class of interest The observations where a medical condition is found, a spam mail or a defaulted loan are examples of positive classes.


We can either make a true of false prediction of the cases:
- A *true* prediction is made when the actual and predicted classes are the same
- A *false* prediction is made when the actual and predicted classes are different

A confusion matrix of a variable of two classes have four parts:


<table class="wikitable" style="border:none; background:inherit;color:inherit; text-align:center;" align="center">
<tbody><tr>
<td rowspan="2" style="border:none;">
</td>
<td style="border:none;">
</td>
<td colspan="2" style="background:#4ad2d260;color:inherit;"><b>Predicted condition</b>
</td></tr>
<tr>
<td style="background:#c9c9c950;color:inherit;"><a href="/wiki/Statistical_population" title="Statistical population">Total population</a> <br><span style="white-space:nowrap;">= P + N</span>
</td>
<td style="background:#78ffff60;color:inherit;"><b>Positive (PP)</b>
</td>
<td style="background:#1da5a560;color:inherit;"><b>Negative (PN)</b>
</td></tr>
<tr>
<td rowspan="2" class="nowrap ts-vertical-header is-valign-middle" style="background:#d2d23d60;color:inherit;"><div style=""><style data-mw-deduplicate="TemplateStyles:r1221560606">@supports(writing-mode:vertical-rl){.mw-parser-output .ts-vertical-header{line-height:1;max-width:1em;padding:0.4em;vertical-align:bottom;width:1em}html.client-js .mw-parser-output .sortable:not(.jquery-tablesorter) .ts-vertical-header:not(.unsortable),html.client-js .mw-parser-output .ts-vertical-header.headerSort{background-position:50%.4em;padding-right:0.4em;padding-top:21px}.mw-parser-output .ts-vertical-header.is-valign-top{vertical-align:top}.mw-parser-output .ts-vertical-header.is-valign-middle{vertical-align:middle}.mw-parser-output .ts-vertical-header.is-normal{font-weight:normal}.mw-parser-output .ts-vertical-header>*{display:inline-block;transform:rotate(180deg);writing-mode:vertical-rl}@supports(writing-mode:sideways-lr){.mw-parser-output .ts-vertical-header>*{transform:none;writing-mode:sideways-lr}}}</style><b>Actual condition</b></div>
</td>
<td style="background:#ffff7860;color:inherit;"><b>Positive (P)</b>
</td>
<td style="background:#78ff7860;color:inherit;"><b><a href="/wiki/True_positive" class="mw-redirect" title="True positive">True positive</a> (TP) <br></b>
</td>
<td style="background:#ffa5a560;color:inherit;"><b><a href="/wiki/False_negative" class="mw-redirect" title="False negative">False negative</a> (FN) <br></b>
</td></tr>
<tr>
<td style="background:#a5a51d60;color:inherit;"><b>Negative (N)</b>
</td>
<td style="background:#ff787860;color:inherit;"><b><a href="/wiki/False_positive" class="mw-redirect" title="False positive">False positive</a> (FP) <br></b>
</td>
<td style="background:#3dd23d60;color:inherit;"><b><a href="/wiki/True_negative" class="mw-redirect" title="True negative">True negative</a> (TN) <br></b>
</td></tr>
<tr>
<td colspan="4" style="border:none;"><sup>Sources: </sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4"><span class="cite-bracket">[</span>4<span class="cite-bracket">]</span></a></sup><sup id="cite_ref-5" class="reference"><a href="#cite_note-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup><sup id="cite_ref-Powers2011_2-1" class="reference"><a href="#cite_note-Powers2011-2"><span class="cite-bracket">[</span>2<span class="cite-bracket">]</span></a></sup><sup id="cite_ref-6" class="reference"><a href="#cite_note-6"><span class="cite-bracket">[</span>6<span class="cite-bracket">]</span></a></sup><sup id="cite_ref-7" class="reference"><a href="#cite_note-7"><span class="cite-bracket">[</span>7<span class="cite-bracket">]</span></a></sup><sup id="cite_ref-8" class="reference"><a href="#cite_note-8"><span class="cite-bracket">[</span>8<span class="cite-bracket">]</span></a></sup><sup id="cite_ref-9" class="reference"><a href="#cite_note-9"><span class="cite-bracket">[</span>9<span class="cite-bracket">]</span></a></sup>
</td></tr></tbody></table>

(https://en.wikipedia.org/wiki/Confusion_matrix)

The meaning of these parts are:

- True Positive (TP): Correctly classified as the class of interest
- True Negative (TN): Correctly classified as not the class of interest
- False Positive (FP): Incorrectly classified as the class of interest
- False Negative (FN): Incorrectly classified as not the class of interest

(Lantz 2015, Machine Learning with R, Ch 10, p.318)

Now let's calculate the confusion matrix for our predictions.

We should first get the actual values:

In [None]:
actual1 <- data1$yterm

And then fitted values can be extract by:

In [None]:
fit1 <- model_glm$fitted

Or:

In [None]:
fit2 <- predict(model_glm, type = "response")

In [None]:
identical(fit1, fit2)

The second method can also be used to get predictions for a test set.

Note that when the *response* value is not passed to *type* parameter, the linear predictions are returned, which should be converted into probabilities with the `logistic` function:

In [None]:
plot(fit2, logistic(predict(model_glm)))

However these values are continuous between 0 and 1. In order to transform them into fitted classes, a cut point should be taken. The usual cut point is 0.5:

In [None]:
fit_class <- ifelse(fit1 > 0.5, 1, 0)

To get the confusion matrix correctly we first get the contingency table of the fitted and actual values (in this order) and pass it to `confusionMatrix` function from the `caret` package. The positive class is defined as "1":

In [None]:
confmat <- table(fitted = fit_class, actual = actual1) %>% confusionMatrix(positive = "1")

In [None]:
confmat

We can extract metrics from the confusion matrix:

In [None]:
confmat$overall

In [None]:
confmat$byClass

In order to replicate the calculations and interpret their meanings, first assign the four values into their names:

In [None]:
conttab <- table(fitted = fit_class, actual = actual1)

In [None]:
TP <- conttab["1", "1"]
FP <- conttab["1", "0"]
TN <- conttab["0", "0"]
FN <- conttab["0", "1"]

In [None]:
TP
FP
TN
FN

Note that actual negative values are classified either as:

- True Negatives (TN) or
- False Positives (FP)

In [None]:
AN <- TN + FP
AN

And actual positive values are classified either as:

- True Positives (TP) or
- False Negatives (FN)

In [None]:
AP <- TP + FN
AP

The accuracy is the ratio of correctly predicted classes:

In [None]:
(TP + TN) / (AP + AN)

In [None]:
accx <- confmat$overall["Accuracy"]
accx

We could have chosen the majority class and predicted all values as that majority class value, which sets the no-information rate or the null value of accuracy:

In [None]:
max(AP, AN) / (AP + AN)

In [None]:
accn <- confmat$overall["AccuracyNull"]
accn

The 95% confidence interval of the accuracy rate is calculated using the binomial distribution:

In [None]:
qbinom(0.025, 10000, 0.7744) / 10000

In [None]:
confmat$overall["AccuracyLower"]

In [None]:
qbinom(1-0.025, 10000, 0.7744) / 10000

In [None]:
confmat$overall["AccuracyUpper"]

We see that, the null accuracy value or no information rate is outside this interval. So we did significantly better than selecting the majority class. That can be confirmed with the p-value also:

In [None]:
pbinom(0.7744, 10000, 0.5047)

In [None]:
confmat$overall["AccuracyPValue"]

The kappa statistic (labeled Kappa in the previous output) adjusts accuracy by accounting for the possibility of a correct prediction by chance alone.

This is especially important for datasets with a severe class imbalance, because a classifier can obtain high accuracy simply by always guessing the most frequent class.

The kappa statistic will only reward the classifier if it is correct more often than this simplistic strategy.

Kappa values range from 0 to a maximum of 1, which indicates perfect agreement between the model's predictions and the true values. Values less than one indicate imperfect agreement. Depending on how a model is to be used, the interpretation of the kappa statistic might vary. One common interpretation is shown as follows:
- Poor agreement = less than 0.20
- Fair agreement = 0.20 to 0.40
- Moderate agreement = 0.40 to 0.60
- Good agreement = 0.60 to 0.80
- Very good agreement = 0.80 to 1.00

(Lantz 2015, Machine Learning with R, Ch 10, p.323)

The formula is:

${\displaystyle \kappa ={\frac {2\times (TP\times TN-FN\times FP)}{(TP+FP)\times (FP+TN)+(TP+FN)\times (FN+TN)}}}$

(https://en.wikipedia.org/wiki/Cohen%27s_kappa)

In [None]:
confmat$overall["Kappa"]

So we have a moderate agreement.

The **sensitivity** of a model (also called the true positive rate) measures the proportion of actual positive examples that were correctly classified. Therefore, it is calculated as the number of true positives divided by the total number of positives, both correctly classified (the true positives) as well as incorrectly classified (the false negatives). **Recall** measure is calculated the same way, although the interpretation of the metric is different: A model with a high recall captures a large portion of the positive examples, meaning that it has wide breadth. For example, a search engine with a high recall returns a large number of documents pertinent to the search query. Similarly, the SMS spam filter has a high recall if the majority of spam messages are correctly identified.

$\displaystyle \text {Sensitivity} = \frac {TP}{TP + FN}$

In [None]:
TP / (TP + FN)

In [None]:
confmat$byClass["Sensitivity"]

So our model correctly classified 76.7% of actual positive cases.

The **specificity** of a model (also called the true negative rate) measures the proportion of actual negative examples that were correctly classified. As with sensitivity, this is computed as the number of true negatives, divided by the total number of negatives—the true negatives plus the false positives:

$\displaystyle \text {Specificity} = \frac {TN}{TN + FP}$

In [None]:
TN / (TN + FP)

In [None]:
confmat$byClass["Specificity"]

So our model correctly classified 78.2% of actual negative cases.

**Precision** or **Positive predictive value** is defined as the proportion of positive examples that are truly positive; in other words, when a model predicts the positive class, how often is it correct? A precise model will only predict the positive class in cases that are very likely to be positive. It will be very trustworthy.

$\displaystyle \text {Precision} = \frac {TP}{TP + FP}$

In [None]:
TP / (TP + FP)

In [None]:
confmat$byClass["Precision"]

So in our model, 77.5% of our positive preditions are true positives.

**Negative predictive value** is defined as the proportion of negative examples that are truly negative; in other words, when a model predicts the negative class, how often is it correct?

$\displaystyle \text {Negative predictive value} = \frac {TN}{TN + FN}$

In [None]:
TN / (TN + FN)

In [None]:
confmat$byClass["Neg Pred Value"]

So in our model, 77.3% of our negative preditions are true negatives.

Another important metric is the *lift* value: How much are we better at identifying positive classes than random guessing.

Baseline precision is:

In [None]:
bprec <- AP / (AP + AN)
bprec

Precision or positive predictive rate of our model is:

In [None]:
confmat$byClass["Precision"]

Lift is:

In [None]:
confmat$byClass["Precision"] / bprec

So our model is 1.56 times better at identifying positive classes than random guessing.

## Receiver Operating Characteristic Curve (ROC)

While the confustion matrix onlu considers classes, a ROC curve takes into account the predicted (or fitted) probabilities (of being in the positive class).

The curve is constructed such that:

- The observations are sorted from the smallest predicted probability to the largest
- For each negative prediction the curve steps up
- For each positive prediction the curve steps right

On the y axis is the sensitivity while on the x axis false positive rate (FPR) is plotted. Note that false positive rate is 1 - specificity or:

$\text {FPR} = \frac{FP}{FP + TN}$

![ROC](../imagesba/roc.png)

Four ROC curves with different values of the area under the ROC curve (interpreted for a medical study):
- A perfect test (A) has an area under the ROC curve of 1.
- The chance diagonal (D, the line segment from 0, 0 to 1, 1) has an area under the ROC curve of 0.5.
- ROC curves of tests with some ability to distinguish between those subjects with and those without a disease (B, C) lie between these two extremes.
- Test B with the higher area under the ROC curve has a better overall diagnostic performance than test C.

(https://www.researchgate.net/figure/Four-ROC-curves-with-different-values-of-the-area-under-the-ROC-curve-A-perfect-test-A_fig2_8636163)

To draw the ROC curve:

In [None]:
plot.roc(actual1, fit1, legacy.axes = T)

We see that classification performance can be considered high, while not very close to perfect: It is in between the perfect performance and the totally random performance case.

The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').In other words, when given one randomly selected positive instance and one randomly selected negative instance, AUC is the probability that the classifier will be able to tell which one is which.

AUC varies between 0 and 1 — with an uninformative classifier yielding 0.5.

(https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

The interpretation of possible values are:

- A: Outstanding = 0.9 to 1.0
- B: Excellent/good = 0.8 to 0.9
- C: Acceptable/fair = 0.7 to 0.8
- D: Poor = 0.6 to 0.7
- E: No discrimination = 0.5 to 0.6

(Lantz 2015, Machine Learning with R, Ch 10, p.333)

We can calculate the AUC value of the predictions of our model:

In [None]:
auc(actual1, fit1)

So we can consider the AUC value as excellent/good.

# Resources on GLM, Logistic Regression and Classification Assessment

- Agresti and Kateri 2021, Foundations of Statistics for Data Scientists, Ch.7
- James et al. 2023, An Introduction to Statistical Learning, Ch.4
- Kaplan and Pruim, Statistical Modeling: A Fresh Approach, Ch.17
- Roback and Legler 2021, Beyond Multiple Linear Regression, Ch.5,6
- Wood 2017, Generalized Additive Models, Ch.3
- McElreath 2020, Statistical Rethinking, Ch.10 (for Bayesian point of view)
- Lantz 2015, Machine Learning with R, Ch.10 (for classification evaluation)