# Logistic regression

## Introduction

Logistic regression is used when the outcome (dependent variable) has two possible values or categories. A simple linear regression is incompatible with a binary outcome as the distribution of the $\epsilon$ random term follows a Gaussian distrubtion and therefore the whole term lies between $(-\infty, +\infty)$.

Because it is almost always used with two or more independent variables, is should be called _multiple logistic regression_.

| Type of regression   | Dependent of dep variable (Y) | Examples                                                      |
| :------------------- | :---------------------------- | :------------------------------------------------------------ | 
| Linear               | Continuous (interval, ratio)  | Enzyme activity, renal function, weight etc.                  |
| Logistic             | Binary (dichotomous)          | Death during surgery, graduation, recurrence of cancer etc.   |
| Proportional hazards | Elapsed time to event         | Months until death, quarters in school before graduation etc. |

The odds of an event occuring equals the probability that that event will occur divided by the probability that it will not occur. Every probability can be expressed as odds, every odds can be expressed as a probability.

The logistic regression model computes the odds from baseline odds and from odds ratios computed for each independent variable:
$$ \text{Odds} = (\text{Baseline Odds}).\text{OR}_1.\text{OR}_2.(...)\text{OR}_n $$
The baseline odds answers this question:
>if every single independent X variables equaled 0 what are te odds of a particular category?

To make the baseline odds meaningful, we can encode that variable as `Age - 20` so that X=0 would encode people who are 20.

For each continuous variable such as the age, the corresponding odds ratio answers the question:
> for each additional year of age, by how much do the odds increase or decrease?

* If the OR associated with age equals 1.0, then the age is not related
* If OR > 1 then the odds increase by a set percentage for each additional year of age

For instance, an OR of 1.03 would mean that the odds of Y increase by 3% as a person grows older by one year. In another example with OR = 0.99, every year, the odds ratio of `suicide.hr` declines by ca. 1%. The OR for a 40-y compared to a 25-y is $1 \times 0.99 \times 0.99 \times ... 0.99 = 0.99^{15} \approxeq 0.86$, with the exponent equals $40-25=15$. That means a 40-y persone has about 14% lower odds of being `suicide.hr` than a 25-y.

The same assumptions and attentions as for multiple regression applied to logistic regression (e.g. at least 5-10 events per variable, not too many independent variables in the model, etc.)

## The equation

$$ Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + ... + \beta_n X_{i,n} $$

with $Y_i$ the **natural log (ln)**, not the common Log ($\log_{10}$), of the odds for a particular participant, $\beta_0$ the natural log of the baseline odds, $\beta_2$ the natural log of the odds ratio for the first independent variable etc. For example:

$$ \log \frac{\text{p}_\text{suicide.hr=1}}{1-\text{p}_\text{suicide.hr=1}} = \beta_0 + \beta_1 \times \text{duree} + \beta_2 \times \text{abus} $$

In fact, the log-odds are given by the logit function, which maps a probability $p$ of the response variable being `1` from $ [0,1) $ to $(-\infty, +\infty)$, with $ \text{logit}(p) = \ln \frac{p}{1-p} = \beta_0 + \beta X $.

The $Y$ value is the natural log of odds, which can be transformed to a probability. Since it implicitly embodies uncertainty, there is no need to explicitly add a random term to the model. Because there is no Gaussian distribution, the method of least squares is not used, instead logistic regression finds the values of the odds ratios using what is called a _maximum likelihood method_ (MLE).

The odds of the response variable being `1` can be obtained by exponentiating the log-odds $ \frac{p}{1-p} = e^{\beta_0 + \beta X}$, and the probability of the response variable being `1` is given by the logistic function $ p = \frac{1}{1 + e^{-(\beta_0 + \beta X)}} $. The first coefficient $\beta_0$ is always the constant term (intercept) of the model.

## Fitting a simple model

We use the dataset from the MOOC 'Introducton à la statistique avec R' which deals with multiple regression.

In [29]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [102]:
data = pd.read_csv(
    'https://raw.githubusercontent.com/sbwiecko/intro_statistique_R/master/data/smp2.tsv',
    delimiter='\t',
)
data.head()

Unnamed: 0,age,prof,duree,discip,n.enfant,n.fratrie,ecole,separation,juge.enfant,place,...,subst.cons,scz.cons,char,rs,ed,dr,suicide.s,suicide.hr,suicide.past,dur.interv
0,31.0,autre,4.0,0.0,2.0,4,1.0,0.0,0.0,0.0,...,0,0,1.0,2.0,1.0,1.0,0.0,0.0,0.0,
1,49.0,,,0.0,7.0,3,2.0,1.0,0.0,0.0,...,0,0,1.0,2.0,2.0,1.0,0.0,0.0,0.0,70.0
2,50.0,prof.intermediaire,5.0,0.0,2.0,2,2.0,0.0,0.0,0.0,...,0,0,1.0,2.0,3.0,2.0,0.0,0.0,0.0,
3,47.0,ouvrier,,0.0,0.0,6,1.0,1.0,0.0,1.0,...,0,0,1.0,2.0,2.0,2.0,1.0,0.0,0.0,105.0
4,23.0,sans emploi,4.0,1.0,1.0,6,1.0,1.0,,1.0,...,0,0,1.0,2.0,2.0,2.0,0.0,0.0,1.0,


In [103]:
data_ = data.dropna(subset=['suicide.hr', 'abus'], how='any')
y = data_['suicide.hr']
X = data_['abus']
X = sm.add_constant(X)
model = sm.Logit(y, X) # same as sm.GLM(y, X, family=sm.families.Binomial())
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.494196
         Iterations 5


  x = pd.concat(x[::order], 1)


In [104]:
print(result.summary())

                           Logit Regression Results                           
Dep. Variable:             suicide.hr   No. Observations:                  753
Model:                          Logit   Df Residuals:                      751
Method:                           MLE   Df Model:                            1
Date:                Thu, 12 Aug 2021   Pseudo R-squ.:                 0.02098
Time:                        14:58:16   Log-Likelihood:                -372.13
converged:                       True   LL-Null:                       -380.11
Covariance Type:            nonrobust   LLR p-value:                 6.494e-05
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.6161      0.115    -14.003      0.000      -1.842      -1.390
abus           0.7688      0.190      4.052      0.000       0.397       1.141


In [105]:
result.params

const   -1.616082
abus     0.768785
dtype: float64

In [106]:
result.conf_int()

Unnamed: 0,0,1
const,-1.842275,-1.38989
abus,0.396937,1.140632


In [107]:
OR1 = np.exp(result.params[1])
CI_OR1 = np.round(np.exp(result.conf_int().loc['abus']), 2)
print(f"Odds ratio for 'abus': {OR1:3.3f}")
print(f"CI of the OR for 'abus': {CI_OR1[0]} - {CI_OR1[1]}")

Odds ratio for 'abus': 2.157
CI of the OR for 'abus': 1.49 - 3.13


In [108]:
data_['abus'].unique()

array([0., 1.])

The independent variable has two possible values. The corresponding odds ratio is 1.0 for `abus=0` and 2.16 for `abus=1`, meaning that participant with `abus=1`, **but sharing the other attributes**, has a bit more than twice the odds of being `suicide.hr`, with CI ranging from 1.49 to 3.13. If the H0 were true, the OR would equal 1.0, but the CI above doesn't include 1.0, so the corresponding P value must be less than 0.05.

In the summary table P < 0.0001, therefore the association between `abus` and `suicide.hr` is statistically significant.

### Analysis of the corresponding contingency table

In [109]:
table = pd.crosstab(data_['abus'], data_['suicide.hr'])
sm.stats.Table2x2(table).summary()

0,1,2,3,4,5
,Estimate,SE,LCB,UCB,p-value
Odds ratio,2.157,,1.487,3.129,0.000
Log odds ratio,0.769,0.190,0.397,1.141,0.000
Risk ratio,1.192,,1.083,1.312,0.000
Log risk ratio,0.175,0.049,0.079,0.272,0.000


The Odds Ratio obtained with the analysis of the contingency table equals the Odds Ratio computed using the logistic regression.

### Analysis using pingouin

In [110]:
import pingouin as pg

lom = pg.logistic_regression(
    X=X,
    y=y
)

lom.round(3)

Unnamed: 0,names,coef,se,z,pval,CI[2.5%],CI[97.5%]
0,Intercept,-1.616,0.115,-14.003,0.0,-1.842,-1.39
1,abus,0.769,0.19,4.052,0.0,0.397,1.141


### Interpretation of the intercept coefficient

In [111]:
intercept = lom.loc[0, 'coef']

print("Coefficients for intercept (`abus=0`):")
print("--------------------------------------")
print(f"Log-odds\t {intercept:.3f}")
print(f"Odds    \t {np.exp(intercept):.2f}")
print(f"Ratio   \t 1:{1/np.exp(intercept):.0f}")
print(f"Proba   \t {1/(1+np.exp(-(intercept))):.3f}")

Coefficients for intercept (`abus=0`):
--------------------------------------
Log-odds	 -1.616
Odds    	 0.20
Ratio   	 1:5
Proba   	 0.166


In [112]:
coeff_abus = lom.loc[1, 'coef']
np.exp(coeff_abus)

2.1571428116306226

In [113]:
print("Coefficients for `abus=1`):")
print("---------------------------")
print(f"Log-odds \t {intercept + coeff_abus:.3f}")
print(f"Odds     \t {np.exp(intercept + coeff_abus):.2f}")
print(f"Ratio ca.\t 1:{1/np.exp(intercept + coeff_abus):.0f}")
print(f"Proba    \t {1/(1+np.exp(-(intercept + coeff_abus))):.3f}")

Coefficients for `abus=1`):
---------------------------
Log-odds 	 -0.847
Odds     	 0.43
Ratio ca.	 1:2
Proba    	 0.300


In the previous example, the _intercept_ or _constant_ coefficient (-1.616) is the log-odds of `suicide.hr=1` when `abus=0`. Thus, the odds ratio of _suicide when no abus_ is `0.20:1`, i.e. a probability of ca. 16.6%.

The _abus_ coefficient (0.769) means that for each additional _abus_ (altough for this binary parameter there are only 2 possible values), the log-odds of _suicide.hr_ increases by 0.769, therefore  the odds are multiplied by $e^{0.769}$, i.e. the odds are multiplied by ca. 2.16. The table below shows the probability of `suicide.hr` for both values of `abus`:

| abus | Log-odds | Odds | Ratio | Proba |
| ---- | -------- | ---- | ----- | ----- |
|  0   |  -1.616  | 0.20 |  1:5  | 0.166 |
|  1   |  -0.847  | 0.43 |  1:2  | 0.300 |

## Fitting a more complex model

In [116]:
data2_ = data.dropna(
    subset=['suicide.hr', 'abus', 'discip', 'duree'], how='any')

In [118]:
y2 = data2_['suicide.hr']
X2 = data2_[['abus', 'discip', 'duree']]
# duree is graduated from 1 to 5
X2 = sm.add_constant(X2)
model2 = sm.Logit(y2, X2)
result2 = model2.fit()
print(result2.summary2())

Optimization terminated successfully.
         Current function value: 0.484782
         Iterations 6
                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.041     
Dependent Variable: suicide.hr       AIC:              541.2607  
Date:               2021-08-12 15:01 BIC:              558.5004  
No. Observations:   550              Log-Likelihood:   -266.63   
Df Model:           3                LL-Null:          -277.97   
Df Residuals:       546              LLR p-value:      4.7043e-05
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
-------------------------------------------------------------------
           Coef.    Std.Err.      z      P>|z|     [0.025    0.975]
-------------------------------------------------------------------
const     -0.0246     0.4963   -0.0496   0.9604   -0.9974    0.9482
abus       0.6229     0.2276    2.7363   0.0062    0.1767 

  x = pd.concat(x[::order], 1)


We can already interpret the sign of the Log odds coefficients, with negative coefficient associated with a decrease in risk, as for example with the explanatory variable `duree`. For measuring the amplitude of the effect, further manipulation is required as shown below.

In [119]:
OR2 = np.exp(result2.params)
OR2

const     0.975680
abus      1.864315
discip    1.695687
duree     0.671249
dtype: float64

In [63]:
.99**15

0.8600583546412884

### Table of the probability of `suicide.hr=1` for the different `duree` values

In [125]:
lom2 = pg.logistic_regression(
    X=X2,
    y=y2
)

lom2.round(4)

Unnamed: 0,names,coef,se,z,pval,CI[2.5%],CI[97.5%]
0,Intercept,-0.0246,0.4963,-0.0496,0.9604,-0.9974,0.9482
1,abus,0.6229,0.2276,2.7363,0.0062,0.1767,1.0691
2,discip,0.5281,0.2377,2.2219,0.0263,0.0623,0.9939
3,duree,-0.3986,0.1172,-3.4004,0.0007,-0.6284,-0.1689


In [149]:
intercept2  = lom2.loc[0, 'coef']
coeff_duree = lom2.loc[3, 'coef']

print("`duree`\tLogodds\tOdds\tRatio\tProba")
print("-------------------------------------")

for duree in range(6):  # for values 0 to 5
    print(duree, end='\t')
    print(f"{intercept + duree*coeff_duree:.3f}", end='\t')
    print(f"{np.exp(intercept + duree*coeff_duree):.3f}", end='\t')
    print(f"1:{1/np.exp(intercept + duree*coeff_duree):.0f}", end='\t')
    print(f"{1/(1+np.exp(-(intercept + duree*coeff_duree))):.3f}")

`duree`	Logodds	Odds	Ratio	Proba
-------------------------------------
0	-1.616	0.199	1:5	0.166
1	-2.015	0.133	1:7	0.118
2	-2.413	0.090	1:11	0.082
3	-2.812	0.060	1:17	0.057
4	-3.211	0.040	1:25	0.039
5	-3.609	0.027	1:37	0.026


## Interactions (synergies) - R-style formula

In [73]:
import statsmodels.formula.api as smf

In [123]:
model3 = smf.logit(
    formula="Q('suicide.hr') ~ abus + discip*duree", 
    data=data_
)
# Q() is a way to 'quote' variable names, especially ones that do not otherwise 
# meet Python's variable name rules, such as with a dot in the variable name

# ":" adds a new column to the design matrix with the product of the other two columns
# "*" will also include the individual columns that were multiplied together
# thus "Q('suicide.hr') ~ abus + discip*duree" is eq "Q('suicide.hr') ~ abus + discip + dure + discip:duree"

print(model3.fit().summary2())

Optimization terminated successfully.
         Current function value: 0.484782
         Iterations 6
                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.041     
Dependent Variable: Q('suicide.hr')  AIC:              543.2599  
Date:               2021-08-12 15:03 BIC:              564.8095  
No. Observations:   550              Log-Likelihood:   -266.63   
Df Model:           4                LL-Null:          -277.97   
Df Residuals:       545              LLR p-value:      0.00014651
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
------------------------------------------------------------------
               Coef.   Std.Err.     z     P>|z|    [0.025   0.975]
------------------------------------------------------------------
Intercept     -0.0315    0.5512  -0.0571  0.9544  -1.1119   1.0489
abus           0.6229    0.2276   2.7362  0.0062   0.1767   1.