# Test Assumptions of the Logistic regression

## Load the data

In [1]:
penalty <- read.delim("data/penalty.dat",header=T, stringsAsFactors = T)

head(penalty)

Unnamed: 0_level_0,PSWQ,Anxious,Previous,Scored
Unnamed: 0_level_1,<int>,<int>,<int>,<fct>
1,18,21,56,Scored Penalty
2,17,32,35,Scored Penalty
3,16,34,35,Scored Penalty
4,14,40,15,Scored Penalty
5,5,24,47,Scored Penalty
6,1,15,67,Scored Penalty


## Fit a Model

In [2]:
penaltyModel <- glm(Scored ~ Previous + PSWQ + Anxious, data = penalty, family = binomial())

summary(penaltyModel)


Call:
glm(formula = Scored ~ Previous + PSWQ + Anxious, family = binomial(), 
    data = penalty)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.31374  -0.35996   0.08334   0.53860   1.61380  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -11.49256   11.80175  -0.974  0.33016   
Previous      0.20261    0.12932   1.567  0.11719   
PSWQ         -0.25137    0.08401  -2.992  0.00277 **
Anxious       0.27585    0.25259   1.092  0.27480   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 103.638  on 74  degrees of freedom
Residual deviance:  47.416  on 71  degrees of freedom
AIC: 55.416

Number of Fisher Scoring iterations: 6


## Testing for multicollinearity

In [18]:
library(car)
vif(penaltyModel)

"package 'car' was built under R version 4.0.2"
Loading required package: carData



In [None]:
**Previous** and **Anxious** have VIFs over 10, so the model is seriously biased.

## Testing for linearity of the logit

Logistic regresstion assumes that each continuous variable is linearly related to the log the outcome variable(**Scored**). 
To test this assumption we need to run the logistic regression but include predictors that are the interaction between each predictors.

### Create the interaction terms

In [19]:
penalty$logPSWQInt <- log(penalty$PSWQ) * penalty$PSWQ
penalty$logAnxInt <- log(penalty$Anxious) * penalty$Anxious
penalty$logPrevInt <- log(penalty$Previous) * penalty$Previous

head(penalty)

Unnamed: 0_level_0,PSWQ,Anxious,Previous,Scored,logPSWQInt,logAnxInt,logPrevInt
Unnamed: 0_level_1,<int>,<int>,<int>,<fct>,<dbl>,<dbl>,<dbl>
1,18,21,56,Scored Penalty,52.02669,63.93497,225.41969
2,17,32,35,Scored Penalty,48.16463,110.90355,124.43718
3,16,34,35,Scored Penalty,44.36142,119.89626,124.43718
4,14,40,15,Scored Penalty,36.9468,147.55518,40.62075
5,5,24,47,Scored Penalty,8.04719,76.27329,180.95694
6,1,15,67,Scored Penalty,0.0,40.62075,281.71441


### fit a model and check the significance of the interaction terms

Any interaction that is **significant** indicates that the main effect has **violated** the assumption of linearity of the logit.

In [22]:
testModel <- glm(Scored ~ PSWQ + Anxious + Previous + logPSWQInt + logAnxInt + logPrevInt, data=penalty, family = binomial())
summary(testModel)


Call:
glm(formula = Scored ~ PSWQ + Anxious + Previous + logPSWQInt + 
    logAnxInt + logPrevInt, family = binomial(), data = penalty)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0685  -0.3846   0.1116   0.5460   1.8272  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.87885   14.92410  -0.260    0.795
PSWQ        -0.42233    1.10267  -0.383    0.702
Anxious     -2.64485    2.79702  -0.946    0.344
Previous     1.66601    1.48202   1.124    0.261
logPSWQInt   0.04393    0.29675   0.148    0.882
logAnxInt    0.68077    0.65277   1.043    0.297
logPrevInt  -0.31855    0.31731  -1.004    0.315

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 97.283  on 70  degrees of freedom
Residual deviance: 45.909  on 64  degrees of freedom
  (4 observations deleted due to missingness)
AIC: 59.909

Number of Fisher Scoring iterations: 7


There's no significant interaction terms, so the assumption of linearity of the logit has been met.