# Logistic Regression

In [1]:
library(ISLR) #load ISLR package which contains datasets "Default"
?Default

0,1
Default {ISLR},R Documentation


In [2]:
summary(Default) #always look at your data first!

 default    student       balance           income     
 No :9667   No :7056   Min.   :   0.0   Min.   :  772  
 Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
                       Median : 823.6   Median :34553  
                       Mean   : 835.4   Mean   :33517  
                       3rd Qu.:1166.3   3rd Qu.:43808  
                       Max.   :2654.3   Max.   :73554  

What do you notice about default rates? Number of students?

## Logistic regression on Continuous (Numerical) Variable

In [3]:
glm.fit <- glm(default~balance,family="binomial",data=Default)
summary(glm.fit)


Call:
glm(formula = default ~ balance, family = "binomial", data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2697  -0.1465  -0.0589  -0.0221   3.7589  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1596.5  on 9998  degrees of freedom
AIC: 1600.5

Number of Fisher Scoring iterations: 8


## The Deviances, Log-likelihoods, and Likelihood Ratio tests.
The "residual deviance" is negative 2 times the maximized loglikelihood: $-2l(\hat\beta)$. This gives a measure of fit for the model where lower is better.

The "null deviance" is negative 2 times the maximized loglikelihood for the model with only an intercept. 

Comparing these two numbers, via a *likelihood ratio test*, tells us whether the model gives a better fit.

$LR$ = Null Deviance - Residual Deviance

and is distributed chi-sqared with $r$ degrees of freedom

$LR\sim \chi^2(r)$

In [4]:
names(summary(glm.fit))

In [5]:
summary(glm.fit)$null - summary(glm.fit)$deviance

To run a likelihood ratio test, you can use lrtest()

In [6]:
library(lmtest)
lrtest(glm.fit)

Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric



#Df,LogLik,Df,Chisq,Pr(>Chisq)
2,-798.2258,,,
1,-1460.3249,-1.0,1324.198,6.232869e-290


The first line reports the "degreees of freedom" of the model (how many parameters), and its log-likelihood.

The second line reports the "degreees of freedom" of the null model, its log-likelihood, the LR test statistic, and the p-value for this test.

Note the LR test statistic is what we just calculated above.

Since the p-value is so low, we can reject the null model (i.e. balance has explanatory power)

### Predicted probability

In [7]:
newdata <- data.frame(balance=1000)
predict(glm.fit,type="response",newdata)

## Logistic Regression on Categorical (Dummy) Variable

In [8]:
glm.fit <- glm(default~student,family="binomial",data=Default)
summary(glm.fit)


Call:
glm(formula = default ~ student, family = "binomial", data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2970  -0.2970  -0.2434  -0.2434   2.6585  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
studentYes   0.40489    0.11502    3.52 0.000431 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 2908.7  on 9998  degrees of freedom
AIC: 2912.7

Number of Fisher Scoring iterations: 6


In [11]:
newdata <- data.frame(student="Yes")
predict(glm.fit,type="response",newdata) #return the predited probability
predict(glm.fit,newdata) #return the predicted log-odds ratio

In [10]:
newdata <- data.frame(student="No")
predict(glm.fit,type="response",newdata)

## Multiple Logistic Regression

In [8]:
glm.fit <- glm(default~balance+income+student,family="binomial",data=Default)
summary(glm.fit)


Call:
glm(formula = default ~ balance + income + student, family = "binomial", 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8
