# Logistic Regression

In [1]:
using GLM
using StatsBase
using RDatasets
using MLDataUtils

## Load data

In [2]:
data = RDatasets.dataset("ISLR", "Default")
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64
1,No,No,729.526,44361.6
2,No,Yes,817.18,12106.1
3,No,No,1073.55,31767.1
4,No,No,529.251,35704.5
5,No,No,785.656,38463.5
6,No,Yes,919.589,7491.56


## Preprocessing

In [3]:
isyes(x) = x == "Yes" ? 1.0 : 0.0

data[:DefaultNum] = isyes.(data[:Default])
data[:StudentNum] = isyes.(data[:Student])
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,729.526,44361.6,0.0,0.0
2,No,Yes,817.18,12106.1,0.0,1.0
3,No,No,1073.55,31767.1,0.0,0.0
4,No,No,529.251,35704.5,0.0,0.0
5,No,No,785.656,38463.5,0.0,0.0
6,No,Yes,919.589,7491.56,0.0,1.0


## Training/Testing set

In [4]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at=0.8);

In [5]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,866.028,38363.4,0.0,0.0
2,No,No,521.8,22209.1,0.0,0.0
3,No,Yes,495.136,15519.5,0.0,1.0
4,No,No,1186.8,52689.3,0.0,0.0
5,No,No,1078.62,46060.1,0.0,0.0
6,No,Yes,441.626,15261.7,0.0,1.0
7,No,Yes,1060.22,22795.9,0.0,1.0
8,No,No,419.441,34332.7,0.0,0.0
9,No,No,801.803,46182.3,0.0,0.0
10,No,Yes,498.506,16967.6,0.0,1.0


## Model

In [6]:
logreg = glm(@formula(DefaultNum ~ Balance + Income), train, Binomial(), LogitLink())

StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: DefaultNum ~ 1 + Balance + Income

Coefficients:
               Estimate   Std.Error  z value Pr(>|z|)
(Intercept)    -11.3446    0.481304 -23.5705   <1e-99
Balance      0.00550756 0.000248322  22.1791   <1e-99
Income       2.05294e-5  5.66543e-6  3.62362   0.0003


## Prediction

In [7]:
pred = predict(logreg, test)

2000-element Array{Union{Missing, Float64},1}:
 0.00305693731166153   
 0.0003304325548684028 
 0.0002487124704046108 
 0.023510168775308222  
 0.011448235383525436  
 0.00018426230439239313
 0.006449073484437817  
 0.00024120088874050055
 0.0025212115801270464 
 0.00026101402565653875
 0.38399861174419103   
 0.0001913084006543652 
 0.015755093076617276  
 ⋮                     
 0.0004224977315033718 
 0.31616134955331027   
 2.2217153247511253e-5 
 0.0014623015840897306 
 0.005167187677641112  
 0.18838419540018783   
 0.00030477650049113203
 3.3209337362472214e-5 
 3.079043963597167e-5  
 0.0005502430481336342 
 0.0004980362341448185 
 0.09943929367601344   

## Validation

In [8]:
error(x, y) = ((x > 0.5) ? 1.0 : 0.0) == y
accuracy(xs, ys) = sum(error.(xs, ys)) / size(xs, 1)

accuracy (generic function with 1 method)

In [9]:
accuracy(pred, test[:DefaultNum])

0.9745