# Logistic Regression

## Using GLM

In [1]:
using GLM
using Statistics
using RDatasets
using MLDataUtils

### Load data

In [2]:
data = RDatasets.dataset("ISLR", "Default")
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income
Unnamed: 0_level_1,Cat…,Cat…,Float64,Float64
1,No,No,729.526,44361.6
2,No,Yes,817.18,12106.1
3,No,No,1073.55,31767.1
4,No,No,529.251,35704.5
5,No,No,785.656,38463.5
6,No,Yes,919.589,7491.56


### Preprocessing

In [3]:
isyes(x) = x == "Yes" ? 1.0 : 0.0

data[!, :DefaultNum] = isyes.(data[!, :Default])
data[!, :StudentNum] = isyes.(data[!, :Student])
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Cat…,Cat…,Float64,Float64,Float64,Float64
1,No,No,729.526,44361.6,0.0,0.0
2,No,Yes,817.18,12106.1,0.0,1.0
3,No,No,1073.55,31767.1,0.0,0.0
4,No,No,529.251,35704.5,0.0,0.0
5,No,No,785.656,38463.5,0.0,0.0
6,No,Yes,919.589,7491.56,0.0,1.0


### Training/Testing set

In [4]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at=0.8);

In [5]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Cat…,Cat…,Float64,Float64,Float64,Float64
1,No,Yes,1752.74,15596.9,0.0,1.0
2,No,No,51.186,39385.8,0.0,0.0
3,No,No,979.851,44869.0,0.0,0.0
4,No,No,1228.31,37408.5,0.0,0.0
5,No,Yes,1464.39,13968.5,0.0,1.0
6,No,Yes,1423.94,22634.5,0.0,1.0
7,No,No,796.991,25159.6,0.0,0.0
8,Yes,No,1610.48,35589.7,1.0,0.0
9,No,No,0.0,39893.3,0.0,0.0
10,No,Yes,220.556,16872.9,0.0,1.0


### Model

In [6]:
logreg = glm(@formula(DefaultNum ~ Balance + Income), train, Binomial(), LogitLink())

StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, LogitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

DefaultNum ~ 1 + Balance + Income

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error       z  Pr(>|z|)     Lower 95%     Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)  -11.5226      0.480836     -23.96    <1e-99  -12.465       -10.5802
Balance        0.00563487  0.000251878   22.37    <1e-99    0.0051412     0.00612854
Income         2.18582e-5  5.45605e-6     4.01    <1e-04    1.11646e-5    3.25519e-5
────────────────────────────────────────────────────────────────────────────────────

### Prediction

In [7]:
pred = predict(logreg, test)

2000-element Vector{Union{Missing, Float64}}:
 0.21328625454571212
 3.125612838928029e-5
 0.006557933523560017
 0.022236137055162565
 0.04900485991424943
 0.04723963605465983
 0.001528854259701545
 0.15845169642308257
 2.3686135678251725e-5
 4.962441404369916e-5
 0.00017711989399314524
 0.0018276337069826445
 0.0003823405425518227
 ⋮
 0.0029645402339915877
 8.933402494237131e-5
 0.004992202023964724
 0.0009649886232806776
 0.007878355817466555
 2.024910853402094e-5
 0.0009898826715552314
 0.0001505145966882297
 0.980421178398864
 0.09454087048209255
 0.04264618321679343
 0.005030685046548807

### Validation

In [8]:
error(x, y) = ((x > 0.5) ? 1.0 : 0.0) == y
accuracy(xs, ys) = mean(error.(xs, ys))

accuracy (generic function with 1 method)

In [9]:
accuracy(pred, test[!, :DefaultNum])

0.978

## Using MLJ

In [10]:
using MLJ

┌ Info: Precompiling MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]
└ @ Base loading.jl:1423
[33m[1m│ [22m[39mThis may mean Distributions [31c24e10-a181-5473-b8eb-7969acd0382f] does not support precompilation but is imported by a module that does.
[33m[1m└ [22m[39m[90m@ Base loading.jl:1107[39m
[33m[1m│ [22m[39mThis may mean Distributions [31c24e10-a181-5473-b8eb-7969acd0382f] does not support precompilation but is imported by a module that does.
[33m[1m└ [22m[39m[90m@ Base loading.jl:1107[39m
[33m[1m│ [22m[39mThis may mean Distributions [31c24e10-a181-5473-b8eb-7969acd0382f] does not support precompilation but is imported by a module that does.
[33m[1m└ [22m[39m[90m@ Base loading.jl:1107[39m
[33m[1m│ [22m[39mThis may mean Distributions [31c24e10-a181-5473-b8eb-7969acd0382f] does not support precompilation but is imported by a module that does.
[33m[1m└ [22m[39m[90m@ Base loading.jl:1107[39m
┌ Info: Skipping precompilation since __precompile__(false)

### Load data

In [11]:
smarket = dataset("ISLR", "Smarket")
first(smarket, 6)

Unnamed: 0_level_0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Cat…
1,2001.0,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001.0,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001.0,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001.0,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001.0,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up
6,2001.0,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up


### Casting scientific types

In [12]:
y, X = unpack(smarket, ==(:Direction), colname -> true);
X = select(X, Not([:Year, :Today]))
first(X, 6)

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,0.381,-0.192,-2.624,-1.055,5.01,1.1913
2,0.959,0.381,-0.192,-2.624,-1.055,1.2965
3,1.032,0.959,0.381,-0.192,-2.624,1.4112
4,-0.623,1.032,0.959,0.381,-0.192,1.276
5,0.614,-0.623,1.032,0.959,0.381,1.2057
6,0.213,0.614,-0.623,1.032,0.959,1.3491


In [13]:
y = coerce(y, OrderedFactor)
classes(y[1])

2-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Down"
 "Up"

### Training/testing set

In [14]:
train, test = partition(eachindex(y), 0.7, shuffle=true)

([829, 1, 409, 417, 820, 617, 1013, 142, 610, 238  …  170, 988, 695, 522, 159, 16, 749, 340, 1117, 1063], [428, 329, 41, 520, 334, 100, 39, 496, 868, 1046  …  893, 1041, 1233, 566, 690, 422, 620, 862, 1099, 253])

### Model

In [15]:
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels

import MLJLinearModels ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/yuehhua/.julia/packages/MLJModels/lDzCR/src/loading.jl:168


MLJLinearModels.LogisticClassifier

In [16]:
logreg = machine(LogisticClassifier(), X, y)

Machine trained 0 times; caches data
  model: LogisticClassifier(lambda = 1.0, …)
  args: 
    1:	Source @632 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @470 ⏎ `AbstractVector{OrderedFactor{2}}`


### Training

In [17]:
fit!(logreg, rows=train)

┌ Info: Training machine(LogisticClassifier(lambda = 1.0, …), …).
└ @ MLJBase /home/yuehhua/.julia/packages/MLJBase/rQDaq/src/machines.jl:487
┌ Info: Solver: MLJLinearModels.LBFGS()
└ @ MLJLinearModels /home/yuehhua/.julia/packages/MLJLinearModels/2qDvV/src/mlj/interface.jl:76


Machine trained 1 time; caches data
  model: LogisticClassifier(lambda = 1.0, …)
  args: 
    1:	Source @632 ⏎ `Table{AbstractVector{Continuous}}`
    2:	Source @470 ⏎ `AbstractVector{OrderedFactor{2}}`


### Predict

In [18]:
ŷ = predict_mode(logreg, rows=test)

375-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 ⋮
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"

### Evaluation

In [19]:
ŷ = MLJ.predict(logreg, rows=test)

375-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, String, UInt8, Float64}:
 UnivariateFinite{OrderedFactor{2}}(Down=>0.49, Up=>0.51)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.472, Up=>0.528)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.479, Up=>0.521)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.474, Up=>0.526)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.483, Up=>0.517)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.482, Up=>0.518)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.483, Up=>0.517)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.489, Up=>0.511)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.476, Up=>0.524)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.477, Up=>0.523)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.478, Up=>0.522)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.479, Up=>0.521)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.478, Up=>0.522)
 ⋮
 UnivariateFinite{OrderedFactor{2}}(Down=>0.478, Up=>0.522)
 UnivariateFinite{OrderedFactor{2}}(Down=>0.466, Up=>0.5

In [20]:
cross_entropy(ŷ, y[test]) |> mean

0.6930377928406966

### Evaluation methods

In [21]:
ŷ = predict_mode(logreg, rows=test)

375-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 ⋮
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"

In [22]:
misclassification_rate(ŷ, y[test])

0.49066666666666664

In [23]:
cm = confusion_matrix(ŷ, y[test])

              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │    Down     │     Up      │
├─────────────┼─────────────┼─────────────┤
│    Down     │      0      │      0      │
├─────────────┼─────────────┼─────────────┤
│     Up      │     184     │     191     │
└─────────────┴─────────────┴─────────────┘


In [24]:
false_positive(cm)

184

In [25]:
MLJ.accuracy(cm)

0.5093333333333333

In [26]:
MLJ.accuracy(ŷ, y[test])

0.5093333333333334

In [27]:
precision(ŷ, y[test])

0.5093333333333334

In [28]:
recall(ŷ, y[test])

1.0

In [29]:
f1score(ŷ, y[test])

0.6749116607773852