# Logistic Regression

## Using GLM

In [1]:
using GLM
using Statistics
using RDatasets
using MLDataUtils

### Load data

In [2]:
data = RDatasets.dataset("ISLR", "Default")
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64
1,No,No,729.526,44361.6
2,No,Yes,817.18,12106.1
3,No,No,1073.55,31767.1
4,No,No,529.251,35704.5
5,No,No,785.656,38463.5
6,No,Yes,919.589,7491.56


### Preprocessing

In [3]:
isyes(x) = x == "Yes" ? 1.0 : 0.0

data[!, :DefaultNum] = isyes.(data[!, :Default])
data[!, :StudentNum] = isyes.(data[!, :Student])
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,729.526,44361.6,0.0,0.0
2,No,Yes,817.18,12106.1,0.0,1.0
3,No,No,1073.55,31767.1,0.0,0.0
4,No,No,529.251,35704.5,0.0,0.0
5,No,No,785.656,38463.5,0.0,0.0
6,No,Yes,919.589,7491.56,0.0,1.0


### Training/Testing set

In [4]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at=0.8);

In [5]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,241.336,40122.4,0.0,0.0
2,No,No,946.138,40038.7,0.0,0.0
3,No,No,752.222,48764.0,0.0,0.0
4,No,No,1032.94,31152.1,0.0,0.0
5,No,No,850.548,44501.9,0.0,0.0
6,No,No,124.699,65211.1,0.0,0.0
7,No,No,265.729,60182.7,0.0,0.0
8,No,No,0.0,46306.9,0.0,0.0
9,No,Yes,1091.14,19990.8,0.0,1.0
10,No,Yes,1218.65,19206.1,0.0,1.0


### Model

In [6]:
logreg = glm(@formula(DefaultNum ~ Balance + Income), train, Binomial(), LogitLink())

StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

DefaultNum ~ 1 + Balance + Income

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                 Estimate   Std. Error    z value  Pr(>|z|)     Lower 95%     Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  -11.634       0.482006     -24.1366     <1e-99  -12.5787      -10.6893
Balance        0.00564643  0.000251109   22.486      <1e-99    0.00515427    0.00613859
Income         2.42322e-5  5.54692e-6     4.36858    <1e-4     1.33604e-5    3.51039e-5
───────────────────────────────────────────────────────────────────────────────────────

### Prediction

In [7]:
pred = predict(logreg, test)

2000-element Array{Union{Missing, Float64},1}:
 9.150220336679368e-5
 0.004861800412552353
 0.0020152992066220475
 0.006389529620440955
 0.0031630256332805685
 8.698519971359189e-5
 0.0001707364240762617
 2.7210563647774178e-5
 0.00676952980488337
 0.01355204597103904
 6.873337234012609e-5
 8.122426366054097e-5
 0.002222270417418081
 ⋮
 0.00906038766208714
 0.00035610053239268076
 0.00014655597818013374
 0.21638148426472423
 0.1329844724882828
 0.0031090945400260336
 0.021201777133772914
 0.0038254800015384374
 0.00160979539197448
 0.00018422939233096847
 0.00017890290104524042
 0.2718299195933858

### Validation

In [8]:
error(x, y) = ((x > 0.5) ? 1.0 : 0.0) == y
accuracy(xs, ys) = mean(error.(xs, ys))

accuracy (generic function with 1 method)

In [9]:
accuracy(pred, test[!, :DefaultNum])

0.974

## Using MLJ

In [10]:
using MLJ



### Load data

In [11]:
smarket = dataset("ISLR", "Smarket")
first(smarket, 6)

Unnamed: 0_level_0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Categorical…
1,2001.0,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001.0,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001.0,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001.0,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001.0,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up
6,2001.0,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up


### Casting scientific types

In [12]:
y, X = unpack(smarket, ==(:Direction), colname -> true);
X = select(X, Not([:Year, :Today]))
first(X, 6)

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,0.381,-0.192,-2.624,-1.055,5.01,1.1913
2,0.959,0.381,-0.192,-2.624,-1.055,1.2965
3,1.032,0.959,0.381,-0.192,-2.624,1.4112
4,-0.623,1.032,0.959,0.381,-0.192,1.276
5,0.614,-0.623,1.032,0.959,0.381,1.2057
6,0.213,0.614,-0.623,1.032,0.959,1.3491


In [13]:
y = coerce(y, OrderedFactor)
classes(y[1])

2-element CategoricalArray{String,1,UInt8}:
 "Down"
 "Up"

### Training/testing set

In [14]:
train, test = partition(eachindex(y), 0.7, shuffle=true)

([909, 1071, 252, 1138, 459, 381, 429, 713, 757, 449  …  327, 150, 34, 260, 106, 764, 1235, 1242, 802, 97], [221, 956, 442, 879, 1103, 1130, 162, 101, 365, 814  …  111, 1175, 344, 957, 632, 295, 717, 342, 435, 362])

### Model

In [15]:
model = @load LogisticClassifier pkg=MLJLinearModels

LogisticClassifier(
    lambda = 1.0,
    gamma = 0.0,
    penalty = :l2,
    fit_intercept = true,
    penalize_intercept = false,
    solver = nothing,
    multi_class = false)[34m @ 1…96[39m

In [16]:
match = machine(model, X, y)

[34mMachine{LogisticClassifier} @ 4…61[39m


### Training

In [17]:
fit!(match, rows=train)

┌ Info: Training [34mMachine{LogisticClassifier} @ 4…61[39m.
└ @ MLJBase C:\Users\a504082002\.julia\packages\MLJBase\qJs1o\src\machines.jl:182


[34mMachine{LogisticClassifier} @ 4…61[39m


### Predict

In [18]:
ŷ = predict_mode(match, rows=test)

375-element CategoricalArray{String,1,UInt8}:
 "Up"
 "Up"
 "Down"
 "Down"
 "Up"
 "Down"
 "Up"
 "Down"
 "Up"
 "Up"
 "Down"
 "Down"
 "Down"
 ⋮
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Down"
 "Up"
 "Down"
 "Up"
 "Up"

### Evaluation

In [19]:
ŷ = MLJ.predict(match, rows=test)

375-element Array{UnivariateFinite{String,UInt8,Float64},1}:
 UnivariateFinite(Down=>0.412, Up=>0.588)
 UnivariateFinite(Down=>0.45, Up=>0.55)
 UnivariateFinite(Down=>0.509, Up=>0.491)
 UnivariateFinite(Down=>0.506, Up=>0.494)
 UnivariateFinite(Down=>0.464, Up=>0.536)
 UnivariateFinite(Down=>0.504, Up=>0.496)
 UnivariateFinite(Down=>0.402, Up=>0.598)
 UnivariateFinite(Down=>0.551, Up=>0.449)
 UnivariateFinite(Down=>0.497, Up=>0.503)
 UnivariateFinite(Down=>0.441, Up=>0.559)
 UnivariateFinite(Down=>0.578, Up=>0.422)
 UnivariateFinite(Down=>0.505, Up=>0.495)
 UnivariateFinite(Down=>0.543, Up=>0.457)
 ⋮
 UnivariateFinite(Down=>0.498, Up=>0.502)
 UnivariateFinite(Down=>0.437, Up=>0.563)
 UnivariateFinite(Down=>0.486, Up=>0.514)
 UnivariateFinite(Down=>0.458, Up=>0.542)
 UnivariateFinite(Down=>0.459, Up=>0.541)
 UnivariateFinite(Down=>0.439, Up=>0.561)
 UnivariateFinite(Down=>0.492, Up=>0.508)
 UnivariateFinite(Down=>0.514, Up=>0.486)
 UnivariateFinite(Down=>0.429, Up=>0.571)
 UnivariateFin

In [20]:
cross_entropy(ŷ, y[test]) |> mean

0.7031074056572498

### Evaluation methods

In [21]:
ŷ = predict_mode(match, rows=test)

375-element CategoricalArray{String,1,UInt8}:
 "Up"
 "Up"
 "Down"
 "Down"
 "Up"
 "Down"
 "Up"
 "Down"
 "Up"
 "Up"
 "Down"
 "Down"
 "Down"
 ⋮
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Down"
 "Up"
 "Down"
 "Up"
 "Up"

In [22]:
misclassification_rate(ŷ, y[test])

0.5173333333333333

In [23]:
cm = confusion_matrix(ŷ, y[test])

              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │    Down     │     Up      │
├─────────────┼─────────────┼─────────────┤
│    Down     │     34      │     41      │
├─────────────┼─────────────┼─────────────┤
│     Up      │     153     │     147     │
└─────────────┴─────────────┴─────────────┘


In [24]:
false_positive(cm)

153

In [25]:
MLJ.accuracy(cm)

0.4826666666666667

In [26]:
MLJ.accuracy(ŷ, y[test])

0.4826666666666667

In [27]:
precision(ŷ, y[test])

0.49

In [28]:
recall(ŷ, y[test])

0.7819148936170213

In [29]:
f1score(ŷ, y[test])

0.6024590163934427