# Logistic Regression

## Using GLM

In [1]:
using GLM
using Statistics
using RDatasets
using MLDataUtils

### Load data

In [2]:
data = RDatasets.dataset("ISLR", "Default")
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64
1,No,No,729.526,44361.6
2,No,Yes,817.18,12106.1
3,No,No,1073.55,31767.1
4,No,No,529.251,35704.5
5,No,No,785.656,38463.5
6,No,Yes,919.589,7491.56


### Preprocessing

In [3]:
isyes(x) = x == "Yes" ? 1.0 : 0.0

data[!, :DefaultNum] = isyes.(data[!, :Default])
data[!, :StudentNum] = isyes.(data[!, :Student])
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,729.526,44361.6,0.0,0.0
2,No,Yes,817.18,12106.1,0.0,1.0
3,No,No,1073.55,31767.1,0.0,0.0
4,No,No,529.251,35704.5,0.0,0.0
5,No,No,785.656,38463.5,0.0,0.0
6,No,Yes,919.589,7491.56,0.0,1.0


### Training/Testing set

In [4]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at=0.8);

In [5]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,1698.07,48595.7,0.0,0.0
2,No,Yes,242.463,23413.4,0.0,1.0
3,No,No,1164.51,29874.5,0.0,0.0
4,No,Yes,743.415,19610.2,0.0,1.0
5,No,No,430.651,38372.0,0.0,0.0
6,No,No,274.957,24102.4,0.0,0.0
7,No,Yes,1017.36,21702.2,0.0,1.0
8,No,Yes,1115.11,19169.1,0.0,1.0
9,No,No,379.1,31350.9,0.0,0.0
10,No,Yes,688.748,26662.4,0.0,1.0


### Model

In [6]:
logreg = glm(@formula(DefaultNum ~ Balance + Income), train, Binomial(), LogitLink())

StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

DefaultNum ~ 1 + Balance + Income

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
                 Estimate   Std. Error    z value  Pr(>|z|)     Lower 95%     Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept)  -11.5229      0.481352     -23.9386     <1e-99  -12.4663      -10.5795
Balance        0.00561236  0.000250602   22.3955     <1e-99    0.00512119    0.00610353
Income         2.26011e-5  5.53166e-6     4.08576    <1e-4     1.17592e-5    3.34429e-5
───────────────────────────────────────────────────────────────────────────────────────

### Prediction

In [7]:
pred = predict(logreg, test)

2000-element Array{Union{Missing, Float64},1}:
 0.2902000935151865
 6.552844637396135e-5
 0.013228644051809268
 0.0009993643744540825
 0.000264157767550555
 7.987082349190347e-5
 0.0048562096819437985
 0.007913064828352651
 0.00016878683175409925
 0.0008624935188341505
 0.0028663715480199115
 0.008382779196333909
 0.007460248098994489
 ⋮
 0.0009132297169298889
 0.0002537232429431653
 0.020315487844089864
 0.00012577658442680816
 0.00842067591367221
 7.704417770134006e-5
 0.1606254333649271
 0.0011134564986545641
 8.253681488091501e-5
 0.0006940973924595191
 0.0008464402079546009
 0.08663579139007269

### Validation

In [8]:
error(x, y) = ((x > 0.5) ? 1.0 : 0.0) == y
accuracy(xs, ys) = mean(error.(xs, ys))

accuracy (generic function with 1 method)

In [9]:
accuracy(pred, test[!, :DefaultNum])

0.974

## Using MLJ

In [10]:
using MLJ



### Load data

In [11]:
smarket = dataset("ISLR", "Smarket")
first(smarket, 6)

Unnamed: 0_level_0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Categorical…
1,2001.0,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001.0,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001.0,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001.0,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001.0,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up
6,2001.0,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up


### Casting scientific types

In [12]:
y, X = unpack(smarket, ==(:Direction), colname -> true);
X = select(X, Not([:Year, :Today]))
first(X, 6)

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,0.381,-0.192,-2.624,-1.055,5.01,1.1913
2,0.959,0.381,-0.192,-2.624,-1.055,1.2965
3,1.032,0.959,0.381,-0.192,-2.624,1.4112
4,-0.623,1.032,0.959,0.381,-0.192,1.276
5,0.614,-0.623,1.032,0.959,0.381,1.2057
6,0.213,0.614,-0.623,1.032,0.959,1.3491


In [13]:
y = coerce(y, OrderedFactor)
classes(y[1])

2-element CategoricalArray{String,1,UInt8}:
 "Down"
 "Up"

### Training/testing set

In [14]:
train, test = partition(eachindex(y), 0.7, shuffle=true)

([1169, 1222, 230, 287, 624, 808, 1068, 125, 820, 1072  …  429, 877, 209, 314, 745, 530, 501, 271, 785, 37], [183, 541, 1122, 1015, 1114, 50, 841, 458, 470, 307  …  720, 1020, 270, 448, 985, 315, 921, 1183, 988, 562])

### Model

In [15]:
model = @load LogisticClassifier pkg=MLJLinearModels

LogisticClassifier(
    lambda = 1.0,
    gamma = 0.0,
    penalty = :l2,
    fit_intercept = true,
    penalize_intercept = false,
    solver = nothing,
    multi_class = false,
    nclasses = 2)[34m @ 1…04[39m

In [16]:
match = machine(model, X, y)

[34mMachine{LogisticClassifier} @ 1…32[39m


### Training

In [17]:
fit!(match, rows=train)

┌ Info: Training [34mMachine{LogisticClassifier} @ 1…32[39m.
└ @ MLJBase /home/yuehhua/.julia/packages/MLJBase/O5b6j/src/machines.jl:187


[34mMachine{LogisticClassifier} @ 1…32[39m


### Predict

In [18]:
ŷ = predict_mode(match, rows=test)

375-element CategoricalArray{String,1,UInt8}:
 "Up"
 "Down"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Down"
 "Down"
 "Up"
 "Up"
 "Up"
 "Up"
 ⋮
 "Down"
 "Up"
 "Up"
 "Up"
 "Down"
 "Up"
 "Up"
 "Up"
 "Down"
 "Up"
 "Up"
 "Up"

### Evaluation

In [19]:
ŷ = MLJ.predict(match, rows=test)

375-element Array{UnivariateFinite{String,UInt8,Float64},1}:
 UnivariateFinite(Down=>0.482, Up=>0.518)
 UnivariateFinite(Down=>0.513, Up=>0.487)
 UnivariateFinite(Down=>0.45, Up=>0.55)
 UnivariateFinite(Down=>0.465, Up=>0.535)
 UnivariateFinite(Down=>0.473, Up=>0.527)
 UnivariateFinite(Down=>0.418, Up=>0.582)
 UnivariateFinite(Down=>0.469, Up=>0.531)
 UnivariateFinite(Down=>0.511, Up=>0.489)
 UnivariateFinite(Down=>0.514, Up=>0.486)
 UnivariateFinite(Down=>0.465, Up=>0.535)
 UnivariateFinite(Down=>0.492, Up=>0.508)
 UnivariateFinite(Down=>0.483, Up=>0.517)
 UnivariateFinite(Down=>0.492, Up=>0.508)
 ⋮
 UnivariateFinite(Down=>0.506, Up=>0.494)
 UnivariateFinite(Down=>0.461, Up=>0.539)
 UnivariateFinite(Down=>0.491, Up=>0.509)
 UnivariateFinite(Down=>0.478, Up=>0.522)
 UnivariateFinite(Down=>0.5, Up=>0.5)
 UnivariateFinite(Down=>0.449, Up=>0.551)
 UnivariateFinite(Down=>0.477, Up=>0.523)
 UnivariateFinite(Down=>0.446, Up=>0.554)
 UnivariateFinite(Down=>0.503, Up=>0.497)
 UnivariateFinite(

In [20]:
cross_entropy(ŷ, y[test]) |> mean

0.6978154594195113

### Evaluation methods

In [21]:
ŷ = predict_mode(match, rows=test)

375-element CategoricalArray{String,1,UInt8}:
 "Up"
 "Down"
 "Up"
 "Up"
 "Up"
 "Up"
 "Up"
 "Down"
 "Down"
 "Up"
 "Up"
 "Up"
 "Up"
 ⋮
 "Down"
 "Up"
 "Up"
 "Up"
 "Down"
 "Up"
 "Up"
 "Up"
 "Down"
 "Up"
 "Up"
 "Up"

In [22]:
misclassification_rate(ŷ, y[test])

0.4826666666666667

In [23]:
cm = confusion_matrix(ŷ, y[test])

              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │    Down     │     Up      │
├─────────────┼─────────────┼─────────────┤
│    Down     │     40      │     37      │
├─────────────┼─────────────┼─────────────┤
│     Up      │     144     │     154     │
└─────────────┴─────────────┴─────────────┘


In [24]:
false_positive(cm)

144

In [25]:
MLJ.accuracy(cm)

0.5173333333333333

In [26]:
MLJ.accuracy(ŷ, y[test])

0.5173333333333333

In [27]:
precision(ŷ, y[test])

0.5167785234899329

In [28]:
recall(ŷ, y[test])

0.806282722513089

In [29]:
f1score(ŷ, y[test])

0.6298568507157464