## Classification

The jupyter notebook will provide the code and the results for least-square classification.

In [19]:
#random selection for train and test dataset
function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    data[train_idx,:], data[test_idx,:]
end

partitionTrainTest (generic function with 2 methods)

The code installs the dataset with "iris", which is introduced by Ronald Fisher in his 1936 paper. The data is containing 3 classes and 4 features with each sample.

In [20]:
using Flux, LinearAlgebra, RDatasets, Random
using Flux: onehotbatch

# set iris dataset
iris = dataset("datasets", "iris")

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


Create quantify label. The lavel of "setosa" is 1, "versicolor" is 2, "virginica" is 3.

In [21]:
# set labels
labels = []

for i in 1:length(iris[:, :Species])
    if iris[i,5] == "setosa"
        append!(labels, [1])
    end
    
    if iris[i,5] == "versicolor"
        append!(labels, [2])
    end
    
    if iris[i,5] == "virginica"
        append!(labels, [3])
    end
end
labels =Int.(labels)
iris = [iris labels]

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species,x1
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…,Int64
1,5.1,3.5,1.4,0.2,setosa,1
2,4.9,3.0,1.4,0.2,setosa,1
3,4.7,3.2,1.3,0.2,setosa,1
4,4.6,3.1,1.5,0.2,setosa,1
5,5.0,3.6,1.4,0.2,setosa,1
6,5.4,3.9,1.7,0.4,setosa,1
7,4.6,3.4,1.4,0.3,setosa,1
8,5.0,3.4,1.5,0.2,setosa,1
9,4.4,2.9,1.4,0.2,setosa,1
10,4.9,3.1,1.5,0.1,setosa,1


Create train data set and test data set. Test data set is made from 70% original dataset which is selected randomly.

In [22]:
#create train and test data
train, test = partitionTrainTest(iris, 0.7) # 70% data is used for train
trainData = convert(Matrix,train[:,[:SepalLength,:SepalWidth,:PetalLength,:PetalWidth]])
trainLabels = vec(convert(Matrix, train[:,[:x1]]))
testData = convert(Matrix,test[:,[:SepalLength,:SepalWidth,:PetalLength,:PetalWidth]])
testLabels = vec(convert(Matrix, test[:,[:x1]]))
nTrain = length(trainLabels)
nTest = length(testLabels);

105 observations are randomly selected from the original dataset as a train data

In [23]:
trainData

105×4 Array{Float64,2}:
 5.5  2.6  4.4  1.2
 5.4  3.7  1.5  0.2
 6.3  2.5  4.9  1.5
 4.8  3.0  1.4  0.3
 4.4  3.2  1.3  0.2
 5.4  3.4  1.7  0.2
 6.7  3.1  5.6  2.4
 5.5  2.4  3.8  1.1
 5.0  3.4  1.5  0.2
 5.2  3.4  1.4  0.2
 6.1  2.9  4.7  1.4
 4.8  3.4  1.6  0.2
 7.7  2.6  6.9  2.3
 ⋮              
 7.1  3.0  5.9  2.1
 6.2  2.2  4.5  1.5
 4.8  3.0  1.4  0.1
 4.4  2.9  1.4  0.2
 6.4  3.1  5.5  1.8
 5.6  2.5  3.9  1.1
 6.5  3.0  5.5  1.8
 6.4  2.9  4.3  1.3
 5.7  2.8  4.1  1.3
 4.9  3.1  1.5  0.2
 5.1  3.7  1.5  0.4
 4.6  3.1  1.5  0.2

45 observations are randomly selected from the original dataset as a train data

In [24]:
testData

45×4 Array{Float64,2}:
 7.2  3.6  6.1  2.5
 6.1  2.8  4.0  1.3
 5.0  2.0  3.5  1.0
 7.2  3.0  5.8  1.6
 6.3  2.5  5.0  1.9
 5.5  3.5  1.3  0.2
 5.8  2.7  3.9  1.2
 6.9  3.1  5.4  2.1
 6.3  2.3  4.4  1.3
 6.7  3.3  5.7  2.5
 5.6  2.8  4.9  2.0
 4.9  3.0  1.4  0.2
 6.0  2.2  4.0  1.0
 ⋮              
 6.4  3.2  4.5  1.5
 6.3  3.3  6.0  2.5
 6.3  3.4  5.6  2.4
 6.0  2.2  5.0  1.5
 6.0  3.0  4.8  1.8
 4.9  3.1  1.5  0.1
 4.8  3.1  1.6  0.2
 5.1  3.8  1.9  0.4
 7.3  2.9  6.3  1.8
 4.7  3.2  1.6  0.2
 5.1  3.5  1.4  0.2
 5.9  3.0  5.1  1.8

Set design matrix A

$A =\begin{bmatrix}
    1 & x_{1,1} & \dots & x_{1,n} \\
    \vdots & \vdots & \ddots & \vdots \\
    1 & x_{n,1} & \dots & x_{m,n}
    \end{bmatrix}$

In [25]:
A = [ones(nTrain) trainData]

105×5 Array{Float64,2}:
 1.0  5.5  2.6  4.4  1.2
 1.0  5.4  3.7  1.5  0.2
 1.0  6.3  2.5  4.9  1.5
 1.0  4.8  3.0  1.4  0.3
 1.0  4.4  3.2  1.3  0.2
 1.0  5.4  3.4  1.7  0.2
 1.0  6.7  3.1  5.6  2.4
 1.0  5.5  2.4  3.8  1.1
 1.0  5.0  3.4  1.5  0.2
 1.0  5.2  3.4  1.4  0.2
 1.0  6.1  2.9  4.7  1.4
 1.0  4.8  3.4  1.6  0.2
 1.0  7.7  2.6  6.9  2.3
 ⋮                   
 1.0  7.1  3.0  5.9  2.1
 1.0  6.2  2.2  4.5  1.5
 1.0  4.8  3.0  1.4  0.1
 1.0  4.4  2.9  1.4  0.2
 1.0  6.4  3.1  5.5  1.8
 1.0  5.6  2.5  3.9  1.1
 1.0  6.5  3.0  5.5  1.8
 1.0  6.4  2.9  4.3  1.3
 1.0  5.7  2.8  4.1  1.3
 1.0  4.9  3.1  1.5  0.2
 1.0  5.1  3.7  1.5  0.4
 1.0  4.6  3.1  1.5  0.2

pseudo-inverse matrix of $A^{\dagger}=(A^TA)^{-1}A$

In [26]:
Adag = pinv(A)

5×105 Array{Float64,2}:
  0.156989   -0.108147    0.0503434    0.0721648  …  -0.0545528    0.103429
 -0.0548856   0.0212511   0.0198314    0.0235398      0.00279506  -0.0212085
  0.0126652   0.0147667  -0.0426759   -0.0360498      0.0259901    0.00306035
  0.0685796  -0.0099469   0.00177102  -0.0345982     -0.0184733    0.0166305
 -0.103734   -0.0114755  -0.0274498    0.0339607      0.0318316   -0.0350109

Classifier

In [27]:
tfPM(x) = x ? +1 : -1
yDat(k) = tfPM.(onehotbatch(trainLabels,1:3)'[:,k])
bets = [Adag*yDat(k) for k in 1:3]
classify(input) = findmax([([1 ; input])'*bets[k] for k in 1:3])[2]

classify (generic function with 1 method)

Perform predicrtion with the least-square clasiffier

In [28]:
predictions = [classify(testData[k,:]) for k in 1:nTest]
confusionMatrix = [sum((predictions .== i) .& (testLabels .== j)) for i in 1:3, j in 1:3]
accuracy = sum(diag(confusionMatrix))/nTest
println("Accuracy: ", accuracy, "\nConfusion Matrix:")
show(stdout, "text/plain", confusionMatrix)

Accuracy: 0.8444444444444444
Confusion Matrix:
3×3 Array{Int64,2}:
 16   0   0
  0  10   4
  0   3  12

In [42]:
using Flux, LinearAlgebra, RDatasets, Random
using Flux: onehotbatch

# set iris dataset
iris = dataset("datasets", "iris")

# set labels
labels = []

for i in 1:length(iris[:, :Species])
    if iris[i,5] == "setosa"
        append!(labels, [1])
    end
    
    if iris[i,5] == "versicolor"
        append!(labels, [2])
    end
    
    if iris[i,5] == "virginica"
        append!(labels, [3])
    end
end
labels =Int.(labels)
iris = [iris labels]


#create train and test data
train, test = partitionTrainTest(iris, 0.7) # 70% data is used for train
trainData = convert(Matrix,train[:,[:SepalLength,:SepalWidth,:PetalLength,:PetalWidth]])
trainLabels = vec(convert(Matrix, train[:,[:x1]]))
testData = convert(Matrix,test[:,[:SepalLength,:SepalWidth,:PetalLength,:PetalWidth]])
testLabels = vec(convert(Matrix, test[:,[:x1]]))
nTrain = length(trainLabels)
nTest = length(testLabels);

A = [ones(nTrain) trainData]
Adag = pinv(A)


tfPM(x) = x ? +1 : -1
yDat(k) = tfPM.(onehotbatch(trainLabels,1:3)'[:,k])
bets = [Adag*yDat(k) for k in 1:3]
classify(input) = findmax([([1 ; input])'*bets[k] for k in 1:3])[2]

predictions = [classify(testData[k,:]) for k in 1:nTest]
confusionMatrix = [sum((predictions .== i) .& (testLabels .== j)) for i in 1:3, j in 1:3]
accuracy = sum(diag(confusionMatrix))/nTest
println("Accuracy: ", accuracy, "\nConfusion Matrix:")
show(stdout, "text/plain", confusionMatrix)

Accuracy: 0.8222222222222222
Confusion Matrix:
3×3 Array{Int64,2}:
 16   0  0
  0  12  5
  0   3  9