# Machine Learning packages

### Yueh-Hua Tu

## Linear Regression

In [1]:
using GLM
using StatsBase
using RDatasets
using MLDataUtils

### Load data

In [2]:
data = RDatasets.dataset("datasets", "mtcars");
first(data, 6)

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String⍰,Float64⍰,Int64⍰,Float64⍰,Int64⍰,Float64⍰,Float64⍰,Float64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
2,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
3,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
4,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
5,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
6,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1


### Training/Testing set

In [3]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at = 0.8);

In [4]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Model,MPG,Cyl,Disp,HP,DRat,WT,QSec,VS,AM,Gear,Carb
Unnamed: 0_level_1,String⍰,Float64⍰,Int64⍰,Float64⍰,Int64⍰,Float64⍰,Float64⍰,Float64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
2,Camaro Z28,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
3,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2
4,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
5,Chrysler Imperial,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
6,Toyota Corona,21.5,4,120.1,97,3.7,2.465,20.01,1,0,3,1


### Model

In [5]:
ols = GLM.lm(@formula(MPG ~ Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb), train)

│   caller = evalcontrasts(::DataFrame, ::Dict{Any,Any}) at modelframe.jl:124
└ @ StatsModels /home/pika/.julia/packages/StatsModels/AYB2E/src/modelframe.jl:124


StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Array{Float64,1}},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: MPG ~ 1 + Cyl + Disp + HP + DRat + WT + QSec + VS + AM + Gear + Carb

Coefficients:
                Estimate Std.Error   t value Pr(>|t|)
(Intercept)      14.6607    24.549  0.597201   0.5593
Cyl            -0.174004   1.41193 -0.123239   0.9036
Disp         -0.00537269 0.0157801 -0.340472   0.7382
HP             0.0189388 0.0232905  0.813156   0.4288
DRat             2.32819   1.68151   1.38459   0.1864
WT              -1.49635   1.93572 -0.773022   0.4515
QSec            0.126962  0.707465   0.17946   0.8600
VS               1.92495   1.78979   1.07552   0.2991
AM               3.64723   1.93173   1.88806   0.0785
Gear            0.759327   2.12377  0.357537   0.7257
Carb            -1.85788  0.893982  -2.07821   0.0553


### Prediction

In [6]:
predict(ols, test)

│   caller = evalcontrasts(::DataFrame, ::Dict{Symbol,StatsModels.ContrastsMatrix}) at modelframe.jl:124
└ @ StatsModels /home/pika/.julia/packages/StatsModels/AYB2E/src/modelframe.jl:124


6-element Array{Union{Missing, Float64},1}:
 23.31752756476571 
 15.769348961336535
 28.043149474588315
 29.75211257491646 
 11.840815642125223
 24.96785130133965 

### Validation

In [7]:
GLM.r²(ols)

0.9397740339791928

## Logistic Regression

### Load data

In [8]:
data = RDatasets.dataset("ISLR", "Default");
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64
1,No,No,729.526,44361.6
2,No,Yes,817.18,12106.1
3,No,No,1073.55,31767.1
4,No,No,529.251,35704.5
5,No,No,785.656,38463.5
6,No,Yes,919.589,7491.56


### Preprocessing

In [9]:
isyes(x) = x == "Yes" ? 1.0 : 0.0

data[:DefaultNum] = isyes.(data[:Default])
data[:StudentNum] = isyes.(data[:Student])
first(data, 6)

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,No,729.526,44361.6,0.0,0.0
2,No,Yes,817.18,12106.1,0.0,1.0
3,No,No,1073.55,31767.1,0.0,0.0
4,No,No,529.251,35704.5,0.0,0.0
5,No,No,785.656,38463.5,0.0,0.0
6,No,Yes,919.589,7491.56,0.0,1.0


### Training/Testing set

In [10]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(data)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at=0.8);

In [11]:
train = data[train_ind, :]
test = data[test_ind, :]

Unnamed: 0_level_0,Default,Student,Balance,Income,DefaultNum,StudentNum
Unnamed: 0_level_1,Categorical…,Categorical…,Float64,Float64,Float64,Float64
1,No,Yes,561.392,21747.3,0.0,1.0
2,No,No,476.15,58837.7,0.0,0.0
3,No,No,306.19,30587.0,0.0,0.0
4,No,Yes,880.282,20177.4,0.0,1.0
5,Yes,No,1278.41,36675.6,1.0,0.0
6,No,Yes,1036.32,14108.5,0.0,1.0
7,No,No,0.0,50231.4,0.0,0.0
8,No,No,913.587,46907.2,0.0,0.0
9,No,No,380.95,36943.4,0.0,0.0
10,No,No,1351.85,40178.0,0.0,0.0


### Model

In [12]:
logreg = glm(@formula(DefaultNum ~ Balance + Income), train, Binomial(), LogitLink())

StatsModels.DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Binomial{Float64},LogitLink},DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: DefaultNum ~ 1 + Balance + Income

Coefficients:
               Estimate   Std.Error  z value Pr(>|z|)
(Intercept)     -11.618    0.496266 -23.4109   <1e-99
Balance      0.00568553 0.000259296  21.9268   <1e-99
Income       2.02646e-5  5.66888e-6  3.57472   0.0004


### Prediction

In [13]:
pred = predict(logreg, test)

2000-element Array{Union{Missing, Float64},1}:
 0.00034024396710588675
 0.0004443174293703553 
 9.54004846127362e-5   
 0.0020167221292853167 
 0.02643235506192927   
 0.0043204448555003525 
 2.4913076173515666e-5 
 0.004180092664907543  
 0.0001659814546915922 
 0.04237785420866679   
 0.039262745252264315  
 0.006382969565056107  
 0.02349885745801304   
 ⋮                     
 0.00012742497366052106
 0.0004536984361026593 
 0.002103658535463094  
 0.137446797248739     
 0.00018146944715101502
 3.993538240581394e-5  
 0.005705433158083141  
 0.0008590887889198012 
 0.04190369964691215   
 0.09246229319645737   
 2.5941133788426722e-5 
 0.6394205785962401    

### Validation

In [14]:
error(x, y) = ((x > 0.5) ? 1.0 : 0.0) == y
accuracy(xs, ys) = sum(error.(xs, ys)) / size(xs, 1)
accuracy(pred, test[:DefaultNum])

0.969

## Naive Bayes

In [15]:
using NaiveBayes



### Load data

In [16]:
iris = dataset("datasets", "iris")
first(iris, 6)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Categorical…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


### Training/Testing set

In [17]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(iris)))
train_ind, test_ind = MLDataUtils.splitobs(indecies, at=0.8);

In [18]:
train = iris[train_ind, :]
test = iris[test_ind, :]
train_X = Matrix(train[:, 1:4])'[:, :]
train_y = Vector(train[:Species])
test_X = Matrix(test[:, 1:4])'[:, :]
test_y = Vector(test[:Species]);

### Model

In [19]:
model = GaussianNB(unique(train[:Species]), 4)
fit(model, train_X, train_y)

GaussianNB(Dict("virginica"=>36,"versicolor"=>42,"setosa"=>42))

## Validation

In [20]:
acc = sum(predict(model, test_X) .== test_y) / length(test_y)
println("Accuracy: $acc")

Accuracy: 1.0


## Decision Tree and Random Forest

In [21]:
using DecisionTree



### Load data

In [22]:
features, labels = DecisionTree.load_data("iris");

### Casting

In [23]:
features = float.(features)
labels = string.(labels)

150-element Array{String,1}:
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 "Iris-setosa"   
 ⋮               
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"

### Model

In [24]:
model = DecisionTree.DecisionTreeClassifier(max_depth=2)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  root:                     

nothing
nothing

Available models:

* `DecisionTreeClassifier`
* `DecisionTreeRegressor`
* `RandomForestClassifier`
* `RandomForestRegressor`
* `AdaBoostStumpClassifier`

In [25]:
?DecisionTreeClassifier

search: [0m[1mD[22m[0m[1me[22m[0m[1mc[22m[0m[1mi[22m[0m[1ms[22m[0m[1mi[22m[0m[1mo[22m[0m[1mn[22m[0m[1mT[22m[0m[1mr[22m[0m[1me[22m[0m[1me[22m[0m[1mC[22m[0m[1ml[22m[0m[1ma[22m[0m[1ms[22m[0m[1ms[22m[0m[1mi[22m[0m[1mf[22m[0m[1mi[22m[0m[1me[22m[0m[1mr[22m



```
DecisionTreeClassifier(; pruning_purity_threshold=0.0,
                       max_depth::Int=-1,
                       min_samples_leaf::Int=1,
                       min_samples_split::Int=2,
                       min_purity_increase::Float=0.0,
                       n_subfeatures::Int=0,
                       rng=Random.GLOBAL_RNG)
```

Decision tree classifier. See [DecisionTree.jl's documentation](https://github.com/bensadeghi/DecisionTree.jl)

Hyperparameters:

  * `pruning_purity_threshold`: (post-pruning) merge leaves having `>=thresh` combined purity (default: no pruning)
  * `max_depth`: maximum depth of the decision tree (default: no maximum)
  * `min_samples_leaf`: the minimum number of samples each leaf needs to have (default: 1)
  * `min_samples_split`: the minimum number of samples in needed for a split (default: 2)
  * `min_purity_increase`: minimum purity needed for a split (default: 0.0)
  * `n_subfeatures`: number of features to select at random (default: keep all)
  * `rng`: the random number generator to use. Can be an `Int`, which will be used to seed and create a new random number generator.

Implements `fit!`, `predict`, `predict_proba`, `get_classes`


### Training

In [26]:
DecisionTree.fit!(model, features, labels)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  root:                     

["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Decision Tree
Leaves: 3
Depth:  2

#### pretty print of the tree, to a depth of 5 nodes

In [27]:
DecisionTree.print_tree(model, 5)

Feature 3, Threshold 2.45
L-> Iris-setosa : 50/50
R-> Feature 4, Threshold 1.75
    L-> Iris-versicolor : 49/54
    R-> Iris-virginica : 45/46


### Prediction

In [28]:
new_iris = [5.9, 3.0, 5.1, 1.9]
DecisionTree.predict(model, new_iris)

"Iris-virginica"

In [29]:
DecisionTree.predict_proba(model, new_iris)

3-element Array{Float64,1}:
 0.0                 
 0.021739130434782608
 0.9782608695652174  

#### the ordering of the columns in `predict_proba`'s output

In [30]:
DecisionTree.get_classes(model)

3-element Array{String,1}:
 "Iris-setosa"    
 "Iris-versicolor"
 "Iris-virginica" 

### Save model

In [31]:
using JLD2

In [32]:
@save "models/decision-tree.jld2" model

# Packages

## Distributions

* Distributions.jl

## Regression

* Lasso.jl
    * Ridge regression
    * LASSO regression
    * ElasticNet
* LARS.jl
    * Least angle regression
    * L1-regularized linear regression
* Isotonic.jl
    * Linear PAVA (fastest)
    * Pooled PAVA (slower)
    * Active Set (slowest)

## Clustering

* Clustering.jl
    * K-means
    * K-medoids
    * Affinity Propagation
    * Density-based spatial clustering of applications with noise (DBSCAN)
    * Markov Clustering Algorithm (MCL)
    * Fuzzy C-Means Clustering
    * Hierarchical Clustering
        * Single Linkage
        * Average Linkage
        * Complete Linkage
        * Ward's Linkage

## Dimensional Reduction

* MultivariateStats.jl
    * Data Whitening
    * Principal Components Analysis (PCA)
    * Canonical Correlation Analysis (CCA)
    * Classical Multidimensional Scaling (MDS)
    * Linear Discriminant Analysis (LDA)
    * Multiclass LDA
    * Independent Component Analysis (ICA), FastICA
    * Probabilistic PCA
    * Factor Analysis
    * Kernel PCA

## Dimensional Reduction

* NMF.jl
    * Lee & Seung's Multiplicative Update (for both MSE & Divergence objectives)
    * (Naive) Projected Alternate Least Squared
    * ALS Projected Gradient Methods
    * Random Initialization
    * NNDSVD Initialization

## Kernel Density Estimtion

* KernelDensity.jl

## Time Series Analysis

* TimeSeries.jl

# Go to deep learning!