# Julia 機器學習：DecisionTree 決策樹

本範例需要使用到的套件有 DecisionTree、ScikitLearn，請在執行以下範例前先安裝。

```
] add DecisionTree
] add ScikitLearn
```

In [2]:
import Pkg
Pkg.add(["DecisionTree", "ScikitLearn"])

[32m[1m  Resolving[22m[39m package versions...
[32m[1m  Installed[22m[39m OpenBLAS_jll ──────── v0.3.9+4
[32m[1m  Installed[22m[39m ProgressMeter ─────── v1.2.0
[32m[1m  Installed[22m[39m ArrayLayouts ──────── v0.2.6
[32m[1m  Installed[22m[39m ElasticArrays ─────── v1.1.0
[32m[1m  Installed[22m[39m JLD2 ──────────────── v0.1.13
[32m[1m  Installed[22m[39m DecisionTree ──────── v0.10.1
[32m[1m  Installed[22m[39m Documenter ────────── v0.24.11
[32m[1m  Installed[22m[39m ScikitLearn ───────── v0.5.1
[32m[1m  Installed[22m[39m ZygoteRules ───────── v0.2.0
[32m[1m  Installed[22m[39m PyPlot ────────────── v2.9.0
[32m[1m  Installed[22m[39m NBInclude ─────────── v2.2.0
[32m[1m  Installed[22m[39m ElasticPDMats ─────── v0.2.1
[32m[1m  Installed[22m[39m IRTools ───────────── v0.3.2
[32m[1m  Installed[22m[39m LaTeXStrings ──────── v1.1.0
[32m[1m  Installed[22m[39m GaussianMixtures ──── v0.3.1
[32m[1m  Installed[22m[39m FastGaussQua

In [3]:
using DecisionTree
using ScikitLearn.CrossValidation: cross_val_score

┌ Info: Precompiling DecisionTree [7806a523-6efd-50cb-b5f6-3fa6f1930dbb]
└ @ Base loading.jl:1260
┌ Info: Precompiling ScikitLearn [3646fa90-6ef7-5e7e-9f22-8aca16db6324]
└ @ Base loading.jl:1260


## ScikitLearn.jl API
DecisionTree.jl supports the ScikitLearn.jl interface and algorithms (cross-validation, hyperparameter tuning, pipelines, etc.)

Available models: DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, AdaBoostStumpClassifier. See each model's help (eg. ?DecisionTreeRegressor at the REPL) for more information

## 載入資料

In [4]:
features, labels = DecisionTree.load_data("iris");

In [5]:
typeof(features)

Array{Any,2}

In [6]:
typeof(labels)

Array{Any,1}

## Casting

In [7]:
# the data loaded are of type Array{Any}
# cast them to concrete types for better performance
features = float.(features)
labels = string.(labels)

150-element Array{String,1}:
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 ⋮
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"

## 決策樹模型

In [8]:
# create a depth-truncated classifier
model = DecisionTree.DecisionTreeClassifier(max_depth=2)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing

可用模型:

* `DecisionTreeClassifier`
* `DecisionTreeRegressor`
* `RandomForestClassifier`
* `RandomForestRegressor`
* `AdaBoostStumpClassifier`

In [9]:
typeof(model)

DecisionTreeClassifier

In [12]:
dump(model)

DecisionTreeClassifier
  pruning_purity_threshold: Float64 1.0
  max_depth: Int64 2
  min_samples_leaf: Int64 1
  min_samples_split: Int64 2
  min_purity_increase: Float64 0.0
  n_subfeatures: Int64 0
  rng: Random._GLOBAL_RNG Random._GLOBAL_RNG()
  root: Nothing nothing
  classes: Nothing nothing


In [13]:
Base.show_supertypes(DecisionTreeClassifier)

DecisionTreeClassifier <: ScikitLearnBase.BaseClassifier <: ScikitLearnBase.BaseEstimator <: Any

## 訓練

In [14]:
DecisionTree.fit!(model, features, labels)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
root:                     Decision Tree
Leaves: 3
Depth:  2

## 印出決策樹

In [15]:
# pretty print of the tree, to a depth of 5 nodes (optional)
DecisionTree.print_tree(model, 5)

Feature 3, Threshold 2.45
L-> Iris-setosa : 50/50
R-> Feature 4, Threshold 1.75
    L-> Iris-versicolor : 49/54
    R-> Iris-virginica : 45/46


## 預測

In [16]:
# apply learned model
new_iris = [5.9, 3.0, 5.1, 1.9]
DecisionTree.predict(model, new_iris)

"Iris-virginica"

In [17]:
# get the probability of each label
DecisionTree.predict_proba(model, new_iris)

3-element Array{Float64,1}:
 0.0
 0.021739130434782608
 0.9782608695652174

## `predict_proba` 對應的類別

In [18]:
# returns the ordering of the columns in predict_proba's output
println(get_classes(model))
DecisionTree.get_classes(model)

["Iris-setosa", "Iris-versicolor", "Iris-virginica"]


3-element Array{String,1}:
 "Iris-setosa"
 "Iris-versicolor"
 "Iris-virginica"

### 驗證 cross validation

In [19]:
# run n-fold cross validation over 3 CV folds
# See ScikitLearn.jl for installation instructions
# using ScikitLearn.CrossValidation: cross_val_score
accuracy = cross_val_score(model, features, labels, cv=3)

3-element Array{Float64,1}:
 0.9607843137254902
 0.9019607843137255
 0.9791666666666666

## 隨機森林模型

In [20]:
model = DecisionTree.RandomForestClassifier(n_trees=50, max_depth=2)

RandomForestClassifier
n_trees:             50
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           2
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             nothing
ensemble:            nothing

## 訓練

In [21]:
DecisionTree.fit!(model, features, labels)

RandomForestClassifier
n_trees:             50
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           2
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
ensemble:            Ensemble of Decision Trees
Trees:      50
Avg Leaves: 3.16
Avg Depth:  2.0

## 預測

In [22]:
new_iris = [5.9, 3.0, 5.1, 1.9]
DecisionTree.predict(model, new_iris)

"Iris-virginica"

## 交叉驗證

In [23]:
accuracy = cross_val_score(model, features, labels, cv=5)

5-element Array{Float64,1}:
 0.9333333333333333
 0.9666666666666667
 0.9
 0.9333333333333333
 1.0

## Native API

### Classification Example
#### Decision Tree Classifier

In [24]:
# train full-tree classifier
model = build_tree(labels, features)

# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)

# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)

# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])

# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])

# run 3-fold cross validation of pruned tree,
n_folds=3
accuracy = nfoldCV_tree(labels, features, n_folds)

Feature 4, Threshold 0.8
L-> Iris-setosa : 50/50
R-> Feature 4, Threshold 1.75
    L-> Feature 3, Threshold 4.95
        L-> Iris-versicolor : 47/48
        R-> Feature 4, Threshold 1.55
            L-> Iris-virginica : 3/3
            R-> Feature 1, Threshold 6.95
                L-> Iris-versicolor : 2/2
                R-> Iris-virginica : 1/1
    R-> Feature 3, Threshold 4.85
        L-> Feature 2, Threshold 3.1
            L-> Iris-virginica : 2/2
            R-> Iris-versicolor : 1/1
        R-> Iris-virginica : 43/43

Fold 1
Classes:  ["Iri

3×3 Array{Int64,2}:
 17   0   0
  0  18   0
  0   0  15

s-setosa", "Iris-versicolor", "Iris-virginica"]

3×3 Array{Int64,2}:
 13   0   0
  0  17   0
  0   0  20

3×3 Array{Int64,2}:
 20   0   0
  0  15   0
  0   0  15


Matrix:   
Accuracy: 1.0
Kappa:    1.0

Fold 2
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Fold 3
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Mean Accuracy: 1.0


3-element Array{Float64,1}:
 1.0
 1.0
 1.0

In [25]:
# set of classification parameters and respective default values
# pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
# max_depth: maximum depth of the decision tree (default: -1, no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 1)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# n_subfeatures: number of features to select at random (default: 0, keep all)
n_subfeatures = 0;
max_depth = -1;
min_samples_leaf = 1;
min_samples_split = 2;
min_purity_increase = 0.0;
pruning_purity = 1.0

model    =   build_tree(labels, features,
                        n_subfeatures,
                        max_depth,
                        min_samples_leaf,
                        min_samples_split,
                        min_purity_increase)

accuracy = nfoldCV_tree(labels, features,
                        n_folds,
                        pruning_purity,
                        max_depth,
                        min_samples_leaf,
                        min_samples_split,
                        min_purity_increase)

3×3 Array{Int64,2}:
 11   0   0
  0  20   0
  0   0  19

3×3 Array{Int64,2}:
 13   0   0
  0  17   0
  0   0  20

3×3 Array{Int64,2}:
 26   0   0
  0  13   0
  0   0  11


Fold 1
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Fold 2
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Fold 3
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Mean Accuracy: 1.0


3-element Array{Float64,1}:
 1.0
 1.0
 1.0

#### Random Forest Classifier

In [26]:
# train random forest classifier
# using 2 random features, 10 trees, 0.5 portion of samples per tree, and a maximum tree depth of 6
model = build_forest(labels, features, 2, 10, 0.5, 6)

# apply learned model
apply_forest(model, [5.9,3.0,5.1,1.9])

# get the probability of each label
apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])

# run 3-fold cross validation for forests, using 2 random features per split
n_folds=3;
n_subfeatures=2;
accuracy = nfoldCV_forest(labels, features, n_folds, n_subfeatures)

3×3 Array{Int64,2}:
 16   0   0
  0  14   1
  0   0  19

3×3 Array{Int64,2}:
 18   0   0
  0  15   0
  0   1  16

3×3 Array{Int64,2}:
 16   0   0
  0  20   0
  0   1  13


Fold 1
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.98
Kappa:    0.9697702539298669

Fold 2
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.98
Kappa:    0.969951923076923

Fold 3
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.98
Kappa:    0.9695493300852619

Mean Accuracy: 0.98


3-element Array{Float64,1}:
 0.98
 0.98
 0.98

In [27]:
# set of classification parameters and respective default values
# n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
# n_trees: number of trees to train (default: 10)
# partial_sampling: fraction of samples to train each tree on (default: 0.7)
# max_depth: maximum depth of the decision trees (default: no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
n_subfeatures = -1;
n_trees = 10;
partial_sampling = 0.7;
max_depth = -1;
min_samples_leaf = 5;
min_samples_split = 2;
min_purity_increase = 0.0;

model    =   build_forest(labels, features,
                          n_subfeatures,
                          n_trees,
                          partial_sampling,
                          max_depth,
                          min_samples_leaf,
                          min_samples_split,
                          min_purity_increase)

accuracy = nfoldCV_forest(labels, features,
                          n_folds,
                          n_subfeatures,
                          n_trees,
                          partial_sampling,
                          max_depth,
                          min_samples_leaf,
                          min_samples_split,
                          min_purity_increase)

3×3 Array{Int64,2}:
 18   0   0
  0  15   2
  0   0  15

3×3 Array{Int64,2}:
 16   0   0
  0  11   0
  0   2  21

3×3 Array{Int64,2}:
 16   0   0
  0  19   3
  0   0  12


Fold 1
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.9399759903961584

Fold 2
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.938195302843016

Fold 3
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9088699878493316

Mean Accuracy: 0.9533333333333333


3-element Array{Float64,1}:
 0.96
 0.96
 0.94

#### Adaptive-Boosted Decision Stumps Classifier

In [28]:
# train adaptive-boosted stumps, using 7 iterations
model, coeffs = build_adaboost_stumps(labels, features, 7);

# apply learned model
apply_adaboost_stumps(model, coeffs, [5.9,3.0,5.1,1.9])

# get the probability of each label
apply_adaboost_stumps_proba(model, coeffs, [5.9,3.0,5.1,1.9], ["Iris-setosa", "Iris-versicolor", "Iris-virginica"])

# run 3-fold cross validation for boosted stumps, using 7 iterations
n_iterations=7;
n_folds=3
accuracy = nfoldCV_stumps(labels, features,
                          n_folds,
                          n_iterations)

3×3 Array{Int64,2}:
 18   0   0
  0  10   3
  0   1  18

3×3 Array{Int64,2}:
 13   0   0
  0  22   1
  0   2  12

3×3 Array{Int64,2}:
 19   0   0
  0  14   0
  0   3  14


Fold 1
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.92
Kappa:    0.8776009791921667

Fold 2
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9060738885410143

Fold 3
Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.909801563439567

Mean Accuracy: 0.9333333333333332


3-element Array{Float64,1}:
 0.92
 0.94
 0.94

### Regression Example

In [29]:
n, m = 10^3, 5
features = randn(n, m)
weights = rand(-2:2, m)
labels = features * weights

1000-element Array{Float64,1}:
 -2.3940256977680154
 -4.535785063986726
 -3.971110993337532
  0.5265605327422795
  1.0311315746395038
  5.600370918786503
  2.4528376945533896
 -2.447322497472193
  1.805591436389902
  2.722688592534743
  0.3388062179053337
 -9.818812566894554
 -5.089166858411414
  ⋮
  1.686803642122198
  1.4950737070560025
 -1.271427251847347
  2.8710074291145715
 -0.7968302719081265
  0.06826435959750815
 -2.266895243073727
 -3.0240479712785917
  4.105311969174998
 -3.5789904048676626
 -5.284592717156557
  4.094758312294372

#### Regression Tree

In [30]:
# train regression tree
model = build_tree(labels, features)

# apply learned model
apply_tree(model, [-0.9,3.0,5.1,1.9,0.0])

# run 3-fold cross validation, returns array of coefficients of determination (R^2)
n_folds = 3
r2 = nfoldCV_tree(labels, features, n_folds)


Fold 1
Mean Squared Error:     0.6441944163339781
Correlation Coeff:      0.9699778491090388
Coeff of Determination: 0.9404479695269051

Fold 2
Mean Squared Error:     0.7253914321857626
Correlation Coeff:      0.9658514809419773
Coeff of Determination: 0.9324125701676048

Fold 3
Mean Squared Error:     0.7999629568075756
Correlation Coeff:      0.9636917041932589
Coeff of Determination: 0.928261798446073

Mean Coeff of Determination: 0.933707446046861


3-element Array{Float64,1}:
 0.9404479695269051
 0.9324125701676048
 0.928261798446073

In [31]:
# set of regression parameters and respective default values
# pruning_purity: purity threshold used for post-pruning (default: 1.0, no pruning)
# max_depth: maximum depth of the decision tree (default: -1, no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
# n_subfeatures: number of features to select at random (default: 0, keep all)
n_subfeatures = 0;
max_depth = -1;
min_samples_leaf = 5;
min_samples_split = 2;
min_purity_increase = 0.0;
pruning_purity = 1.0

model = build_tree(labels, features,
                   n_subfeatures,
                   max_depth,
                   min_samples_leaf,
                   min_samples_split,
                   min_purity_increase)

r2 =  nfoldCV_tree(labels, features,
                   n_folds,
                   pruning_purity,
                   max_depth,
                   min_samples_leaf,
                   min_samples_split,
                   min_purity_increase)


Fold 1
Mean Squared Error:     0.7385604014875171
Correlation Coeff:      0.9638652920570201
Coeff of Determination: 0.9287663505443762

Fold 2
Mean Squared Error:     0.7451621629078825
Correlation Coeff:      0.9626734042963007
Coeff of Determination: 0.9253999239971102

Fold 3
Mean Squared Error:     0.6831476590817647
Correlation Coeff:      0.9721502195510533
Coeff of Determination: 0.944765437405164

Mean Coeff of Determination: 0.9329772373155502


3-element Array{Float64,1}:
 0.9287663505443762
 0.9253999239971102
 0.944765437405164

#### Regression Random Forest

In [32]:
# train regression forest, using 2 random features, 10 trees,
# averaging of 5 samples per leaf, and 0.7 portion of samples per tree
model = build_forest(labels, features, 2, 10, 0.7, 5)

# apply learned model
apply_forest(model, [-0.9,3.0,5.1,1.9,0.0])

# run 3-fold cross validation on regression forest, using 2 random features per split
n_subfeatures=2;
n_folds=3;
r2 = nfoldCV_forest(labels, features, n_folds, n_subfeatures)


Fold 1
Mean Squared Error:     0.8059708239371419
Correlation Coeff:      0.9736564432531492
Coeff of Determination: 0.9297075445233678

Fold 2
Mean Squared Error:     0.711891006710938
Correlation Coeff:      0.9739435917200108
Coeff of Determination: 0.9305895157465476

Fold 3
Mean Squared Error:     0.8100033506368693
Correlation Coeff:      0.9696441162160488
Coeff of Determination: 0.9259403706933763

Mean Coeff of Determination: 0.9287458103210972


3-element Array{Float64,1}:
 0.9297075445233678
 0.9305895157465476
 0.9259403706933763

In [33]:
# set of regression build_forest() parameters and respective default values
# n_subfeatures: number of features to consider at random per split (default: -1, sqrt(# features))
# n_trees: number of trees to train (default: 10)
# partial_sampling: fraction of samples to train each tree on (default: 0.7)
# max_depth: maximum depth of the decision trees (default: no maximum)
# min_samples_leaf: the minimum number of samples each leaf needs to have (default: 5)
# min_samples_split: the minimum number of samples in needed for a split (default: 2)
# min_purity_increase: minimum purity needed for a split (default: 0.0)
n_subfeatures = -1;
n_trees = 10;
partial_sampling = 0.7;
max_depth=-1;
min_samples_leaf = 5;
min_samples_split = 2;
min_purity_increase = 0.0;

model = build_forest(labels, features,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase)

r2 =  nfoldCV_forest(labels, features,
                     n_folds,
                     n_subfeatures,
                     n_trees,
                     partial_sampling,
                     max_depth,
                     min_samples_leaf,
                     min_samples_split,
                     min_purity_increase)


Fold 1
Mean Squared Error:     0.797472585598311
Correlation Coeff:      0.9655158673504437
Coeff of Determination: 0.9180006428603329

Fold 2
Mean Squared Error:     0.8835021203457081
Correlation Coeff:      0.9711007836797497
Coeff of Determination: 0.9214695443029738

Fold 3
Mean Squared Error:     0.832325609247293
Correlation Coeff:      0.970444581313212
Coeff of Determination: 0.9268171831112857

Mean Coeff of Determination: 0.9220957900915309


3-element Array{Float64,1}:
 0.9180006428603329
 0.9214695443029738
 0.9268171831112857

### Saving Models
Models can be saved to disk and loaded back with the use of the JLD2.jl package.

In [36]:
Pkg.add("JLD2")

[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `C:\Users\kai\.julia\environments\v1.4\Project.toml`
 [90m [033835bb][39m[92m + JLD2 v0.1.13[39m
[32m[1m   Updating[22m[39m `C:\Users\kai\.julia\environments\v1.4\Manifest.toml`
[90m [no changes][39m


In [37]:
using JLD2
@save "model_file.jld2" model

┌ Info: Precompiling JLD2 [033835bb-8acc-5ee8-8aae-3f567f8a3819]
└ @ Base loading.jl:1260


Note that even though features and labels of type `Array{Any}` are supported, it is highly recommended that data be cast to explicit types (ie with `float.()`, `string.()`, etc). This significantly improves model training and prediction execution times, and also drastically reduces the size of saved models.