# Julia 機器學習：DecisionTree 決策樹

## 作業 030：乳癌預測資料集

請使用隨機森林模型建立一個分類模型，來預測乳癌資料集中，為良性或是惡性的腫瘤。

In [1]:
using DecisionTree, RDatasets, DataFrames, MLDataUtils, Statistics

## 讀取資料

In [2]:
biopsy = dataset("MASS", "biopsy")

Unnamed: 0_level_0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,Class
Unnamed: 0_level_1,String,Int32,Int32,Int32,Int32,Int32,Int32?,Int32,Int32,Int32,Cat…
1,1000025,5,1,1,1,2,1,3,1,1,benign
2,1002945,5,4,4,5,7,10,3,2,1,benign
3,1015425,3,1,1,1,2,2,3,1,1,benign
4,1016277,6,8,8,1,3,4,3,7,1,benign
5,1017023,4,1,1,3,2,1,3,1,1,benign
6,1017122,8,10,10,8,7,10,9,7,1,malignant
7,1018099,1,1,1,1,2,10,3,1,1,benign
8,1018561,2,1,2,1,2,1,3,1,1,benign
9,1033078,2,1,1,1,2,1,1,1,5,benign
10,1033078,4,2,1,1,2,1,2,1,1,benign


## Casting

In [48]:
features, labels = biopsy[!, 2:10], biopsy[!, 11]
features = Array(float.(features))
labels = string.(labels);
for i in 1:9
    avg = mean(skipmissing(features[:, i]))
    features[:, i] = coalesce.(features[:, i], avg)
end

## 決策樹模型

In [49]:
model = DecisionTree.DecisionTreeClassifier(max_depth = 2)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing

## 訓練

In [50]:
DecisionTree.fit!(model, features, labels)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["benign", "malignant"]
root:                     Decision Tree
Leaves: 4
Depth:  2

## 印出決策樹

In [51]:
DecisionTree.print_tree(model, 5)

Feature 2, Threshold 2.5
L-> Feature 6, Threshold 3.772327964860908
    L-> benign : 404/406
    R-> benign : 13/23
R-> Feature 2, Threshold 4.5
    L-> malignant : 56/92
    R-> malignant : 173/178


In [55]:
features

699×9 Array{Union{Missing, Float64},2}:
 5.0   1.0   1.0  1.0  2.0   1.0   3.0   1.0  1.0
 5.0   4.0   4.0  5.0  7.0  10.0   3.0   2.0  1.0
 3.0   1.0   1.0  1.0  2.0   2.0   3.0   1.0  1.0
 6.0   8.0   8.0  1.0  3.0   4.0   3.0   7.0  1.0
 4.0   1.0   1.0  3.0  2.0   1.0   3.0   1.0  1.0
 8.0  10.0  10.0  8.0  7.0  10.0   9.0   7.0  1.0
 1.0   1.0   1.0  1.0  2.0  10.0   3.0   1.0  1.0
 2.0   1.0   2.0  1.0  2.0   1.0   3.0   1.0  1.0
 2.0   1.0   1.0  1.0  2.0   1.0   1.0   1.0  5.0
 4.0   2.0   1.0  1.0  2.0   1.0   2.0   1.0  1.0
 1.0   1.0   1.0  1.0  1.0   1.0   3.0   1.0  1.0
 2.0   1.0   1.0  1.0  2.0   1.0   2.0   1.0  1.0
 5.0   3.0   3.0  3.0  2.0   3.0   4.0   4.0  1.0
 ⋮                           ⋮                
 3.0   1.0   1.0  1.0  2.0   1.0   2.0   3.0  1.0
 4.0   1.0   1.0  1.0  2.0   1.0   1.0   1.0  1.0
 1.0   1.0   1.0  1.0  2.0   1.0   1.0   1.0  8.0
 1.0   1.0   1.0  3.0  2.0   1.0   1.0   1.0  1.0
 5.0  10.0  10.0  5.0  4.0   5.0   4.0   4.0  1.0
 3.0   1.0   

## 預測

In [64]:
new_biopsy = [2.0, 1.1, 4.5, 6.0, 1.0, missing, 6.6, 1.2, 2.3]
new_biopsy[6] = 5.6
DecisionTree.predict(model, new_biopsy)

"benign"

In [65]:
DecisionTree.predict_proba(model, new_biopsy)

2-element Array{Float64,1}:
 0.5652173913043478
 0.43478260869565216