# Decision Tree

In [1]:
using DecisionTree

## Load data

In [2]:
features, labels = DecisionTree.load_data("iris");

## Casting

In [3]:
features = float.(features)
labels = string.(labels)

150-element Array{String,1}:
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 ⋮
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"

## Model

In [4]:
model = DecisionTree.DecisionTreeClassifier(max_depth=2)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  nothing
root:                     nothing

Available models:

* `DecisionTreeClassifier`
* `DecisionTreeRegressor`
* `RandomForestClassifier`
* `RandomForestRegressor`
* `AdaBoostStumpClassifier`

In [5]:
?DecisionTreeClassifier

search: [0m[1mD[22m[0m[1me[22m[0m[1mc[22m[0m[1mi[22m[0m[1ms[22m[0m[1mi[22m[0m[1mo[22m[0m[1mn[22m[0m[1mT[22m[0m[1mr[22m[0m[1me[22m[0m[1me[22m[0m[1mC[22m[0m[1ml[22m[0m[1ma[22m[0m[1ms[22m[0m[1ms[22m[0m[1mi[22m[0m[1mf[22m[0m[1mi[22m[0m[1me[22m[0m[1mr[22m



```
DecisionTreeClassifier(; pruning_purity_threshold=0.0,
                       max_depth::Int=-1,
                       min_samples_leaf::Int=1,
                       min_samples_split::Int=2,
                       min_purity_increase::Float=0.0,
                       n_subfeatures::Int=0,
                       rng=Random.GLOBAL_RNG)
```

Decision tree classifier. See [DecisionTree.jl's documentation](https://github.com/bensadeghi/DecisionTree.jl)

Hyperparameters:

  * `pruning_purity_threshold`: (post-pruning) merge leaves having `>=thresh` combined purity (default: no pruning)
  * `max_depth`: maximum depth of the decision tree (default: no maximum)
  * `min_samples_leaf`: the minimum number of samples each leaf needs to have (default: 1)
  * `min_samples_split`: the minimum number of samples in needed for a split (default: 2)
  * `min_purity_increase`: minimum purity needed for a split (default: 0.0)
  * `n_subfeatures`: number of features to select at random (default: keep all)
  * `rng`: the random number generator to use. Can be an `Int`, which will be used to seed and create a new random number generator.

Implements `fit!`, `predict`, `predict_proba`, `get_classes`


## Training

In [6]:
DecisionTree.fit!(model, features, labels)

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
root:                     Decision Tree
Leaves: 3
Depth:  2

### pretty print of the tree, to a depth of 5 nodes

In [7]:
DecisionTree.print_tree(model, 5)

Feature 3, Threshold 2.45
L-> Iris-setosa : 50/50
R-> Feature 4, Threshold 1.75
    L-> Iris-versicolor : 49/54
    R-> Iris-virginica : 45/46


## Prediction

In [8]:
new_iris = [5.9, 3.0, 5.1, 1.9]
DecisionTree.predict(model, new_iris)

"Iris-virginica"

In [9]:
DecisionTree.predict_proba(model, new_iris)

3-element Array{Float64,1}:
 0.0
 0.021739130434782608
 0.9782608695652174

### the ordering of the columns in `predict_proba`'s output

In [10]:
DecisionTree.get_classes(model)

3-element Array{String,1}:
 "Iris-setosa"
 "Iris-versicolor"
 "Iris-virginica"

## Using MLJ

In [1]:
using MLJ
using RDatasets

In [2]:
smarket = dataset("ISLR", "Smarket")
first(smarket, 6)

Unnamed: 0_level_0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Categorical…
1,2001.0,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001.0,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001.0,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001.0,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001.0,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up
6,2001.0,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up


In [3]:
y, X = unpack(smarket, ==(:Direction), colname -> true);
X = select(X, Not([:Year, :Today]))
first(X, 6)

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,0.381,-0.192,-2.624,-1.055,5.01,1.1913
2,0.959,0.381,-0.192,-2.624,-1.055,1.2965
3,1.032,0.959,0.381,-0.192,-2.624,1.4112
4,-0.623,1.032,0.959,0.381,-0.192,1.276
5,0.614,-0.623,1.032,0.959,0.381,1.2057
6,0.213,0.614,-0.623,1.032,0.959,1.3491


In [4]:
y = coerce(y, OrderedFactor)
classes(y[1])

2-element CategoricalArray{String,1,UInt8}:
 "Down"
 "Up"

### Training/testing set

In [5]:
train, test = partition(eachindex(y), 0.7, shuffle=true)

([842, 376, 460, 518, 77, 190, 442, 79, 986, 58  …  82, 72, 109, 167, 294, 651, 339, 240, 1130, 70], [658, 1052, 205, 457, 66, 1009, 659, 282, 598, 887  …  911, 177, 565, 508, 121, 53, 536, 833, 575, 1163])

### Model

In [6]:
model = @load DecisionTreeClassifier pkg=DecisionTree

DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5)[34m @ 8…68[39m

In [7]:
match = machine(model, X, y)

[34mMachine{DecisionTreeClassifier} @ 1…71[39m


### Training

In [8]:
fit!(match, rows=train)

┌ Info: Training [34mMachine{DecisionTreeClassifier} @ 1…71[39m.
└ @ MLJBase /home/yuehhua/.julia/packages/MLJBase/O5b6j/src/machines.jl:187


[34mMachine{DecisionTreeClassifier} @ 1…71[39m
