# Getting started with MLJ
If you are new to MLJ but are familiar with Julia and with Machine Learning, we recommend you start by going through the short Getting started examples in order:

* How to choose a model,

* How to fit, predict and transform

* How to tune models

* How to ensemble models

* How to ensemble models (2)

* More on ensembles

* How to compose models

* How to build a learning network

* How to create models from learning networks

* An extended tutorial on stacking

Additionally, you can refer to the documentation for more detailed information

## How to choose a model,

In [1]:
using RDatasets
using MLJ

In [54]:
iris = dataset("datasets","iris")
first(iris,3) |> pretty

┌─────────────┬────────────┬─────────────┬────────────┬─────────────────────────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalLength [0m│[1m PetalWidth [0m│[1m Species                         [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64     [0m│[90m Float64    [0m│[90m CategoricalValue{String, UInt8} [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous  [0m│[90m Continuous [0m│[90m Multiclass{3}                   [0m│
├─────────────┼────────────┼─────────────┼────────────┼─────────────────────────────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa                          │
│ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa                          │
│ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa                          │
└─────────────┴────────────┴─────────────┴────────────┴─────────────────────────────────┘


In [3]:
schema(iris)

┌─────────────┬─────────────────────────────────┬───────────────┐
│[22m _.names     [0m│[22m _.types                         [0m│[22m _.scitypes    [0m│
├─────────────┼─────────────────────────────────┼───────────────┤
│ SepalLength │ Float64                         │ Continuous    │
│ SepalWidth  │ Float64                         │ Continuous    │
│ PetalLength │ Float64                         │ Continuous    │
│ PetalWidth  │ Float64                         │ Continuous    │
│ Species     │ CategoricalValue{String, UInt8} │ Multiclass{3} │
└─────────────┴─────────────────────────────────┴───────────────┘
_.nrows = 150


In [5]:
iris2 = coerce(iris, :PetalWidth => OrderedFactor)
schema(iris2)

┌─────────────┬───────────────────────────────────┬───────────────────┐
│[22m _.names     [0m│[22m _.types                           [0m│[22m _.scitypes        [0m│
├─────────────┼───────────────────────────────────┼───────────────────┤
│ SepalLength │ Float64                           │ Continuous        │
│ SepalWidth  │ Float64                           │ Continuous        │
│ PetalLength │ Float64                           │ Continuous        │
│ PetalWidth  │ CategoricalValue{Float64, UInt32} │ OrderedFactor{22} │
│ Species     │ CategoricalValue{String, UInt8}   │ Multiclass{3}     │
└─────────────┴───────────────────────────────────┴───────────────────┘
_.nrows = 150


In [6]:
# Unpacking data

In [55]:
y, X = unpack(iris, ==(:Species), colname->true)
first(X,1) |> pretty

┌─────────────┬────────────┬─────────────┬────────────┐
│[1m SepalLength [0m│[1m SepalWidth [0m│[1m PetalLength [0m│[1m PetalWidth [0m│
│[90m Float64     [0m│[90m Float64    [0m│[90m Float64     [0m│[90m Float64    [0m│
│[90m Continuous  [0m│[90m Continuous [0m│[90m Continuous  [0m│[90m Continuous [0m│
├─────────────┼────────────┼─────────────┼────────────┤
│ 5.1         │ 3.5        │ 1.4         │ 0.2        │
└─────────────┴────────────┴─────────────┴────────────┘


In [8]:
# Also iris can be downloaded as
#X, y = @load_iris

((sepal_length = [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9  …  6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], sepal_width = [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1  …  3.1, 3.1, 2.7, 3.2, 3.3, 3.0, 2.5, 3.0, 3.4, 3.0], petal_length = [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5  …  5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1], petal_width = [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1  …  2.4, 2.3, 1.9, 2.3, 2.5, 2.3, 1.9, 2.0, 2.3, 1.8]), CategoricalArrays.CategoricalValue{String, UInt32}["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"])

In [47]:
schema(X)

┌─────────────┬─────────┬────────────┐
│[22m _.names     [0m│[22m _.types [0m│[22m _.scitypes [0m│
├─────────────┼─────────┼────────────┤
│ SepalLength │ Float64 │ Continuous │
│ SepalWidth  │ Float64 │ Continuous │
│ PetalLength │ Float64 │ Continuous │
│ PetalWidth  │ Float64 │ Continuous │
└─────────────┴─────────┴────────────┘
_.nrows = 150


### Choosing a model

In [56]:
for m in models(matching(X,y))
    println(rpad(m.name,30),"($(m.package_name))")
end

AdaBoostClassifier            (ScikitLearn)
AdaBoostStumpClassifier       (DecisionTree)
BaggingClassifier             (ScikitLearn)
BayesianLDA                   (MultivariateStats)
BayesianLDA                   (ScikitLearn)
BayesianQDA                   (ScikitLearn)
BayesianSubspaceLDA           (MultivariateStats)
ConstantClassifier            (MLJModels)
DecisionTreeClassifier        (BetaML)
DecisionTreeClassifier        (DecisionTree)
DeterministicConstantClassifier(MLJModels)
DummyClassifier               (ScikitLearn)
EvoTreeClassifier             (EvoTrees)
ExtraTreesClassifier          (ScikitLearn)
GaussianNBClassifier          (NaiveBayes)
GaussianNBClassifier          (ScikitLearn)
GaussianProcessClassifier     (ScikitLearn)
GradientBoostingClassifier    (ScikitLearn)
KNNClassifier                 (NearestNeighborModels)
KNeighborsClassifier          (ScikitLearn)
KernelPerceptronClassifier    (BetaML)
LDA                           (MultivariateStats)
LGBMClassifier     

### Loading a model

In [13]:
knc =  @load KNeighborsClassifier

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/sandhya/.julia/packages/MLJModels/E8BbE/src/loading.jl:168


import MLJScikitLearnInterface ✔


MLJScikitLearnInterface.KNeighborsClassifier

In [14]:
linreg = @load LinearRegressor pkg = "GLM"

import MLJGLMInterface

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/sandhya/.julia/packages/MLJModels/E8BbE/src/loading.jl:168
┌ Info: Precompiling MLJGLMInterface [caf8df21-4939-456d-ac9c-5fefbfb04c0c]
└ @ Base loading.jl:1317


 ✔


MLJGLMInterface.LinearRegressor

In [15]:
using DecisionTree

### Fit, predict and tranform

In [16]:
using MLJ
using Statistics
using PrettyPrinting
using StableRNGs

In [49]:
#X, y = @load_iris;

X = Float64.(X)
y = String.(y)

150-element Vector{String}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [57]:
Tree = @load DecisionTreeClassifier pkg="DecisionTree"
import MLJDecisionTreeInterface

import MLJDecisionTreeInterface ✔


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/sandhya/.julia/packages/MLJModels/E8BbE/src/loading.jl:168


In [58]:
tree = Tree()

DecisionTreeClassifier(
    max_depth = -1,
    min_samples_leaf = 1,
    min_samples_split = 2,
    min_purity_increase = 0.0,
    n_subfeatures = 0,
    post_prune = false,
    merge_purity_threshold = 1.0,
    pdf_smoothing = 0.0,
    display_depth = 5)[34m @060[39m

In [59]:
evaluate(tree, X, y,
                resampling=CV(shuffle=true), measure=log_loss, verbosity=0)

┌───────────────────────────────────┬───────────────┬───────────────────────────
│[22m _.measure                         [0m│[22m _.measurement [0m│[22m _.per_fold              [0m ⋯
├───────────────────────────────────┼───────────────┼───────────────────────────
│ \e[34mLogLoss{Float64} @633\e[39m │ 1.68          │ [2.22e-16, 2.22e-16, 2.2 ⋯
└───────────────────────────────────┴───────────────┴───────────────────────────
[36m                                                                1 column omitted[0m
_.per_observation = [[[2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 36.0, ..., 2.22e-16]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]


In [27]:
?evaluate

search: [0m[1me[22m[0m[1mv[22m[0m[1ma[22m[0m[1ml[22m[0m[1mu[22m[0m[1ma[22m[0m[1mt[22m[0m[1me[22m [0m[1me[22m[0m[1mv[22m[0m[1ma[22m[0m[1ml[22m[0m[1mu[22m[0m[1ma[22m[0m[1mt[22m[0m[1me[22m! Int[0m[1me[22mr[0m[1mv[22m[0m[1ma[22m[0m[1ml[22mS[0m[1mu[22mrrog[0m[1ma[22m[0m[1mt[22m[0m[1me[22m



some meta-models may choose to implement the `evaluate` operations

---

```
evaluate(model, data...; cache=true, kw_options...)
```

Equivalent to `evaluate!(machine(model, data..., cache=cache); wk_options...)`.  See the machine version `evaluate!` for the complete list of options.


In [60]:
evaluate(tree, X, y,
                resampling=CV(shuffle=true), measure=accuracy, operation=predict_mode, verbosity=0)

┌───────────────┬───────────────┬─────────────────────────────────────┐
│[22m _.measure     [0m│[22m _.measurement [0m│[22m _.per_fold                          [0m│
├───────────────┼───────────────┼─────────────────────────────────────┤
│ Accuracy @560 │ 0.947         │ [1.0, 0.96, 0.92, 0.88, 0.96, 0.96] │
└───────────────┴───────────────┴─────────────────────────────────────┘
_.per_observation = [missing]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]


In [29]:
info("DecisionTreeClassifier", pkg="DecisionTree").target_scitype

AbstractVector{_s40} where _s40<:Finite (alias for AbstractArray{_s40, 1} where _s40<:Finite)

In [30]:
# Fit and predict

In [61]:
mach = machine(tree, X, y)

[34mMachine{DecisionTreeClassifier,…} @814[39m trained 0 times; caches data
  args: 
    1:	[34mSource @540[39m ⏎ `Table{AbstractVector{Continuous}}`
    2:	[34mSource @781[39m ⏎ `AbstractVector{Multiclass{3}}`


In [62]:
train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split

In [63]:
train

105-element Vector{Int64}:
 112
 146
  46
  32
  21
  58
  39
  31
  82
 136
  33
 115
 133
   ⋮
  94
  55
 140
 145
  45
 147
  91
 113
 142
  24
  29
 138

In [64]:
MLJ.fit!(mach, rows=train);

┌ Info: Training [34mMachine{DecisionTreeClassifier,…} @814[39m.
└ @ MLJBase /home/sandhya/.julia/packages/MLJBase/pCCd7/src/machines.jl:342


In [65]:
yhat = MLJ.predict(mach, X[test,:])

45-element MLJBase.UnivariateFiniteVector{Multiclass{3}, String, UInt8, Float64}:
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>0.0, virginica=>1.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, ver

In [66]:
yhat[3:5]

3-element MLJBase.UnivariateFiniteVector{Multiclass{3}, String, UInt8, Float64}:
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>1.0, versicolor=>0.0, virginica=>0.0)
 UnivariateFinite{Multiclass{3}}(setosa=>0.0, versicolor=>1.0, virginica=>0.0)

In [45]:
typeof(X)

NamedTuple{(:sepal_length, :sepal_width, :petal_length, :petal_width), NTuple{4, Vector{Float64}}}

In [70]:
log_loss(yhat, y[test]) |> mean

4.805820451882287

In [75]:
fitted_params(mach) |> pprint

(tree = Decision Tree
Leaves: 5
Depth:  4,
 encoding =
     Dict(CategoricalArrays.CategoricalValue{String, UInt8} "virginica" =>
              0x03,
          CategoricalArrays.CategoricalValue{String, UInt8} "setosa" => 0x01,
          CategoricalArrays.CategoricalValue{String, UInt8} "versicolor" =>
              0x02))

In [81]:
y_pred = predict_mode(mach, rows=test)

45-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "virginica"
 "versicolor"
 "setosa"
 "versicolor"
 "versicolor"
 "versicolor"
 "versicolor"
 "virginica"
 "versicolor"
 "virginica"
 "setosa"
 "versicolor"
 ⋮
 "setosa"
 "setosa"
 "virginica"
 "versicolor"
 "setosa"
 "virginica"
 "virginica"
 "versicolor"
 "setosa"
 "setosa"
 "setosa"
 "versicolor"

In [86]:
DataFrame("Pred" => y_pred, "Actual" => y[test])

Unnamed: 0_level_0,Pred,Actual
Unnamed: 0_level_1,Cat…,Cat…
1,setosa,setosa
2,virginica,virginica
3,versicolor,virginica
4,setosa,setosa
5,versicolor,versicolor
6,versicolor,virginica
7,versicolor,versicolor
8,versicolor,versicolor
9,virginica,virginica
10,versicolor,versicolor


In [102]:
mce = MLJ.cross_entropy(yhat, y[test]) |> mean
round(mce, digits=4)

4.8058

In [103]:
MLJ.cross_entropy(yhat, y[test]) 

45-element Vector{Float64}:
  2.2204460492503136e-16
  2.2204460492503136e-16
 36.04365338911715
  2.2204460492503136e-16
  2.2204460492503136e-16
 36.04365338911715
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  ⋮
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
 36.04365338911715
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16
  2.2204460492503136e-16

## Unsupervised models

In [105]:
v = [1, 2, 3, 4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)

[34mMachine{UnivariateStandardizer,…} @319[39m trained 0 times; caches data
  args: 
    1:	[34mSource @487[39m ⏎ `AbstractVector{Count}`


In [107]:
MLJ.fit!(stand)

┌ Info: Training [34mMachine{UnivariateStandardizer,…} @319[39m.
└ @ MLJBase /home/sandhya/.julia/packages/MLJBase/pCCd7/src/machines.jl:342


[34mMachine{UnivariateStandardizer,…} @319[39m trained 1 time; caches data
  args: 
    1:	[34mSource @487[39m ⏎ `AbstractVector{Count}`


In [109]:
w = MLJ.transform(stand, v)
@show round.(w, digits=2)
@show mean(w)
@show std(w)

round.(w, digits = 2) = [-1.16, -0.39, 0.39, 1.16]
mean(w) = 0.0
std(w) = 1.0


1.0

In [110]:
vv = inverse_transform(stand, w)
sum(abs.(vv .- v))

0.0