# MLJ Task Comparison

[MLJ Docs](https://alan-turing-institute.github.io/MLJ.jl/dev/)


## Installation
*This particular method of integrating Julia was taken from [this template](https://colab.research.google.com/github/ageron/julia_notebooks/blob/master/Julia_Colab_Notebook_Template.ipynb)*!

1. If you need a GPU: _Runtime_ > _Change runtime type_ > _Hardware accelerator_ = _GPU_.
2. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
3. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat all steps.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 2 and 3.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.8.2" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia  

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.8.2 on the current Colab Runtime...
2023-05-02 16:49:01 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz [135859273/135859273] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...
Installing Julia package CUDA...
Installing IJulia kernel...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInstalling julia kernelspec in /root/.local/share/jupyter/kernels/julia-1.8

Successfully installed julia version 1.8.2!
Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then
jump to the 'Checking the Installation' section.




To import all of the neccesary packages to run this notebook...

In [1]:
import Pkg
Pkg.add("MLJ")
Pkg.add("DecisionTree")
Pkg.add("PalmerPenguins")
Pkg.add("MLJModels")
Pkg.add("MLJDecisionTreeInterface")
Pkg.add("DataFrames")
Pkg.add("CSV")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Calculus ──────────────────── v0.5.1
[32m[1m   Installed[22m[39m HypergeometricFunctions ───── v0.3.15
[32m[1m   Installed[22m[39m CategoricalDistributions ──── v0.1.10
[32m[1m   Installed[22m[39m StatsFuns ─────────────────── v1.3.0
[32m[1m   Installed[22m[39m StatisticalTraits ─────────── v3.2.0
[32m[1m   Installed[22m[39m LoggingExtras ─────────────── v1.0.0
[32m[1m   Installed[22m[39m RelocatableFolders ────────── v1.0.0
[32m[1m   Installed[22m[39m PDMats ────────────────────── v0.11.17
[32m[1m   Installed[22m[39m MarchingCubes ─────────────── v0.1.8
[32m[1m   Installed[22m[39m ConcurrentUtilities ───────── v2.1.1
[32m[1m   Installed[22m[39m Contour ───────────────────── v0.6.2
[32m[1m   Installed[22m[39m ProgressMeter ─────────────── v1.7.2
[32m[1m   Installed[22m[39m Forma

### Checking the Installation
The `versioninfo()` function should print your Julia version and some other info about the system:

In [2]:
versioninfo()

Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 2 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 2


## Loading a Dataset

Let's start by loading in a dataset to work with. We'll be working with the Palmer Penguins dataset.

`load()` will load in penguins as a CSV.File object that supports the Tables.jl interface. We'll need to turn this into a `DataFrame` for easier preprocessing later, which we can do using `DataFrame(CSV.File)`

Run the following code snippet to load up your penguins:

(you may need to confirm the installation of the dataset by selecting 'y' in stdin.)

In [3]:
using MLJ
using CSV
using PalmerPenguins
using DataFrames

penguin_csv = PalmerPenguins.load()
penguins = DataFrame(penguin_csv)

This program has requested access to the data dependency PalmerPenguins.
which is not currently installed. It can be installed automatically, and you will not see this message again.

Dataset: The Palmer penguins dataset
Authors: Allison Horst, Alison Hill, Kristen Gorman
Website: https://allisonhorst.github.io/palmerpenguins/index.html

The Palmer penguins dataset is a dataset for data exploration & visualization, as an
alternative to the Iris dataset.

The dataset contains data for 344 penguins. There are 3 different species of penguins in
this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

Data were collected and made available by
[Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php)
and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the
[Long Term Ecological Research Network](https://lternet.edu/).

Data are available by
[CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) l

[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mDownloading
[36m[1m│ [22m[39m  source = "https://cdn.jsdelivr.net/gh/allisonhorst/palmerpenguins@433439c8b013eff3d36c847bb7a27fa0d7e353d8/inst/extdata/penguins.csv"
[36m[1m│ [22m[39m  dest = "/root/.julia/datadeps/PalmerPenguins/penguins.csv"
[36m[1m│ [22m[39m  progress = 1.0
[36m[1m│ [22m[39m  time_taken = "0.05 s"
[36m[1m│ [22m[39m  time_remaining = "0.0 s"
[36m[1m│ [22m[39m  average_speed = "280.834 KiB/s"
[36m[1m│ [22m[39m  downloaded = "13.199 KiB"
[36m[1m│ [22m[39m  remaining = "0 bytes"
[36m[1m└ [22m[39m  total = "13.199 KiB"
[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mDownloading
[36m[1m│ [22m[39m  source = "https://cdn.jsdelivr.net/gh/allisonhorst/palmerpenguins@433439c8b013eff3d36c847bb7a27fa0d7e353d8/inst/extdata/penguins_raw.csv"
[36m[1m│ [22m[39m  dest = "/root/.julia/datadeps/PalmerPenguins/penguins_raw.csv"
[36m[1m│ [22m[39m  progress = 1.0
[36m[1m│ [22m[39m  time_taken = 

Row,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
Unnamed: 0_level_1,String15,String15,Float64?,Float64?,Int64?,Int64?,String7?
1,Adelie,Torgersen,39.1,18.7,181,3750,male
2,Adelie,Torgersen,39.5,17.4,186,3800,female
3,Adelie,Torgersen,40.3,18.0,195,3250,female
4,Adelie,Torgersen,missing,missing,missing,missing,missing
5,Adelie,Torgersen,36.7,19.3,193,3450,female
6,Adelie,Torgersen,39.3,20.6,190,3650,male
7,Adelie,Torgersen,38.9,17.8,181,3625,female
8,Adelie,Torgersen,39.2,19.6,195,4675,male
9,Adelie,Torgersen,34.1,18.1,193,3475,missing
10,Adelie,Torgersen,42.0,20.2,190,4250,missing


## Getting to Know Your Dataset + Simple Data Visualizations

First, let's get to know our dataset and walk through some basic functionality of examining and visualizing data in MLJ. 

We can use `selectrows(dataset, first:last)`, where `first:last` is the range of values to print, to visualize a table for the first couple of values of our data. `|> pretty` attempts to display this data in a table. This can be helpful to view a snapshot of the data in your set in an elegant overview, but may not be indicative of the values in the dataset as a whole.

In [4]:
selectrows(penguins, 1:10)  |> pretty

┌──────────┬───────────┬────────────────────────────┬────────────────────────────┬───────────────────────┬───────────────────────┬─────────────────────────┐
│[1m species  [0m│[1m island    [0m│[1m bill_length_mm             [0m│[1m bill_depth_mm              [0m│[1m flipper_length_mm     [0m│[1m body_mass_g           [0m│[1m sex                     [0m│
│[90m String15 [0m│[90m String15  [0m│[90m Union{Missing, Float64}    [0m│[90m Union{Missing, Float64}    [0m│[90m Union{Missing, Int64} [0m│[90m Union{Missing, Int64} [0m│[90m Union{Missing, String7} [0m│
│[90m Textual  [0m│[90m Textual   [0m│[90m Union{Missing, Continuous} [0m│[90m Union{Missing, Continuous} [0m│[90m Union{Missing, Count} [0m│[90m Union{Missing, Count} [0m│[90m Union{Missing, Textual} [0m│
├──────────┼───────────┼────────────────────────────┼────────────────────────────┼───────────────────────┼───────────────────────┼─────────────────────────┤
│ Adelie   │ Torgersen │ 39.1    

Another important way to examine data is `schema()`, which will return each feature of the data and its type.

In [None]:
schema(penguins)

┌───────────────────┬────────────────────────────┬─────────────────────────┐
│[22m names             [0m│[22m scitypes                   [0m│[22m types                   [0m│
├───────────────────┼────────────────────────────┼─────────────────────────┤
│ species           │ Textual                    │ String15                │
│ island            │ Textual                    │ String15                │
│ bill_length_mm    │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ bill_depth_mm     │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ flipper_length_mm │ Union{Missing, Count}      │ Union{Missing, Int64}   │
│ body_mass_g       │ Union{Missing, Count}      │ Union{Missing, Int64}   │
│ sex               │ Union{Missing, Textual}    │ Union{Missing, String7} │
└───────────────────┴────────────────────────────┴─────────────────────────┘


## Classification Model - Decision Trees

This example will demonstrate how the MLJ library can be used to create decision trees. We'll be training and testing a model to predict the species of penguins.

The MLJ equivalent of a Decision Tree Classifier loads in the DecisionTree package as a type definition, which we'll call `dTree`.

In [5]:
import MLJModels
import DecisionTree

dtree = @iload DecisionTreeClassifier pkg = "DecisionTree"
pengs_classifier = dtree()

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFor silent loading, specify `verbosity=0`. 


import MLJDecisionTreeInterface ✔


DecisionTreeClassifier(
  max_depth = -1, 
  min_samples_leaf = 1, 
  min_samples_split = 2, 
  min_purity_increase = 0.0, 
  n_subfeatures = 0, 
  post_prune = false, 
  merge_purity_threshold = 1.0, 
  display_depth = 5, 
  feature_importance = :impurity, 
  rng = Random._GLOBAL_RNG())

### Train/Test Data Sets

Now that we have our classifier, we can begin to split the data into train/test sets. 

It should be noted that in MLJ, categorical data is not handled elegantly. As a result, we will have to do some preprocessing for non-numeric data before feeding it to our model.

You can use `coerce` to coax categorical data into multiclass data. We'll have to apply this function to `species` and `island` respectively.

Afterwards, we'll also `dropmissing()` all missing values. Note the new schema of the datatypes printed at the end of this cell, as well as the new data - the union datatypes have been removed.

In [6]:
penguins = coerce(penguins, :species => Multiclass,
                            :island => Multiclass,
                            :sex => Multiclass)

penguins = dropmissing(penguins)

selectrows(penguins, 1:10)  |> pretty

[36m[1m┌ [22m[39m[36m[1mInfo: [22m[39mTrying to coerce from `Union{Missing, String7}` to `Multiclass`.
[36m[1m└ [22m[39mCoerced to `Union{Missing,Multiclass}` instead.


┌────────────────────────────────────┬────────────────────────────────────┬────────────────┬───────────────┬───────────────────┬─────────────┬───────────────────────────────────┐
│[1m species                            [0m│[1m island                             [0m│[1m bill_length_mm [0m│[1m bill_depth_mm [0m│[1m flipper_length_mm [0m│[1m body_mass_g [0m│[1m sex                               [0m│
│[90m CategoricalValue{String15, UInt32} [0m│[90m CategoricalValue{String15, UInt32} [0m│[90m Float64        [0m│[90m Float64       [0m│[90m Int64             [0m│[90m Int64       [0m│[90m CategoricalValue{String7, UInt32} [0m│
│[90m Multiclass{3}                      [0m│[90m Multiclass{3}                      [0m│[90m Continuous     [0m│[90m Continuous    [0m│[90m Count             [0m│[90m Count       [0m│[90m Multiclass{2}                     [0m│
├────────────────────────────────────┼────────────────────────────────────┼────────────────┼──────────

There is no built-in model for feature selection and train/test sets in MLJ. Use of the `unpack` and `partition` functions can replicate this functionality.

> Our X values will be the train/test values for our data (that is, all the numeric data for the penguins).<br>
> Our Y values will be the train/test values for our target (that is, the species).

To first select our features and remove data that cannot be processed, we can use `unpack` to separate our data into x and y values. This makes column selections based on the predicates specified.

A predicate is any object `f` such that `f(name)` is true or false for each column `name::Symbol` of a `table`.

> The predicate `==(:species)` means that we are unpacking the `species` column into `y`.

> The predicate `!=(:species)` means that we are unpacking all columns except for `species` into x

In [7]:
y, x = unpack(penguins, ==(:species),
                                 x -> x !=(:sex) && x !=(:island))
# model does not like categorical values lol

(CategoricalArrays.CategoricalValue{String15, UInt32}[String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie"), String15("Adelie")  …  String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap"), String15("Chinstrap")], [1m333×4 DataFrame[0m
[1m Row [0m│[1m bill_length_mm [0m[1m bill_depth_mm [0m[1m flipper_length_mm [0m[1m body_mass_g [0m
     │[90m Float64        [0m[90m Float64       [0m[90m Int64             [0m[90m Int64       [0m
─────┼───────────────────────────────────────────────────────────────
   1 │           39.1           18.7                181         3750
   2 │           39.5           17.4                186         3800
   3 │           40.3           18.0                195         

After unpacking, we can use `partition` to split our dataset into train/test sets. We'll be training on an 80/20 split, where 80% of our data will be the train set, and 20% of our data will be the test set.

> We specify `shuffle` to be true in order to shuffle this data before partitioning it.

> `multi` is set to true, as X is expected to be a tuple of objects sharing a common length, which are each partitioned separately using the same specified fractions and the same row shuffling. 

> Our `rng` is 4 - which is an arbitrary randomizer seed that allows for data replication. 

In [8]:
(x_train, x_test), (y_train, y_test) = partition((x, y), 0.8, shuffle = true, multi = true, rng = 4)

(([1m266×4 DataFrame[0m
[1m Row [0m│[1m bill_length_mm [0m[1m bill_depth_mm [0m[1m flipper_length_mm [0m[1m body_mass_g [0m
     │[90m Float64        [0m[90m Float64       [0m[90m Int64             [0m[90m Int64       [0m
─────┼───────────────────────────────────────────────────────────────
   1 │           40.2           17.1                193         3400
   2 │           35.0           17.9                190         3450
   3 │           52.1           17.0                230         5550
   4 │           34.6           21.1                198         4400
   5 │           45.7           13.9                214         4400
   6 │           40.6           17.2                187         3475
   7 │           49.1           14.8                220         5150
   8 │           35.7           18.0                202         3550
   9 │           39.2           19.6                195         4675
  10 │           36.5           16.6                181         285

### Building a DecisionTreeClassifier

Great! Our data has been processed and is ready to build our model.

Now, we can train the `penguins_classifier` we made previously. 

We can do this by connecting the model (`penguins_classifier`) and training data (`x_train, y_train`) through a `machine()`. Machines bind models to data in MLJ. 

After we bind the data and the model, we then train the machine by calling `fit!` on the machine as demonstrated below.

We can then generate a prediction using `predict()`, which accepts a machine and some test data as its parameters.

In [9]:
peng_machine = machine(pengs_classifier, x_train, y_train)
fit!(peng_machine)
prediction = predict(peng_machine, x_test)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mTraining machine(DecisionTreeClassifier(max_depth = -1, …), …).


67-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{3}, String15, UInt32, Float64}:
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>0.0, Chinstrap=>0.0, Gentoo=>1.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>0.0, Chinstrap=>0.0, Gentoo=>1.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>0.0, Chinstrap=>0.0, Gentoo=>1.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>0.0, Chinstrap=>0.0, Gentoo=>1.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>0.0, Chinstrap=>0.0, Gentoo=>1.0)
 UnivariateFinite{Multiclass{3}}(Adelie=>1.0, Chinstrap=>0.0, Gentoo=>0.0

### Evaulating Performance

We can evaluate the model's performance by calculating its accuracy using `accuracy`. 0% is poor accuracy, while 100% is good accuracy.

Here, we're comparing the actual y values of the test data we predicted on to what our model predicted.

In [10]:
print(accuracy(prediction, y_test))

0.0

### Resampling

Below are examples of popular resampling methods and how they are implemented in MLJ.

MLJ uses the `evaluate!()` sugar function to initialize resampling upon machine wrapping models with data. It accepts a machine, a resampling method, and the resampling method's parameters.

`evaluate()` (no exclamation point) can be applied to a model and data without a machine against a single measure, i.e. `evaluate(model, X, y, resampling = cv, measure = accuracy, verbosity = 0)`

#### Holdout/Train-Test Split

>We're already familiar with Holdout resampling. We simulated this earlier using `partition` on the dataset.

>Holdout resampling splits the data into train/test splits, where the model is trained on the train set and tested on the test set. The most common split is a 80% train 20% test split.

>It accepts the parameters:
- `fraction_train` - the % of data to put into the training set.
- `shuffle` - true/false, set to true to shuffle the data before splitting.
- `rng` - an arbitrary seed that allows for data reproducibility.

In [None]:
Holdout(fraction_train = 0.8, shuffle = true, rng = nothing)

#### K-fold Cross Validation

>  Here, the training set is split into `n` smaller sets. The performance measure reported by k-fold cross-validation is then the average of the values computed in the loops, which are executed as follows:

> - A model is trained using `n-1` of the folds as training data.
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy). 

> It accepts the parameters:
- `nfolds` - # of folds (also known as samples) in the dataset. i.e. `nfolds = 4` would fold twice, splitting the data into 2 samples, each sample holding 25% of the datset. 
- `shuffle` - true/false, set to true to shuffle the data before splitting.
- `rng` - an arbitrary seed that allows for data reproducibility.

In [None]:
CV(nfolds = 6,  shuffle = nothing, rng = nothing)

MLJ does not have built-in functions for Repeated Cross-Val, Leave-One-Out (LOO), or Bootstrap resampling. 

It is possible to implement custom resampling methods in MLJ. That methodology is documented [here](https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/#Custom-resampling-strategies).

For the most up-to-date information about resampling methods, read more [here](https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/).

###Using Resampling to Train a Model

Let's use K-fold cross-validation to train a model! We'll be using a K-fold with 10 folds.

We'll be using the `evaluate!()` function we discussed above. It accepts the parameters:

- the machine being evaluated
- `resampling` - the resampling method. In this case, we're using `CV`.
- `measure` - the performance measure to evaluate based on
- `verbosity` - how much is being logged.

First, we'll create an instance of `CV`, then evaluate. We want to set shuffle to true (to ensure our data is not ordered anymore), and use our seed of 4.

We'll be using accuracy as our measure. Each performance measure and the strings associated with accessing them can be found [here](https://alan-turing-institute.github.io/MLJ.jl/dev/performance_measures/#List-of-measures).

It's also possible to evaluate with multiple measures by passing `measure` as an array of measures.

In [None]:
pengs_cv = CV(nfolds = 10, shuffle = true, rng = 4)

evaluate!(peng_machine, resampling = pengs_cv, measure = accuracy, verbosity = 0)

###Evaluating Model Performance Using Confusion Matrices

We can also use a confusion matrix to get a better idea of how our model has performed. It shows how many datapoints were correctly and incorrectly classified. 

The diagonal values represent the number of correct predictions, and all other values are incorrect predictions.

In [None]:
ConfusionMatrix()(prediction, y_test)

## Benchmarking
Benchmarking is the comparaison of different learners on a single task or multiple tasks. The end goal of benchmaking is to identify the best performing learner for a given problem or task. <br>
MLJ does not have a built-in method for benchmarking. We do this by performing the following steps:

*   storing datasets in a list 
*   storing learners in a list
*   selecting the resampling method we're going to use 
*   iterating over the datasets and the learners and store the score in each iteration

In [25]:
datasets = [DataFrame(PalmerPenguins.load()), DataFrame(load_iris()), DataFrame(load_crabs)]
GaussianNBClassifier = @load GaussianNBClassifier pkg=NaiveBayes
dtree = @iload DecisionTreeClassifier pkg = "DecisionTree"
classifiers = [dtree(),GaussianNBClassifier()]

resultsForPolting = []
for dataset in datasets
  

3-element Vector{DataFrame}:
 [1m344×7 DataFrame[0m
[1m Row [0m│[1m species   [0m[1m island    [0m[1m bill_length_mm [0m[1m bill_depth_mm [0m[1m flipper_length_mm [0m[1m[0m ⋯
     │[90m String15  [0m[90m String15  [0m[90m Float64?       [0m[90m Float64?      [0m[90m Int64?            [0m[90m[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Adelie     Torgersen            39.1           18.7                181  ⋯
   2 │ Adelie     Torgersen            39.5           17.4                186
   3 │ Adelie     Torgersen            40.3           18.0                195
   4 │ Adelie     Torgersen [90m      missing   [0m[90m     missing   [0m[90m           missing [0m[90m[0m
   5 │ Adelie     Torgersen            36.7           19.3                193  ⋯
   6 │ Adelie     Torgersen            39.3           20.6                190
   7 │ Adelie     Torgersen            38.9           17.8                181
   8

#todo: regression