In [1]:
using MLJ, RDatasets
using Random:seed!
seed!(1234)

MersenneTwister(UInt32[0x000004d2])

## MLJ Basics

In [2]:
boston = dataset("MASS", "Boston")
first(boston, 3)

Unnamed: 0_level_0,Crim,Zn,Indus,Chas,NOx,Rm,Age,Dis,Rad,Tax
Unnamed: 0_level_1,Float64,Float64,Float64,Int64,Float64,Float64,Float64,Float64,Int64,Int64
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242


### Machine types and Scitypes

MLJ distinguishes between the machine type (how the data is originally encoded) and the scientific type (how the user believes the data should be interpreted).
So for instance, it may be that an "Age" feature is encoded as some integer format but should be interpreted as a continuous feature.

To one scitype can correspond only one (Abstract)type e.g. Continuous <> AbstractFloat.
The opposite is not true as one machine type can correspond to several scientific types (e.g. in categorical cases).

When looking at data with no information about the scientific types, MLJ will try to guess what they are, if that guess is incorrect, the user should specify the differences.

**Current context**: most features are floating point number and so will be interpreted as continuous. However for instance "Rad" will be interpreted as a `Count` but we'd like to interpret is as `Continuous`.

In [3]:
scitypes(boston)

(Crim = MLJBase.Continuous,
 Zn = MLJBase.Continuous,
 Indus = MLJBase.Continuous,
 Chas = MLJBase.Count,
 NOx = MLJBase.Continuous,
 Rm = MLJBase.Continuous,
 Age = MLJBase.Continuous,
 Dis = MLJBase.Continuous,
 Rad = MLJBase.Count,
 Tax = MLJBase.Count,
 PTRatio = MLJBase.Continuous,
 Black = MLJBase.Continuous,
 LStat = MLJBase.Continuous,
 MedV = MLJBase.Continuous,)

### Tasks

Tasks wrap

1. the data
1. the interpretation of the data (see scitypes above)
1. the learning objectives (supervised/unsupervised etc)

Here we want to specify the data, change the interpretation of `Rad` and `Tax` and specify that we want a supervised model with probabilistic output with the response being the `MedV` variable.

In [4]:
task = supervised(
        data   = boston,
        target = :MedV,
        ignore = :Chas,
        types  = Dict(:Rad=>Continuous, :Tax=>Continuous),
        is_probabilistic = true)
# In case the data is arranged in a specific order, we may want to shuffle it so that this doesn't impact the training.
shuffle!(task)

┌ Info: 
│ is_probabilistic = true
│ input_scitype_union = MLJBase.Continuous 
│ target_scitype_union = MLJBase.Continuous
└ @ MLJBase /Users/tlienart/.julia/dev/MLJBase/src/tasks.jl:104


[34mSupervisedTask @ 1…85[39m


### Models

What models are available to us for the given task?

In [5]:
models(task)

Dict{Any,Any} with 2 entries:
  "MLJ" => Any["MLJ.Constant.ConstantRegressor"]
  "GLM" => Any["OLSRegressor"]

If we just want a deterministic model, a few more models are available:

In [6]:
task.is_probabilistic = false
models(task)

Dict{Any,Any} with 6 entries:
  "MultivariateStats" => Any["RidgeRegressor"]
  "MLJ"               => Any["MLJ.Constant.DeterministicConstantRegressor", "ML…
  "DecisionTree"      => Any["DecisionTreeRegressor"]
  "ScikitLearn"       => Any["SVMLRegressor", "ElasticNet", "ElasticNetCV", "SV…
  "LIBSVM"            => Any["EpsilonSVR", "NuSVR"]
  "XGBoost"           => Any["XGBoostRegressor"]

### Binding a task to a model

Let's say we want to use a simple ridge regression to start with.
We first need to load it, specify the parameter and wrap the task and the model in a `machine`.

In [7]:
@load RidgeRegressor
ridge  = RidgeRegressor(lambda=0.1)

import MLJModels ✔
import MultivariateStats ✔
import MLJModels.MultivariateStats_.RidgeRegressor ✔


MLJModels.MultivariateStats_.RidgeRegressor(lambda = 0.1,)[34m @ 3…04[39m

In [8]:
mach = machine(ridge, task)

[34mMachine{RidgeRegressor} @ 9…34[39m


### Training and evaluation of performance

The step by step approach involves:

* creating a train/test split
* fitting the machine
* predicting on the test set
* checking the performance

In [9]:
train, test = partition(1:nrows(task), 0.7)
fit!(mach, rows=train)
yhat = predict(mach, rows=test)

┌ Info: Training [34mMachine{RidgeRegressor} @ 9…34[39m.
└ @ MLJ /Users/tlienart/.julia/dev/MLJ/src/machines.jl:140


152-element Array{Float64,1}:
 27.875406760269595
 22.539700844045903
 20.687610162632115
 16.055185145971578
 27.035480363584426
 22.08399548853992 
 25.68178700626379 
 16.197770168117202
 28.660543916730674
 19.938590782495897
  ⋮                
  6.411957395028047
 25.220138235162814
 20.73029337806124 
 20.503424619204196
 14.247863747123866
 25.634903856481294
 23.898588020215485
 19.052170414464587
 28.0721791686803  

In [10]:
rms(yhat, task.y[test])

5.189945057488582

This can be done "all in one" using the `evaluate!` function (which returns the metric)

In [11]:
evaluate!(mach,
          resampling = Holdout(fraction_train = 0.7),
          measure    = rms)
# Either way, you can extract the parameters with `fitted_params` applied to the machine:
fp = fitted_params(mach);
fp.coefficients

┌ Info: Evaluating using a holdout set. 
│ fraction_train=0.7 
│ shuffle=false 
│ measure=rms 
│ operation=predict 
│ Resampling from all rows. 
└ @ MLJ /Users/tlienart/.julia/dev/MLJ/src/resampling.jl:92


12-element Array{Float64,1}:
  -0.09729297300527442 
   0.036081041218516664
   0.014619300443044603
 -16.442630582387928   
   3.6064512838027496  
   0.018083869485879565
  -1.2542562200220786  
   0.33786937633742553 
  -0.013689025192817809
  -0.9578864713233827  
   0.009176603960309929
  -0.5803250413745314  

In [12]:
fp.bias

36.41191172003011

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*