# Composing models

## Generating dummy data

In [1]:
using MLJ
using PrettyPrinting

@load KNNRegressor


┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /home/sandhya/.julia/packages/MLJModels/E8BbE/src/loading.jl:168


import NearestNeighborModels ✔


NearestNeighborModels.KNNRegressor

In [2]:
# input
X = (age    = [23, 45, 34, 25, 67],
     gender = categorical(['m', 'm', 'f', 'm', 'f']))
# target
height = [178, 194, 165, 173, 168];

In [3]:
scitype(X.age)

AbstractVector{Count} (alias for AbstractArray{Count, 1})

## Declaring a pipeline

A typical workflow for such data is to one-hot-encode the categorical data and then apply some regression model on the data. Let's say that we want to apply the following steps:

standardize the target variable (:height)

one hot encode the categorical data

train a KNN regression model

The @pipeline macro helps you define such a simple (non-branching) pipeline of steps to be applied in order:

In [7]:
using NearestNeighborModels

In [8]:
pipe = @pipeline(X -> coerce(X, :age=>Continuous),
                OneHotEncoder(),
                KNNRegressor(K=3),
                target = UnivariateStandardizer());

┌ Info: Treating pipeline as a `Deterministic` predictor.
│ To override, specify `prediction_type=...` (options: :deterministic, :probabilistic, :interval). 
└ @ MLJBase /home/sandhya/.julia/packages/MLJBase/pCCd7/src/composition/models/pipelines.jl:372


In [10]:
pipe.knn_regressor.K=2
pipe.one_hot_encoder.drop_last = true

true

In [11]:
evaluate(pipe, X, height, resampling=Holdout(),
         measure=rms) |> pprint

(measure = [RootMeanSquaredError @344],
 measurement = [11.5],
 per_fold = [[11.5]],
 per_observation = [missing],
 fitted_params_per_fold =
     [(one_hot_encoder = (fitresult = [34mOneHotEncoderResult @925[39m,),
       knn_regressor =
           (tree =
                NearestNeighbors.KDTree{StaticArrays.SVector{2, Float64}, Euclidean, Float64}
  Number of points: 4
  Dimensions: 2
  Metric: Euclidean(0.0)
  Reordered: true,),
       target = (mean_and_std = (177.5, 12.233832869001711),),
       machines = [[34mMachine{OneHotEncoder,…} @608[39m,
                   [34mMachine{KNNRegressor,…} @086[39m,
                   [34mMachine{UnivariateStandardizer,…} @722[39m],
       fitted_params_given_machine =
           OrderedCollections.LittleDict{Any, Any, Vector{Any}, Vector{Any}}([34mMachine{OneHotEncoder,…} @608[39m => (fitresult = [34mOneHotEncoderResult @925[39m,), [34mMachine{KNNRegressor,…} @086[39m => (tree = NearestNeighbors.KDTree{StaticArrays.SVector{2, Float