Skip to content

Operator Based Machine Learning Pipeline Construction

Lars Kotthoff edited this page Mar 13, 2017 · 5 revisions

Background

mlr is a powerful package for general-purpose machine learning in R. Many machine learning applications require extensive preprocessing on the data for state-of-the-art performance. mlr already contains a lot of available methods, which are described on our tutorial page. A large number of preprocessing procedures have to be fused with a learner to create a wrapped learner, which assures that the same preprocessing is used in training and prediction.

The current code makes defining and applying such processing pipelines harder than it needs to be:

  • Preprocessing operations, often to ensure very basic things, have to be set through a combination of options and different wrappers.
  • Adding custom preprocessing methods is unnecessarily complex.
  • The resulting learner objects are very complex and getter, setter and printer of their arguments have to be defined awkwardly.
  • Using the same preprocessing methods on multiple learners often requires a lot of (redundant) code.

It would be much more natural and better if we could write something like the following:

preproc = remove.constants %>% filter.features %>% pca 
lrn = fuseLearnerWithPreprocessing(lrn, preproc)

This creates a preprocessing chain which can be added to any learner with a single command to construct a configurable pipeline.

Related work

The R programming language already offers a general purpose package for piping function output to new functions, magrittr. This is heavily utilized by the package dplyr for data manipulation, but mostly consists of basic low level functionality, e.g., mutation, aggregation, selection, filtering etc. The aim of this project is to provide a similar syntax, but much more focused on machine learning and utilizing a large number of preprocessing methods that are available in other R packages.

Another project that one could draw ideas from is pipelearner.

For comparison, the Python machine learning toolbox Scikit learn offers Pipeline and FeatureUnion functions to chain multiple estimators and transformer functions into one call. As stated in their documentation the main advantage is that only one call to fit and predict is required for a complete pipeline fit and it is possible to optimize jointly over the whole space of hyperparameters.

Commercial software like RapidMiner and Knime use similar approaches in their graphical user interface, in which analysis pipelines, similar to flowcharts, can be created. They use the same basic principle of piping output from one step to the next.

Details of your coding project

  • Define a clean API for this (partially already exists and mentors will help a lot with this, but you need to make it better).

We do have a current internal branch which consists of a rough API idea for this in BRANCH. (Files are TaskTransform.R and test_tt.R). This can be used as a starting point. The tasks to be finished are:

  • Implement abstract base class TaskTransform (TT).
  • Make TTs “joinable”, so they form a pipeline (for training and prediction).
  • Allow to apply a TT to a task to transform it (during training and on new data during prediction).
  • Allow to add a TT learner through a wrapper, so it becomes a ĺearner’s "first preprocessing step".
  • Integrate the package vtreat which helps with lots of pre-processing stuff where we have custom code.
  • Convert all existing mlr preprocess in operations (these are not too many) into TTs, and some can be removed when vtreat is there.

Milestones

  • Month 1: Definition of API with proof of concept
  • Month 2: Implementation mostly finished, along with comprehensive tests
  • Month 3: Implementation finished, along with large-scale evaluation

Expected impact

This is the first step towards defining a domain-specific language for machine learning. It will enable users to implement machine learning pipelines much faster and with less code, and make machine learning in R much more accessible.

Mentors

Tests

Required: Good knowledge of machine learning, R package development and R in general. This is NOT a simple project. You need to feel comfortable with larger projects, designing software and OO design. This project is not for people beginning to learn R.

Go to the mlr tracker, communicate in the tracker that you want to solve an issue. In particular, look for the issues tagged "effort-simplefix". Show us your skills by creating a good pull request.

Then create a doc and write down a few ideas how you would approach the project and / or extend the linked branch.

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally