Add import / export features using the PMML format for decision tree based models #1596

Closed
ogrisel opened this Issue Jan 19, 2013 · 40 comments

Comments

Projects
None yet
Owner

ogrisel commented Jan 19, 2013

PMML (predictive models markup language) is standard interchange format for trained predictive models (and more).

Here is a list of industry players and open source projects supporting PMML:

http://www.dmg.org/products.html

In particular, BigML is supporting export / import of PMML for decision tree models (and maybe ensemble of tree models too, need to check).

This the discussion in the comments of this thread:

http://www.quora.com/Machine-Learning/What-are-the-pros-and-cons-of-using-PMML-as-an-interchange-format-for-predictive-analytics-models/answer/Francisco-J-Martin

Note: BigML also supports a simpler, more compact and human readable JSON version of PMML named JSON PML as documented on this gist:

https://gist.github.com/4565563

Supporting PMML and/or JSON PML export of scikit-learn decision trees would make it possible to use the BigML web user interface to introspect the trees and run them on datasets loaded on the platform or publish them using the BigML features.

Using an external format for the persistence of some scikit-learn models (maybe not all) would also provide a partial solution to the issue of persistence / loading of models trained using a prior, class incompatible version of scikit-learn.

PMML export would also make it easier to perform science experiment replication (and help with reproduction too) if scientist using decision tree models would publish the PMML export of their ensemble of trees as technical annex to a paper for instance as already R/Rattle and Weka support PMML exports and imports.

Open questions:

  • how big (in bytes) would be a PMML export of a realistic ensemble of trees model such as the ones used in computer vision?
  • the PMML format looks very verbose, we could probably devise more compact model export format using msgpack BSON, Avro, Protocol Buffers or Parquet as a more generic solution for the persistence of scikit-learn model that does not rely upon the python class structure.
Owner

amueller commented Jan 19, 2013

That indeed looks very interesting.

Owner

larsmans commented Jan 20, 2013

-1 on anything related to XML in the main package. Handling it properly, in my experience, requires a lot of work and expertise. It's also terribly verbose and therefore slow to produce and parse, making the tests even slower than they already are.

Owner

GaelVaroquaux commented Jan 20, 2013

-1 on anything related to XML in the main package. Handling it properly, in my
experience, requires a lot of work and expertise. It's also terribly verbose
and therefore slow to produce and parse, making the tests even slower than they
already are.

I tend to agree :)

Owner

amueller commented Jan 23, 2013

having some alternative way of storing models has also turned up in the user survey.
As did the problem of incompatible pickles.

Is there a good non-xml format? We could at least try to define some interface that makes it possible to read/write models.

Owner

larsmans commented Jan 23, 2013

Ideally, we would decouple the formats from the estimators. That way, a separate PMML package could dump and load them, and similarly for other formats -- N+M instead of N×M.

Owner

amueller commented Jan 23, 2013

But then the estimator still has to know what needs to be stored somehow, right?

Owner

larsmans commented Jan 23, 2013

Yes, so probably a large part of the N×M matrix would be missing. The alternative is to have a format per estimator, but then that format should be self-identifying. I have to admit, that's something PMML would solve, although mapping our estimator classes to its categories would be painful. Would HDF5 help here?

Owner

GaelVaroquaux commented Jan 24, 2013

On Wed, Jan 23, 2013 at 02:26:54PM -0800, Andreas Mueller wrote:

having some alternative way of storing models has also turned up in the user
survey.
As did the problem of incompatible pickles.

Is there a good non-xml format? We could at least try to define some interface
that makes it possible to read/write models.

Honestly, Prabhu and I lost a lot of time trying to do this with Mayavi
and never could do it robustly. Having a persistence format that is both
rich and compatible across version take a lot of developer time. This is
typically the kind of features that I refuse to implement nowadays. If
the users think enough/learn enough about their problem, they can
implement the persistence. It is not clear that the packages have the
resources to deal with these problems.

G

Owner

GaelVaroquaux commented Jan 24, 2013

The alternative is to have a format per estimator, but then that format
should be self-identifying. I have to admit, that's something PMML
would solve, although mapping our estimator classes to its categories
would be painful.

Yes, basically we are reinventing a pickle-like mechanism, but with more
control.

Would HDF5 help here?

Not really, I believe. The problem is to define a data model and adaptor
paths to go from versions to versions. Where this data model is stored
doesn't really matter, as it is not the difficulty.

Owner

ogrisel commented Jan 24, 2013

The alternative is to have a format per estimator, but then that format
should be self-identifying. I have to admit, that's something PMML
would solve, although mapping our estimator classes to its categories
would be painful.

Yes, basically we are reinventing a pickle-like mechanism, but with more
control.

PMML cannot be a full replacement for pickling: the unpickle models can only predict (and maybe transform for some of them) but will not allow use to cover all the scikit-learn specific parameters and features not

Would HDF5 help here?

Not really, I believe. The problem is to define a data model and adaptor
paths to go from versions to versions. Where this data model is stored
doesn't really matter, as it is not the difficulty.

We could have a JSON / BSON representation specific to scikit-learn + adaptors to garanty an upgrade path from version to version. JSON is simple enough, well supported by the standard library and would make it easier for people wanting to productionize scikit-learn trained models in cross-platform environments (for instance with the python ecosystem for data prototyping and java based technologies on the production servers).

But that would not provide interop with external product and services that already support PMML (R libraries, Weka, BigML, Google Prediction API...).

The PMML support could be done outside of the main sklearn repo (at least as long as people fear that it would impose an unnecessary maintenance burdon).

Using an external scikit-learn-pmml project would allow us to add the lxml dependency to that project which makes XML support much cleaner (e.s.p. w.r.t. proper support of namespaces) than just using the standard library.

Owner

larsmans commented Jan 24, 2013

We can also reuse pickle but not pickle the instances, but some neutralized representation of them, e.g.

(str(type(self)), public_hyperparams, learned_params)

Pro: pickle is an efficient special-purpose compressor for Python objects.
Con: the usual security issues.

Owner

larsmans commented Jan 24, 2013

@ogrisel LXML is a bit cleaner than vanilla etree, but its full XPath 1.0 support comes as at a price. It eats memory like a hog and sometimes it's actually slower, as I found out when hacking on this (note the dirty namespace hack, which wouldn't have been much cleaner with LXML :)

Owner

ogrisel commented Jun 17, 2014

To update this discussion with a new data point: some of our users such as AirBnB have developed their own in-house PMML exporters for the subset of sklearn models they use: http://nerds.airbnb.com/architecting-machine-learning-system-risk/

Owner

mblondel commented Jun 17, 2014

I think an external sklearn-pmml project is a good idea. This will also make it possible to check that popular 3rd party PMML implementations evaluate exported models correctly.

Contributor

vruusmann commented Jun 21, 2014

Zementis appears to be working on a similar project:
https://support.zementis.com/entries/37092748-Introducing-Py2PMML

bmabey commented Jun 22, 2014

The Zementis project is closed-source and only available if you purchase a license to their other software (I checked before we started our own internal project).

Hi,

Any plans to work with/based on airbnb projects to produce PMML out of scikit-learn (is their code available, did not find it on github?) ? Did the discussion go further ?

Thanks for your job.

Has anyone made progress on a PMML exporter? I have built a Naive Bayes model and would like to export it to PMML, hoping not to build my own parser if I can help it.

hey guys, any news about this one? is there a timeline? is it planned for the SVM models as well, or are you focusing on decision trees first?
Thanks for all your efforts!

Owner

GaelVaroquaux commented Nov 30, 2014

It would be useful indeed. So someone needs to volunteer to do it. I
don't believe that anybody is working on it right now, and it is a
significant effort.

It seems that AirBNB has an internal PMML exporter:
http://nerds.airbnb.com/architecting-machine-learning-system-risk/
If anybody has contacts at AirBNB, we could try to retrieve their code.
It would still probably need significant work for inclusion.

Anyhow, such work should be done in an external repo, and polished before
thinking of submitting a PR to scikit-learn.

Hey all, I'm an engineer at Yelp (where we use several scikit-learn models) and am interested in improving the persistence and sharing of scikit-learn models. I would be happy to volunteer to begin this project.

off topic, but we circumvented around our initial problem. We wanted to export our models from scikit-learn to Java so we can classify in a hadoop pig-UDF. Apparently, it's easy to run scikit-learn classification using hadoop streaming. Here are the details, HTH someone: http://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/

Owner

ogrisel commented Dec 14, 2014

@iandewancker please go ahead, start a new github repo and link it here to make interested people aware of it. It would be interesting to write a generic test framework that checks that predictions made by sklearn models are in line with the results of loading a PMML export into jpmml-evaluator for instance.

To automate such tests you might want to use sklearn.utils.testing.all_estimators to iterate over all the model classes available in sklearn. This is a utility we use in the test_common.py test suite of scikit-learn.

Owner

jnothman commented Dec 14, 2014

To automate such tests you might want to use sklearn.utils.testing.all_estimators to iterate over all the model classes available in sklearn.

I think PMML export even coming close to all estimators is very ambitious, particularly given the differences in how scikit-learn and PMML describe data transforms and pipelines. Rather, trees, forests, and other models that can't easily be described by a matrix of coefficients should have priority in building exporters.

lesn-v commented Dec 21, 2014

@iandewancker, have you started working on the project? Please, let me know, I'd like to take a part!

ngould commented Apr 1, 2015

@iandewancker @jnothman Seconded. Data Scientist at Lucid Design Group here. Currently banging my head debugging our custom persistence for sklearn models. I'd LOVE to contribute to something like this.

Owner

amueller commented Apr 1, 2015

@ngould @lesn-v just start a repo and get going ;)

Hey @lesn-v, @ngould, @iandewancker. I've been contributing to https://github.com/alex-pirozhenko/sklearn-pmml. Has support for DecisionTree, GradientBoosted RandomForest. Another guy is looking at adding support for SVM etc... if you guys have any interest in contributing. Has automated integration tests with the BSD licensed version of JPMML

Owner

jnothman commented Aug 20, 2015

My concern is trying to translate our transformation pipeline model into PMML. If anyone would like to take on that task.

I'm also concerned that a library doing this has a way to check the consistency of the exported model (using a pre-existing PMML predictor) with scikit-learn's predictions. That testing framework is another subproject that would be great to see. (A first step might be a scikit-learn-compatible wrapper for a PMML prediction tool.)

Owner

jnothman commented Aug 20, 2015

But of course it's great to see you making progress on this, @NeverNude. (Let's just hope it's the right choice of technology!)

@jnothman what transformation pipeline model are you referring to specifically? Are you referring to pipeline http://scikit-learn.org/stable/modules/pipeline.html or something else?

Owner

jnothman commented Aug 20, 2015

I'm referring to that Pipeline, but iirc, PMML doesn't have an idea of
"pipeline" so much as a declarative description of how each feature was
compiled. So it depends on supporting the translation of individual
transformers (e.g. scalers, PCA, selectors) to PMML in a feature-wise
manner.

On 21 August 2015 at 04:17, Evan notifications@github.com wrote:

@jnothman https://github.com/jnothman what transformation pipeline
model are you referring to specifically? Are you referring to pipeline
http://scikit-learn.org/stable/modules/pipeline.html or something else?


Reply to this email directly or view it on GitHub
#1596 (comment)
.

ngould commented Aug 21, 2015

@NeverNude @jnothman Sweet, thanks for bring it to my attention. I am still interested in helping out. I'll take a closer look over the weekend.

Owner

amueller commented Aug 24, 2015

@jnothman what do you think about adding this to the related project list? It seems of interest to a number of people.

Owner

jnothman commented Aug 24, 2015

that's definitely the right place for it

On 25 August 2015 at 06:41, Andreas Mueller notifications@github.com
wrote:

@jnothman https://github.com/jnothman what do you think about adding
this to the related project list? It seems of interest to a number of
people.


Reply to this email directly or view it on GitHub
#1596 (comment)
.

Owner

amueller commented Aug 24, 2015

See #5151. Should we close this one? PMML is unlikly to go to master. ping @ogrisel

Owner

jnothman commented Aug 25, 2015

I think that the title of this issue is fulfilled as an external contribution, so yes, we should close it, and specific issues for sklearn-pmml should move to that project.

@jnothman jnothman closed this Aug 25, 2015

Owner

jnothman commented Aug 25, 2015

Except that it currently only supports (single-output?) classification, without regression etc.

Contributor

vruusmann commented Nov 10, 2015

I would like to inform all participants about my sklearn2pmml project, which also tackles the export of Scikit-Learn ML workflows to Predictive Model Markup Language (PMML) documents.

The conversion logic is actually provided by the JPMML-SkLearn library. Python objects are transported from Python platform to Java platform in pickle data format. The unpickling is handled by the Pyrolite library.

I'm actually surprised about how easy the translation of Scikit-Learn's Estimator and Transformer classes to PMML concepts has turned out to be. Regarding the feature engineering aspect of ML workflows, then this is addressed by pairing each Estimator object with a sklearn_pandas.DataFrameMapper object that provides "recipies" for calculating individual feature values. I like this approach better than Scikit-Learn's standard pipeline.Pipeline approach, because the latter does not permit adding descriptive metadata (eg. field names!) to the mix. If you're interested, then the project's README file contains a worked-out example for multi-class classification, where the data pre-processing is performed using decomposition.PCA.

Also, all converters are covered with tests. So far, the relative error between Scikit-Learn and the JPMML-Evaluator library predictions and predicted probabilities is less than 1 part per billion parts (1e-10).

Owner

amueller commented Dec 11, 2015

@vruusmann we actually added that to the related projects, though the dev doc build is currently stalled: https://github.com/scikit-learn/scikit-learn/blob/master/doc/related_projects.rst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment