Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX conversion #181

Open
adrinjalali opened this issue Oct 10, 2022 · 5 comments
Open

ONNX conversion #181

adrinjalali opened this issue Oct 10, 2022 · 5 comments
Labels
persistence Secure persistence feature

Comments

@adrinjalali
Copy link
Member

We should have convenience methods to convert scikit-learn fitted models to ONNX, and have easy ways to check if a model can be converted to ONNX.

The https://github.com/onnx/sklearn-onnx project as well as https://github.com/sdpython/mlprodict/ are where we can start to see how to implement these features.

Once this is in place, we can think of ways to automatically do the conversion, if possible, on the hub side.

@BenjaminBossan
Copy link
Collaborator

convenience methods to convert scikit-learn fitted models to ONNX, and have easy ways to check if a model can be converted to ONNX

I went through most of the sklearn-onnx docs (which I often found confusing) and have to say this looks to be more difficult than I initially expected. There seem to be many missing pieces, pitfalls, specific adjustments that have to be made to models and pipelines, many of which are not trivially automated.

As an example, if I understand correctly, the onnx runtime doesn't support sparse matrices yet (https://onnx.ai/sklearn-onnx/auto_tutorial/plot_usparse_xgboost.html#tfidf-and-sparse-matrices), so most of text classification/regression will only work with small data, since converting to dense would be too expensive otherwise. And as soon as users have custom estimators, automatic conversion is basically impossible.

Maybe we can first start with supporting inference of ONNX models but require users to do the conversion themselves?

What would be the main benefits of ONNX anyway? Is it for efficiency or for security?

@adrinjalali
Copy link
Member Author

We can start with a utility which tries to convert models, and it might fail on any complex case, and then improve that over time.

I suspect a good part of that might later end up in the sklearn-onnx lib itself.

I don't think we have to worry about using onnx in inference time much yet, the value here is more about people not having to load pickles rather than us not doing that. We're also not really focusing on the performance of the inference API ATM, so thta's not an issue yet.

People might like if for efficiency or security, but they like it.

@adrinjalali adrinjalali added the persistence Secure persistence feature label Jan 24, 2023
@omar-araboghli
Copy link
Contributor

Despite whether onnx is efficient or not and if the conversion to its format is within the scope of skops, I have had a use-case where I had to convert many ML models to an intermediate formats that can be used by one python environment. Specifically, I converted sklearn models to onnx and torch.Module to torch.ScriptModule to make that work.

Now let's imagine the following use-case:

  • I have a REST API in the project my-app that depends on sklearn version x.1.x.
  • I dumped sklearn.KMeans version x.1.x with skops and uploaded it to the hub.
  • I dumped sklearn.KMeans version x.2.x with skops and uploaded it to the hub.
  • my-app tries to download the models from the hub and load them in its environment.

Here it's a question to my understanding of skops rather than a statement. Would my-app break ? If yes, then what's the benefit of introducing the skops format apart from the ability of storing the metadata on the hub to make loading and saving easier ?

Also, introducing the model Intermediate Representation (IR) from scratch in skops would mean reinventing the wheel, thus, providing a conversion utility to onnx, also when it's not very mature yet, seems to be a good path to take.

@BenjaminBossan
Copy link
Collaborator

Would my-app break ? If yes, then what's the benefit of introducing the skops format apart from the ability of storing the metadata on the hub to make loading and saving easier ?

Your app would most likely not break because sklearn rarely makes backwards incompatible changes. However, sklearn doesn't go so far as to guarantee it will never do that, which is why you would generally get a warning if you load an sklearn model using a different sklearn version.

To give a hypothetical example, in version 1.2.0, sklearn could add a new attribute foo_ to KMeans that is set during fitting and which is required for inference. If you load a model from version 1.1.0, the attribute is missing, thus the model could not be used. There are possibly ways of avoiding this breakage, but that would add a lot of extra maintenance burden. Also, there can be many edge cases when it comes to what is considered bc breaking and what isn't.

Coming to the second part of your question, the main objective of the skops persistence format is to present a secure alternative to pickle for the sklearn ecosystem. If this is not clear from our docs, please let us know and we can clarify.

Regarding the specific comparison to ONNX, the goals are quite different. For skops, we have:

  • Security is the number one concern (runtime performance is not a goal)
  • Stick very close to sklearn, with all the pros and cons
  • Support as much of the sklearn ecosystem as possible
  • Loading a skops object results in the original sklearn object, therefore it's possible to train models loaded with the skops format

For ONNX, we have:

  • Generate a common IR, which abstracts away many implementation details of sklearn, has potential to run much faster (ONNX is secure too but that's not the main concern)
  • Therefore, there can be some deviations from the sklearn implementation
  • Many parts of sklearn are not supported yet
  • Only used for inference, the original sklearn object cannot be reconstructed (thus no further training possible)

There are more differences of course, this is just from the top of my head.

@omar-araboghli
Copy link
Contributor

Many thanks for clarifying the borders! Indeed then, users can use both skops and sklearn-onnx based on their specific needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistence Secure persistence feature
Projects
None yet
Development

No branches or pull requests

3 participants