Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards registering models with MLJ #20

Open
ablaom opened this issue May 6, 2022 · 19 comments
Open

Towards registering models with MLJ #20

ablaom opened this issue May 6, 2022 · 19 comments

Comments

@ablaom
Copy link

ablaom commented May 6, 2022

Further to JuliaAI/MLJModels.jl#454, I have been asked to provide feedback on the implementation of the MLJ model interface package for models provided by RAPIDS.jl, with a view to registering the models in MLJ's model registry.

First, it's clear that a lot of effort has been put into understanding and implementing the interface, which is really awesome. I'm pretty optimistic we can close the necessary gaps to get this across the finish line.

I have not had an opportunity for a thorough review but wanted to get the ball rolling with a few comments. Only the first is more serious.

  1. This scitype declaration for a classifier is not appropriate for an MLJ model, as users provide targets for classification in the form of categorical vectors (element scitype Finite, which includes OrderedFactor - for ordered categoricals - and Multiclass - for unordered ones). So, for classification, the target_scitype you typically have is AbstractVector{<:Finite} unless it's binary only, which would be AbstractVector{<:Finite{2}}. Furthermore, one needs to take into account all the classes in the pool of the target, not just those manifest as entries seen in the training vector. This is so that the target you predict includes all classes. (See this discussion). This generally means re-encoding the data using the categorical reference integers for passing to the core model, but also somehow recording the full pool and a method to invert the conversion in the prediction phase. This is discussed in the manual but there are also lots of existing implementations to adapt, for example this DecisionTreeClassifier for probablistic predictors, and this SVM model for deterministic (point) predictions. 2.

  2. If your clustering models have a predict, then they should predict categorical vectors. I didn't check this.

  3. The same scitype declaration referenced in 1. also allows a Table type and I wonder if that was intenentional. Tabular targets are only used in multitarget models (one column per target). Is this actually supported in the current case?

  4. A lot of model_metadata entries in this pkg indicate that you can use tables and matrices for input but the examples in the doc-strings only use matrices. It's fine if your model indeed can handle both (is that the case?) but generally MLJ users are used to inputs being tabular (wrapping matrices as tables if necessary) so probably the doc-strings should include tabular examples. See the DecisioinTreeClassifier docstring for examples.

  5. This is probably an issue you want to address last, but MLJ now has a hard requirement for model docstrings, which must comply to this standard

At this early stage, you may just want to comment on these. If you do want to work on new commits to address them, can you please put them in unmerged PR, so I can more easily comment on any changes.

Happy to take a call if you think that would be efficient at this early stage of the review.

@tylerjthomas9
Copy link
Owner

Thank you for taking the time to review the repo. I will make all the changes to bring it to the MLJ model standard. I thought that the scitypes were incorrect, so thank you for the information on that too.

@ablaom
Copy link
Author

ablaom commented May 18, 2022

You might find this new tool useful. You can use it to test your interfaces. Eventually you might want to include it in CI but for now it remains experimental.

@ablaom
Copy link
Author

ablaom commented May 18, 2022

You should use the dev branch.

@tylerjthomas9
Copy link
Owner

Thank you for sharing. This looks like a great addition.

@tylerjthomas9
Copy link
Owner

@ablaom Thank you again for reviewing the library. Here is a PR with some of the improvements that you suggested: #25

  1. This scitype declaration for a classifier is not appropriate for an MLJ model, as users provide targets for classification in the form of categorical vectors (element scitype Finite, which includes OrderedFactor - for ordered categoricals - and Multiclass - for unordered ones). So, for classification, the target_scitype you typically have is AbstractVector{<:Finite} unless it's binary only, which would be AbstractVector{<:Finite{2}}. Furthermore, one needs to take into account all the classes in the pool of the target, not just those manifest as entries seen in the training vector. This is so that the target you predict includes all classes. (See this discussion). This generally means re-encoding the data using the categorical reference integers for passing to the core model, but also somehow recording the full pool and a method to invert the conversion in the prediction phase. This is discussed in the manual but there are also lots of existing implementations to adapt, for example this DecisionTreeClassifier for probablistic predictors, and this SVM model for deterministic (point) predictions. 2.

Unless I am missing something, all the classification methods should be able to handle multi class classification.

I am a little confused on implementing the encoding of categorical variables. Because we pass the y vector to python, the model will be able to predict any class that is present in the training data.

  1. If your clustering models have a predict, then they should predict categorical vectors. I didn't check this.

This should now be the case. I added the casting to MMI.categorical, because we get a normal array back from python.

  1. The same scitype declaration referenced in 1. also allows a Table type and I wonder if that was intenentional. Tabular targets are only used in multitarget models (one column per target). Is this actually supported in the current case?

This was incorrect. I have fixed the target scitype.

  1. A lot of model_metadata entries in this pkg indicate that you can use tables and matrices for input but the examples in the doc-strings only use matrices. It's fine if your model indeed can handle both (is that the case?) but generally MLJ users are used to inputs being tabular (wrapping matrices as tables if necessary) so probably the doc-strings should include tabular examples. See the DecisioinTreeClassifier docstring for examples.

The machines should work with table/matrix inputs for the features. I will add some examples with this.

  1. This is probably an issue you want to address last, but MLJ now has a hard requirement for model docstrings, which must comply to this standard

I updated the docstrings for the classifiers (LogisticRegression,...). Let me know what you think.

A side note, if you do try to install it, there is a problem with micromamba installing rapids right now. The workaround is setting the solver as mamba by setting the following env variable: JULIA_CONDAPKG_EXE=mamba.

JuliaPy/CondaPkg.jl#38

@ablaom
Copy link
Author

ablaom commented Jul 7, 2022

@tylerjthomas9 Thanks indeed for this update. Currently swamped with reviews and JuliaCon prep, but I will return to this in due course.

@ablaom
Copy link
Author

ablaom commented Jul 25, 2022

Having trouble getting using RAPIDS to work. I'm using mamba as suggested above.

Any ideas?

WARNING: Method definition load_path(Type{var"#s446"} where var"#s446"<:RAPIDS.HDBSCAN) in module RAPIDS at /home/ubuntu/Julia/RAPIDS/src/cuml/clustering.jl:135 overwritten at /home/ubuntu/Julia/RAPIDS/src/cuml/dimensionality_reduction.jl:227.
  ** incremental compilation may be fatally broken for this module **

ERROR: InitError: Python: ImportError: /opt/julia-1.7.3/bin/../lib/julia/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/ubuntu/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/pyarrow/../../../libarrow.so.700)
Python stacktrace:
 [1] <module>
   @ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/pyarrow/__init__.py:65
 [2] <module>
   @ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/core/dtypes.py:10
 [3] <module>
   @ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/api/types.py:18
 [4] <module>
   @ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/api/__init__.py:3
 [5] <module>
   @ ~/for_extra_software/julia_depot/environments/rapids/.CondaPkg/env/lib/python3.9/site-packages/cudf/__init__.py:12
Stacktrace:
 [1] pythrow()
   @ PythonCall ~/for_extra_software/julia_depot/packages/PythonCall/XgP8G/src/err.jl:94
 [2] errcheck
   @ ~/for_extra_software/julia_depot/packages/PythonCall/XgP8G/src/err.jl:10 [inlined]
 [3] pyimport(m::String)
   @ PythonCall ~/for_extra_software/julia_depot/packages/PythonCall/XgP8G/src/concrete/import.jl:11
 [4] __init__()
   @ RAPIDS ~/Julia/RAPIDS/src/RAPIDS.jl:37
 [5] _include_from_serialized(path::String, depmods::Vector{Any})
   @ Base ./loading.jl:768
 [6] _require_from_serialized(path::String)
   @ Base ./loading.jl:821
 [7] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1130
 [8] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1013
 [9] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:997
during initialization of module RAPIDS

@ablaom
Copy link
Author

ablaom commented Jul 25, 2022

I'm using the version on main branch.

@tylerjthomas9
Copy link
Owner

I'm using the version on main branch.

I am trying to narrow down this issue. I can build it without issues on one of my computers, but I get the same error as you on a different computer. I'll let you know as soon as I can get the builds working everywhere again. It seems that something broke right after i published v0.1.0, and building the environment has become less consistent.

@ablaom
Copy link
Author

ablaom commented Aug 2, 2022

Yeah, I can appreciate this kind of multi-language GPU installation is very tricky. I will wait for you to sort this out before proceeding with a review.

@tylerjthomas9
Copy link
Owner

The nightly builds of julia should contain the GCC updates required for the most recent versions of RAPIDS (JuliaLang/julia#45582 (comment)). Additionally, the micromamba build bug has been fixed. I am going to start development on this library again now that these roadblocks are clearing up. I will let you know when it's ready for review.

@tylerjthomas9
Copy link
Owner

@ablaom I think that the package is in a good place for review. You will need to use a nightly julia build to run it. Let me know if you have any questions.

@ablaom
Copy link
Author

ablaom commented Nov 9, 2022

Awesome. Will 1.8.3 (currently release-1.8) suffice? If so I might wait for that.

@tylerjthomas9
Copy link
Owner

Awesome. Will 1.8.3 (currently release-1.8) suffice? If so I might wait for that.

I think the backport is scheduled to come with 1.8.4. We can circle back when that comes out.

@ablaom
Copy link
Author

ablaom commented Nov 10, 2022

Yes. I know this is frustrating, but I prefer to prioritise review to interfaces that will work with current Julia. Hang in there.

@tylerjthomas9
Copy link
Owner

Yes. I know this is frustrating, but I prefer to prioritise review to interfaces that will work with current Julia. Hang in there.

Sounds good. I'll ping you when it's ready.

@tylerjthomas9
Copy link
Owner

Hey @ablaom, I put a little time into getting this library to a similar level as CatBoost.jl. I think that it is in a good spot for a review when you have the capacity. There is absolutely no rush.

I want to eventually move away from Python and wrap the C++/CUDA code directly, but I won't have the capacity to tackle this transition for a while. Because of this eventual transition at some point in the future, I think that this would be a good addition to the third-party MLJ models.

@ablaom
Copy link
Author

ablaom commented May 18, 2023

Okay, I'm having trouble because of this error:

"ERROR: LoadError: InitError: IOError: write: no space left on device (ENOSPC)"

I've checked by disk and there is enough space. I understand this may be tricky to diagnose. Two facts lead me to kick the can on this one: my old GPU is to be replaced; after next week I am between contracts and I'm going to loose access. So I'll try again when the dust settles. Thanks for your patience.

@tylerjthomas9
Copy link
Owner

I have never seen that error before. If it is a pre-pascal GPU, then it might not work. There is absolutely no rush.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants