Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path to adopt 32-bit implementations for {KD, Ball}Tree #26963

Open
Micky774 opened this issue Aug 1, 2023 · 14 comments
Open

Path to adopt 32-bit implementations for {KD, Ball}Tree #26963

Micky774 opened this issue Aug 1, 2023 · 14 comments
Labels

Comments

@Micky774
Copy link
Contributor

Micky774 commented Aug 1, 2023

Motivation

Having and using 32-bit implementations of {KD, Ball}Tree allows for better preservation of dtype, lower memory footprint, and more consistent Cython code.

Strategy

Work has already been started in #25914 to add the code for the new 32-bit implementations. These will not be directly used yet, and the PR instead focuses on the actual creation of the new classes. Consequently, {KD, Ball}Tree are bound to {KD, Ball}Tree64 for consistency and backwards compatibility.

Following this PR, we will need to begin an API deprecation to move users away from constructing trees directly, and instead using a factory method (similar to DistanceMetric.get_metric). Then, later, we can separate {KD, Ball}Tree from {KD, Ball}Tree64 so that we have a singular dispatcher {KD, Ball}Tree and two type-specialized implementations.

The deprecation process involves:

  1. Introducing get_tree(...) to {KD, Ball}Tree64 with the same signature of BinaryTree.__init__, which is in charge of constructing the type-specialized trees directly.
  2. Adding a FutureWarning to BinaryTree.__init__ to begin deprecation, and suppressing the warning using a context manager in {KD, Ball}Tree64.get_tree to enforce it as the "correct" way to construct the trees.
  3. Complete the deprecation by introducing a separate {KD, Ball}Tree into the hierarchy which comes with the same get_tree method and fully remove the FutureWarning from {KD, Ball}Tree{32, 64}.__init__ since they should no longer be directly constructed, as well as getting rid of the {KD, Ball}Tree{32, 64}.get_tree method since it should only exist in {KD, Ball}Tree.

At the end, the expected API is:

from sklearn.neighbors import KDTree

# Data
X = ...

# tree is thus an instance of KDTree{32, 64} based on X.dtype
tree = KDTree.get_tree(X, ...)

cc: @thomasjpfan @jjerphan @OmarManzoor

@Micky774 Micky774 added the API label Aug 1, 2023
@OmarManzoor
Copy link
Contributor

@Micky774 Thanks a lot for creating this issue.

@jjerphan
Copy link
Member

jjerphan commented Aug 1, 2023

Thank you for creating the PR, @Micky774.

I think there are alternatives which would allow not introducing any new API while supporting both implementations independently from the existing API.

I do not have time to write them now, but I should be able to soon.

@jjerphan
Copy link
Member

jjerphan commented Aug 4, 2023

I think that introducing a public {KD,Ball}Tree.get_tree class method would be inconsistent with the current and overall design estimator creation. To me, we must stick to only one interface to create a {KD,Ball}Tree: their constructor.

Would it be possible to have something similar to what exists for DecisionTree? That is:

  • keep the current {KD,Ball}Tree public API unchanged
  • have a {KD,Ball}Tree{32,64} implementation be a private _tree attribute for {KD,Ball}Tree when it is being constructed based on the type of X
  • forward calls on methods of {KD,Ball}Tree to their _tree attribute's

I think this would allow keeping the UX of {KD,Ball}Tree unchanged while supporting float32 data.

Yet, I am not against having a private class method if we need to have typing for {KD,Ball}Tree{32,64} in Cython code without instantiating instances of the public {KD,Ball}Tree classes.

What do you think?

@Micky774
Copy link
Contributor Author

Micky774 commented Aug 4, 2023

I still prefer to avoid the extra indirection of having a _tree attribute if possible.

As an alternative, what about defining {KD, Ball}Tree.__new__() so that it directly produces the type-specialized implementation? This may result in changed models though since it means that tree=KDTree(X_32) now produces a tree where tree.data is of type float32_t rather than float64_t. Each solution we consider would have that downside as well, the only difference is that since this one has a backwards compatible API, it would actually be a model change rather than behavior change over two releases.

e.g.

import numpy as np

X_64 = np.random.rand(20, 20).astype(np.float64)
X_32 = X_64.astype(np.float32)

class Tree:
    def __new__(cls, data):
        if data.dtype==np.float32:
            return object.__new__(Tree32)
        elif data.dtype==np.float64:
            return object.__new__(Tree64)

class Tree32(Tree):
    def __init__(self, data) -> None:
        print("Tree32 is initialized!")
        assert data.dtype == np.float32

class Tree64(Tree):
    def __init__(self, data) -> None:
        print("Tree64 is initialized!")
        assert data.dtype == np.float64

tree64 = Tree(X_64)
# Tree64 is initialized!

tree32 = Tree(X_32)
# Tree32 is initialized!

assert isinstance(tree64, Tree64)
assert isinstance(tree32, Tree32)

@jjerphan
Copy link
Member

jjerphan commented Aug 4, 2023

I still prefer to avoid the extra indirection of having a _tree attribute if possible.

What are the reasons to avoid the extra indirection? If the reason is boiler-plate code, we could define __getattr__ to redirect all calls to the component (this would remove inspectability and notably docstring, which we do not want). If the reason is dispatch cost, I think this would be negligible for most calls, but a cost would be present.

I considered this solution acceptable since some estimators and transformers in scikit-learn already forward calls to components (for instance, this is the case for Ridge{Classifier,}{CV,}).

What about defining {KD, Ball}Tree.__new__() so that it directly produces the type-specialized implementation?

I like the use of __new__, but it returns instances that leak implementations details; as you have shown:

>>> assert isinstance(tree64, Tree64)
>>> assert isinstance(tree32, Tree32)
>>> tree64
<__main__.Tree64 at 0x7fc3b1f54a90>
>>> tree32
<__main__.Tree32 at 0x7fc3b1f54df0>

I think we must prevent public API from leaking such implementation details to users.

I would wait for a few other maintainers to give their points of view before starting an implementation. What do you think?

@Micky774
Copy link
Contributor Author

Micky774 commented Aug 4, 2023

I like the use of new, but it returns instances that leak implementations details; as you have shown: ...
I think we must prevent public API from leaking such implementation details to users.

I don't personally consider this a negative aspect of the solution. What situations does this risk leading to that we would prefer avoiding? I'm open to revising my opinion on this, but as a developer and user, I personally enjoy having this information available as it makes debugging easier and more direct.

I would wait for a few other maintainers to give their points of view before starting an implementation. What do you think?

Agreed. I think this is opinionated enough that I don't want to haphazardly move forwards without considering more perspectives.

@OmarManzoor
Copy link
Contributor

@scikit-learn/core-devs We need a few additional perspectives here before we can continue.

@thomasjpfan
Copy link
Member

thomasjpfan commented Aug 9, 2023

From the discussion, I see two options:

  1. Public KDTree and BallTree and keep the 32bit and 64bit implementations private.
  2. Introduce KDTree.get_tree and BallTree.get_tree and make the 32bit and 64bit implementations public.

In DistanceMetric, we decided to go with option 2 with get_metric, because of performance reasons. {KD, Ball}Tree does not have the same performance constraints so we can consider option 1. I prefer not to have two patterns for handling 32/64 Cython classes, so I am leaning toward option 2 for {KD, Ball}Tree to be consistent with DistanceMetric{32,64}.

Edit: I meant option 2.

@Micky774
Copy link
Contributor Author

Micky774 commented Aug 9, 2023

I prefer not to have two patterns for handling 32/64 Cython classes, so I am leaning toward option 1 for {KD, Ball}Tree to be consistent with DistanceMetric{32,64}.

Do you mean for consistency you prefer option two? Or do you indeed mean option one here?

@thomasjpfan
Copy link
Member

Do you mean for consistency you prefer option two?

Yes, I meant option 2 and I corrected my typo with option 2.

@OmarManzoor
Copy link
Contributor

@jjerphan Should we progress with option 2 and start with introducing the get_tree method while deprecating the current way to create trees?

@jjerphan
Copy link
Member

jjerphan commented Sep 5, 2023

Hi @OmarManzoor,

I do not have time to think of it in details for now, but I do want to block.

You might want to start an implementation, but it might be further discussed (for instance I do not think we want to deprecate the current way we create {KD,Ball}Tree because we need to maintain backward compatibility).

@OmarManzoor
Copy link
Contributor

@jjerphan I think in that case we can wait before the discussion regarding this is finalized. Thanks for mentioning.

@jjerphan
Copy link
Member

jjerphan commented Oct 4, 2023

Thinking again, I do not want to have the decision be pending for an eternity since we might not be able to discuss it in the coming weeks (at least I might not be and in this case I do not want to cookie lick it).

@OmarManzoor: you can implement the proposal of @thomasjpfan I think.

@lorentzenchr lorentzenchr added Needs Decision Requires decision and removed Needs Decision - API labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants