New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Path to adopt 32-bit implementations for {KD, Ball}Tree
#26963
Comments
@Micky774 Thanks a lot for creating this issue. |
Thank you for creating the PR, @Micky774. I think there are alternatives which would allow not introducing any new API while supporting both implementations independently from the existing API. I do not have time to write them now, but I should be able to soon. |
I think that introducing a public Would it be possible to have something similar to what exists for
I think this would allow keeping the UX of Yet, I am not against having a private class method if we need to have typing for What do you think? |
I still prefer to avoid the extra indirection of having a As an alternative, what about defining e.g. import numpy as np
X_64 = np.random.rand(20, 20).astype(np.float64)
X_32 = X_64.astype(np.float32)
class Tree:
def __new__(cls, data):
if data.dtype==np.float32:
return object.__new__(Tree32)
elif data.dtype==np.float64:
return object.__new__(Tree64)
class Tree32(Tree):
def __init__(self, data) -> None:
print("Tree32 is initialized!")
assert data.dtype == np.float32
class Tree64(Tree):
def __init__(self, data) -> None:
print("Tree64 is initialized!")
assert data.dtype == np.float64
tree64 = Tree(X_64)
# Tree64 is initialized!
tree32 = Tree(X_32)
# Tree32 is initialized!
assert isinstance(tree64, Tree64)
assert isinstance(tree32, Tree32) |
What are the reasons to avoid the extra indirection? If the reason is boiler-plate code, we could define I considered this solution acceptable since some estimators and transformers in scikit-learn already forward calls to components (for instance, this is the case for
I like the use of >>> assert isinstance(tree64, Tree64)
>>> assert isinstance(tree32, Tree32)
>>> tree64
<__main__.Tree64 at 0x7fc3b1f54a90>
>>> tree32
<__main__.Tree32 at 0x7fc3b1f54df0> I think we must prevent public API from leaking such implementation details to users. I would wait for a few other maintainers to give their points of view before starting an implementation. What do you think? |
I don't personally consider this a negative aspect of the solution. What situations does this risk leading to that we would prefer avoiding? I'm open to revising my opinion on this, but as a developer and user, I personally enjoy having this information available as it makes debugging easier and more direct.
Agreed. I think this is opinionated enough that I don't want to haphazardly move forwards without considering more perspectives. |
@scikit-learn/core-devs We need a few additional perspectives here before we can continue. |
From the discussion, I see two options:
In Edit: I meant option 2. |
Do you mean for consistency you prefer option two? Or do you indeed mean option one here? |
Yes, I meant option 2 and I corrected my typo with option 2. |
@jjerphan Should we progress with option 2 and start with introducing the get_tree method while deprecating the current way to create trees? |
Hi @OmarManzoor, I do not have time to think of it in details for now, but I do want to block. You might want to start an implementation, but it might be further discussed (for instance I do not think we want to deprecate the current way we create |
@jjerphan I think in that case we can wait before the discussion regarding this is finalized. Thanks for mentioning. |
Thinking again, I do not want to have the decision be pending for an eternity since we might not be able to discuss it in the coming weeks (at least I might not be and in this case I do not want to cookie lick it). @OmarManzoor: you can implement the proposal of @thomasjpfan I think. |
Motivation
Having and using 32-bit implementations of
{KD, Ball}Tree
allows for better preservation of dtype, lower memory footprint, and more consistent Cython code.Strategy
Work has already been started in #25914 to add the code for the new 32-bit implementations. These will not be directly used yet, and the PR instead focuses on the actual creation of the new classes. Consequently,
{KD, Ball}Tree
are bound to{KD, Ball}Tree64
for consistency and backwards compatibility.Following this PR, we will need to begin an API deprecation to move users away from constructing trees directly, and instead using a factory method (similar to
DistanceMetric.get_metric
). Then, later, we can separate{KD, Ball}Tree
from{KD, Ball}Tree64
so that we have a singular dispatcher{KD, Ball}Tree
and two type-specialized implementations.The deprecation process involves:
get_tree(...)
to{KD, Ball}Tree64
with the same signature ofBinaryTree.__init__
, which is in charge of constructing the type-specialized trees directly.FutureWarning
toBinaryTree.__init__
to begin deprecation, and suppressing the warning using a context manager in{KD, Ball}Tree64.get_tree
to enforce it as the "correct" way to construct the trees.{KD, Ball}Tree
into the hierarchy which comes with the sameget_tree
method and fully remove theFutureWarning
from{KD, Ball}Tree{32, 64}.__init__
since they should no longer be directly constructed, as well as getting rid of the{KD, Ball}Tree{32, 64}.get_tree
method since it should only exist in{KD, Ball}Tree
.At the end, the expected API is:
cc: @thomasjpfan @jjerphan @OmarManzoor
The text was updated successfully, but these errors were encountered: