Skip to content

FIX Fix inconsistent naming convention for algorithm selection of HDBSCAN #26744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Shreesha3112
Copy link
Contributor

@Shreesha3112 Shreesha3112 commented Jul 1, 2023

Reference Issues/PRs

Fixes #26732

What does this implement/fix? Explain your changes.

  • Fixes inconsistent naming convention for algorithm selection in HDBSCAN for "kd_tree" and "ball_tree"
  • Maintains backward compatibility for "kdtree" and "balltree" and will be removed in 1.6?

current naming conventions

Other estimators(K-NN, DBSCAN, etc.) : algorithms : {"kd_tree", "bal_tree"}

HDBSCAN: algorithm : {"kdtree", "balltree"}

updated naming convention for HDBSCAN

algorithm : {"kd_tree", "ball_tree"}

@github-actions
Copy link

github-actions bot commented Jul 1, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ddd14f7. Link to the linter CI: here

mst_func = _hdbscan_prims
kwargs["algo"] = "kd_tree"
kwargs["leaf_size"] = self.leaf_size
elif self.algorithm == "balltree":
elif self.algorithm == "ball_tree":
Copy link
Contributor Author

@Shreesha3112 Shreesha3112 Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch code coverage failed with partial hit for this line 800 due to missing else block at elif == "ball_tree". Probably not added since parameter validation is completed by validate_parameter_constraints.

@Shreesha3112 Shreesha3112 marked this pull request as ready for review July 3, 2023 06:56
@ogrisel
Copy link
Member

ogrisel commented Jul 3, 2023

Thanks for the PR!

Since scikit-learn 1.3.0 has been released with those names, our policy would mandate a deprecation cycle for the inconsistent names:

Though since HDBSCAN was just introduced in that release we could maybe do an exception here? Personally I feel we should follow our backward compat policy.

Any opinion @glemaitre and others?

@glemaitre
Copy link
Member

I should have pushed for the change before the release :)

When speaking about this issue with @jeremiedbb, we thought to put it as a fix and include it in the next bug fixes release for which HDSCAN would have existed for about 1 month and where there is little chance to break production code.

However, I don't want that we introduce bad practices.

@Shreesha3112
Copy link
Contributor Author

I can make sure the kd_tree and ball_tree in algorithm parameter value are backward compatible if required. I have found a similar use case that I can use as a reference, which is for depreciating None value for penalty parameter in LogisticRegression which was made in release 1.2

@Micky774
Copy link
Contributor

Micky774 commented Jul 4, 2023

So, for backstory purposes, the reason that the algorithm names here are inconsistent is because originally they were going to also encode the MST algorithm in their names when we had both prims and boruvka, so something of the sort {prims, boruvka}_{kd, ball}tree just for the sake of readability.

Still, I agree that after trimming boruvka (even if only temporarily) there's no reason for this inconsistency. It totally slipped past me🤦

I'd personally be okay considering it a backport fix, since most likely very few people are working with this new scikit-learn implementation of HDBSCAN in any serious or critical capacity. I think I prefer it, only because there's a lot of work to be done on HDBSCAN in the next few releases, and adding that deprecation is going to make following work potentially more complicated (e.g. if we reintroduce boruvka and decide to encode it in the very same keyword).

@thomasjpfan
Copy link
Member

However, I don't want that we introduce bad practices.

Although this estimator is new, I prefer deprecating the parameter. The default is "auto", so it should not impact many users.

Comment on lines 25 to 35
# TODO(1.6): Remove
filterwarnings_kdtree = pytest.mark.filterwarnings(
"ignore:`algorithm='kdtree'`has been deprecated in 1.4 and will be renamed"
" to'kd_tree'`in 1.6. To keep the past behaviour, set `algorithm=kd_tree`."
)
# TODO(1.6): Remove
filterwarnings_balltree = pytest.mark.filterwarnings(
"ignore:`algorithm='balltree'`has been deprecated in 1.4 and will be renamed"
" to'ball_tree'`in 1.6. To keep the past behaviour, set `algorithm=ball_tree`."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than filterwarnings everywhere, we can go ahead and use the new algorithm names. We need only test that the deprecation works as expected in one place and move on to future behavior everywhere else.

Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, failing CI is unrelated. Thanks!

@Micky774 Micky774 added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jul 27, 2023
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan changed the title Fix inconsistent naming convention for algorithm selection of HDBSCAN FIX Fix inconsistent naming convention for algorithm selection of HDBSCAN Aug 1, 2023
@thomasjpfan thomasjpfan enabled auto-merge (squash) August 1, 2023 16:31
@thomasjpfan thomasjpfan merged commit fdd3941 into scikit-learn:main Aug 1, 2023
@Shreesha3112 Shreesha3112 deleted the fix/HDBSCAN_inconsistent_kdtree_balltree branch August 1, 2023 17:35
@Micky774
Copy link
Contributor

Micky774 commented Aug 1, 2023

Thank you for work @Shreesha3112 😄

9Y5 pushed a commit to 9Y5/scikit-learn that referenced this pull request Aug 2, 2023
…SCAN (scikit-learn#26744)

Co-authored-by: shreesha3112 <shreesha3112.com>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…SCAN (scikit-learn#26744)

Co-authored-by: shreesha3112 <shreesha3112.com>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In HDBSCAN, algorithm="kdtree" is not consistent with other estimator
5 participants