FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor#27966
Merged
Conversation
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
adam2392
commented
Dec 18, 2023
Signed-off-by: Adam Li <adam2392@gmail.com>
… into extratreenan
Signed-off-by: Adam Li <adam2392@gmail.com>
… into extratreenan
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
… into extratreenan
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
OmarManzoor
reviewed
Jul 8, 2024
Contributor
OmarManzoor
left a comment
There was a problem hiding this comment.
Minor comments otherwise this looks good now
Signed-off-by: Adam Li <adam2392@gmail.com>
Member
Author
|
Perhaps @glemaitre and @thomasjpfan can take one last look then? I think this has changed a bit since @glemaitre approved changes, and I believe I've addressed most of the core issues raised. Thanks everyone for reviewing! |
thomasjpfan
approved these changes
Jul 8, 2024
OmarManzoor
approved these changes
Jul 9, 2024
Contributor
OmarManzoor
left a comment
There was a problem hiding this comment.
LGTM! Thanks a lot @adam2392.
Member
|
Sorry for the delay, I was a bit under the water. LGTM and I see the auto-merge was activated. |
glemaitre
approved these changes
Jul 9, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Towards: #27931
Follow-up to: #26391 and #23595
What does this implement/fix? Explain your changes.
RandomSplitterin _splitter.pyxDecisionTree*andExtraTree*more numerically robust by increasing the tolerance for a GLOBAL_RANDOM_SEED check, and using cross-validation scores rather than a single scoreAny other comments?
Compared to BestSplitter, there can be an expected cost to doing splits on missing-values, as we can either:
The push of missing values down the tree can be done randomly (i.e. first option), OR the second option can actually be evaluated. There is a computational cost to doing so, but more importantly there is an interpretation tradeoff. The tradeoff imo comes from the assumption of the missing-values:
However, I think the difference at a tree level is not super important. E.g. in the ExtraTree forest, #28268 demonstrates that the ExtraTrees when combined as a forest are resilient and predictive for missing-values.
Benchmarks demonstrating no significant runtime performance degradation
There is some complexity involved in checking if there are missing values. However, this only occurs at the Python level as shown by the following benchmark. In terms of the Cython code, there is no regression.
Benchmarks with and without Python Check
Also ran this benchmark for ExtraTrees, which demonstrates that this check is negligible at the forest level, since it only occurs once. See #28268, which has the short code to enable it for ExtraTrees.
Benchmarks on ExtraTrees