FIX Fix ExtraTreeRegressor missing data handling#30318
FIX Fix ExtraTreeRegressor missing data handling#30318adam2392 merged 5 commits intoscikit-learn:mainfrom
Conversation
test_regression_extra_tree_missing_values_toy
| assert all(impurity >= 0), impurity.min() # MSE should always be positive | ||
|
|
||
| # Note: the impurity matches after the first split only on greedy trees | ||
| if Tree is DecisionTreeRegressor: |
There was a problem hiding this comment.
With the fix impurities match even for ExtraTreeRegressor, so I removed the if clause
There was a problem hiding this comment.
Uhm this is actually something that we wonder with @adam2392 in his PR originally. So it might have been this hidden bug.
| support missing-values in the data matrix ``X``. Missing-values are handled by | ||
| randomly moving all of the samples to the left, or right child node as the tree is | ||
| traversed. | ||
| By :user:`Adam Li <adam2392>` and :user:`Loïc Estève <lesteve>` |
There was a problem hiding this comment.
Since the feature has not been released and assuming we want the fix for 1.6, I use the same content so that towncrier merges them and mention both PRs. I added the "No Changelog needed" to make the changelog check green (another instance of #30222)
To be honest I would be fine not adding a changelog as well, let me know what you think ...
There was a problem hiding this comment.
I don't mind. Maybe this is a nice implementation to have actually.
|
Tagging as blocker since we need to backport this one. |
|
Interesting so is my interpretation correct: There was essentially a hidden bug where the partitioning was occurring to move samples left or right that included missing values when it shouldn't have? |
SummaryMy understanding was that the split position was wrong because it was taking into account unitialised values at the end of Debugging sessionWith this diff removing the fix and adding debug code: diff --git a/sklearn/tree/_partitioner.pyx b/sklearn/tree/_partitioner.pyx
index 195b7e2caf..d3dddcbebd 100644
--- a/sklearn/tree/_partitioner.pyx
+++ b/sklearn/tree/_partitioner.pyx
@@ -194,10 +194,17 @@ cdef class DensePartitioner:
"""Partition samples for feature_values at the current_threshold."""
cdef:
intp_t p = self.start
- intp_t partition_end = self.end - self.n_missing
+ intp_t partition_end = self.end # - self.n_missing
intp_t[::1] samples = self.samples
float32_t[::1] feature_values = self.feature_values
+ with gil:
+ print('current_threshold', current_threshold)
+ print('before partition feature_values', np.asarray(feature_values))
+ print('partition_end', partition_end)
+ print('partition_end - self.n_missing', partition_end - self.n_missing)
+
+
while p < partition_end:
if feature_values[p] <= current_threshold:
p += 1
@@ -209,6 +216,10 @@ cdef class DensePartitioner:
)
samples[p], samples[partition_end] = samples[partition_end], samples[p]
+ with gil:
+ print('after partition feature_values', np.asarray(feature_values))
+ print('split_pos', partition_end)
+
return partition_end
cdef inline void partition_samples_final(and this Python snippet: import numpy as np
from sklearn.tree import ExtraTreeRegressor, DecisionTreeRegressor
from sklearn.tree import export_text
for i in range(2):
print("-" * 80)
print(f"{i}")
print("-" * 80)
X = np.array([1, 2, 3, 4, np.nan, np.nan])
criterion = "squared_error"
Tree = ExtraTreeRegressor
X = X.reshape(-1, 1)
y = np.arange(6)
tree = Tree(criterion=criterion, random_state=0, max_depth=2).fit(X, y)
print(export_text(tree))
n_node_samples = tree.tree_.n_node_samples
missing_go_to_left = tree.tree_.missing_go_to_left
print(f"{n_node_samples=}")
print(f"{missing_go_to_left=}")
impurity = tree.tree_.impurity
print("impurity", impurity)
assert all(impurity >= 0), impurity # MSE should always be positiveOutput: Debugging AnalysisOn iteration 0 you can tell that there is On iteration 1, the uninitialised values are different and somehow this ends up creating negative impurities and the For completeness, here is the output with the fix: Footnotes
|
|
Thanks for the fix. This bug was probably the cause of one of the failure I observed when randomizing the tests last summer: #29584 (comment) |
Originally started from #30258, this appears this is not free-threaded specific. This fails consistently on my machine on iteration 1 on
main.Locally the added test fails for 38 out of the 100 seeds.
I made sure that the test was passing on all the random seeds in the CI on 54f22cf.
Fix #30258.
Debugged with @glemaitre, cc @adam2392 for his take on the fix.