New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] Tree MAE fix to ensure sample_weights are used during impurity calculation #11464
Changes from 24 commits
5278339
1c5a6cd
0dacd2e
d7e8161
16bd695
2bddc6a
37badb8
a404983
6ad17c0
2d0a97e
f49ef59
a136cf5
aa073d5
5f90f71
bd417e9
0912207
6c8ff77
100157e
de00b02
fed3117
8ad1414
74c9791
42a050b
cba8bf2
eeee051
fdb30ff
ef6fe3b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
from sklearn.metrics import accuracy_score | ||
from sklearn.metrics import mean_squared_error | ||
|
||
from sklearn.utils.testing import assert_allclose | ||
from sklearn.utils.testing import assert_array_equal | ||
from sklearn.utils.testing import assert_array_almost_equal | ||
from sklearn.utils.testing import assert_almost_equal | ||
|
@@ -1693,19 +1694,101 @@ def test_no_sparse_y_support(name): | |
|
||
|
||
def test_mae(): | ||
# check MAE criterion produces correct results | ||
# on small toy dataset | ||
"""Check MAE criterion produces correct results on small toy dataset: | ||
|
||
------------------ | ||
| X | y | weight | | ||
------------------ | ||
| 3 | 3 | 0.1 | | ||
| 5 | 3 | 0.3 | | ||
| 8 | 4 | 1.0 | | ||
| 3 | 6 | 0.6 | | ||
| 5 | 7 | 0.3 | | ||
------------------ | ||
|sum wt:| 2.3 | | ||
------------------ | ||
|
||
Because we are dealing with sample weights, we cannot find the median by | ||
simply choosing/averaging the centre value(s), instead we consider the | ||
median where 50% of the cumulative weight is found (in a y sorted data set) | ||
. Therefore with regards to this test data, the cumulative weight is >= 50% | ||
when y = 4. Therefore: | ||
Median = 4 | ||
|
||
For all the samples, we can get the total error by summing: | ||
Absolute(Median - y) * weight | ||
|
||
I.e., total error = (Absolute(4 - 3) * 0.1) | ||
+ (Absolute(4 - 3) * 0.3) | ||
+ (Absolute(4 - 4) * 1.0) | ||
+ (Absolute(4 - 6) * 0.6) | ||
+ (Absolute(4 - 7) * 0.3) | ||
= 2.5 | ||
|
||
Impurity = Total error / total weight | ||
= 2.5 / 2.3 | ||
= 1.08695652173913 | ||
------------------ | ||
|
||
From this root node, the next best split is between X values of 3 and 5. | ||
Thus, we have left and right child nodes: | ||
|
||
LEFT RIGHT | ||
------------------ ------------------ | ||
| X | y | weight | | X | y | weight | | ||
------------------ ------------------ | ||
| 3 | 3 | 0.1 | | 5 | 3 | 0.3 | | ||
| 3 | 6 | 0.6 | | 8 | 4 | 1.0 | | ||
------------------ | 5 | 7 | 0.3 | | ||
|sum wt:| 0.7 | ------------------ | ||
------------------ |sum wt:| 1.6 | | ||
------------------ | ||
|
||
Impurity is found in the same way: | ||
Left node Median = 6 | ||
Total error = (Absolute(6 - 3) * 0.1) | ||
+ (Absolute(6 - 6) * 0.6) | ||
= 0.3 | ||
|
||
Left Impurity = Total error / total weight | ||
= 0.3 / 0.7 | ||
= 0.428571428571429 | ||
------------------- | ||
|
||
Likewise for Right node: | ||
Right node Median = 4 | ||
Total error = (Absolute(4 - 3) * 0.3) | ||
+ (Absolute(4 - 4) * 1.0) | ||
+ (Absolute(4 - 7) * 0.3) | ||
= 1.2 | ||
|
||
Right Impurity = Total error / total weight | ||
= 1.2 / 1.6 | ||
= 0.75 | ||
------""" | ||
|
||
dt_mae = DecisionTreeRegressor(random_state=0, criterion="mae", | ||
max_leaf_nodes=2) | ||
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3]) | ||
assert_array_equal(dt_mae.tree_.impurity, [1.4, 1.5, 4.0/3.0]) | ||
assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0]) | ||
|
||
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], | ||
[0.6, 0.3, 0.1, 1.0, 0.3]) | ||
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6]) | ||
# Test MAE where sample weights are non-uniform (as illustrated above): | ||
dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], | ||
sample_weight=[0.6, 0.3, 0.1, 1.0, 0.3]) | ||
assert_allclose(dt_mae.tree_.impurity, [2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6]) | ||
assert_array_equal(dt_mae.tree_.value.flat, [4.0, 6.0, 4.0]) | ||
|
||
# Test MAE where all sample weights are uniform: | ||
dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], | ||
sample_weight=np.ones(5)) | ||
assert_array_equal(dt_mae.tree_.impurity, [1.4, 1.5, 4.0 / 3.0]) | ||
assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we know that these are the right values. It would be useful to explain it in a comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I explained in my 2nd post above... Perhaps I could put a link to this pull in the comments? Otherwise I can try to summarise? |
||
|
||
# Test MAE where a `sample_weight` is not explicitly provided. | ||
# This is equivalent to providing uniform sample weights, though | ||
# the internal logic is different: | ||
dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3]) | ||
assert_array_equal(dt_mae.tree_.impurity, [1.4, 1.5, 4.0 / 3.0]) | ||
assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0]) | ||
|
||
|
||
def test_criterion_copy(): | ||
# Let's check whether copy of our criterion has the same type | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the two lines below then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call 👍