New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Tree MAE fix to ensure sample_weights are used during impurity calculation #11464

Merged
merged 27 commits into from Jul 17, 2018

Conversation

4 participants
@JohnStott
Contributor

JohnStott commented Jul 10, 2018

Tree MAE is not considering sample_weights when calculating impurity!

In the proposed fix, you will see I have multiplied by the sample weight after applying the absolute to the difference (not before). This is in line with the consensus / discussion found here, where negative sample weights are considered: #3774 (and also because during initialisation, self.weighted_n_node_samples is a summation of the sample weights with no "absolute" applied (this is used in the impurity division calc)).

Reference Issues/PRs

Fixes: #11460 (Tree MAE is not considering sample_weights when calculating impurity!)

JohnStott added some commits Jul 10, 2018

Merge pull request #1 from JohnStott/mae_sample_wts
Tree MAE fix to ensure sample_weights are used during impurity calculation

@jnothman jnothman added the Bug label Jul 10, 2018

@jnothman jnothman added this to the 0.20 milestone Jul 10, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jul 10, 2018

I think this looks good, but I am no expert on the calculation of impurities. Could you give some calculation to show why the values in the test are those we should expect?

Please add an entry to the change log under Bug Fixes at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 11, 2018

Thanks @jnothman ,

I will make those changes as soon as possible, for now, here is an example, I hope it is sufficient...

It is perhaps important to remind that because we are dealing with sample_weights we cannot find the median by simply choosing/averaging the centre value(s), instead we consider the median where 50% of the cumulative weight is found in a sorted data set:

Data sorted by y:

X y wt cuml_w  
3 3 0.1 0.1  
5 3 0.3 0.4  
8 4 1 1.4 this is the median (i.e., 1.4 >= 1.15)
3 6 0.6 2  
5 7 0.3 2.3  
  Total wt: 2.3    
  50% of Total wt: 1.15    

From the above we can see the that the sample with a y of 4 is the median because it is the first sample with a cumulative weight bigger then 50% of the total weight i.e., 1.4

Using the y value of 4 as our median we can now calculate the absolute error:

y y - Median Abs(y - Median)
3 -1 1
3 -1 1
4 0 0
6 2 2
7 3 3
  Total Error: 7

This was the original calculation (before this fix) however this does not consider the sample weight, i.e., even though the sample with x = 3, y = 4 has a weight of 0.1 it is currently being given a weight of 1 as it’s importance which is wrong(imo), I propose we should therefore multiply by the sample weight to get the relative error:

y Abs(Median – y) wt Abs(Median – y) * wt
3 1 0.1 0.1
3 1 0.3 0.3
4 0 1 0
6 2 0.6 1.2
7 3 0.3 0.9
    Total Error: 2.5

To calculate the impurity we divide the Total Error by the total wt (i.e., 2.5 / 2.3):

Impurity: 1.08695652173913

The DecisionTreeRegressor then does its next cycle, it goes through every value of X to find the optimal split. I won’t go through each one, just the optimal one… It finds that the optimal split is between the X values of 3 and 5. Therefore we have 2 new splits as follows:

Left node:

X y wt cuml_w  
3 3 0.1 0.1  
3 6 0.6 0.7 this is the median (i.e., 0.7 >= 0.35)
  Total wt: 0.7    
  50% of Total wt: 0.35    

Thus, with a median of 6:

y Abs(Median – y) wt Abs(Median – y) * wt
3 3 0.1 0.3
6 0 0.6 0
    Total Error: 0.3

To calculate the impurity we divide the Total Error by the total wt (i.e., 0.3 / 0.7):

Left Node Impurity: 0.428571428571429

Right node:

X y wt cuml_w  
5 3 0.3 0.3  
8 4 1 1.3 this is the median (i.e., 1.3 >= 0.8)
5 7 0.3 1.6  
  Total wt: 1.6    
  50% of Total wt: 0.8    

Thus, with a median of 4:

y Abs(Median – y) wt Abs(Median – y) * wt
3 1 0.3 0.3
4 0 1 0
7 3 0.3 0.9
    Total Error: 1.2

To calculate the impurity we divide the Total Error by the total wt (i.e., 1.2 / 1.6):

Right Node Impurity: 0.75
@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 11, 2018

@jnothman
if you can hold off merging this for now. I think it prudent I check out the information gain is still okay and perhaps try this out on some sample datasets (comparing to original method)... will get back to you within next few days, thanks.

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 11, 2018

@jnothman I managed to do some extra checks and I'm confident it is working as expected. I just pushed the Boston data set through a few RandomForestRegressors and obtained very similar results (error values) to the original implementation. The only downside to this fix is it takes a little longer to run, this is to be expected as there is now an additional lookup (sample weight) and an extra multiplication for every impurity calculation!

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 11, 2018

In case of interest (?):

import time
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

dataset = load_boston()
X_full, y_full = dataset.data, dataset.target

mae = 0
start = time.time()
noOfTrees = 1000
estimator = RandomForestRegressor(random_state=0, n_estimators=noOfTrees, criterion="mae")
estimator.fit(X_full, y_full)
prediction = estimator.predict(X_full)

mae = mean_absolute_error(y_full, prediction)

print ("\nSeconds elapsed:|" + str(time.time() - start))
print ("Total MAE: " + str(mae))

Results:

n_estimators=1000

original:  
Seconds elapsed: 23.1293253898621
MAE: 0.834333695652168
   
fixed impurity:  
Seconds elapsed: 25.9046275615692
MAE: 0.821276482213434

n_estimators=10000

original:  
Seconds elapsed: 231.590049028397
MAE: 0.834080750988118
   
fixed impurity:  
Seconds elapsed: 261.451124191284
MAE: 0.823872351778635
@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 12, 2018

Interestingly... I wasn't impressed with all the datatype conversions going on, and trialled a version which shared the same type. The results are much better... I'll do a few more tests and upload the changes / results shortly...

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 12, 2018

Results of datatype consistency fix:

n_estimators=1000        
original:     fixed impurity/datatypes:  
Seconds elapsed: 23.1293253898621   Seconds elapsed: 22.7475745677948
MAE: 0.834333695652168   MAE: 0.821276482213434
n_estimators=10000        
original:     fixed impurity/datatypes:  
Seconds elapsed: 231.590049028397   Seconds elapsed: 227.267272949219
MAE: 0.834080750988118   MAE: 0.823872351778635

Much better!

impurity += fabs(y_ik - self.node_medians[k]) * w
return <double>(impurity / (self.weighted_n_node_samples *

This comment has been minimized.

@glemaitre

glemaitre Jul 13, 2018

Contributor

I don't think that there is need to cast here.

There is an implicit conversion of n_outputs do double since all other variables are double.

This comment has been minimized.

@JohnStott

JohnStott Jul 13, 2018

Contributor

Note that impurity has been changed to a type DOUBLE_t so that the calculation within the loop does not require any implicit conversion since y_ik, self.node_medians and w are all now DOUBLE_t.

I think all the other variables would be converted to DOUBLE_t (not double) as this will likely be the "bigger" datatype...? I'd agree with you on the whole though, after doing this end calculation with all variables implicitly DOUBLE_T, the compiler will do one final implicit conversion to double. So explicit cast not required. I'll remove explicit cast.

if sample_weight != NULL:
w = sample_weight[i]
imp_left += fabs(y_ik - median) * w

This comment has been minimized.

@glemaitre

glemaitre Jul 13, 2018

Contributor

Why not letting the previous code here. I don't see why we should not make the operation inplace.

I think that we can strip the casting thought.

This comment has been minimized.

@JohnStott

JohnStott Jul 13, 2018

Contributor

impurity_left and impurity_right point to array of type double. It is very expensive to cast within the loop (implicitly) from DOUBLE_t to double on each iteration hence the introduction of holders imp_left and imp_right. Cast is done at the end, outside of the loop where it is much less expensive and also will not suffer from loss in precision.

This comment has been minimized.

@glemaitre

glemaitre Jul 14, 2018

Contributor

DOUBLE_t is np.npy_float64 which is actually a double. So there is no casting cost because they are the same type. I am not really sure why we have all those casting around.

The only casting which we can make explicit is at the front of n_output which is an integer.

This comment has been minimized.

@JohnStott

JohnStott Jul 15, 2018

Contributor

It is confusing why DOUBLE_t is used throughout instead of double in that case? Maybe some legacy reason (Python2? etc..?)

I'm not sure how intelligent the compiler is in determining that DOUBLE_t is np.npy_float64 and thus an implicit conversion is not required... so I ran some benchmarks to see the difference between updating directly to impurity_left/right[0] versus imp_left/right the results are interesting (note I removed all explicit casts). I ran the timings 5 times for each:

imp_left/right:
Seconds elapsed:|22.688602447509766
Total MAE: 0.8212764822134341

Seconds elapsed:|22.552382230758667
Total MAE: 0.8212764822134341

Seconds elapsed:|22.5827374458313
Total MAE: 0.8212764822134341

Seconds elapsed:|22.657567262649536
Total MAE: 0.8212764822134341

Seconds elapsed:|22.656784534454346
Total MAE: 0.8212764822134341

impurity_left/right[0]
Seconds elapsed:|25.91646981239319
Total MAE: 0.8212764822134341

Seconds elapsed:|26.169488668441772
Total MAE: 0.8212764822134341

Seconds elapsed:|25.957823753356934
Total MAE: 0.8212764822134341

Seconds elapsed:|26.088392734527588
Total MAE: 0.8212764822134341

Seconds elapsed:|25.923059225082397
Total MAE: 0.8212764822134341

There is a notable difference. This may not actually be caused by the implicit casts (if any?), it could the lookup to array item 0 for arrays impurity_left[0] / impurity_right[0] each iteration?

I think I can test this by changing my imp_left/right solution to have the imp_left/right variables declared as a double instead of DOUBLE_t? I'll try that and see what timings we get...

This comment has been minimized.

@JohnStott

JohnStott Jul 15, 2018

Contributor

Here are the results:
Seconds elapsed:|22.72613835334778
Total MAE: 0.8212764822134341

Seconds elapsed:|22.563627243041992
Total MAE: 0.8212764822134341

Seconds elapsed:|22.58454728126526
Total MAE: 0.8212764822134341

Seconds elapsed:|22.605653047561646
Total MAE: 0.8212764822134341

Seconds elapsed:|22.564542770385742
Total MAE: 0.8212764822134341

From this, I think we can safely conclude that there are no issues with implicit casts, the time wasted is looking up item 0 in the arrays on each iteration. Therefore I propose we keep imp_left / imp_right.

This comment has been minimized.

@JohnStott

JohnStott Jul 15, 2018

Contributor

(As an aside, I decided to run once last benchmark, explicitly casting self.n_outputs to type double and got the following:)
Seconds elapsed:|22.798115253448486
Seconds elapsed:|22.661853551864624
Seconds elapsed:|22.832958459854126
Seconds elapsed:|22.838365077972412
Seconds elapsed:|22.705886602401733
(so not overly conclusive and thus no benefit in introducing an explicit cast here)

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

DOUBLE_t and double are the same at the end of the day. However, you point regarding the dereferencing is actually interesting and then it makes sense to avoid doing this in the loop and just make a single affectation at the end

This comment has been minimized.

@JohnStott

JohnStott Jul 15, 2018

Contributor

If DOUBLE_t and double are exactly the same, we should replace all DOUBLE_t with double, and the equivalent with SIZE_t if equivalent to another native datatype? Otherwise it will just cause further confusion (as above) to other developers and untrue assumptions. Unless anyone knows of some other good reason for these?

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

Basically, in C++ double_t is an alias to double as well: http://www.cplusplus.com/reference/cmath/double_t/.

So I don't think that it worth making do much change. I don't exactly the reason why the code was originally written that way but it could have been imposed with Cython in the past.

if sample_weight != NULL:
w = sample_weight[i]
imp_right += fabs(y_ik - median) * w

This comment has been minimized.

@glemaitre

glemaitre Jul 13, 2018

Contributor

Same comment as before.

This comment has been minimized.

@JohnStott

JohnStott Jul 13, 2018

Contributor

see above

@@ -1703,7 +1703,8 @@ def test_mae():
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3],
[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_almost_equal(dt_mae.tree_.impurity,
[2.5/2.3, 0.3/0.7, 1.2/1.6])

This comment has been minimized.

@glemaitre

glemaitre Jul 13, 2018

Contributor

put space after the sign / in this case. It will be more readable.

This comment has been minimized.

@JohnStott

JohnStott Jul 13, 2018

Contributor

did you mean like this:
[2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6]

JohnStott added some commits Jul 13, 2018

@@ -1269,6 +1272,9 @@ cdef class MAE(RegressionCriterion):
cdef SIZE_t i, p, k
cdef DOUBLE_t y_ik
cdef DOUBLE_t median
cdef DOUBLE_t w = 1.0
cdef DOUBLE_t imp_left = 0.0
cdef DOUBLE_t imp_right = 0.0
cdef void** left_child = <void**> self.left_child.data
cdef void** right_child = <void**> self.right_child.data

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

You can remove the two lines below then

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

Good call 👍

if sample_weight != NULL:
w = sample_weight[i]
imp_left += fabs(y_ik - median) * w

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

Basically, in C++ double_t is an alias to double as well: http://www.cplusplus.com/reference/cmath/double_t/.

So I don't think that it worth making do much change. I don't exactly the reason why the code was originally written that way but it could have been imposed with Cython in the past.

assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0])
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3],
[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_almost_equal(dt_mae.tree_.impurity,

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

we should probably add the case where we have the same sample_weight (array of one) which should give the previous results as well.

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

Good point, added (with additional comments too)

@@ -1269,6 +1272,9 @@ cdef class MAE(RegressionCriterion):
cdef SIZE_t i, p, k
cdef DOUBLE_t y_ik
cdef DOUBLE_t median
cdef DOUBLE_t w = 1.0
cdef DOUBLE_t imp_left = 0.0

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

can you call imp_left to impurity_left and impurity_left to p_impurity_left to differentiate which one is the pointer and like this both name will be explicit.
The same thing should be done for the right.

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

Done 👍

@@ -513,6 +513,12 @@ Classifiers and regressors
per-tree basis. The previous behavior over-weighted the Gini importance of
features that appear in later stages. This issue only affected feature
importances. :issue:`11176` by :user:`Gil Forsyth <gforsyth>`.
- Fixed a bug in :class:`tree.MAE` to ensure sample_weights are being used

This comment has been minimized.

@glemaitre

glemaitre Jul 15, 2018

Contributor

use backticks around sample_weights

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

English wise, it didn't quite look right sample_weights (with plurality) so changed to plain English (no parameter reference).

@jorisvandenbossche jorisvandenbossche added this to PRs tagged in scikit-learn 0.20 Jul 16, 2018

# on small toy dataset
dt_mae = DecisionTreeRegressor(random_state=0, criterion="mae",
max_leaf_nodes=2)
# Test MAE where all sample weights are uniform:
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], np.ones(5))

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

Cosmit: could you be explicit on the named kwargs, so that it is easy to see that you are setting sample weights.

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

Yes I'll do that later, I have simply followed the style that pre-existed.

# Test MAE where all sample weights are uniform:
dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3], np.ones(5))
assert_array_equal(dt_mae.tree_.impurity, [1.4, 1.5, 4.0 / 3.0])
assert_array_equal(dt_mae.tree_.value.flat, [4, 4.5, 4.0])

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

How do we know that these are the right values. It would be useful to explain it in a comment.

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

I explained in my 2nd post above... Perhaps I could put a link to this pull in the comments? Otherwise I can try to summarise?

dt_mae.fit([[3], [5], [3], [8], [5]], [6, 7, 3, 4, 3],
[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_almost_equal(dt_mae.tree_.impurity,
[2.5 / 2.3, 0.3 / 0.7, 1.2 / 1.6])

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

This illustrates well why I don't love hard coded numerical values: without being an expert in this part of the code, it is hard to see why the new version is correct and not the old one.

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

I agree, unfortunately this file has many other tests with hardcoded values and no explanation. I was simply following suit. On a side-note, I have a fix for another issue in this domain with respect to the medians not being calculated correctly. See #10725 I will need perhaps more rigorous testing there too rather than one test that works. It is tricky for me to know best way to integrate as the values can be verified via code at runtime (not a one off test) if desired (as explained in that linked issue)?

Perhaps you know of test-unit code that I can be re-directed to for how it should really be done properly so that I can follow that instead?

This comment has been minimized.

@JohnStott

JohnStott Jul 16, 2018

Contributor

Apologies for cross contaminating threads, I only mention this other issue because in most likelihood I will be extending this exact test function! Ignore this please, I will investigate via other thread.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 16, 2018

JohnStott added some commits Jul 16, 2018

@@ -1693,19 +1693,104 @@ def test_no_sparse_y_support(name):
def test_mae():
# check MAE criterion produces correct results
# on small toy dataset
'''

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

replace ''' by """ and put the title on the first line.

The example is verbose but at least it explains the rational behind the number.

# Test MAE where sample weights are non-uniform (as illustrated above):
dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3],
sample_weight=[0.6, 0.3, 0.1, 1.0, 0.3])
assert_array_almost_equal(dt_mae.tree_.impurity,

This comment has been minimized.

@glemaitre

glemaitre Jul 16, 2018

Contributor

use assert_allclose instead of assert_array_almost_equal

@GaelVaroquaux

This is great! Thank you so much!!

Merging when CI is green.

@GaelVaroquaux GaelVaroquaux changed the title from Tree MAE fix to ensure sample_weights are used during impurity calculation to [MRG+1] Tree MAE fix to ensure sample_weights are used during impurity calculation Jul 17, 2018

@GaelVaroquaux GaelVaroquaux changed the title from [MRG+1] Tree MAE fix to ensure sample_weights are used during impurity calculation to [MRG+2] Tree MAE fix to ensure sample_weights are used during impurity calculation Jul 17, 2018

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 17, 2018

Trying to figure out what I'm doing wrong with this whitespace issue but I'll keep at it until fixed...

In the meantime, many thanks @glemaitre, @GaelVaroquaux and @jnothman, really appreciate your help, time, patience and direction on this, has been a great learning process for me 👍.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

@JohnStott : you're doing good!

The trailing whitespace error on travis means that you have whitespace at the end of certain lines.

I personally work by setting up my editor to show me the whitespace at the end of lines (you can underline them, or highlight them).

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 17, 2018

@GaelVaroquaux, Thanks I have just switched on as you suggest... I can now see several lines with trailing whitespace... hopefully now will go through, cheers.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

You still have one trailing whitespace. Just one !

sklearn/tree/tests/test_tree.py:1780:63: W291 trailing whitespace
    dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], 
1 similar comment
@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

You still have one trailing whitespace. Just one !

sklearn/tree/tests/test_tree.py:1780:63: W291 trailing whitespace
    dt_mae.fit(X=[[3], [5], [3], [8], [5]], y=[6, 7, 3, 4, 3], 
@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

@JohnStott : I figured that you were probably sleeping, so I just did this change, in the interest of merging this PR.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Jul 17, 2018

Yey! PEP8 now works.

Merging! Thank you!

@GaelVaroquaux GaelVaroquaux reopened this Jul 17, 2018

@GaelVaroquaux GaelVaroquaux merged commit 6ade12a into scikit-learn:master Jul 17, 2018

3 of 5 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details

scikit-learn 0.20 automation moved this from PRs tagged to Done Jul 17, 2018

@JohnStott

This comment has been minimized.

Contributor

JohnStott commented Jul 17, 2018

Brilliant, many thanks @GaelVaroquaux you saved me pulling out my hair :) 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment