Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[MRG+2] Tree MAE fix to ensure sample_weights are used during impurity calculation #11464
Tree MAE is not considering sample_weights when calculating impurity!
In the proposed fix, you will see I have multiplied by the sample weight after applying the absolute to the difference (not before). This is in line with the consensus / discussion found here, where negative sample weights are considered: #3774 (and also because during initialisation, self.weighted_n_node_samples is a summation of the sample weights with no "absolute" applied (this is used in the impurity division calc)).
Fixes: #11460 (Tree MAE is not considering sample_weights when calculating impurity!)
I think this looks good, but I am no expert on the calculation of impurities. Could you give some calculation to show why the values in the test are those we should expect?
Please add an entry to the change log under Bug Fixes at
Thanks @jnothman ,
I will make those changes as soon as possible, for now, here is an example, I hope it is sufficient...
It is perhaps important to remind that because we are dealing with sample_weights we cannot find the median by simply choosing/averaging the centre value(s), instead we consider the median where 50% of the cumulative weight is found in a sorted data set:
Data sorted by y:
From the above we can see the that the sample with a y of 4 is the median because it is the first sample with a cumulative weight bigger then 50% of the total weight i.e., 1.4
Using the y value of 4 as our median we can now calculate the absolute error:
This was the original calculation (before this fix) however this does not consider the sample weight, i.e., even though the sample with x = 3, y = 4 has a weight of 0.1 it is currently being given a weight of 1 as it’s importance which is wrong(imo), I propose we should therefore multiply by the sample weight to get the relative error:
To calculate the impurity we divide the Total Error by the total wt (i.e., 2.5 / 2.3):
The DecisionTreeRegressor then does its next cycle, it goes through every value of X to find the optimal split. I won’t go through each one, just the optimal one… It finds that the optimal split is between the X values of 3 and 5. Therefore we have 2 new splits as follows:
Thus, with a median of 6:
To calculate the impurity we divide the Total Error by the total wt (i.e., 0.3 / 0.7):
Thus, with a median of 4:
To calculate the impurity we divide the Total Error by the total wt (i.e., 1.2 / 1.6):
@jnothman I managed to do some extra checks and I'm confident it is working as expected. I just pushed the Boston data set through a few RandomForestRegressors and obtained very similar results (error values) to the original implementation. The only downside to this fix is it takes a little longer to run, this is to be expected as there is now an additional lookup (sample weight) and an extra multiplication for every impurity calculation!
In case of interest (?):
Results of datatype consistency fix:
changed the title from
Tree MAE fix to ensure sample_weights are used during impurity calculation
[MRG+1] Tree MAE fix to ensure sample_weights are used during impurity calculation
Jul 17, 2018
@JohnStott : you're doing good!
The trailing whitespace error on travis means that you have whitespace at the end of certain lines.
I personally work by setting up my editor to show me the whitespace at the end of lines (you can underline them, or highlight them).