Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+3] Add mean absolute error splitting criterion to DecisionTreeRegressor #6667

Merged
merged 75 commits into from
Jul 25, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
783f433
feature: add initial node_value method
nelson-liu Apr 15, 2016
68ae519
testing code for node_impurity and node_value
nelson-liu Apr 17, 2016
c7b640a
fix: node_value now correctly calculating weighted median for sorted …
nelson-liu Apr 18, 2016
2fb7651
fix: node_value now correctly calculates median regardless of initial…
nelson-liu Apr 19, 2016
a3f2f76
fix: correct bug in calculating median when taking midpoint is necessary
nelson-liu Apr 19, 2016
c40a54b
feature: add initial version of children_impurity
nelson-liu Apr 19, 2016
19e811d
feature: refactor median calculation into one function
nelson-liu Apr 19, 2016
31f04b4
fix: fix use of DOUBLE_t vs double
nelson-liu Apr 21, 2016
ffff616
feature: move helper functions to _utils.pyx, fix mismatched pointer …
nelson-liu May 12, 2016
bfde38d
fix: fix some bugs in children_impurity method
nelson-liu May 14, 2016
8b77de0
push a debug version to try to solve segfault
nelson-liu May 19, 2016
adb244d
push latest changes, segfault probably happening bc of something in _…
nelson-liu May 19, 2016
ca5149a
fix: fix segfault in median calculation and remove excessive logging
nelson-liu May 19, 2016
1e5a969
chore: revert some misc spacing changes I accidentally made
nelson-liu May 19, 2016
99132ac
chore: one last spacing fix in _splitter.pyx
nelson-liu May 19, 2016
9655fb0
feature: don't calculate weighted median if no weights are passed in
nelson-liu May 19, 2016
e0476b9
remove extraneous logging statement
nelson-liu May 20, 2016
04dfc7e
fix: fix children impurity calculation
nelson-liu May 20, 2016
a61782f
fix: fix bug with children impurity not being initally set to 0
nelson-liu May 20, 2016
33af0fb
fix: hacky fix for a float accuracy error
nelson-liu May 21, 2016
5844d81
fix: incorrect type cast in median array generation for node_impurity
nelson-liu May 22, 2016
134eb92
slightly tweak node_impurity function
nelson-liu May 22, 2016
115df19
fix: be more explicit with casts
nelson-liu May 22, 2016
6fa918f
feature: revert cosmetic changes and free temporary arrays
nelson-liu May 27, 2016
8d00594
fix: only free weight array in median calcuation if it was created
nelson-liu May 27, 2016
d51e091
style: remove extraneous newline / trigger CI build
nelson-liu May 28, 2016
a9ccf18
style: remove extraneous 0 from range
nelson-liu May 28, 2016
f46b3c2
feature: save sorts within a node to speed it up
nelson-liu May 29, 2016
5635c97
fix: move parts of dealloc to regression criterion
nelson-liu May 29, 2016
97d44e3
chore: add comment to splitter to try to force recythonizing
nelson-liu May 29, 2016
58949f7
chore: add comment to _tree.pyx to try to force recythonizing
nelson-liu May 29, 2016
0be994f
chore: add empty comment to gradient boosting to force recythonizing
nelson-liu May 29, 2016
492ea7d
fix: fix bug in weighted median
nelson-liu May 29, 2016
00fbe6e
try moving sorted values to a class variable
nelson-liu May 29, 2016
f03cf38
feature: refactor criterion to sort once initially, then draw all sam…
nelson-liu Jun 18, 2016
2fdb56d
style: remove extraneous parens from if condition
nelson-liu Jun 18, 2016
b9aef43
implement median-heap method for calculating impurity
nelson-liu Jun 22, 2016
39e693c
style: remove extra line
nelson-liu Jun 22, 2016
20d6107
style: fix inadvertent cosmetic changes; i'll address some of these i…
nelson-liu Jun 23, 2016
f73ac8e
feature: change minmaxheap to internally use sorted arrays
nelson-liu Jul 8, 2016
5b8d665
refactored MAE and push to share work
nelson-liu Jul 16, 2016
9920cfc
fix errors wrt median insertion case
nelson-liu Jul 17, 2016
53207d4
spurious comment to force recythonization
nelson-liu Jul 17, 2016
6907227
general code cleanup
nelson-liu Jul 17, 2016
8d55097
fix typo in _tree.pyx
nelson-liu Jul 17, 2016
b465abc
removed some extraneous comments
nelson-liu Jul 17, 2016
df9e64a
[ci skip] remove earlier microchanges
nelson-liu Jul 17, 2016
32c1fef
[ci skip] remove change to priorityheap
nelson-liu Jul 17, 2016
5e2cd1a
[ci skip] fix indentation
nelson-liu Jul 17, 2016
9f1b5fd
[ci skip] fix class-specific issues with heaps
nelson-liu Jul 17, 2016
802e1fd
[ci skip] restore a newline
nelson-liu Jul 17, 2016
c0401a5
[ci skip] remove microchange to refactor later
nelson-liu Jul 17, 2016
0bfc2c3
reword a comment
nelson-liu Jul 17, 2016
702bb6b
remove heapify methods from queue class
nelson-liu Jul 17, 2016
327ea19
doc: update docstrings for dt, rf, and et regressors
nelson-liu Jul 18, 2016
469274d
doc: revert incorrect spacing to shorten diff
nelson-liu Jul 18, 2016
560f6fa
convert get_median to return value directly
nelson-liu Jul 18, 2016
87b0180
[ci skip] remove accidental whitespace
nelson-liu Jul 18, 2016
ecae675
remove extraneous unpacking of values
nelson-liu Jul 19, 2016
6c28358
style: misc changes to identifiers
nelson-liu Jul 19, 2016
0db9965
add docstrings and more informative variable identifiers
nelson-liu Jul 19, 2016
e373416
[ci skip] add trivial comments to recythonize
nelson-liu Jul 19, 2016
448bb6e
remove trivial comments for recythonizing
nelson-liu Jul 19, 2016
c44f327
force recythonization for real this time
nelson-liu Jul 19, 2016
8d442cf
remove trivial comments for recythonization
nelson-liu Jul 19, 2016
a008538
rfc: harmonize arg. names and remove unnecessary checks
nelson-liu Jul 20, 2016
929153c
convert allocations to safe_realloc
nelson-liu Jul 20, 2016
f383c94
fix bug in weighted case and add tests for MAE
nelson-liu Jul 20, 2016
6a1f3d4
change all medians to DOUBLE_t
nelson-liu Jul 20, 2016
e25a52c
add loginc allocate mediancalculators once, and reset otherwise
nelson-liu Jul 20, 2016
bd0c71d
misc style fixes
nelson-liu Jul 21, 2016
d3245ae
modify cinit of regressioncriterion to take n_samples
nelson-liu Jul 21, 2016
dbaa57b
add MAE formula and force rebuild bc. travis was down
nelson-liu Jul 21, 2016
f668ab9
add criterion parameter to gradient boosting and add forest tests
nelson-liu Jul 21, 2016
04d3b8a
add entries to what's new
nelson-liu Jul 21, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,14 @@ New features
and Harabaz score to evaluate the resulting clustering of a set of points.
By `Arnaud Fouchet`_ and `Thierry Guillemot`_.

- Added a new splitting criterion for :class:`tree.DecisionTreeRegressor`,
the mean absolute error. This criterion can also be used in
:class:`ensemble.ExtraTreesRegressor`,
:class:`ensemble.RandomForestRegressor`, and the gradient boosting
estimators. (`#6667
<https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
Liu`_.

Enhancements
............

Expand Down Expand Up @@ -146,6 +154,11 @@ Enhancements
provided as a percentage of the training samples. By
`yelite`_ and `Arnaud Joly`_.

- Gradient boosting estimators accept the parameter ``criterion`` to specify
to splitting criterion used in built decision trees. (`#6667
<https://github.com/scikit-learn/scikit-learn/pull/6667>`_) by `Nelson
Liu`_.

- Codebase does not contain C/C++ cython generated files: they are
generated during build. Distribution packages will still contain generated
C/C++ files. By `Arthur Mensch`_.
Expand Down Expand Up @@ -4280,3 +4293,5 @@ David Huard, Dave Morrill, Ed Schofield, Travis Oliphant, Pearu Peterson.
.. _Sebastian Säger: https://github.com/ssaeger

.. _YenChen Lin: https://github.com/yenchenlin

.. _Nelson Liu: https://github.com/nelson-liu
12 changes: 8 additions & 4 deletions sklearn/ensemble/forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -947,8 +947,10 @@ class RandomForestRegressor(ForestRegressor):
The number of trees in the forest.

criterion : string, optional (default="mse")
The function to measure the quality of a split. The only supported
criterion is "mse" for the mean squared error.
The function to measure the quality of a split. Supported criteria
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have versionadded for mae

are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion, and "mae" for the mean
absolute error.

max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand Down Expand Up @@ -1299,8 +1301,10 @@ class ExtraTreesRegressor(ForestRegressor):
The number of trees in the forest.

criterion : string, optional (default="mse")
The function to measure the quality of a split. The only supported
criterion is "mse" for the mean squared error.
The function to measure the quality of a split. Supported criteria
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have versionadded for mae

are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion, and "mae" for the mean
absolute error.

max_features : int, float, string or None, optional (default="auto")
The number of features to consider when looking for the best split:
Expand Down
31 changes: 24 additions & 7 deletions sklearn/ensemble/gradient_boosting.py
Original file line number Diff line number Diff line change
Expand Up @@ -720,15 +720,16 @@ class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,
"""Abstract base class for Gradient Boosting. """

@abstractmethod
def __init__(self, loss, learning_rate, n_estimators, min_samples_split,
min_samples_leaf, min_weight_fraction_leaf,
def __init__(self, loss, learning_rate, n_estimators, criterion,
min_samples_split, min_samples_leaf, min_weight_fraction_leaf,
max_depth, init, subsample, max_features,
random_state, alpha=0.9, verbose=0, max_leaf_nodes=None,
warm_start=False, presort='auto'):

self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.loss = loss
self.criterion = criterion
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.min_weight_fraction_leaf = min_weight_fraction_leaf
Expand Down Expand Up @@ -762,7 +763,7 @@ def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,

# induce regression tree on residuals
tree = DecisionTreeRegressor(
criterion='friedman_mse',
criterion=self.criterion,
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
Expand Down Expand Up @@ -1296,6 +1297,14 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
of the input variables.
Ignored if ``max_leaf_nodes`` is not None.

criterion : string, optional (default="friedman_mse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to the party, but this should have a versionadded, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should, i'll add that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

The function to measure the quality of a split. Supported criteria
are "friedman_mse" for the mean squared error with improvement
score by Friedman, "mse" for mean squared error, and "mae" for
the mean absolute error. The default value of "friedman_mse" is
generally the best as it can provide a better approximation in
some cases.

min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:

Expand Down Expand Up @@ -1426,7 +1435,7 @@ class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
_SUPPORTED_LOSS = ('deviance', 'exponential')

def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, init=None, random_state=None,
max_features=None, verbose=0,
Expand All @@ -1435,7 +1444,7 @@ def __init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,

super(GradientBoostingClassifier, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
min_samples_split=min_samples_split,
criterion=criterion, min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_depth=max_depth, init=init, subsample=subsample,
Expand Down Expand Up @@ -1643,6 +1652,14 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
of the input variables.
Ignored if ``max_leaf_nodes`` is not None.

criterion : string, optional (default="friedman_mse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded

The function to measure the quality of a split. Supported criteria
are "friedman_mse" for the mean squared error with improvement
score by Friedman, "mse" for mean squared error, and "mae" for
the mean absolute error. The default value of "friedman_mse" is
generally the best as it can provide a better approximation in
some cases.

min_samples_split : int, float, optional (default=2)
The minimum number of samples required to split an internal node:

Expand Down Expand Up @@ -1772,15 +1789,15 @@ class GradientBoostingRegressor(BaseGradientBoosting, RegressorMixin):
_SUPPORTED_LOSS = ('ls', 'lad', 'huber', 'quantile')

def __init__(self, loss='ls', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2,
subsample=1.0, criterion='friedman_mse', min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, init=None, random_state=None,
max_features=None, alpha=0.9, verbose=0, max_leaf_nodes=None,
warm_start=False, presort='auto'):

super(GradientBoostingRegressor, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
min_samples_split=min_samples_split,
criterion=criterion, min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_depth=max_depth, init=init, subsample=subsample,
Expand Down
4 changes: 2 additions & 2 deletions sklearn/ensemble/tests/test_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def check_boston_criterion(name, criterion):


def test_boston():
for name, criterion in product(FOREST_REGRESSORS, ("mse", )):
for name, criterion in product(FOREST_REGRESSORS, ("mse", "mae", "friedman_mse")):
yield check_boston_criterion, name, criterion


Expand Down Expand Up @@ -244,7 +244,7 @@ def test_importances():
for name, criterion in product(FOREST_CLASSIFIERS, ["gini", "entropy"]):
yield check_importances, name, criterion, X, y

for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse"]):
for name, criterion in product(FOREST_REGRESSORS, ["mse", "friedman_mse", "mae"]):
yield check_importances, name, criterion, X, y


Expand Down
Loading