Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH add Poisson splitting criterion for single trees #17386

Merged
merged 33 commits into from Nov 2, 2020
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1cb9716
ENH add poisson splitting criterion for trees
May 28, 2020
97fb35e
TST add tests for poisson splitting in trees
May 29, 2020
3bf7fce
BUG fix and test fobidden zero node split for Poisson
May 29, 2020
e5ab4dc
TST include poisson criterion in test_diabetes_underfit
Jun 2, 2020
463b858
address review comments
Jun 3, 2020
736a872
DOC add Poisson deviance as impurity in user guide
Jun 4, 2020
0c8e8ec
DOC improve tree section of user guide
Jun 4, 2020
68126db
DOC miiiiiinor text improvement
Jun 4, 2020
2f35789
Merge branch 'master' into poisson_tree
Jul 4, 2020
5653962
address review comments
Jul 6, 2020
d412cf6
Check positive child sums before calculating means
Jul 17, 2020
57a3cc7
trigger doc build
lorentzenchr Aug 19, 2020
062865d
Merge branch 'master' into poisson_tree
lorentzenchr Aug 19, 2020
a87f8d7
address review comments
lorentzenchr Aug 19, 2020
3438e45
TST compare poisson split vs mse
lorentzenchr Aug 19, 2020
eda5b14
MNT reduce redundant code with helper poisson_loss
lorentzenchr Aug 20, 2020
93e2b12
TST make test_poisson_vs_mse pass
lorentzenchr Aug 21, 2020
a5d84bb
TST improve test_diabetes_underfit
lorentzenchr Aug 21, 2020
d07efd2
DOC consistent formulae
lorentzenchr Aug 21, 2020
cc69c14
Merge branch 'master' into poisson_tree
lorentzenchr Aug 23, 2020
6571d52
DOC add hint as to when to use Poisson criterion
lorentzenchr Aug 23, 2020
d2ebc06
DOX fix fomula super- and subscripts
lorentzenchr Sep 14, 2020
71ff59c
address review comments
lorentzenchr Oct 4, 2020
521cf99
TST add further test for forbidden zero nodes
lorentzenchr Oct 4, 2020
82e3586
BUG check for y_mean <= 0 in poisson_loss
lorentzenchr Oct 4, 2020
e63f8ad
CLN remove FIXME tag
lorentzenchr Oct 4, 2020
d741303
CLN address review comments
lorentzenchr Oct 7, 2020
2a7c836
TST add tests for sample_weight consistency
lorentzenchr Oct 10, 2020
167f166
DOC add what_new entry
lorentzenchr Oct 10, 2020
da25541
DOC mention poisson and mae fit slower than mse
lorentzenchr Oct 10, 2020
2fd61aa
TST add comment about not touching original data
lorentzenchr Nov 2, 2020
8800269
CLN nicer code formating
lorentzenchr Nov 2, 2020
e329a76
TST parametrize metric as well
lorentzenchr Nov 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
55 changes: 36 additions & 19 deletions doc/modules/tree.rst
Expand Up @@ -10,7 +10,7 @@ Decision Trees
for :ref:`classification <tree_classification>` and :ref:`regression
<tree_regression>`. The goal is to create a model that predicts the value of a
target variable by learning simple decision rules inferred from the data
features.
features. A tree can be seen as a piecewise constant approximation.

For instance, in the example below, decision trees learn from data to
approximate a sine curve with a set of if-then-else decision rules. The deeper
Expand All @@ -33,8 +33,8 @@ Some advantages of decision trees are:
- The cost of using the tree (i.e., predicting data) is logarithmic in the
number of data points used to train the tree.

- Able to handle both numerical and categorical data. However scikit-learn
implementation does not support categorical variables for now. Other
- Able to handle both numerical and categorical data. However scikit-learn
implementation does not support categorical variables for now. Other
techniques are usually specialised in analysing datasets that have only one type
of variable. See :ref:`algorithms <tree_algorithms>` for more
information.
Expand Down Expand Up @@ -66,6 +66,10 @@ The disadvantages of decision trees include:
This problem is mitigated by using decision trees within an
ensemble.

- Predictions of decision trees are neither smooth nor continuous, but
piecewise constant approximations as seen in the above figure. Therefore,
they are not good at extrapolation.

- The problem of learning an optimal decision tree is known to be
NP-complete under several aspects of optimality and even for simple
concepts. Consequently, practical decision-tree learning algorithms
Expand Down Expand Up @@ -112,7 +116,7 @@ probability, the classifier will predict the class with the lowest index
amongst those classes.

As an alternative to outputting a specific class, the probability of each class
can be predicted, which is the fraction of training samples of the class in a
can be predicted, which is the fraction of training samples of the class in a
leaf::

>>> clf.predict_proba([[2., 2.]])
Expand Down Expand Up @@ -429,8 +433,9 @@ Mathematical formulation
========================

Given training vectors :math:`x_i \in R^n`, i=1,..., l and a label vector
:math:`y \in R^l`, a decision tree recursively partitions the space such
that the samples with the same labels are grouped together.
:math:`y \in R^l`, a decision tree recursively partitions the feature space
such that the samples with the same labels or similar target values are grouped
together.

Let the data at node :math:`m` be represented by :math:`Q`. For
each candidate split :math:`\theta = (j, t_m)` consisting of a
Expand All @@ -443,9 +448,9 @@ feature :math:`j` and threshold :math:`t_m`, partition the data into

Q_{right}(\theta) = Q \setminus Q_{left}(\theta)

The impurity at :math:`m` is computed using an impurity function
:math:`H()`, the choice of which depends on the task being solved
(classification or regression)
The quality of a candidate split of node :math:`m` is then computed using an
impurity function or loss function :math:`H()`, the choice of which depends on
the task being solved (classification or regression)

.. math::

Expand Down Expand Up @@ -473,37 +478,40 @@ observations, let

p_{mk} = 1/ N_m \sum_{x_i \in R_m} I(y_i = k)

be the proportion of class k observations in node :math:`m`
be the proportion of class k observations in node :math:`m`. If :math:`m` is a
terminal node, `predict_proba` for this region is set to :math:`p_{mk}`.
Common measures of impurity are the following.

Common measures of impurity are Gini
Gini:

.. math::

H(X_m) = \sum_k p_{mk} (1 - p_{mk})

Entropy
Entropy:

.. math::

H(X_m) = - \sum_k p_{mk} \log(p_{mk})

and Misclassification
Misclassification:

.. math::

H(X_m) = 1 - \max(p_{mk})

where :math:`X_m` is the training data in node :math:`m`
Here, :math:`X_m` is the training data in node :math:`m`.
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved

Regression criteria
-------------------

If the target is a continuous value, then for node :math:`m`,
representing a region :math:`R_m` with :math:`N_m` observations, common
criteria to minimise as for determining locations for future
splits are Mean Squared Error, which minimizes the L2 error
using mean values at terminal nodes, and Mean Absolute Error, which
minimizes the L1 error using median values at terminal nodes.
criteria to minimize as for determining locations for future splits are Mean
Squared Error (MSE or L2 error), Poisson deviance as well as Mean Absolute
Error (MAE or L1 error). MSE and Poisson deviance both set the predicted value
of terminal nodes to the learned mean value of the node whereas the MAE sets
the predicted value of terminal nodes to the median.

Mean Squared Error:

Expand All @@ -513,6 +521,15 @@ Mean Squared Error:

H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2

Half Poisson deviance:

.. math::

\bar{y}_m = \frac{1}{N_m} \sum_{i \in N_m} y_i
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved

H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i \log\frac{y_i}{\bar{y}_m}
- y_i + \bar{y}_m)

Mean Absolute Error:

.. math::
Expand All @@ -521,7 +538,7 @@ Mean Absolute Error:

H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} |y_i - median(y)_m|

where :math:`X_m` is the training data in node :math:`m`
Again, :math:`X_m` is the training data in node :math:`m`.


.. _minimal_cost_complexity_pruning:
Expand Down
36 changes: 27 additions & 9 deletions sklearn/tree/_classes.py
Expand Up @@ -60,9 +60,12 @@
DTYPE = _tree.DTYPE
DOUBLE = _tree.DOUBLE

CRITERIA_CLF = {"gini": _criterion.Gini, "entropy": _criterion.Entropy}
CRITERIA_REG = {"mse": _criterion.MSE, "friedman_mse": _criterion.FriedmanMSE,
"mae": _criterion.MAE}
CRITERIA_CLF = {"gini": _criterion.Gini,
"entropy": _criterion.Entropy}
CRITERIA_REG = {"mse": _criterion.MSE,
"friedman_mse": _criterion.FriedmanMSE,
"mae": _criterion.MAE,
"poisson": _criterion.Poisson}

DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
"random": _splitter.RandomSplitter}
Expand Down Expand Up @@ -161,6 +164,14 @@ def fit(self, X, y, sample_weight=None, check_input=True,
raise ValueError("No support for np.int64 index based "
"sparse matrices")

if self.criterion == "poisson":
if np.any(y < 0):
raise ValueError("Some value(s) of y are negative which is"
" not allowed for Poisson regression.")
if np.sum(y) <= 0:
raise ValueError("Sum of y is not positive which is "
"necessary for Poisson regression.")

# Determine output settings
n_samples, self.n_features_ = X.shape
is_classification = is_classifier(self)
Expand Down Expand Up @@ -973,18 +984,22 @@ class DecisionTreeRegressor(RegressorMixin, BaseDecisionTree):

Parameters
----------
criterion : {"mse", "friedman_mse", "mae"}, default="mse"
criterion : {"mse", "friedman_mse", "mae", "poisson"}, default="mse"
The function to measure the quality of a split. Supported criteria
are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion and minimizes the L2 loss
using the mean of each terminal node, "friedman_mse", which uses mean
squared error with Friedman's improvement score for potential splits,
and "mae" for the mean absolute error, which minimizes the L1 loss
using the median of each terminal node.
"mae" for the mean absolute error, which minimizes the L1 loss using
the median of each terminal node, and "poisson" which uses reduction in
Poisson deviance to find splits.

.. versionadded:: 0.18
Mean Absolute Error (MAE) criterion.

.. versionadded:: 0.24
Poisson deviance criterion.

splitter : {"best", "random"}, default="best"
The strategy used to choose the split at each node. Supported
strategies are "best" to choose the best split and "random" to choose
Expand Down Expand Up @@ -1521,15 +1536,18 @@ class ExtraTreeRegressor(DecisionTreeRegressor):

Parameters
----------
criterion : {"mse", "friedman_mse", "mae"}, default="mse"
criterion : {"mse", "friedman_mse", "mae", "poisson"}, default="mse"
The function to measure the quality of a split. Supported criteria
are "mse" for the mean squared error, which is equal to variance
reduction as feature selection criterion, and "mae" for the mean
absolute error.
reduction as feature selection criterion, "mae" for the mean absolute
error and "poisson" for the Poisson deviance.

.. versionadded:: 0.18
Mean Absolute Error (MAE) criterion.

.. versionadded:: 0.24
Poisson deviance criterion.

splitter : {"random", "best"}, default="random"
The strategy used to choose the split at each node. Supported
strategies are "best" to choose the best split and "random" to choose
Expand Down