scikit-learn · glemaitre · Nov 2, 2020 · May 28, 2020 · May 29, 2020 · May 29, 2020
diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
@@ -10,7 +10,7 @@ Decision Trees
 for :ref:`classification <tree_classification>` and :ref:`regression
 <tree_regression>`. The goal is to create a model that predicts the value of a
 target variable by learning simple decision rules inferred from the data
-features.
+features. A tree can be seen as a piecewise constant approximation.
 
 For instance, in the example below, decision trees learn from data to
 approximate a sine curve with a set of if-then-else decision rules. The deeper
@@ -33,8 +33,8 @@ Some advantages of decision trees are:
     - The cost of using the tree (i.e., predicting data) is logarithmic in the
       number of data points used to train the tree.
 
-    - Able to handle both numerical and categorical data. However scikit-learn 
-      implementation does not support categorical variables for now. Other 
+    - Able to handle both numerical and categorical data. However scikit-learn
+      implementation does not support categorical variables for now. Other
       techniques are usually specialised in analysing datasets that have only one type
       of variable. See :ref:`algorithms <tree_algorithms>` for more
       information.
@@ -66,6 +66,10 @@ The disadvantages of decision trees include:
       This problem is mitigated by using decision trees within an
       ensemble.
 
+    - Predictions of decision trees are neither smooth nor continuous, but
+      piecewise constant approximations as seen in the above figure. Therefore,
+      they are not good at extrapolation.
+
     - The problem of learning an optimal decision tree is known to be
       NP-complete under several aspects of optimality and even for simple
       concepts. Consequently, practical decision-tree learning algorithms
@@ -112,7 +116,7 @@ probability, the classifier will predict the class with the lowest index
 amongst those classes.
 
 As an alternative to outputting a specific class, the probability of each class
-can be predicted, which is the fraction of training samples of the class in a 
+can be predicted, which is the fraction of training samples of the class in a
 leaf::
 
     >>> clf.predict_proba([[2., 2.]])
@@ -429,8 +433,9 @@ Mathematical formulation
 ========================
 
 Given training vectors :math:`x_i \in R^n`, i=1,..., l and a label vector
-:math:`y \in R^l`, a decision tree recursively partitions the space such
-that the samples with the same labels are grouped together.
+:math:`y \in R^l`, a decision tree recursively partitions the feature space
+such that the samples with the same labels or similar target values are grouped
+together.
 
 Let the data at node :math:`m` be represented by :math:`Q`. For
 each candidate split :math:`\theta = (j, t_m)` consisting of a
@@ -443,9 +448,9 @@ feature :math:`j` and threshold :math:`t_m`, partition the data into
 
     Q_{right}(\theta) = Q \setminus Q_{left}(\theta)
 
-The impurity at :math:`m` is computed using an impurity function
-:math:`H()`, the choice of which depends on the task being solved
-(classification or regression)
+The quality of a candidate split of node :math:`m` is then computed using an
+impurity function or loss function :math:`H()`, the choice of which depends on
+the task being solved (classification or regression)
 
 .. math::
 
@@ -473,37 +478,40 @@ observations, let
 
     p_{mk} = 1/ N_m \sum_{x_i \in R_m} I(y_i = k)
 
-be the proportion of class k observations in node :math:`m`
+be the proportion of class k observations in node :math:`m`. If :math:`m` is a
+terminal node, `predict_proba` for this region is set to :math:`p_{mk}`.
+Common measures of impurity are the following.
 
-Common measures of impurity are Gini
+Gini:
 
 .. math::
 
     H(X_m) = \sum_k p_{mk} (1 - p_{mk})
 
-Entropy
+Entropy:
 
 .. math::
 
     H(X_m) = - \sum_k p_{mk} \log(p_{mk})
 
-and Misclassification
+Misclassification:
 
 .. math::
 
     H(X_m) = 1 - \max(p_{mk})
 
-where :math:`X_m` is the training data in node :math:`m`
+Here, :math:`X_m` is the training data in node :math:`m`.
 
 Regression criteria
 -------------------
 
 If the target is a continuous value, then for node :math:`m`,
 representing a region :math:`R_m` with :math:`N_m` observations, common
-criteria to minimise as for determining locations for future
-splits are Mean Squared Error, which minimizes the L2 error
-using mean values at terminal nodes, and Mean Absolute Error, which
-minimizes the L1 error using median values at terminal nodes.
+criteria to minimize as for determining locations for future splits are Mean
+Squared Error (MSE or L2 error), Poisson deviance as well as Mean Absolute
+Error (MAE or L1 error). MSE and Poisson deviance both set the predicted value
+of terminal nodes to the learned mean value of the node whereas the MAE sets
+the predicted value of terminal nodes to the median.
 
 Mean Squared Error:
 
@@ -513,6 +521,15 @@ Mean Squared Error:
 
     H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i - \bar{y}_m)^2
 
+Half Poisson deviance:
+
+.. math::
+
+    \bar{y}_m = \frac{1}{N_m} \sum_{i \in N_m} y_i
+
+    H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} (y_i \log\frac{y_i}{\bar{y}_m}
+    - y_i + \bar{y}_m)
+
 Mean Absolute Error:
 
 .. math::
@@ -521,7 +538,7 @@ Mean Absolute Error:
 
     H(X_m) = \frac{1}{N_m} \sum_{i \in N_m} |y_i - median(y)_m|
 
-where :math:`X_m` is the training data in node :math:`m`
+Again, :math:`X_m` is the training data in node :math:`m`.
 
 
 .. _minimal_cost_complexity_pruning:

diff --git a/sklearn/tree/_classes.py b/sklearn/tree/_classes.py
@@ -60,9 +60,12 @@
 DTYPE = _tree.DTYPE
 DOUBLE = _tree.DOUBLE
 
-CRITERIA_CLF = {"gini": _criterion.Gini, "entropy": _criterion.Entropy}
-CRITERIA_REG = {"mse": _criterion.MSE, "friedman_mse": _criterion.FriedmanMSE,
-                "mae": _criterion.MAE}
+CRITERIA_CLF = {"gini": _criterion.Gini,
+                "entropy": _criterion.Entropy}
+CRITERIA_REG = {"mse": _criterion.MSE,
+                "friedman_mse": _criterion.FriedmanMSE,
+                "mae": _criterion.MAE,
+                "poisson": _criterion.Poisson}
 
 DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
                    "random": _splitter.RandomSplitter}
@@ -161,6 +164,14 @@ def fit(self, X, y, sample_weight=None, check_input=True,
                     raise ValueError("No support for np.int64 index based "
                                      "sparse matrices")
 
+            if self.criterion == "poisson":
+                if np.any(y < 0):
+                    raise ValueError("Some value(s) of y are negative which is"
+                                     " not allowed for Poisson regression.")
+                if np.sum(y) <= 0:
+                    raise ValueError("Sum of y is not positive which is "
+                                     "necessary for Poisson regression.")
+
         # Determine output settings
         n_samples, self.n_features_ = X.shape
         is_classification = is_classifier(self)
@@ -973,18 +984,22 @@ class DecisionTreeRegressor(RegressorMixin, BaseDecisionTree):
 
     Parameters
     ----------
-    criterion : {"mse", "friedman_mse", "mae"}, default="mse"
+    criterion : {"mse", "friedman_mse", "mae", "poisson"}, default="mse"
         The function to measure the quality of a split. Supported criteria
         are "mse" for the mean squared error, which is equal to variance
         reduction as feature selection criterion and minimizes the L2 loss
         using the mean of each terminal node, "friedman_mse", which uses mean
         squared error with Friedman's improvement score for potential splits,
-        and "mae" for the mean absolute error, which minimizes the L1 loss
-        using the median of each terminal node.
+        "mae" for the mean absolute error, which minimizes the L1 loss using
+        the median of each terminal node, and "poisson" which uses reduction in
+        Poisson deviance to find splits.
 
         .. versionadded:: 0.18
            Mean Absolute Error (MAE) criterion.
 
+        .. versionadded:: 0.24
+            Poisson deviance criterion.
+
     splitter : {"best", "random"}, default="best"
         The strategy used to choose the split at each node. Supported
         strategies are "best" to choose the best split and "random" to choose
@@ -1521,15 +1536,18 @@ class ExtraTreeRegressor(DecisionTreeRegressor):
 
     Parameters
     ----------
-    criterion : {"mse", "friedman_mse", "mae"}, default="mse"
+    criterion : {"mse", "friedman_mse", "mae", "poisson"}, default="mse"
         The function to measure the quality of a split. Supported criteria
         are "mse" for the mean squared error, which is equal to variance
-        reduction as feature selection criterion, and "mae" for the mean
-        absolute error.
+        reduction as feature selection criterion, "mae" for the mean absolute
+        error and "poisson" for the Poisson deviance.
 
         .. versionadded:: 0.18
            Mean Absolute Error (MAE) criterion.
 
+        .. versionadded:: 0.24
+            Poisson deviance criterion.
+
     splitter : {"random", "best"}, default="random"
         The strategy used to choose the split at each node. Supported
         strategies are "best" to choose the best split and "random" to choose