scikit-learn · thomasjpfan · Aug 13, 2020 · Jul 31, 2020 · Jul 31, 2020 · Aug 11, 2020
diff --git a/doc/modules/model_evaluation.rst b/doc/modules/model_evaluation.rst
@@ -1505,9 +1505,9 @@ for binary classes. Quoting Wikipedia:
 This function returns a score of the mean square difference between the actual
 outcome and the predicted probability of the possible outcome. The actual
 outcome has to be 1 or 0 (true or false), while the predicted probability of
-the actual outcome can be a value between 0 and 1.
+the actual outcome can be a value between 0 and 1 [Brier1950]_.
 
-The brier score loss is also between 0 to 1 and the lower the score (the mean
+The Brier score loss is also between 0 to 1 and the lower the score (the mean
 square difference is smaller), the more accurate the prediction is. It can be
 thought of as a measure of the "calibration" of a set of probabilistic
 predictions.
@@ -1536,6 +1536,16 @@ Here is a small example of usage of this function:::
     >>> brier_score_loss(y_true, y_prob > 0.5)
     0.0
 
+The Brier score can be used to assess how well a classifier is calibrated
+however, a lower Brier score does not always mean a better calibration. This is
+because the Brier score can be decomposed as the sum of calibration loss and
+refinement loss [Bella2012]_. Calibration loss is defined as the mean squared
+deviation from empirical probabilities derived from the slope of ROC segments.
+Refinement loss can be defined as the expected optimal loss as measured by the
+area under the optimal cost curve. Refinement loss can change independently
+from calibration loss, thus a lower Brier score does not necessarily mean a
+better calibrated model. "Only when refinement loss remains the same does a
+lower Brier score always mean better calibration" [Bella2012]_, [Flach2008]_.
 
 .. topic:: Example:
 
@@ -1545,10 +1555,21 @@ Here is a small example of usage of this function:::
 
 .. topic:: References:
 
-  * G. Brier, `Verification of forecasts expressed in terms of probability
+  .. [Brier1950] G. Brier, `Verification of forecasts expressed in terms of
+    probability
     <ftp://ftp.library.noaa.gov/docs.lib/htdocs/rescue/mwr/078/mwr-078-01-0001.pdf>`_,
     Monthly weather review 78.1 (1950)
 
+  .. [Bella2012] Bella, Ferri, Hernández-Orallo, and Ramírez-Quintana
+    `"Calibration of Machine Learning Models"
+    <http://dmip.webs.upv.es/papers/BFHRHandbook2010.pdf>`_
+    in Khosrow-Pour, M. "Machine learning: concepts, methodologies, tools
+    and applications." Hershey, PA: Information Science Reference (2012).
+
+  .. [Flach2008] Flach, Peter, and Edson Matsubara. `"On classification, ranking,
+    and probability estimation." <https://drops.dagstuhl.de/opus/volltexte/2008/1382/>`_
+    Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fr Informatik (2008).
+
 .. _multilabel_ranking_metrics:
 
 Multilabel ranking metrics

diff --git a/sklearn/metrics/_classification.py b/sklearn/metrics/_classification.py
@@ -2382,23 +2382,24 @@ def brier_score_loss(y_true, y_prob, *, sample_weight=None, pos_label=None):
     """Compute the Brier score.
 
     The smaller the Brier score, the better, hence the naming with "loss".
-    Across all items in a set N predictions, the Brier score measures the
-    mean squared difference between (1) the predicted probability assigned
-    to the possible outcomes for item i, and (2) the actual outcome.
-    Therefore, the lower the Brier score is for a set of predictions, the
-    better the predictions are calibrated. Note that the Brier score always
+    The Brier score measures the mean squared difference between the predicted
+    probability and the actual outcome. The Brier score always
     takes on a value between zero and one, since this is the largest
     possible difference between a predicted probability (which must be
     between zero and one) and the actual outcome (which can take on values
-    of only 0 and 1). The Brier loss is composed of refinement loss and
+    of only 0 and 1). It can be decomposed is the sum of refinement loss and
     calibration loss.
+
     The Brier score is appropriate for binary and categorical outcomes that
     can be structured as true or false, but is inappropriate for ordinal
     variables which can take on three or more values (this is because the
     Brier score assumes that all possible outcomes are equivalently
     "distant" from one another). Which label is considered to be the positive
-    label is controlled via the parameter pos_label, which defaults to 1.
-    Read more in the :ref:`User Guide <calibration>`.
+    label is controlled via the parameter `pos_label`, which defaults to
+    the greater label unless `y_true` is all 0 or all -1, in which case
+    `pos_label` defaults to 1.
+
+    Read more in the :ref:`User Guide <brier_score_loss>`.
 
     Parameters
     ----------