[MRG + 1] DOC Fix Sphinx errors (#9420)

* Fix Rouseeuw1984 broken link * Change label vbgmm to bgmm Previously modified with PR #6651 * Change tag name Old refers to new tag added with PR #7388 * Remove prefix underscore to match tag * Realign to fit 80 chars * Link to metrics.rst. pairwise metrics yet to be documented * Remove tag as LSHForest is deprecated * Remove all references to randomized_l1 and sphx_glr_auto_examples_linear_model_plot_sparse_recovery.py. It is deprecated. * Fix few Sphinx warnings * Realign to 80 chars * Changes based on PR review * Remove unused ref in calibration * Fix link ref in covariance.rst * Fix linking issues * Differentiate Rouseeuw1999 tag within file. * Change all duplicate Rouseeuw1999 tags * Remove numbers from tag Rousseeuw
scikit-learn · Jul 30, 2017 · 04be1a9 · 04be1a9
1 parent f6c7080
commit 04be1a9
Show file tree

Hide file tree

Showing 22 changed files with 100 additions and 102 deletions.
diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
@@ -44,7 +44,7 @@ with different biases per method:
 *  :class:`RandomForestClassifier` shows the opposite behavior: the histograms
    show peaks at approximately 0.2 and 0.9 probability, while probabilities close to
    0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil
-   and Caruana [4]: "Methods such as bagging and random forests that average
+   and Caruana [4]_: "Methods such as bagging and random forests that average
    predictions from a base set of models can have difficulty making predictions
    near 0 and 1 because variance in the underlying base models will bias
    predictions that should be near zero or one away from these values. Because
@@ -57,15 +57,15 @@ with different biases per method:
    ensemble away from 0. We observe this effect most strongly with random
    forests because the base-level trees trained with random forests have
    relatively high variance due to feature subseting." As a result, the
-   calibration curve also referred to as the reliability diagram (Wilks 1995[5]) shows a
+   calibration curve also referred to as the reliability diagram (Wilks 1995 [5]_) shows a
    characteristic sigmoid shape, indicating that the classifier could trust its
    "intuition" more and return probabilties closer to 0 or 1 typically.
 
 .. currentmodule:: sklearn.svm
 
 *  Linear Support Vector Classification (:class:`LinearSVC`) shows an even more sigmoid curve
    as the RandomForestClassifier, which is typical for maximum-margin methods
-   (compare Niculescu-Mizil and Caruana [4]), which focus on hard samples
+   (compare Niculescu-Mizil and Caruana [4]_), which focus on hard samples
    that are close to the decision boundary (the support vectors).
 
 .. currentmodule:: sklearn.calibration
@@ -190,18 +190,18 @@ a similar decrease in log-loss.
 
 .. topic:: References:
 
-    .. [1] Obtaining calibrated probability estimates from decision trees
-          and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
+    * Obtaining calibrated probability estimates from decision trees
+      and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
 
-    .. [2] Transforming Classifier Scores into Accurate Multiclass
-          Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
+    * Transforming Classifier Scores into Accurate Multiclass
+      Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
 
-    .. [3] Probabilistic Outputs for Support Vector Machines and Comparisons to
-          Regularized Likelihood Methods, J. Platt, (1999)
+    * Probabilistic Outputs for Support Vector Machines and Comparisons to
+      Regularized Likelihood Methods, J. Platt, (1999)
 
     .. [4] Predicting Good Probabilities with Supervised Learning,
-          A. Niculescu-Mizil & R. Caruana, ICML 2005
+           A. Niculescu-Mizil & R. Caruana, ICML 2005
 
     .. [5] On the combination of forecast probabilities for
-         consecutive precipitation periods. Wea. Forecasting, 5, 640–
-         650., Wilks, D. S., 1990a
+           consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
+           Wilks, D. S., 1990a
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
@@ -1343,7 +1343,7 @@ mean of homogeneity and completeness**:
 
 .. topic:: References
 
- .. [RH2007] `V-Measure: A conditional entropy-based external cluster evaluation
+ * `V-Measure: A conditional entropy-based external cluster evaluation
    measure <http://aclweb.org/anthology/D/D07/D07-1043.pdf>`_
    Andrew Rosenberg and Julia Hirschberg, 2007
 

diff --git a/doc/modules/covariance.rst b/doc/modules/covariance.rst
@@ -95,7 +95,7 @@ bias/variance trade-off, and is discussed below.
 Ledoit-Wolf shrinkage
 ---------------------
 
-In their 2004 paper [1], O. Ledoit and M. Wolf propose a formula so as
+In their 2004 paper [1]_, O. Ledoit and M. Wolf propose a formula so as
 to compute the optimal shrinkage coefficient :math:`\alpha` that
 minimizes the Mean Squared Error between the estimated and the real
 covariance matrix.
@@ -112,18 +112,19 @@ fitting a :class:`LedoitWolf` object to the same sample.
      for visualizing the performances of the Ledoit-Wolf estimator in
      terms of likelihood.
 
+.. topic:: References:
 
-[1] O. Ledoit and M. Wolf, "A Well-Conditioned Estimator for Large-Dimensional
-    Covariance Matrices", Journal of Multivariate Analysis, Volume 88, Issue 2,
-    February 2004, pages 365-411.
+    .. [1] O. Ledoit and M. Wolf, "A Well-Conditioned Estimator for Large-Dimensional
+           Covariance Matrices", Journal of Multivariate Analysis, Volume 88, Issue 2,
+           February 2004, pages 365-411.
 
 .. _oracle_approximating_shrinkage:
 
 Oracle Approximating Shrinkage
 ------------------------------
 
 Under the assumption that the data are Gaussian distributed, Chen et
-al. [2] derived a formula aimed at choosing a shrinkage coefficient that
+al. [2]_ derived a formula aimed at choosing a shrinkage coefficient that
 yields a smaller Mean Squared Error than the one given by Ledoit and
 Wolf's formula. The resulting estimator is known as the Oracle
 Shrinkage Approximating estimator of the covariance.
@@ -141,8 +142,10 @@ object to the same sample.
    Bias-variance trade-off when setting the shrinkage: comparing the
    choices of Ledoit-Wolf and OAS estimators
 
-[2] Chen et al., "Shrinkage Algorithms for MMSE Covariance Estimation",
-    IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
+.. topic:: References:
+
+    .. [2] Chen et al., "Shrinkage Algorithms for MMSE Covariance Estimation",
+           IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
 
 .. topic:: Examples:
 
@@ -266,14 +269,14 @@ perform outlier detection and discard/downweight some observations
 according to further processing of the data.
 
 The ``sklearn.covariance`` package implements a robust estimator of covariance,
-the Minimum Covariance Determinant [3].
+the Minimum Covariance Determinant [3]_.
 
 
 Minimum Covariance Determinant
 ------------------------------
 
 The Minimum Covariance Determinant estimator is a robust estimator of
-a data set's covariance introduced by P.J. Rousseeuw in [3].  The idea
+a data set's covariance introduced by P.J. Rousseeuw in [3]_.  The idea
 is to find a given proportion (h) of "good" observations which are not
 outliers and compute their empirical covariance matrix.  This
 empirical covariance matrix is then rescaled to compensate the
@@ -283,7 +286,7 @@ weights to observations according to their Mahalanobis distance,
 leading to a reweighted estimate of the covariance matrix of the data
 set ("reweighting step").
 
-Rousseeuw and Van Driessen [4] developed the FastMCD algorithm in order
+Rousseeuw and Van Driessen [4]_ developed the FastMCD algorithm in order
 to compute the Minimum Covariance Determinant. This algorithm is used
 in scikit-learn when fitting an MCD object to data. The FastMCD
 algorithm also computes a robust estimate of the data set location at
@@ -292,11 +295,13 @@ the same time.
 Raw estimates can be accessed as ``raw_location_`` and ``raw_covariance_``
 attributes of a :class:`MinCovDet` robust covariance estimator object.
 
-[3] P. J. Rousseeuw. Least median of squares regression.
-    J. Am Stat Ass, 79:871, 1984.
-[4] A Fast Algorithm for the Minimum Covariance Determinant Estimator,
-    1999, American Statistical Association and the American Society
-    for Quality, TECHNOMETRICS.
+.. topic:: References:
+
+    .. [3] P. J. Rousseeuw. Least median of squares regression.
+           J. Am Stat Ass, 79:871, 1984.
+    .. [4] A Fast Algorithm for the Minimum Covariance Determinant Estimator,
+           1999, American Statistical Association and the American Society
+           for Quality, TECHNOMETRICS.
 
 .. topic:: Examples:
 

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -246,7 +246,7 @@ amount of time (e.g., on large datasets).
 
  .. [B1998] L. Breiman, "Arcing Classifiers", Annals of Statistics 1998.
 
- .. [GEW2006] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
+ * P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
    trees", Machine Learning, 63(1), 3-42, 2006.
 
 .. _random_forest_feature_importance:

diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
@@ -1141,7 +1141,7 @@ in the following ways.
 
 .. topic:: References:
 
-    .. [#f1] Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
+  * Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
 
 Also, this estimator is different from the R implementation of Robust Regression
 (http://www.ats.ucla.edu/stat/r/dae/rreg.htm) because the R implementation does a weighted least

diff --git a/doc/modules/multiclass.rst b/doc/modules/multiclass.rst
@@ -251,8 +251,8 @@ Below is an example of multiclass learning using OvO::
 
 .. topic:: References:
 
-    .. [1] "Pattern Recognition and Machine Learning. Springer",
-        Christopher M. Bishop, page 183, (First Edition)
+    * "Pattern Recognition and Machine Learning. Springer",
+      Christopher M. Bishop, page 183, (First Edition)
 
 .. _ecoc:
 
@@ -315,19 +315,19 @@ Below is an example of multiclass learning using Output-Codes::
 
 .. topic:: References:
 
-    .. [2] "Solving multiclass learning problems via error-correcting output codes",
-        Dietterich T., Bakiri G.,
-        Journal of Artificial Intelligence Research 2,
-        1995.
+    * "Solving multiclass learning problems via error-correcting output codes",
+      Dietterich T., Bakiri G.,
+      Journal of Artificial Intelligence Research 2,
+      1995.
 
     .. [3] "The error coding method and PICTs",
         James G., Hastie T.,
         Journal of Computational and Graphical statistics 7,
         1998.
 
-    .. [4] "The Elements of Statistical Learning",
-        Hastie T., Tibshirani R., Friedman J., page 606 (second-edition)
-        2008.
+    * "The Elements of Statistical Learning",
+      Hastie T., Tibshirani R., Friedman J., page 606 (second-edition)
+      2008.
 
 Multioutput regression
 ======================

diff --git a/doc/modules/outlier_detection.rst b/doc/modules/outlier_detection.rst
@@ -126,8 +126,8 @@ This strategy is illustrated below.
 
 .. topic:: References:
 
-    ..  [RD1999] Rousseeuw, P.J., Van Driessen, K. "A fast algorithm for the minimum
-        covariance determinant estimator" Technometrics 41(3), 212 (1999)
+    * Rousseeuw, P.J., Van Driessen, K. "A fast algorithm for the minimum
+      covariance determinant estimator" Technometrics 41(3), 212 (1999)
 
 .. _isolation_forest:
 
@@ -172,8 +172,8 @@ This strategy is illustrated below.
 
 .. topic:: References:
 
-    .. [LTZ2008] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
-           Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.
+    * Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
+      Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.
 
 
 Local Outlier Factor
@@ -228,7 +228,7 @@ This strategy is illustrated below.
 
 .. topic:: References:
 
-   .. [BKNS2000]  Breunig, Kriegel, Ng, and Sander (2000)
+   *  Breunig, Kriegel, Ng, and Sander (2000)
       `LOF: identifying density-based local outliers.
       <http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf>`_
       Proc. ACM SIGMOD
@@ -272,16 +272,16 @@ multiple modes and :class:`ensemble.IsolationForest` and
         opposite, the decision rule based on fitting an
         :class:`covariance.EllipticEnvelope` learns an ellipse, which
         fits well the inlier distribution. The :class:`ensemble.IsolationForest`
-	and :class:`neighbors.LocalOutlierFactor` perform as well.
+        and :class:`neighbors.LocalOutlierFactor` perform as well.
       - |outlier1| 
 
    *
       - As the inlier distribution becomes bimodal, the
         :class:`covariance.EllipticEnvelope` does not fit well the
         inliers. However, we can see that :class:`ensemble.IsolationForest`,
-	:class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
-	have difficulties to detect the two modes,
-	and that the :class:`svm.OneClassSVM`
+        :class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
+        have difficulties to detect the two modes,
+        and that the :class:`svm.OneClassSVM`
         tends to overfit: because it has no model of inliers, it
         interprets a region where, by chance some outliers are
         clustered, as inliers.
@@ -292,7 +292,7 @@ multiple modes and :class:`ensemble.IsolationForest` and
         :class:`svm.OneClassSVM` is able to recover a reasonable
         approximation as well as :class:`ensemble.IsolationForest`
         and :class:`neighbors.LocalOutlierFactor`,
-	whereas the :class:`covariance.EllipticEnvelope` completely fails.
+        whereas the :class:`covariance.EllipticEnvelope` completely fails.
       - |outlier3|
 
 .. topic:: Examples:

diff --git a/doc/tutorial/statistical_inference/putting_together.rst b/doc/tutorial/statistical_inference/putting_together.rst
@@ -17,7 +17,7 @@ can predict variables. We can also create combined estimators:
    :align: right
 
 .. literalinclude:: ../../auto_examples/plot_digits_pipe.py
-    :lines: 26-66
+    :lines: 23-63
 
 
 

diff --git a/examples/ensemble/plot_adaboost_hastie_10_2.py b/examples/ensemble/plot_adaboost_hastie_10_2.py
@@ -3,11 +3,11 @@
 Discrete versus Real AdaBoost
 =============================
 
-This example is based on Figure 10.2 from Hastie et al 2009 [1] and illustrates
-the difference in performance between the discrete SAMME [2] boosting
-algorithm and real SAMME.R boosting algorithm. Both algorithms are evaluated
-on a binary classification task where the target Y is a non-linear function
-of 10 input features.
+This example is based on Figure 10.2 from Hastie et al 2009 [1]_ and
+illustrates the difference in performance between the discrete SAMME [2]_
+boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are
+evaluated on a binary classification task where the target Y is a non-linear
+function of 10 input features.
 
 Discrete SAMME AdaBoost adapts based on errors in predicted class labels
 whereas real SAMME.R uses the predicted class probabilities.

diff --git a/examples/ensemble/plot_adaboost_multiclass.py b/examples/ensemble/plot_adaboost_multiclass.py
@@ -3,14 +3,14 @@
 Multi-class AdaBoosted Decision Trees
 =====================================
 
-This example reproduces Figure 1 of Zhu et al [1] and shows how boosting can
+This example reproduces Figure 1 of Zhu et al [1]_ and shows how boosting can
 improve prediction accuracy on a multi-class problem. The classification
 dataset is constructed by taking a ten-dimensional standard normal distribution
 and defining three classes separated by nested concentric ten-dimensional
 spheres such that roughly equal numbers of samples are in each class (quantiles
 of the :math:`\chi^2` distribution).
 
-The performance of the SAMME and SAMME.R [1] algorithms are compared. SAMME.R
+The performance of the SAMME and SAMME.R [1]_ algorithms are compared. SAMME.R
 uses the probability estimates to update the additive model, while SAMME  uses
 the classifications only. As the example illustrates, the SAMME.R algorithm
 typically converges faster than SAMME, achieving a lower test error with fewer

diff --git a/examples/ensemble/plot_adaboost_regression.py b/examples/ensemble/plot_adaboost_regression.py
@@ -3,7 +3,7 @@
 Decision Tree Regression with AdaBoost
 ======================================
 
-A decision tree is boosted using the AdaBoost.R2 [1] algorithm on a 1D
+A decision tree is boosted using the AdaBoost.R2 [1]_ algorithm on a 1D
 sinusoidal dataset with a small amount of Gaussian noise.
 299 boosts (300 decision trees) is compared with a single decision tree
 regressor. As the number of boosts is increased the regressor can fit more

diff --git a/examples/ensemble/plot_ensemble_oob.py b/examples/ensemble/plot_ensemble_oob.py
@@ -8,7 +8,7 @@
 :math:`z_i = (x_i, y_i)`. The *out-of-bag* (OOB) error is the average error for
 each :math:`z_i` calculated using predictions from the trees that do not
 contain :math:`z_i` in their respective bootstrap sample. This allows the
-``RandomForestClassifier`` to be fit and validated whilst being trained [1].
+``RandomForestClassifier`` to be fit and validated whilst being trained [1]_.
 
 The example below demonstrates how the OOB error can be measured at the
 addition of each new tree during training. The resulting plot allows a

diff --git a/examples/ensemble/plot_gradient_boosting_regularization.py b/examples/ensemble/plot_gradient_boosting_regularization.py
@@ -4,7 +4,7 @@
 ================================
 
 Illustration of the effect of different regularization strategies
-for Gradient Boosting. The example is taken from Hastie et al 2009.
+for Gradient Boosting. The example is taken from Hastie et al 2009 [1]_.
 
 The loss function used is binomial deviance. Regularization via
 shrinkage (``learning_rate < 1.0``) improves performance considerably.