Skip to content

Commit

Permalink
[MRG + 1] DOC Fix Sphinx errors (#9420)
Browse files Browse the repository at this point in the history
* Fix Rouseeuw1984 broken link

* Change label vbgmm to bgmm
Previously modified with PR #6651

* Change tag name
Old refers to new tag added with PR #7388

* Remove prefix underscore to match tag

* Realign to fit 80 chars

* Link to metrics.rst.
pairwise metrics yet to be documented

* Remove tag as LSHForest is deprecated

* Remove all references to randomized_l1 and sphx_glr_auto_examples_linear_model_plot_sparse_recovery.py.
It is deprecated.

* Fix few Sphinx warnings

* Realign to 80 chars

* Changes based on PR review

* Remove unused ref in calibration

* Fix link ref in covariance.rst

* Fix linking issues

* Differentiate Rouseeuw1999 tag within file.

* Change all duplicate Rouseeuw1999 tags

* Remove numbers from tag Rousseeuw
  • Loading branch information
Balakumaran Manoharan authored and jnothman committed Jul 30, 2017
1 parent f6c7080 commit 04be1a9
Show file tree
Hide file tree
Showing 22 changed files with 100 additions and 102 deletions.
24 changes: 12 additions & 12 deletions doc/modules/calibration.rst
Expand Up @@ -44,7 +44,7 @@ with different biases per method:
* :class:`RandomForestClassifier` shows the opposite behavior: the histograms
show peaks at approximately 0.2 and 0.9 probability, while probabilities close to
0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil
and Caruana [4]: "Methods such as bagging and random forests that average
and Caruana [4]_: "Methods such as bagging and random forests that average
predictions from a base set of models can have difficulty making predictions
near 0 and 1 because variance in the underlying base models will bias
predictions that should be near zero or one away from these values. Because
Expand All @@ -57,15 +57,15 @@ with different biases per method:
ensemble away from 0. We observe this effect most strongly with random
forests because the base-level trees trained with random forests have
relatively high variance due to feature subseting." As a result, the
calibration curve also referred to as the reliability diagram (Wilks 1995[5]) shows a
calibration curve also referred to as the reliability diagram (Wilks 1995 [5]_) shows a
characteristic sigmoid shape, indicating that the classifier could trust its
"intuition" more and return probabilties closer to 0 or 1 typically.

.. currentmodule:: sklearn.svm

* Linear Support Vector Classification (:class:`LinearSVC`) shows an even more sigmoid curve
as the RandomForestClassifier, which is typical for maximum-margin methods
(compare Niculescu-Mizil and Caruana [4]), which focus on hard samples
(compare Niculescu-Mizil and Caruana [4]_), which focus on hard samples
that are close to the decision boundary (the support vectors).

.. currentmodule:: sklearn.calibration
Expand Down Expand Up @@ -190,18 +190,18 @@ a similar decrease in log-loss.

.. topic:: References:

.. [1] Obtaining calibrated probability estimates from decision trees
and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
* Obtaining calibrated probability estimates from decision trees
and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001

.. [2] Transforming Classifier Scores into Accurate Multiclass
Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
* Transforming Classifier Scores into Accurate Multiclass
Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)

.. [3] Probabilistic Outputs for Support Vector Machines and Comparisons to
Regularized Likelihood Methods, J. Platt, (1999)
* Probabilistic Outputs for Support Vector Machines and Comparisons to
Regularized Likelihood Methods, J. Platt, (1999)

.. [4] Predicting Good Probabilities with Supervised Learning,
A. Niculescu-Mizil & R. Caruana, ICML 2005
A. Niculescu-Mizil & R. Caruana, ICML 2005
.. [5] On the combination of forecast probabilities for
consecutive precipitation periods. Wea. Forecasting, 5, 640–
650., Wilks, D. S., 1990a
consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
Wilks, D. S., 1990a
2 changes: 1 addition & 1 deletion doc/modules/clustering.rst
Expand Up @@ -1343,7 +1343,7 @@ mean of homogeneity and completeness**:

.. topic:: References

.. [RH2007] `V-Measure: A conditional entropy-based external cluster evaluation
* `V-Measure: A conditional entropy-based external cluster evaluation
measure <http://aclweb.org/anthology/D/D07/D07-1043.pdf>`_
Andrew Rosenberg and Julia Hirschberg, 2007

Expand Down
35 changes: 20 additions & 15 deletions doc/modules/covariance.rst
Expand Up @@ -95,7 +95,7 @@ bias/variance trade-off, and is discussed below.
Ledoit-Wolf shrinkage
---------------------

In their 2004 paper [1], O. Ledoit and M. Wolf propose a formula so as
In their 2004 paper [1]_, O. Ledoit and M. Wolf propose a formula so as
to compute the optimal shrinkage coefficient :math:`\alpha` that
minimizes the Mean Squared Error between the estimated and the real
covariance matrix.
Expand All @@ -112,18 +112,19 @@ fitting a :class:`LedoitWolf` object to the same sample.
for visualizing the performances of the Ledoit-Wolf estimator in
terms of likelihood.

.. topic:: References:

[1] O. Ledoit and M. Wolf, "A Well-Conditioned Estimator for Large-Dimensional
Covariance Matrices", Journal of Multivariate Analysis, Volume 88, Issue 2,
February 2004, pages 365-411.
.. [1] O. Ledoit and M. Wolf, "A Well-Conditioned Estimator for Large-Dimensional
Covariance Matrices", Journal of Multivariate Analysis, Volume 88, Issue 2,
February 2004, pages 365-411.
.. _oracle_approximating_shrinkage:

Oracle Approximating Shrinkage
------------------------------

Under the assumption that the data are Gaussian distributed, Chen et
al. [2] derived a formula aimed at choosing a shrinkage coefficient that
al. [2]_ derived a formula aimed at choosing a shrinkage coefficient that
yields a smaller Mean Squared Error than the one given by Ledoit and
Wolf's formula. The resulting estimator is known as the Oracle
Shrinkage Approximating estimator of the covariance.
Expand All @@ -141,8 +142,10 @@ object to the same sample.
Bias-variance trade-off when setting the shrinkage: comparing the
choices of Ledoit-Wolf and OAS estimators

[2] Chen et al., "Shrinkage Algorithms for MMSE Covariance Estimation",
IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
.. topic:: References:

.. [2] Chen et al., "Shrinkage Algorithms for MMSE Covariance Estimation",
IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
.. topic:: Examples:

Expand Down Expand Up @@ -266,14 +269,14 @@ perform outlier detection and discard/downweight some observations
according to further processing of the data.

The ``sklearn.covariance`` package implements a robust estimator of covariance,
the Minimum Covariance Determinant [3].
the Minimum Covariance Determinant [3]_.


Minimum Covariance Determinant
------------------------------

The Minimum Covariance Determinant estimator is a robust estimator of
a data set's covariance introduced by P.J. Rousseeuw in [3]. The idea
a data set's covariance introduced by P.J. Rousseeuw in [3]_. The idea
is to find a given proportion (h) of "good" observations which are not
outliers and compute their empirical covariance matrix. This
empirical covariance matrix is then rescaled to compensate the
Expand All @@ -283,7 +286,7 @@ weights to observations according to their Mahalanobis distance,
leading to a reweighted estimate of the covariance matrix of the data
set ("reweighting step").

Rousseeuw and Van Driessen [4] developed the FastMCD algorithm in order
Rousseeuw and Van Driessen [4]_ developed the FastMCD algorithm in order
to compute the Minimum Covariance Determinant. This algorithm is used
in scikit-learn when fitting an MCD object to data. The FastMCD
algorithm also computes a robust estimate of the data set location at
Expand All @@ -292,11 +295,13 @@ the same time.
Raw estimates can be accessed as ``raw_location_`` and ``raw_covariance_``
attributes of a :class:`MinCovDet` robust covariance estimator object.

[3] P. J. Rousseeuw. Least median of squares regression.
J. Am Stat Ass, 79:871, 1984.
[4] A Fast Algorithm for the Minimum Covariance Determinant Estimator,
1999, American Statistical Association and the American Society
for Quality, TECHNOMETRICS.
.. topic:: References:

.. [3] P. J. Rousseeuw. Least median of squares regression.
J. Am Stat Ass, 79:871, 1984.
.. [4] A Fast Algorithm for the Minimum Covariance Determinant Estimator,
1999, American Statistical Association and the American Society
for Quality, TECHNOMETRICS.
.. topic:: Examples:

Expand Down
2 changes: 1 addition & 1 deletion doc/modules/ensemble.rst
Expand Up @@ -246,7 +246,7 @@ amount of time (e.g., on large datasets).
.. [B1998] L. Breiman, "Arcing Classifiers", Annals of Statistics 1998.
.. [GEW2006] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
* P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
trees", Machine Learning, 63(1), 3-42, 2006.

.. _random_forest_feature_importance:
Expand Down
2 changes: 1 addition & 1 deletion doc/modules/linear_model.rst
Expand Up @@ -1141,7 +1141,7 @@ in the following ways.

.. topic:: References:

.. [#f1] Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
* Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172

Also, this estimator is different from the R implementation of Robust Regression
(http://www.ats.ucla.edu/stat/r/dae/rreg.htm) because the R implementation does a weighted least
Expand Down
18 changes: 9 additions & 9 deletions doc/modules/multiclass.rst
Expand Up @@ -251,8 +251,8 @@ Below is an example of multiclass learning using OvO::

.. topic:: References:

.. [1] "Pattern Recognition and Machine Learning. Springer",
Christopher M. Bishop, page 183, (First Edition)
* "Pattern Recognition and Machine Learning. Springer",
Christopher M. Bishop, page 183, (First Edition)

.. _ecoc:

Expand Down Expand Up @@ -315,19 +315,19 @@ Below is an example of multiclass learning using Output-Codes::

.. topic:: References:

.. [2] "Solving multiclass learning problems via error-correcting output codes",
Dietterich T., Bakiri G.,
Journal of Artificial Intelligence Research 2,
1995.
* "Solving multiclass learning problems via error-correcting output codes",
Dietterich T., Bakiri G.,
Journal of Artificial Intelligence Research 2,
1995.

.. [3] "The error coding method and PICTs",
James G., Hastie T.,
Journal of Computational and Graphical statistics 7,
1998.
.. [4] "The Elements of Statistical Learning",
Hastie T., Tibshirani R., Friedman J., page 606 (second-edition)
2008.
* "The Elements of Statistical Learning",
Hastie T., Tibshirani R., Friedman J., page 606 (second-edition)
2008.

Multioutput regression
======================
Expand Down
20 changes: 10 additions & 10 deletions doc/modules/outlier_detection.rst
Expand Up @@ -126,8 +126,8 @@ This strategy is illustrated below.

.. topic:: References:

.. [RD1999] Rousseeuw, P.J., Van Driessen, K. "A fast algorithm for the minimum
covariance determinant estimator" Technometrics 41(3), 212 (1999)
* Rousseeuw, P.J., Van Driessen, K. "A fast algorithm for the minimum
covariance determinant estimator" Technometrics 41(3), 212 (1999)

.. _isolation_forest:

Expand Down Expand Up @@ -172,8 +172,8 @@ This strategy is illustrated below.

.. topic:: References:

.. [LTZ2008] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.
* Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.


Local Outlier Factor
Expand Down Expand Up @@ -228,7 +228,7 @@ This strategy is illustrated below.

.. topic:: References:

.. [BKNS2000] Breunig, Kriegel, Ng, and Sander (2000)
* Breunig, Kriegel, Ng, and Sander (2000)
`LOF: identifying density-based local outliers.
<http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf>`_
Proc. ACM SIGMOD
Expand Down Expand Up @@ -272,16 +272,16 @@ multiple modes and :class:`ensemble.IsolationForest` and
opposite, the decision rule based on fitting an
:class:`covariance.EllipticEnvelope` learns an ellipse, which
fits well the inlier distribution. The :class:`ensemble.IsolationForest`
and :class:`neighbors.LocalOutlierFactor` perform as well.
and :class:`neighbors.LocalOutlierFactor` perform as well.
- |outlier1|

*
- As the inlier distribution becomes bimodal, the
:class:`covariance.EllipticEnvelope` does not fit well the
inliers. However, we can see that :class:`ensemble.IsolationForest`,
:class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
have difficulties to detect the two modes,
and that the :class:`svm.OneClassSVM`
:class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
have difficulties to detect the two modes,
and that the :class:`svm.OneClassSVM`
tends to overfit: because it has no model of inliers, it
interprets a region where, by chance some outliers are
clustered, as inliers.
Expand All @@ -292,7 +292,7 @@ multiple modes and :class:`ensemble.IsolationForest` and
:class:`svm.OneClassSVM` is able to recover a reasonable
approximation as well as :class:`ensemble.IsolationForest`
and :class:`neighbors.LocalOutlierFactor`,
whereas the :class:`covariance.EllipticEnvelope` completely fails.
whereas the :class:`covariance.EllipticEnvelope` completely fails.
- |outlier3|

.. topic:: Examples:
Expand Down
2 changes: 1 addition & 1 deletion doc/tutorial/statistical_inference/putting_together.rst
Expand Up @@ -17,7 +17,7 @@ can predict variables. We can also create combined estimators:
:align: right

.. literalinclude:: ../../auto_examples/plot_digits_pipe.py
:lines: 26-66
:lines: 23-63



Expand Down
10 changes: 5 additions & 5 deletions examples/ensemble/plot_adaboost_hastie_10_2.py
Expand Up @@ -3,11 +3,11 @@
Discrete versus Real AdaBoost
=============================
This example is based on Figure 10.2 from Hastie et al 2009 [1] and illustrates
the difference in performance between the discrete SAMME [2] boosting
algorithm and real SAMME.R boosting algorithm. Both algorithms are evaluated
on a binary classification task where the target Y is a non-linear function
of 10 input features.
This example is based on Figure 10.2 from Hastie et al 2009 [1]_ and
illustrates the difference in performance between the discrete SAMME [2]_
boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are
evaluated on a binary classification task where the target Y is a non-linear
function of 10 input features.
Discrete SAMME AdaBoost adapts based on errors in predicted class labels
whereas real SAMME.R uses the predicted class probabilities.
Expand Down
4 changes: 2 additions & 2 deletions examples/ensemble/plot_adaboost_multiclass.py
Expand Up @@ -3,14 +3,14 @@
Multi-class AdaBoosted Decision Trees
=====================================
This example reproduces Figure 1 of Zhu et al [1] and shows how boosting can
This example reproduces Figure 1 of Zhu et al [1]_ and shows how boosting can
improve prediction accuracy on a multi-class problem. The classification
dataset is constructed by taking a ten-dimensional standard normal distribution
and defining three classes separated by nested concentric ten-dimensional
spheres such that roughly equal numbers of samples are in each class (quantiles
of the :math:`\chi^2` distribution).
The performance of the SAMME and SAMME.R [1] algorithms are compared. SAMME.R
The performance of the SAMME and SAMME.R [1]_ algorithms are compared. SAMME.R
uses the probability estimates to update the additive model, while SAMME uses
the classifications only. As the example illustrates, the SAMME.R algorithm
typically converges faster than SAMME, achieving a lower test error with fewer
Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_adaboost_regression.py
Expand Up @@ -3,7 +3,7 @@
Decision Tree Regression with AdaBoost
======================================
A decision tree is boosted using the AdaBoost.R2 [1] algorithm on a 1D
A decision tree is boosted using the AdaBoost.R2 [1]_ algorithm on a 1D
sinusoidal dataset with a small amount of Gaussian noise.
299 boosts (300 decision trees) is compared with a single decision tree
regressor. As the number of boosts is increased the regressor can fit more
Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_ensemble_oob.py
Expand Up @@ -8,7 +8,7 @@
:math:`z_i = (x_i, y_i)`. The *out-of-bag* (OOB) error is the average error for
each :math:`z_i` calculated using predictions from the trees that do not
contain :math:`z_i` in their respective bootstrap sample. This allows the
``RandomForestClassifier`` to be fit and validated whilst being trained [1].
``RandomForestClassifier`` to be fit and validated whilst being trained [1]_.
The example below demonstrates how the OOB error can be measured at the
addition of each new tree during training. The resulting plot allows a
Expand Down
2 changes: 1 addition & 1 deletion examples/ensemble/plot_gradient_boosting_regularization.py
Expand Up @@ -4,7 +4,7 @@
================================
Illustration of the effect of different regularization strategies
for Gradient Boosting. The example is taken from Hastie et al 2009.
for Gradient Boosting. The example is taken from Hastie et al 2009 [1]_.
The loss function used is binomial deviance. Regularization via
shrinkage (``learning_rate < 1.0``) improves performance considerably.
Expand Down

0 comments on commit 04be1a9

Please sign in to comment.