diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 0c0af594398a0..9762414ac8cc0 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -44,7 +44,7 @@ with different biases per method:
 *  :class:`RandomForestClassifier` shows the opposite behavior: the histograms
    show peaks at approximately 0.2 and 0.9 probability, while probabilities close to
    0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil
-   and Caruana [4]: "Methods such as bagging and random forests that average
+   and Caruana [4]_: "Methods such as bagging and random forests that average
    predictions from a base set of models can have difficulty making predictions
    near 0 and 1 because variance in the underlying base models will bias
    predictions that should be near zero or one away from these values. Because
@@ -57,7 +57,7 @@ with different biases per method:
    ensemble away from 0. We observe this effect most strongly with random
    forests because the base-level trees trained with random forests have
    relatively high variance due to feature subseting." As a result, the
-   calibration curve also referred to as the reliability diagram (Wilks 1995[5]) shows a
+   calibration curve also referred to as the reliability diagram (Wilks 1995 [5]_) shows a
    characteristic sigmoid shape, indicating that the classifier could trust its
    "intuition" more and return probabilties closer to 0 or 1 typically.
 
@@ -65,7 +65,7 @@ with different biases per method:
 
 *  Linear Support Vector Classification (:class:`LinearSVC`) shows an even more sigmoid curve
    as the RandomForestClassifier, which is typical for maximum-margin methods
-   (compare Niculescu-Mizil and Caruana [4]), which focus on hard samples
+   (compare Niculescu-Mizil and Caruana [4]_), which focus on hard samples
    that are close to the decision boundary (the support vectors).
 
 .. currentmodule:: sklearn.calibration
@@ -190,18 +190,18 @@ a similar decrease in log-loss.
 
 .. topic:: References:
 
-    .. [1] Obtaining calibrated probability estimates from decision trees
-          and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
+    * Obtaining calibrated probability estimates from decision trees
+      and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001
 
-    .. [2] Transforming Classifier Scores into Accurate Multiclass
-          Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
+    * Transforming Classifier Scores into Accurate Multiclass
+      Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)
 
-    .. [3] Probabilistic Outputs for Support Vector Machines and Comparisons to
-          Regularized Likelihood Methods, J. Platt, (1999)
+    * Probabilistic Outputs for Support Vector Machines and Comparisons to
+      Regularized Likelihood Methods, J. Platt, (1999)
 
     .. [4] Predicting Good Probabilities with Supervised Learning,
-          A. Niculescu-Mizil & R. Caruana, ICML 2005
+           A. Niculescu-Mizil & R. Caruana, ICML 2005
 
     .. [5] On the combination of forecast probabilities for
-         consecutive precipitation periods. Wea. Forecasting, 5, 640–
-         650., Wilks, D. S., 1990a
+           consecutive precipitation periods. Wea. Forecasting, 5, 640–650.,
+           Wilks, D. S., 1990a
diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
index 7189474752005..b18cb3a6adcf7 100644
--- a/doc/modules/clustering.rst
+++ b/doc/modules/clustering.rst
@@ -1343,7 +1343,7 @@ mean of homogeneity and completeness**:
 
 .. topic:: References
 
- .. [RH2007] `V-Measure: A conditional entropy-based external cluster evaluation
+ * `V-Measure: A conditional entropy-based external cluster evaluation
    measure <http://aclweb.org/anthology/D/D07/D07-1043.pdf>`_
    Andrew Rosenberg and Julia Hirschberg, 2007
 
diff --git a/doc/modules/covariance.rst b/doc/modules/covariance.rst
index 88f40f3896190..2f95051ac9ea3 100644
--- a/doc/modules/covariance.rst
+++ b/doc/modules/covariance.rst
@@ -95,7 +95,7 @@ bias/variance trade-off, and is discussed below.
 Ledoit-Wolf shrinkage
 ---------------------
 
-In their 2004 paper [1], O. Ledoit and M. Wolf propose a formula so as
+In their 2004 paper [1]_, O. Ledoit and M. Wolf propose a formula so as
 to compute the optimal shrinkage coefficient :math:`\alpha` that
 minimizes the Mean Squared Error between the estimated and the real
 covariance matrix.
@@ -112,10 +112,11 @@ fitting a :class:`LedoitWolf` object to the same sample.
      for visualizing the performances of the Ledoit-Wolf estimator in
      terms of likelihood.
 
+.. topic:: References:
 
-[1] O. Ledoit and M. Wolf, "A Well-Conditioned Estimator for Large-Dimensional
-    Covariance Matrices", Journal of Multivariate Analysis, Volume 88, Issue 2,
-    February 2004, pages 365-411.
+    .. [1] O. Ledoit and M. Wolf, "A Well-Conditioned Estimator for Large-Dimensional
+           Covariance Matrices", Journal of Multivariate Analysis, Volume 88, Issue 2,
+           February 2004, pages 365-411.
 
 .. _oracle_approximating_shrinkage:
 
@@ -123,7 +124,7 @@ Oracle Approximating Shrinkage
 ------------------------------
 
 Under the assumption that the data are Gaussian distributed, Chen et
-al. [2] derived a formula aimed at choosing a shrinkage coefficient that
+al. [2]_ derived a formula aimed at choosing a shrinkage coefficient that
 yields a smaller Mean Squared Error than the one given by Ledoit and
 Wolf's formula. The resulting estimator is known as the Oracle
 Shrinkage Approximating estimator of the covariance.
@@ -141,8 +142,10 @@ object to the same sample.
    Bias-variance trade-off when setting the shrinkage: comparing the
    choices of Ledoit-Wolf and OAS estimators
 
-[2] Chen et al., "Shrinkage Algorithms for MMSE Covariance Estimation",
-    IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
+.. topic:: References:
+
+    .. [2] Chen et al., "Shrinkage Algorithms for MMSE Covariance Estimation",
+           IEEE Trans. on Sign. Proc., Volume 58, Issue 10, October 2010.
 
 .. topic:: Examples:
 
@@ -266,14 +269,14 @@ perform outlier detection and discard/downweight some observations
 according to further processing of the data.
 
 The ``sklearn.covariance`` package implements a robust estimator of covariance,
-the Minimum Covariance Determinant [3].
+the Minimum Covariance Determinant [3]_.
 
 
 Minimum Covariance Determinant
 ------------------------------
 
 The Minimum Covariance Determinant estimator is a robust estimator of
-a data set's covariance introduced by P.J. Rousseeuw in [3].  The idea
+a data set's covariance introduced by P.J. Rousseeuw in [3]_.  The idea
 is to find a given proportion (h) of "good" observations which are not
 outliers and compute their empirical covariance matrix.  This
 empirical covariance matrix is then rescaled to compensate the
@@ -283,7 +286,7 @@ weights to observations according to their Mahalanobis distance,
 leading to a reweighted estimate of the covariance matrix of the data
 set ("reweighting step").
 
-Rousseeuw and Van Driessen [4] developed the FastMCD algorithm in order
+Rousseeuw and Van Driessen [4]_ developed the FastMCD algorithm in order
 to compute the Minimum Covariance Determinant. This algorithm is used
 in scikit-learn when fitting an MCD object to data. The FastMCD
 algorithm also computes a robust estimate of the data set location at
@@ -292,11 +295,13 @@ the same time.
 Raw estimates can be accessed as ``raw_location_`` and ``raw_covariance_``
 attributes of a :class:`MinCovDet` robust covariance estimator object.
 
-[3] P. J. Rousseeuw. Least median of squares regression.
-    J. Am Stat Ass, 79:871, 1984.
-[4] A Fast Algorithm for the Minimum Covariance Determinant Estimator,
-    1999, American Statistical Association and the American Society
-    for Quality, TECHNOMETRICS.
+.. topic:: References:
+
+    .. [3] P. J. Rousseeuw. Least median of squares regression.
+           J. Am Stat Ass, 79:871, 1984.
+    .. [4] A Fast Algorithm for the Minimum Covariance Determinant Estimator,
+           1999, American Statistical Association and the American Society
+           for Quality, TECHNOMETRICS.
 
 .. topic:: Examples:
 
diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
index 12a0ff6a74ba0..40a3e834e22c9 100644
--- a/doc/modules/ensemble.rst
+++ b/doc/modules/ensemble.rst
@@ -246,7 +246,7 @@ amount of time (e.g., on large datasets).
 
  .. [B1998] L. Breiman, "Arcing Classifiers", Annals of Statistics 1998.
 
- .. [GEW2006] P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
+ * P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
    trees", Machine Learning, 63(1), 3-42, 2006.
 
 .. _random_forest_feature_importance:
diff --git a/doc/modules/linear_model.rst b/doc/modules/linear_model.rst
index e6d0ea882f6d3..018ff884c4ae2 100644
--- a/doc/modules/linear_model.rst
+++ b/doc/modules/linear_model.rst
@@ -1141,7 +1141,7 @@ in the following ways.
 
 .. topic:: References:
 
-    .. [#f1] Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
+  * Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
 
 Also, this estimator is different from the R implementation of Robust Regression
 (http://www.ats.ucla.edu/stat/r/dae/rreg.htm) because the R implementation does a weighted least
diff --git a/doc/modules/multiclass.rst b/doc/modules/multiclass.rst
index 5ae785400782d..2eec94f76b1c2 100644
--- a/doc/modules/multiclass.rst
+++ b/doc/modules/multiclass.rst
@@ -251,8 +251,8 @@ Below is an example of multiclass learning using OvO::
 
 .. topic:: References:
 
-    .. [1] "Pattern Recognition and Machine Learning. Springer",
-        Christopher M. Bishop, page 183, (First Edition)
+    * "Pattern Recognition and Machine Learning. Springer",
+      Christopher M. Bishop, page 183, (First Edition)
 
 .. _ecoc:
 
@@ -315,19 +315,19 @@ Below is an example of multiclass learning using Output-Codes::
 
 .. topic:: References:
 
-    .. [2] "Solving multiclass learning problems via error-correcting output codes",
-        Dietterich T., Bakiri G.,
-        Journal of Artificial Intelligence Research 2,
-        1995.
+    * "Solving multiclass learning problems via error-correcting output codes",
+      Dietterich T., Bakiri G.,
+      Journal of Artificial Intelligence Research 2,
+      1995.
 
     .. [3] "The error coding method and PICTs",
         James G., Hastie T.,
         Journal of Computational and Graphical statistics 7,
         1998.
 
-    .. [4] "The Elements of Statistical Learning",
-        Hastie T., Tibshirani R., Friedman J., page 606 (second-edition)
-        2008.
+    * "The Elements of Statistical Learning",
+      Hastie T., Tibshirani R., Friedman J., page 606 (second-edition)
+      2008.
 
 Multioutput regression
 ======================
diff --git a/doc/modules/outlier_detection.rst b/doc/modules/outlier_detection.rst
index 011bb6ea07889..db130403f9023 100644
--- a/doc/modules/outlier_detection.rst
+++ b/doc/modules/outlier_detection.rst
@@ -126,8 +126,8 @@ This strategy is illustrated below.
 
 .. topic:: References:
 
-    ..  [RD1999] Rousseeuw, P.J., Van Driessen, K. "A fast algorithm for the minimum
-        covariance determinant estimator" Technometrics 41(3), 212 (1999)
+    * Rousseeuw, P.J., Van Driessen, K. "A fast algorithm for the minimum
+      covariance determinant estimator" Technometrics 41(3), 212 (1999)
 
 .. _isolation_forest:
 
@@ -172,8 +172,8 @@ This strategy is illustrated below.
 
 .. topic:: References:
 
-    .. [LTZ2008] Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
-           Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.
+    * Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
+      Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on.
 
 
 Local Outlier Factor
@@ -228,7 +228,7 @@ This strategy is illustrated below.
 
 .. topic:: References:
 
-   .. [BKNS2000]  Breunig, Kriegel, Ng, and Sander (2000)
+   *  Breunig, Kriegel, Ng, and Sander (2000)
       `LOF: identifying density-based local outliers.
       <http://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf>`_
       Proc. ACM SIGMOD
@@ -272,16 +272,16 @@ multiple modes and :class:`ensemble.IsolationForest` and
         opposite, the decision rule based on fitting an
         :class:`covariance.EllipticEnvelope` learns an ellipse, which
         fits well the inlier distribution. The :class:`ensemble.IsolationForest`
-	and :class:`neighbors.LocalOutlierFactor` perform as well.
+        and :class:`neighbors.LocalOutlierFactor` perform as well.
       - |outlier1| 
 
    *
       - As the inlier distribution becomes bimodal, the
         :class:`covariance.EllipticEnvelope` does not fit well the
         inliers. However, we can see that :class:`ensemble.IsolationForest`,
-	:class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
-	have difficulties to detect the two modes,
-	and that the :class:`svm.OneClassSVM`
+        :class:`svm.OneClassSVM` and :class:`neighbors.LocalOutlierFactor`
+        have difficulties to detect the two modes,
+        and that the :class:`svm.OneClassSVM`
         tends to overfit: because it has no model of inliers, it
         interprets a region where, by chance some outliers are
         clustered, as inliers.
@@ -292,7 +292,7 @@ multiple modes and :class:`ensemble.IsolationForest` and
         :class:`svm.OneClassSVM` is able to recover a reasonable
         approximation as well as :class:`ensemble.IsolationForest`
         and :class:`neighbors.LocalOutlierFactor`,
-	whereas the :class:`covariance.EllipticEnvelope` completely fails.
+        whereas the :class:`covariance.EllipticEnvelope` completely fails.
       - |outlier3|
 
 .. topic:: Examples:
diff --git a/doc/tutorial/statistical_inference/putting_together.rst b/doc/tutorial/statistical_inference/putting_together.rst
index acac7c03d1d06..556b6b8df0894 100644
--- a/doc/tutorial/statistical_inference/putting_together.rst
+++ b/doc/tutorial/statistical_inference/putting_together.rst
@@ -17,7 +17,7 @@ can predict variables. We can also create combined estimators:
    :align: right
 
 .. literalinclude:: ../../auto_examples/plot_digits_pipe.py
-    :lines: 26-66
+    :lines: 23-63
 
 
 
diff --git a/examples/ensemble/plot_adaboost_hastie_10_2.py b/examples/ensemble/plot_adaboost_hastie_10_2.py
index b27636956ef26..4d48d13dd24f2 100644
--- a/examples/ensemble/plot_adaboost_hastie_10_2.py
+++ b/examples/ensemble/plot_adaboost_hastie_10_2.py
@@ -3,11 +3,11 @@
 Discrete versus Real AdaBoost
 =============================
 
-This example is based on Figure 10.2 from Hastie et al 2009 [1] and illustrates
-the difference in performance between the discrete SAMME [2] boosting
-algorithm and real SAMME.R boosting algorithm. Both algorithms are evaluated
-on a binary classification task where the target Y is a non-linear function
-of 10 input features.
+This example is based on Figure 10.2 from Hastie et al 2009 [1]_ and
+illustrates the difference in performance between the discrete SAMME [2]_
+boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are
+evaluated on a binary classification task where the target Y is a non-linear
+function of 10 input features.
 
 Discrete SAMME AdaBoost adapts based on errors in predicted class labels
 whereas real SAMME.R uses the predicted class probabilities.
diff --git a/examples/ensemble/plot_adaboost_multiclass.py b/examples/ensemble/plot_adaboost_multiclass.py
index 39e7cdcb8ef4d..906df85ccf645 100644
--- a/examples/ensemble/plot_adaboost_multiclass.py
+++ b/examples/ensemble/plot_adaboost_multiclass.py
@@ -3,14 +3,14 @@
 Multi-class AdaBoosted Decision Trees
 =====================================
 
-This example reproduces Figure 1 of Zhu et al [1] and shows how boosting can
+This example reproduces Figure 1 of Zhu et al [1]_ and shows how boosting can
 improve prediction accuracy on a multi-class problem. The classification
 dataset is constructed by taking a ten-dimensional standard normal distribution
 and defining three classes separated by nested concentric ten-dimensional
 spheres such that roughly equal numbers of samples are in each class (quantiles
 of the :math:`\chi^2` distribution).
 
-The performance of the SAMME and SAMME.R [1] algorithms are compared. SAMME.R
+The performance of the SAMME and SAMME.R [1]_ algorithms are compared. SAMME.R
 uses the probability estimates to update the additive model, while SAMME  uses
 the classifications only. As the example illustrates, the SAMME.R algorithm
 typically converges faster than SAMME, achieving a lower test error with fewer
diff --git a/examples/ensemble/plot_adaboost_regression.py b/examples/ensemble/plot_adaboost_regression.py
index b5b98d140da1b..0c76ac6af3ae9 100644
--- a/examples/ensemble/plot_adaboost_regression.py
+++ b/examples/ensemble/plot_adaboost_regression.py
@@ -3,7 +3,7 @@
 Decision Tree Regression with AdaBoost
 ======================================
 
-A decision tree is boosted using the AdaBoost.R2 [1] algorithm on a 1D
+A decision tree is boosted using the AdaBoost.R2 [1]_ algorithm on a 1D
 sinusoidal dataset with a small amount of Gaussian noise.
 299 boosts (300 decision trees) is compared with a single decision tree
 regressor. As the number of boosts is increased the regressor can fit more
diff --git a/examples/ensemble/plot_ensemble_oob.py b/examples/ensemble/plot_ensemble_oob.py
index 811cec13b24be..19b01772d5c24 100644
--- a/examples/ensemble/plot_ensemble_oob.py
+++ b/examples/ensemble/plot_ensemble_oob.py
@@ -8,7 +8,7 @@
 :math:`z_i = (x_i, y_i)`. The *out-of-bag* (OOB) error is the average error for
 each :math:`z_i` calculated using predictions from the trees that do not
 contain :math:`z_i` in their respective bootstrap sample. This allows the
-``RandomForestClassifier`` to be fit and validated whilst being trained [1].
+``RandomForestClassifier`` to be fit and validated whilst being trained [1]_.
 
 The example below demonstrates how the OOB error can be measured at the
 addition of each new tree during training. The resulting plot allows a
diff --git a/examples/ensemble/plot_gradient_boosting_regularization.py b/examples/ensemble/plot_gradient_boosting_regularization.py
index e5a01240ccdb0..592dd40ca47cb 100644
--- a/examples/ensemble/plot_gradient_boosting_regularization.py
+++ b/examples/ensemble/plot_gradient_boosting_regularization.py
@@ -4,7 +4,7 @@
 ================================
 
 Illustration of the effect of different regularization strategies
-for Gradient Boosting. The example is taken from Hastie et al 2009.
+for Gradient Boosting. The example is taken from Hastie et al 2009 [1]_.
 
 The loss function used is binomial deviance. Regularization via
 shrinkage (``learning_rate < 1.0``) improves performance considerably.
diff --git a/sklearn/covariance/robust_covariance.py b/sklearn/covariance/robust_covariance.py
index 985dda92f990c..de5ee308764bb 100644
--- a/sklearn/covariance/robust_covariance.py
+++ b/sklearn/covariance/robust_covariance.py
@@ -190,7 +190,7 @@ def select_candidates(X, n_support, n_trials, select=1, n_iter=30,
 
     Starting from a random support, the pure data set is found by the
     c_step procedure introduced by Rousseeuw and Van Driessen in
-    [Rouseeuw1999]_.
+    [RV]_.
 
     Parameters
     ----------
@@ -250,7 +250,7 @@ def select_candidates(X, n_support, n_trials, select=1, n_iter=30,
 
     References
     ----------
-    .. [Rouseeuw1999] A Fast Algorithm for the Minimum Covariance Determinant
+    .. [RV] A Fast Algorithm for the Minimum Covariance Determinant
         Estimator, 1999, American Statistical Association and the American
         Society for Quality, TECHNOMETRICS
 
@@ -339,13 +339,13 @@ def fast_mcd(X, support_fraction=None,
     such computation levels.
 
     Note that only raw estimates are returned. If one is interested in
-    the correction and reweighting steps described in [Rouseeuw1999]_,
+    the correction and reweighting steps described in [RouseeuwVan]_,
     see the MinCovDet object.
 
     References
     ----------
 
-    .. [Rouseeuw1999] A Fast Algorithm for the Minimum Covariance
+    .. [RouseeuwVan] A Fast Algorithm for the Minimum Covariance
         Determinant Estimator, 1999, American Statistical Association
         and the American Society for Quality, TECHNOMETRICS
 
@@ -580,10 +580,10 @@ class MinCovDet(EmpiricalCovariance):
 
     .. [Rouseeuw1984] `P. J. Rousseeuw. Least median of squares regression.
         J. Am Stat Ass, 79:871, 1984.`
-    .. [Rouseeuw1999] `A Fast Algorithm for the Minimum Covariance Determinant
+    .. [Rousseeuw] `A Fast Algorithm for the Minimum Covariance Determinant
         Estimator, 1999, American Statistical Association and the American
         Society for Quality, TECHNOMETRICS`
-    .. [Butler1993] `R. W. Butler, P. L. Davies and M. Jhun,
+    .. [ButlerDavies] `R. W. Butler, P. L. Davies and M. Jhun,
         Asymptotics For The Minimum Covariance Determinant Estimator,
         The Annals of Statistics, 1993, Vol. 21, No. 3, 1385-1400`
 
@@ -650,7 +650,7 @@ def correct_covariance(self, data):
         """Apply a correction to raw Minimum Covariance Determinant estimates.
 
         Correction using the empirical correction factor suggested
-        by Rousseeuw and Van Driessen in [Rouseeuw1984]_.
+        by Rousseeuw and Van Driessen in [RVD]_.
 
         Parameters
         ----------
@@ -659,6 +659,13 @@ def correct_covariance(self, data):
             The data set must be the one which was used to compute
             the raw estimates.
 
+        References
+        ----------
+
+        .. [RVD] `A Fast Algorithm for the Minimum Covariance
+            Determinant Estimator, 1999, American Statistical Association
+            and the American Society for Quality, TECHNOMETRICS`
+
         Returns
         -------
         covariance_corrected : array-like, shape (n_features, n_features)
@@ -675,7 +682,8 @@ def reweight_covariance(self, data):
 
         Re-weight observations using Rousseeuw's method (equivalent to
         deleting outlying observations from the data set before
-        computing location and covariance estimates). [Rouseeuw1984]_
+        computing location and covariance estimates) described
+        in [RVDriessen]_.
 
         Parameters
         ----------
@@ -684,6 +692,13 @@ def reweight_covariance(self, data):
             The data set must be the one which was used to compute
             the raw estimates.
 
+        References
+        ----------
+
+        .. [RVDriessen] `A Fast Algorithm for the Minimum Covariance
+            Determinant Estimator, 1999, American Statistical Association
+            and the American Society for Quality, TECHNOMETRICS`
+
         Returns
         -------
         location_reweighted : array-like, shape (n_features, )
diff --git a/sklearn/datasets/lfw.py b/sklearn/datasets/lfw.py
index 50834f7705ef6..4d188f00bcffa 100644
--- a/sklearn/datasets/lfw.py
+++ b/sklearn/datasets/lfw.py
@@ -68,6 +68,7 @@ def scale_face(face):
 
 def check_fetch_lfw(data_home=None, funneled=True, download_if_missing=True):
     """Helper function to download any missing LFW data"""
+
     data_home = get_data_home(data_home=data_home)
     lfw_home = join(data_home, "lfw_home")
 
diff --git a/sklearn/linear_model/randomized_l1.py b/sklearn/linear_model/randomized_l1.py
index a84558823146e..8f3692dc8675b 100644
--- a/sklearn/linear_model/randomized_l1.py
+++ b/sklearn/linear_model/randomized_l1.py
@@ -195,8 +195,6 @@ class RandomizedLasso(BaseRandomizedLinearModel):
     is known as stability selection. In short, features selected more
     often are considered good features.
 
-    Read more in the :ref:`User Guide <randomized_l1>`.
-
     Parameters
     ----------
     alpha : float, 'aic', or 'bic', optional
@@ -206,7 +204,7 @@ class RandomizedLasso(BaseRandomizedLinearModel):
 
     scaling : float, optional
         The s parameter used to randomly scale the penalty of different
-        features (See :ref:`User Guide <randomized_l1>` for details ).
+        features.
         Should be between 0 and 1.
 
     sample_fraction : float, optional
@@ -300,11 +298,6 @@ class RandomizedLasso(BaseRandomizedLinearModel):
     >>> from sklearn.linear_model import RandomizedLasso
     >>> randomized_lasso = RandomizedLasso()
 
-    Notes
-    -----
-    For an example, see :ref:`examples/linear_model/plot_sparse_recovery.py
-    <sphx_glr_auto_examples_linear_model_plot_sparse_recovery.py>`.
-
     References
     ----------
     Stability selection
@@ -407,8 +400,6 @@ class RandomizedLogisticRegression(BaseRandomizedLinearModel):
     randomizations. This is known as stability selection. In short,
     features selected more often are considered good features.
 
-    Read more in the :ref:`User Guide <randomized_l1>`.
-
     Parameters
     ----------
     C : float or array-like of shape [n_reg_parameter], optional, default=1
@@ -420,7 +411,7 @@ class RandomizedLogisticRegression(BaseRandomizedLinearModel):
 
     scaling : float, optional, default=0.5
         The s parameter used to randomly scale the penalty of different
-        features (See :ref:`User Guide <randomized_l1>` for details ).
+        features.
         Should be between 0 and 1.
 
     sample_fraction : float, optional, default=0.75
@@ -501,11 +492,6 @@ class RandomizedLogisticRegression(BaseRandomizedLinearModel):
     >>> from sklearn.linear_model import RandomizedLogisticRegression
     >>> randomized_logistic = RandomizedLogisticRegression()
 
-    Notes
-    -----
-    For an example, see :ref:`examples/linear_model/plot_sparse_recovery.py
-    <sphx_glr_auto_examples_linear_model_plot_sparse_recovery.py>`.
-
     References
     ----------
     Stability selection
@@ -590,8 +576,6 @@ def lasso_stability_path(X, y, scaling=0.5, random_state=None,
                          verbose=False):
     """Stability path based on randomized Lasso estimates
 
-    Read more in the :ref:`User Guide <randomized_l1>`.
-
     Parameters
     ----------
     X : array-like, shape = [n_samples, n_features]
@@ -638,11 +622,6 @@ def lasso_stability_path(X, y, scaling=0.5, random_state=None,
 
     scores_path : array, shape = [n_features, n_grid]
         The scores for each feature along the path.
-
-    Notes
-    -----
-    For an example, see :ref:`examples/linear_model/plot_sparse_recovery.py
-    <sphx_glr_auto_examples_linear_model_plot_sparse_recovery.py>`.
     """
     X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'])
     rng = check_random_state(random_state)
diff --git a/sklearn/metrics/scorer.py b/sklearn/metrics/scorer.py
index 7d213ae39aaed..f13068d477b09 100644
--- a/sklearn/metrics/scorer.py
+++ b/sklearn/metrics/scorer.py
@@ -320,7 +320,7 @@ def _check_multimetric_scoring(estimator, scoring=None):
         value. Metric functions returning a list/array of values can be wrapped
         into multiple scorers that return one value each.
 
-        See :ref:`multivalued_scorer_wrapping` for an example.
+        See :ref:`multimetric_grid_search` for an example.
 
         If None the estimator's default scorer (if available) is used.
         The return value in that case will be ``{'score': <default_scorer>}``.
diff --git a/sklearn/mixture/dpgmm.py b/sklearn/mixture/dpgmm.py
index 3d1858c513b2a..75b0b88e9b4cf 100644
--- a/sklearn/mixture/dpgmm.py
+++ b/sklearn/mixture/dpgmm.py
@@ -672,7 +672,7 @@ class VBGMM(_DPGMMBase):
     Initialization is with normally-distributed means and identity
     covariance, for proper convergence.
 
-    Read more in the :ref:`User Guide <vbgmm>`.
+    Read more in the :ref:`User Guide <bgmm>`.
 
     Parameters
     ----------
diff --git a/sklearn/model_selection/_search.py b/sklearn/model_selection/_search.py
index db41c19218fa7..ebfa1e9bd3e18 100644
--- a/sklearn/model_selection/_search.py
+++ b/sklearn/model_selection/_search.py
@@ -801,7 +801,7 @@ class GridSearchCV(BaseSearchCV):
         value. Metric functions returning a list/array of values can be wrapped
         into multiple scorers that return one value each.
 
-        See :ref:`multivalued_scorer_wrapping` for an example.
+        See :ref:`multimetric_grid_search` for an example.
 
         If None, the estimator's default scorer (if available) is used.
 
@@ -1111,7 +1111,7 @@ class RandomizedSearchCV(BaseSearchCV):
         value. Metric functions returning a list/array of values can be wrapped
         into multiple scorers that return one value each.
 
-        See :ref:`multivalued_scorer_wrapping` for an example.
+        See :ref:`multimetric_grid_search` for an example.
 
         If None, the estimator's default scorer (if available) is used.
 
diff --git a/sklearn/model_selection/_validation.py b/sklearn/model_selection/_validation.py
index 1e5ea29740c00..147d741b500b9 100644
--- a/sklearn/model_selection/_validation.py
+++ b/sklearn/model_selection/_validation.py
@@ -69,7 +69,7 @@ def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None,
         value. Metric functions returning a list/array of values can be wrapped
         into multiple scorers that return one value each.
 
-        See :ref:`multivalued_scorer_wrapping` for an example.
+        See :ref:`multimetric_grid_search` for an example.
 
         If None, the estimator's default scorer (if available) is used.
 
@@ -803,8 +803,8 @@ def permutation_test_score(estimator, X, y, groups=None, cv=None,
         the dataset into train/test set.
 
     scoring : string, callable or None, optional, default: None
-        A single string (see :ref:`_scoring_parameter`) or a callable
-        (see :ref:`_scoring`) to evaluate the predictions on the test set.
+        A single string (see :ref:`scoring_parameter`) or a callable
+        (see :ref:`scoring`) to evaluate the predictions on the test set.
 
         If None the estimator's default scorer, if available, is used.
 
diff --git a/sklearn/neighbors/approximate.py b/sklearn/neighbors/approximate.py
index 2f297ce68cc56..907b379731a2f 100644
--- a/sklearn/neighbors/approximate.py
+++ b/sklearn/neighbors/approximate.py
@@ -122,8 +122,6 @@ class LSHForest(BaseEstimator, KNeighborsMixin, RadiusNeighborsMixin):
     points. Its value does not depend on the norm of the vector points but
     only on their relative angles.
 
-    Read more in the :ref:`User Guide <approximate_nearest_neighbors>`.
-
     Parameters
     ----------
 
diff --git a/sklearn/neighbors/lof.py b/sklearn/neighbors/lof.py
index 3559d76cf898a..b3686d69d771b 100644
--- a/sklearn/neighbors/lof.py
+++ b/sklearn/neighbors/lof.py
@@ -85,8 +85,8 @@ class LocalOutlierFactor(NeighborsBase, KNeighborsMixin, UnsupervisedMixin):
 
     p : integer, optional (default=2)
         Parameter for the Minkowski metric from
-        :ref:`sklearn.metrics.pairwise.pairwise_distances`. When p = 1, this is
-        equivalent to using manhattan_distance (l1), and euclidean_distance
+        :func:`sklearn.metrics.pairwise.pairwise_distances`. When p = 1, this
+        is equivalent to using manhattan_distance (l1), and euclidean_distance
         (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
 
     metric_params : dict, optional (default=None)