Revert "FEA OPTICS: add extract_xi method (scikit-learn#12077)"

This reverts commit 14724c7.
xhluca · Apr 28, 2019 · e89571f · e89571f
1 parent 7415190
commit e89571f
Show file tree

Hide file tree

Showing 9 changed files with 90 additions and 815 deletions.
diff --git a/doc/modules/classes.rst b/doc/modules/classes.rst
@@ -114,7 +114,6 @@ Functions
 
    cluster.affinity_propagation
    cluster.cluster_optics_dbscan
-   cluster.cluster_optics_xi
    cluster.compute_optics_graph
    cluster.dbscan
    cluster.estimate_bandwidth

diff --git a/doc/modules/clustering.rst b/doc/modules/clustering.rst
@@ -91,12 +91,6 @@ Overview of clustering methods
      - Non-flat geometry, uneven cluster sizes
      - Distances between nearest points
 
-   * - :ref:`OPTICS <optics>`
-     - minimum cluster membership
-     - Very large ``n_samples``, large ``n_clusters``
-     - Non-flat geometry, uneven cluster sizes, variable cluster density
-     - Distances between points
-
    * - :ref:`Gaussian mixtures <mixture>`
      - many
      - Not scalable
@@ -812,11 +806,6 @@ by black points below.
     be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
     A couple of mechanisms for getting around this are:
 
-    - Use :ref:`OPTICS <optics>` clustering in conjunction with the
-      `extract_dbscan` method. OPTICS clustering also calculates the full
-      pairwise matrix, but only keeps one row in memory at a time (memory
-      complexity n).
-
     - A sparse radius neighborhood graph (where missing entries are presumed to
       be out of eps) can be precomputed in a memory-efficient way and dbscan
       can be run over this with ``metric='precomputed'``.  See
@@ -839,92 +828,6 @@ by black points below.
    Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
    In ACM Transactions on Database Systems (TODS), 42(3), 19.
 
-.. _optics:
-
-OPTICS
-======
-
-The :class:`OPTICS` algorithm shares many similarities with the :class:`DBSCAN`
-algorithm, and can be considered a generalization of DBSCAN that relaxes the
-``eps`` requirement from a single value to a value range. The key difference
-between DBSCAN and OPTICS is that the OPTICS algorithm builds a *reachability*
-graph, which assigns each sample both a ``reachability_`` distance, and a spot
-within the cluster ``ordering_`` attribute; these two attributes are assigned
-when the model is fitted, and are used to determine cluster membership. If
-OPTICS is run with the default value of *inf* set for ``max_eps``, then DBSCAN
-style cluster extraction can be performed repeatedly in linear time for any
-given ``eps`` value using the ``cluster_optics_dbscan`` method. Setting
-``max_eps`` to a lower value will result in shorter run times, and can be
-thought of as the maximum neighborhood radius from each point to find other
-potential reachable points.
-
-.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
-        :target: ../auto_examples/cluster/plot_optics.html
-        :scale: 50
-
-.. centered:: |optics_results|
-
-The *reachability* distances generated by OPTICS allow for variable density
-extraction of clusters within a single data set. As shown in the above plot,
-combining *reachability* distances and data set ``ordering_`` produces a
-*reachability plot*, where point density is represented on the Y-axis, and
-points are ordered such that nearby points are adjacent. 'Cutting' the
-reachability plot at a single value produces DBSCAN like results; all points
-above the 'cut' are classified as noise, and each time that there is a break
-when reading from left to right signifies a new cluster. The default cluster
-extraction with OPTICS looks at the steep slopes within the graph to find
-clusters, and the user can define what counts as a steep slope using the
-parameter ``xi``. There are also other possibilities for analysis on the graph
-itself, such as generating hierarchical representations of the data through
-reachability-plot dendrograms, and the hierarchy of clusters detected by the
-algorithm can be accessed through the ``cluster_hierarchy_`` parameter. The
-plot above has been color-coded so that cluster colors in planar space match
-the linear segment clusters of the reachability plot. Note that the blue and
-red clusters are adjacent in the reachability plot, and can be hierarchically
-represented as children of a larger parent cluster.
-
-.. topic:: Examples:
-
-     * :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`
-
-
-.. topic:: Comparison with DBSCAN
-
-    The results from OPTICS ``cluster_optics_dbscan`` method and DBSCAN are
-    very similar, but not always identical; specifically, labeling of periphery
-    and noise points. This is in part because the first samples of each dense
-    area processed by OPTICS have a large reachability value while being close
-    to other points in their area, and will thus sometimes be marked as noise
-    rather than periphery. This affects adjacent points when they are
-    considered as candidates for being marked as either periphery or noise.
-
-    Note that for any single value of ``eps``, DBSCAN will tend to have a
-    shorter run time than OPTICS; however, for repeated runs at varying ``eps``
-    values, a single run of OPTICS may require less cumulative runtime than
-    DBSCAN. It is also important to note that OPTICS' output is close to
-    DBSCAN's only if ``eps`` and ``max_eps`` are close.
-
-.. topic:: Computational Complexity
-
-    Spatial indexing trees are used to avoid calculating the full distance
-    matrix, and allow for efficient memory usage on large sets of samples.
-    Different distance metrics can be supplied via the ``metric`` keyword.
-
-    For large datasets, similar (but not identical) results can be obtained via
-    `HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
-    multithreaded, and has better algorithmic runtime complexity than OPTICS,
-    at the cost of worse memory scaling. For extremely large datasets that
-    exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
-    to *n^2*) memory scaling; however, tuning of the ``max_eps`` parameter
-    will likely need to be used to give a solution in a reasonable amount of
-    wall time.
-
-.. topic:: References:
-
- *  "OPTICS: ordering points to identify the clustering structure."
-    Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
-    In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
-
 .. _birch:
 
 Birch

diff --git a/doc/whats_new/v0.21.rst b/doc/whats_new/v0.21.rst
@@ -89,8 +89,7 @@ Support for Python 3.4 and below has been officially dropped.
 - |MajorFeature| A new clustering algorithm: :class:`cluster.OPTICS`: an
   algoritm related to :class:`cluster.DBSCAN`, that has hyperparameters easier
   to set and that scales better, by :user:`Shane <espg>`,
-  `Adrin Jalali`_, :user:`Erich Schubert <kno10>`, `Hanmin Qin`_, and
-  :user:`Assia Benbihi <assiaben>`.
+  :user:`Adrin Jalali <adrinjalali>`, and :user:`Erich Schubert <kno10>`.
 
 - |Fix| Fixed a bug where :class:`cluster.Birch` could occasionally raise an
   AttributeError. :pr:`13651` by `Joel Nothman`_.

diff --git a/examples/cluster/plot_cluster_comparison.py b/examples/cluster/plot_cluster_comparison.py
@@ -74,20 +74,14 @@
                 'damping': .9,
                 'preference': -200,
                 'n_neighbors': 10,
-                'n_clusters': 3,
-                'min_samples': 20,
-                'xi': 0.05,
-                'min_cluster_size': 0.1}
+                'n_clusters': 3}
 
 datasets = [
     (noisy_circles, {'damping': .77, 'preference': -240,
-                     'quantile': .2, 'n_clusters': 2,
-                     'min_samples': 20, 'xi': 0.25}),
+                     'quantile': .2, 'n_clusters': 2}),
     (noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
-    (varied, {'eps': .18, 'n_neighbors': 2,
-              'min_samples': 5, 'xi': 0.035, 'min_cluster_size': .2}),
-    (aniso, {'eps': .15, 'n_neighbors': 2,
-             'min_samples': 20, 'xi': 0.1, 'min_cluster_size': .2}),
+    (varied, {'eps': .18, 'n_neighbors': 2}),
+    (aniso, {'eps': .15, 'n_neighbors': 2}),
     (blobs, {}),
     (no_structure, {})]
 
@@ -122,9 +116,6 @@
         n_clusters=params['n_clusters'], eigen_solver='arpack',
         affinity="nearest_neighbors")
     dbscan = cluster.DBSCAN(eps=params['eps'])
-    optics = cluster.OPTICS(min_samples=params['min_samples'],
-                            xi=params['xi'],
-                            min_cluster_size=params['min_cluster_size'])
     affinity_propagation = cluster.AffinityPropagation(
         damping=params['damping'], preference=params['preference'])
     average_linkage = cluster.AgglomerativeClustering(
@@ -142,7 +133,6 @@
         ('Ward', ward),
         ('AgglomerativeClustering', average_linkage),
         ('DBSCAN', dbscan),
-        ('OPTICS', optics),
         ('Birch', birch),
         ('GaussianMixture', gmm)
     )

diff --git a/examples/cluster/plot_optics.py b/examples/cluster/plot_optics.py
diff --git a/sklearn/cluster/__init__.py b/sklearn/cluster/__init__.py
@@ -11,8 +11,7 @@
                            FeatureAgglomeration)
 from .k_means_ import k_means, KMeans, MiniBatchKMeans
 from .dbscan_ import dbscan, DBSCAN
-from .optics_ import (OPTICS, cluster_optics_dbscan, compute_optics_graph,
-                      cluster_optics_xi)
+from .optics_ import OPTICS, cluster_optics_dbscan, compute_optics_graph
 from .bicluster import SpectralBiclustering, SpectralCoclustering
 from .birch import Birch
 
@@ -22,7 +21,6 @@
            'DBSCAN',
            'OPTICS',
            'cluster_optics_dbscan',
-           'cluster_optics_xi',
            'compute_optics_graph',
            'KMeans',
            'FeatureAgglomeration',