Skip to content

Commit

Permalink
Revert "FEA OPTICS: add extract_xi method (scikit-learn#12077)"
Browse files Browse the repository at this point in the history
This reverts commit 14724c7.
  • Loading branch information
Xing committed Apr 28, 2019
1 parent 7415190 commit e89571f
Show file tree
Hide file tree
Showing 9 changed files with 90 additions and 815 deletions.
1 change: 0 additions & 1 deletion doc/modules/classes.rst
Expand Up @@ -114,7 +114,6 @@ Functions

cluster.affinity_propagation
cluster.cluster_optics_dbscan
cluster.cluster_optics_xi
cluster.compute_optics_graph
cluster.dbscan
cluster.estimate_bandwidth
Expand Down
97 changes: 0 additions & 97 deletions doc/modules/clustering.rst
Expand Up @@ -91,12 +91,6 @@ Overview of clustering methods
- Non-flat geometry, uneven cluster sizes
- Distances between nearest points

* - :ref:`OPTICS <optics>`
- minimum cluster membership
- Very large ``n_samples``, large ``n_clusters``
- Non-flat geometry, uneven cluster sizes, variable cluster density
- Distances between points

* - :ref:`Gaussian mixtures <mixture>`
- many
- Not scalable
Expand Down Expand Up @@ -812,11 +806,6 @@ by black points below.
be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:

- Use :ref:`OPTICS <optics>` clustering in conjunction with the
`extract_dbscan` method. OPTICS clustering also calculates the full
pairwise matrix, but only keeps one row in memory at a time (memory
complexity n).

- A sparse radius neighborhood graph (where missing entries are presumed to
be out of eps) can be precomputed in a memory-efficient way and dbscan
can be run over this with ``metric='precomputed'``. See
Expand All @@ -839,92 +828,6 @@ by black points below.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
In ACM Transactions on Database Systems (TODS), 42(3), 19.

.. _optics:

OPTICS
======

The :class:`OPTICS` algorithm shares many similarities with the :class:`DBSCAN`
algorithm, and can be considered a generalization of DBSCAN that relaxes the
``eps`` requirement from a single value to a value range. The key difference
between DBSCAN and OPTICS is that the OPTICS algorithm builds a *reachability*
graph, which assigns each sample both a ``reachability_`` distance, and a spot
within the cluster ``ordering_`` attribute; these two attributes are assigned
when the model is fitted, and are used to determine cluster membership. If
OPTICS is run with the default value of *inf* set for ``max_eps``, then DBSCAN
style cluster extraction can be performed repeatedly in linear time for any
given ``eps`` value using the ``cluster_optics_dbscan`` method. Setting
``max_eps`` to a lower value will result in shorter run times, and can be
thought of as the maximum neighborhood radius from each point to find other
potential reachable points.

.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
:target: ../auto_examples/cluster/plot_optics.html
:scale: 50

.. centered:: |optics_results|

The *reachability* distances generated by OPTICS allow for variable density
extraction of clusters within a single data set. As shown in the above plot,
combining *reachability* distances and data set ``ordering_`` produces a
*reachability plot*, where point density is represented on the Y-axis, and
points are ordered such that nearby points are adjacent. 'Cutting' the
reachability plot at a single value produces DBSCAN like results; all points
above the 'cut' are classified as noise, and each time that there is a break
when reading from left to right signifies a new cluster. The default cluster
extraction with OPTICS looks at the steep slopes within the graph to find
clusters, and the user can define what counts as a steep slope using the
parameter ``xi``. There are also other possibilities for analysis on the graph
itself, such as generating hierarchical representations of the data through
reachability-plot dendrograms, and the hierarchy of clusters detected by the
algorithm can be accessed through the ``cluster_hierarchy_`` parameter. The
plot above has been color-coded so that cluster colors in planar space match
the linear segment clusters of the reachability plot. Note that the blue and
red clusters are adjacent in the reachability plot, and can be hierarchically
represented as children of a larger parent cluster.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`


.. topic:: Comparison with DBSCAN

The results from OPTICS ``cluster_optics_dbscan`` method and DBSCAN are
very similar, but not always identical; specifically, labeling of periphery
and noise points. This is in part because the first samples of each dense
area processed by OPTICS have a large reachability value while being close
to other points in their area, and will thus sometimes be marked as noise
rather than periphery. This affects adjacent points when they are
considered as candidates for being marked as either periphery or noise.

Note that for any single value of ``eps``, DBSCAN will tend to have a
shorter run time than OPTICS; however, for repeated runs at varying ``eps``
values, a single run of OPTICS may require less cumulative runtime than
DBSCAN. It is also important to note that OPTICS' output is close to
DBSCAN's only if ``eps`` and ``max_eps`` are close.

.. topic:: Computational Complexity

Spatial indexing trees are used to avoid calculating the full distance
matrix, and allow for efficient memory usage on large sets of samples.
Different distance metrics can be supplied via the ``metric`` keyword.

For large datasets, similar (but not identical) results can be obtained via
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
multithreaded, and has better algorithmic runtime complexity than OPTICS,
at the cost of worse memory scaling. For extremely large datasets that
exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
to *n^2*) memory scaling; however, tuning of the ``max_eps`` parameter
will likely need to be used to give a solution in a reasonable amount of
wall time.

.. topic:: References:

* "OPTICS: ordering points to identify the clustering structure."
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.

.. _birch:

Birch
Expand Down
3 changes: 1 addition & 2 deletions doc/whats_new/v0.21.rst
Expand Up @@ -89,8 +89,7 @@ Support for Python 3.4 and below has been officially dropped.
- |MajorFeature| A new clustering algorithm: :class:`cluster.OPTICS`: an
algoritm related to :class:`cluster.DBSCAN`, that has hyperparameters easier
to set and that scales better, by :user:`Shane <espg>`,
`Adrin Jalali`_, :user:`Erich Schubert <kno10>`, `Hanmin Qin`_, and
:user:`Assia Benbihi <assiaben>`.
:user:`Adrin Jalali <adrinjalali>`, and :user:`Erich Schubert <kno10>`.

- |Fix| Fixed a bug where :class:`cluster.Birch` could occasionally raise an
AttributeError. :pr:`13651` by `Joel Nothman`_.
Expand Down
18 changes: 4 additions & 14 deletions examples/cluster/plot_cluster_comparison.py
Expand Up @@ -74,20 +74,14 @@
'damping': .9,
'preference': -200,
'n_neighbors': 10,
'n_clusters': 3,
'min_samples': 20,
'xi': 0.05,
'min_cluster_size': 0.1}
'n_clusters': 3}

datasets = [
(noisy_circles, {'damping': .77, 'preference': -240,
'quantile': .2, 'n_clusters': 2,
'min_samples': 20, 'xi': 0.25}),
'quantile': .2, 'n_clusters': 2}),
(noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
(varied, {'eps': .18, 'n_neighbors': 2,
'min_samples': 5, 'xi': 0.035, 'min_cluster_size': .2}),
(aniso, {'eps': .15, 'n_neighbors': 2,
'min_samples': 20, 'xi': 0.1, 'min_cluster_size': .2}),
(varied, {'eps': .18, 'n_neighbors': 2}),
(aniso, {'eps': .15, 'n_neighbors': 2}),
(blobs, {}),
(no_structure, {})]

Expand Down Expand Up @@ -122,9 +116,6 @@
n_clusters=params['n_clusters'], eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=params['eps'])
optics = cluster.OPTICS(min_samples=params['min_samples'],
xi=params['xi'],
min_cluster_size=params['min_cluster_size'])
affinity_propagation = cluster.AffinityPropagation(
damping=params['damping'], preference=params['preference'])
average_linkage = cluster.AgglomerativeClustering(
Expand All @@ -142,7 +133,6 @@
('Ward', ward),
('AgglomerativeClustering', average_linkage),
('DBSCAN', dbscan),
('OPTICS', optics),
('Birch', birch),
('GaussianMixture', gmm)
)
Expand Down
98 changes: 0 additions & 98 deletions examples/cluster/plot_optics.py

This file was deleted.

4 changes: 1 addition & 3 deletions sklearn/cluster/__init__.py
Expand Up @@ -11,8 +11,7 @@
FeatureAgglomeration)
from .k_means_ import k_means, KMeans, MiniBatchKMeans
from .dbscan_ import dbscan, DBSCAN
from .optics_ import (OPTICS, cluster_optics_dbscan, compute_optics_graph,
cluster_optics_xi)
from .optics_ import OPTICS, cluster_optics_dbscan, compute_optics_graph
from .bicluster import SpectralBiclustering, SpectralCoclustering
from .birch import Birch

Expand All @@ -22,7 +21,6 @@
'DBSCAN',
'OPTICS',
'cluster_optics_dbscan',
'cluster_optics_xi',
'compute_optics_graph',
'KMeans',
'FeatureAgglomeration',
Expand Down

0 comments on commit e89571f

Please sign in to comment.