Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Implementation of OPTICS #1984

Closed
wants to merge 137 commits into from
Closed
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
147323d
OPTICS clustering algorithm
espg May 21, 2013
d18e827
Create plot_optics
espg May 21, 2013
70daac1
pep8 fixes
espg May 22, 2013
4fea7ce
fixed conditional to be pep8
espg May 22, 2013
7cc7e12
updated to match sklearn API
espg Nov 5, 2014
24dac57
removed extra files
espg Nov 5, 2014
7d23d01
plotting example updated, small changes
espg Nov 5, 2014
15e452e
updated OPTICS.labels to OPTICS.labels_
espg Nov 5, 2014
9bed2e6
additional labels_ changes
espg Nov 5, 2014
9d163e3
added stability warning
espg Jan 22, 2015
b18da24
Noise fix; updated example plot
espg Jan 23, 2015
c0e24e5
Changed to match Sklearn API
espg Jul 15, 2015
90e21ac
Forcing 2 parameters
espg Jul 15, 2015
b06c76d
Conforming to API
espg Jul 24, 2015
abbdbf2
Fixing plot example
espg Jul 24, 2015
2c54cdc
Fixed issue with sparse matrices
freemansw1 Jul 24, 2015
58d530a
Another attempt at fixing the sparse matrix error
freemansw1 Jul 24, 2015
6034139
Better checking of sparse arrays
espg Jul 24, 2015
fd5c65f
General cleanup
espg Jul 25, 2015
39660a5
Merge remote-tracking branch 'upstream/master'
espg Jul 25, 2015
6ae8c94
Added unit tests for extract function
freemansw1 Jul 28, 2015
bc9cc1c
Attempting for near 100% coverage
freemansw1 Jul 28, 2015
34b9a6d
Fixed error in unit tests
freemansw1 Jul 28, 2015
8d5ac08
Trimmed extraneous 'if-else' check
espg Jul 28, 2015
286f90e
forcing to check for a warning.
freemansw1 Jul 28, 2015
ae2bddc
Updates to doc strings
espg Jul 30, 2015
55365c5
Style / pep8 changes
espg Jul 31, 2015
e1053b0
Added Narrative Documentation
espg Aug 2, 2015
965225e
Vectorized nneighbors lookups
espg Aug 2, 2015
1bda3b1
fixing init build error
espg Aug 2, 2015
1827abf
reverting init
espg Aug 2, 2015
08389fc
All code now vectorized
espg Aug 4, 2015
d9c2bb1
Style changes
espg Aug 4, 2015
659a697
Changing parameter style
espg Aug 7, 2015
bec154f
Extraction change; Authors update
espg Sep 21, 2015
5891945
Changed eps scaling to 5x instead of 10x
espg Sep 21, 2015
9427f1e
Fixing unit test
espg Sep 21, 2015
6462b5e
Actually fixing unit test
espg Sep 21, 2015
2785864
Making ordering_ and other attributes public
espg Sep 21, 2015
8c4572b
Updates for Documentation
espg Sep 21, 2015
3d5e0fc
Pep8 cleanup
espg Sep 21, 2015
02bb8c0
updating plot example to match new attribute name
espg Sep 22, 2015
3dd2d29
CamelCase fixes
espg Oct 19, 2015
e52cc1e
adding hierarchical extraction function
espg Oct 19, 2015
fcd5f70
added hierarchical switch to extract
espg Oct 19, 2015
4d7e3be
cluster order bug fix
espg Oct 20, 2015
b31ae1a
removed hierarchical cluster extraction
espg Oct 20, 2015
92ba90d
initial import of automatic cluster extraction code
espg Oct 20, 2015
768f7c6
wrapper for 'auto' extraction
espg Oct 21, 2015
f4d88de
test and example updates
espg Oct 21, 2015
b1aeb52
fixing unit coverage; pruning unused functions
espg Oct 21, 2015
8350711
Added 'filter' fuction
espg Nov 9, 2015
31df5eb
Vectorizing auto_cluster
espg Nov 9, 2015
428d7d4
updated filter function
espg Nov 18, 2015
517a72e
fixing test error
espg Aug 9, 2016
b7daad4
removing exception handling in favor of conditional check
Aug 10, 2016
a08543b
Merge pull request #2 from Broham/exception-handling
espg Aug 10, 2016
2019886
Updated unit tests
espg Aug 27, 2016
6adb41f
Additional unit test
espg Aug 27, 2016
78be6e4
Fix unit test bug / python 3 compat
espg Aug 27, 2016
da1ed0a
Fixed annoying deprecation warning for 1d array in BallTree
kflansburg Oct 6, 2016
e86f14a
More PEP8 and remove print statements fromt est
kflansburg Oct 6, 2016
b5db832
Merge pull request #3 from kflansburg/master
espg Oct 15, 2016
0b4cbdd
70 to 80% faster, fixed distance metrics
espg Oct 15, 2016
da58058
Fixing unit test failure
espg Oct 16, 2016
625aaa5
Exposed auto_cluster parameters as public
espg Oct 17, 2016
1aa913d
Fixed def with missing ':'
espg Oct 17, 2016
950fbb5
Fix bugs from api change...
espg Oct 17, 2016
9a04d0f
pep8 / pyflakes changes
espg Oct 18, 2016
5876f15
Updated example / plot
espg Oct 18, 2016
38784fd
Tuning plot example / pep8 change
espg Oct 18, 2016
11b13c7
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
espg Oct 28, 2016
0f2aff0
Bug fix for commit 0b4cbdd (enforce stable sort)
espg Nov 1, 2016
ff59289
Code review fixes (style)
espg Nov 6, 2016
b126ebb
fixing new unit test
espg Nov 6, 2016
3998a85
refactored min_heap to c extension
espg Nov 25, 2016
b9ffa6d
minor fixes..
espg Nov 25, 2016
4c96df3
small cython optimizations
espg Nov 26, 2016
f921d0e
cython fixes
espg Dec 6, 2016
9424090
Add foo.txt
espg Dec 6, 2016
b1c88f6
Remove foo.txt
espg Dec 6, 2016
f6ee95f
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
espg Dec 6, 2016
ea18017
fix compilation error
espg Dec 6, 2016
5b3a054
last optimizations cython/numpy
espg Dec 7, 2016
886bacc
fix pyflakes errors; change default eps value
espg Dec 7, 2016
00c50c6
_
espg Dec 7, 2016
8ce31ba
API changes from agramfort
espg Jul 15, 2017
c4711df
fixed core samples bug
espg Jul 15, 2017
88334e2
added fit_predict
espg Jul 15, 2017
5b4c7ac
updates to variable names; update plot
espg Jul 16, 2017
9d01ced
refactor to remove balltree specific code
espg Jul 16, 2017
b1a0e5f
major refactor
espg Jul 16, 2017
678330f
fixed bugs; test all pass again
espg Jul 16, 2017
e80a088
fixed weird cython bug
espg Jul 17, 2017
356609b
major refactor
espg Jul 20, 2017
9d57d09
Updated Documentation!
espg Sep 7, 2017
5add240
Merge branch 'heads/scikit-learn/master'
espg Sep 8, 2017
000c20f
fix flake8 error
espg Sep 8, 2017
af28770
added optics to cluster comparison
espg Sep 12, 2017
14ce7e8
Updated comparison plot to transpose
espg Sep 12, 2017
49ccd11
small fix
espg Sep 12, 2017
c198a93
flake8 error
espg Sep 12, 2017
050d2f3
reverting transpose of cluster comparison (seperate PR #9739)
espg Sep 20, 2017
610747e
fixes from agramfort's review
espg Oct 10, 2017
d5e2904
fix for error message
espg Oct 24, 2017
b41c71b
force cluster_id's to start at 0
espg Oct 24, 2017
e95d5d7
Fix sp. error and flake8 warning(s)
espg Oct 28, 2017
f48151a
Updated documentation
espg Nov 14, 2017
a8dff69
Removed extraneous files
espg Nov 14, 2017
61d049c
fixing lgtm alert
espg Nov 19, 2017
267129e
changes from jnothman
espg Dec 7, 2017
c9dff06
Fixes from jnothman's review
espg Feb 13, 2018
92e6824
Fixing flake8 error
espg Feb 13, 2018
547b255
Removed neighbors / balltree inheritance
espg Feb 14, 2018
98d76b7
Made nbrs private and moved initiation to fit()
espg Feb 15, 2018
2eac9fb
fixed non-standard characters
espg May 18, 2018
0b5fa48
Response to TomDLT review
espg May 23, 2018
085e6ae
Fixed labeling bug
espg May 23, 2018
26e9425
update unit test
espg May 23, 2018
fef8ccd
Simple fixes per jnothman
espg Jun 1, 2018
116e0fe
Auto-cluster tests
espg Jun 13, 2018
a98b37d
fixing test error
espg Jun 13, 2018
4932a16
removed python loop
espg Jun 13, 2018
8292495
Fixing test error
espg Jun 13, 2018
9dc2ca5
Fix typo in unit test
espg Jun 13, 2018
51f7009
documentation updates
espg Jun 14, 2018
04f7337
Merge remote-tracking branch 'remotes/upstream/master'
espg Jun 14, 2018
28d8ac4
Post-merge doctest fix
espg Jun 14, 2018
f83d38e
DBSCAN / OPTICS invariant test
espg Jul 5, 2018
338019a
Update _auto_cluster docstring
espg Jul 5, 2018
c4354c5
changes fro jnothman
espg Jul 5, 2018
c9a94e9
fix spelling error in tests
espg Jul 5, 2018
33d74d7
contingency_matrix test
espg Jul 11, 2018
b32a9c1
small unit test updates per jnothman
espg Jul 11, 2018
20823be
unit test typo fix
espg Jul 11, 2018
0b4c677
extract dbscan updates
espg Jul 12, 2018
369e4b5
updated documentation comparing OPTICS/DBSCAN
espg Jul 12, 2018
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
+1,406 −0
Diff settings

Always

Just for now

Copy path View file
@@ -98,6 +98,7 @@ Classes
cluster.AgglomerativeClustering
cluster.Birch
cluster.DBSCAN
cluster.OPTICS
cluster.FeatureAgglomeration
cluster.KMeans
cluster.MiniBatchKMeans
@@ -112,6 +113,7 @@ Functions

cluster.affinity_propagation
cluster.dbscan
cluster.optics
cluster.estimate_bandwidth
cluster.k_means
cluster.mean_shift
Copy path View file
@@ -91,6 +91,12 @@ Overview of clustering methods
- Non-flat geometry, uneven cluster sizes
- Distances between nearest points

* - :ref:`OPTICS <optics>`
- minimum cluster membership
- Very large ``n_samples``, large ``n_clusters``
- Non-flat geometry, uneven cluster sizes, variable cluster density
- Distances between points

* - :ref:`Gaussian mixtures <mixture>`
- many
- Not scalable
@@ -796,6 +802,10 @@ by black points below.
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:

- Use OPTICS clustering in conjunction with the `extract_dbscan` method. OPTICS
clustering also calculates the full pairwise matrix, but only keeps one row in
memory at a time (memory complexity n).

- A sparse radius neighborhood graph (where missing entries are presumed to
be out of eps) can be precomputed in a memory-efficient way and dbscan
can be run over this with ``metric='precomputed'``. See
@@ -814,6 +824,105 @@ by black points below.
In Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996

.. _optics:

OPTICS
======

The :class:`OPTICS` algorithm shares many similarities with the
:class:`DBSCAN` algorithm, and can in fact be considered a generalization of
DBSCAN that relaxes the ``eps`` requirement from a single value to a value
range. The key difference between DBSCAN and OPTICS is that the OPTICS
algorithm builds a *reachability* graph, which assigns each sample both a
``reachability_`` distance, and a spot within the cluster ``ordering_``
attribute; these two attributes are assigned when the model is fitted, and are
used to determine cluster membership. If OPTICS is run with the default value
of *inf* set for ``max_bound``, then DBSCAN style cluster extraction can be
performed in linear time for any given ``eps`` value using the
``extract_dbscan`` method. Setting ``max_bound`` to a lower value will result
in shorter run times, and can be thought of as the maximum cluster object size
(in diameter) that OPTICS will be able to extract.

.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
:target: ../auto_examples/cluster/plot_optics.html
:scale: 50

.. centered:: |optics_results|

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`

The *reachability* distances generated by OPTICS allow for variable density
extraction of clusters within a single data set. As shown in the above plot,
combining *reachability* distances and data set ``ordering_`` produces a
*reachability plot*, where point density is represented on the Y-axis, and
points are ordered such that nearby points are adjacent. 'Cutting' the
reachability plot at a single value produces DBSCAN like results; all points
above the 'cut' are classified as noise, and each time that there is a break
when reading from left to right signifies a new cluster. The default cluster
extraction with OPTICS looks at changes in slope within the graph to guess at
natural clusters. There are also other possibilities for analysis on the graph
itself, such as generating hierarchical representations of the data through
reachability-plot dendrograms. The plot above has been color-coded so that
cluster colors in planar space match the linear segment clusters of the
reachability plot-- note that the blue and red clusters are adjacent in the
reachability plot, and can be hierarchically represented as children of a
larger parent cluster.

.. topic:: Comparison with DBSCAN

The results from OPTICS ``extract_dbscan`` method and DBSCAN are not quite
identical. Specifically, while *core_samples* returned from both OPTICS
and DBSCAN are guaranteed to be identical, labeling of periphery and noise
points is not. This is in part because the first sample processed by
OPTICS will always have a reachability distance that is set to ``inf``,
and will thus generally be marked as noise rather than periphery. This
affects adjacent points when they are considered as candidates for being
marked as either periphery or noise. While this effect is quite local to
the starting point of the dataset and is unlikely to be noticed on even
moderately large datasets, it is worth also noting that non-core boundry
points may switch cluster labels on the rare occasion that they are
equidistant to a competeing cluster due to how the graph is read from left
to right when assigning labels.

Note that for any single value of ``eps``, DBSCAN will tend to have a
shorter run time than OPTICS; however, for repeated runs at varying ``eps``
values, a single run of OPTICS may require less cumulative runtime than
DBSCAN. It is also important to note that OPTICS output can be unstable at
``eps`` values very close to the initial ``max_bound`` value. OPTICS seems
to produce near identical results to DBSCAN provided that ``eps`` passed to
``extract_dbscan`` is a half order of magnitude less than the inital
``max_bound`` that was used to fit; using a value close to ``max_bound``
will throw a warning, and using a value larger will result in an exception.

.. topic:: Computational Complexity

Spatial indexing trees are used to avoid calculating the full distance
matrix, and allow for efficient memory usage on large sets of samples.
Different distance metrics can be supplied via the ``metric`` keyword.

For large datasets, similar (but not identical) results can be obtained via
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
multithreaded, and has better algorithmic runtime complexity than OPTICS--
at the cost of worse memory scaling. For extremely large datasets that
exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
to *n^2* memory scaling); however, tuning of the ``max_bound`` parameter
will likely need to be used to give a solution in a reasonable amount of
wall time.

.. topic:: References:

* "OPTICS: ordering points to identify the clustering structure."
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.

* "Automatic extraction of clusters from hierarchical clustering
representations."
Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
In Advances in knowledge discovery and data mining,
pp. 75-87. Springer Berlin Heidelberg, 2003.

.. _birch:

Birch
@@ -116,6 +116,8 @@
n_clusters=params['n_clusters'], eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=params['eps'])
optics = cluster.OPTICS(min_samples=30, maxima_ratio=.8,
rejection_ratio=.4)
affinity_propagation = cluster.AffinityPropagation(
damping=params['damping'], preference=params['preference'])
average_linkage = cluster.AgglomerativeClustering(
@@ -133,6 +135,7 @@
('Ward', ward),
('AgglomerativeClustering', average_linkage),
('DBSCAN', dbscan),
('OPTICS', optics),
('Birch', birch),
('GaussianMixture', gmm)
)
Copy path View file
@@ -0,0 +1,93 @@
"""
===================================
Demo of OPTICS clustering algorithm
===================================
Finds core samples of high density and expands clusters from them.
This example uses data that is generated so that the clusters have
different densities.
"""

# Authors: Shane Grigsby <refuge@rocktalus.com>
# Amy X. Zhang <axz@mit.edu>
# License: BSD 3 clause


from sklearn.cluster import OPTICS
import matplotlib.gridspec as gridspec


import numpy as np

import matplotlib.pyplot as plt

# Generate sample data

np.random.seed(0)
n_points_per_cluster = 250

C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * np.random.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * np.random.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * np.random.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * np.random.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, C6))

clust = OPTICS(min_samples=9, rejection_ratio=0.5)

# Run the fit
clust.fit(X)

_, labels_025 = clust.extract_dbscan(0.25)
_, labels_075 = clust.extract_dbscan(0.75)

space = np.arange(len(X))
reachability = clust.reachability_[clust.ordering_]
labels = clust.labels_[clust.ordering_]

plt.figure(figsize=(10, 7))
G = gridspec.GridSpec(2, 3)
ax1 = plt.subplot(G[0, :])
ax1.set_ylabel('Reachability (epsilon distance)')
ax1.set_title('Reachability Plot')
ax2 = plt.subplot(G[1, 0])
ax2.set_title('Automatic Clustering')
ax3 = plt.subplot(G[1, 1])
ax3.set_title('Clustering at 0.25 epsilon cut')
ax4 = plt.subplot(G[1, 2])
ax4.set_title('Clustering at 0.75 epsilon cut')

# Reachability plot
color = ['g.', 'r.', 'b.', 'y.', 'c.']
for k, c in zip(range(0, 5), color):
Xk = space[labels == k]
Rk = reachability[labels == k]
ax1.plot(Xk, Rk, c, alpha=0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha=0.3)
ax1.plot(space, np.ones_like(space) * 0.75, 'k-', alpha=0.5)
ax1.plot(space, np.ones_like(space) * 0.25, 'k-.', alpha=0.5)

# OPTICS
color = ['g.', 'r.', 'b.', 'y.', 'c.']
for k, c in zip(range(0, 5), color):
Xk = X[clust.labels_ == k]
ax2.plot(Xk[:, 0], Xk[:, 1], c, alpha=0.3)
ax2.plot(X[clust.labels_ == -1, 0], X[clust.labels_ == -1, 1], 'k+', alpha=0.1)

# DBSCAN at 0.25
color = ['g', 'greenyellow', 'olive', 'r', 'b', 'c']
for k, c in zip(range(0, 6), color):
Xk = X[labels_025 == k]
ax3.plot(Xk[:, 0], Xk[:, 1], c, alpha=0.3, marker='.')
ax3.plot(X[labels_025 == -1, 0], X[labels_025 == -1, 1], 'k+', alpha=0.1)

# DBSCAN at 0.75
color = ['g.', 'm.', 'y.', 'c.']
for k, c in zip(range(0, 4), color):
Xk = X[labels_075 == k]
ax4.plot(Xk[:, 0], Xk[:, 1], c, alpha=0.3)
ax4.plot(X[labels_075 == -1, 0], X[labels_075 == -1, 1], 'k+', alpha=0.1)

plt.tight_layout()
plt.show()
Copy path View file
@@ -11,13 +11,15 @@
FeatureAgglomeration)
from .k_means_ import k_means, KMeans, MiniBatchKMeans
from .dbscan_ import dbscan, DBSCAN
from .optics_ import OPTICS
from .bicluster import SpectralBiclustering, SpectralCoclustering
from .birch import Birch

__all__ = ['AffinityPropagation',
'AgglomerativeClustering',
'Birch',
'DBSCAN',
'OPTICS',
'KMeans',
'FeatureAgglomeration',
'MeanShift',
@@ -0,0 +1,31 @@
cimport numpy as np
import numpy as np
cimport cython

ctypedef np.float64_t DTYPE_t
ctypedef np.int_t DTYPE

@cython.boundscheck(False)
@cython.wraparound(False)
# Checks for smallest reachability distance
# In case of tie, preserves order and returns first instance
# as sorted by distance
cpdef quick_scan(double[:] rdists, double[:] dists):
cdef Py_ssize_t n
cdef int idx
cdef int i
cdef double rdist
cdef double dist
rdist = np.inf
dist = np.inf
n = len(rdists)
for i from 0 <= i < n:
if rdists[i] < rdist:
rdist = rdists[i]
dist = dists[i]
idx = i
if rdists[i] == rdist:
if dists[i] < dist:
dist = dists[i]
idx = i
return idx
Oops, something went wrong.
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.