Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: OPTICS clustering #2043

Closed
wants to merge 8 commits into from

Conversation

FredrikAppelros
Copy link

Added an implementation of the OPTICS clustering algorithm. OPTICS does not by itself produce a set of labels for the samples so we have also implemented a hierarchical cluster extraction algorithm.

As of now, both implementations are located in the optics_.py file, but the extraction algorithm should probably be refactored out. OPTICS can have several different ways to produce labels from its set of reachability distances and its ordering so a modular approach to the extraction methods would be preferred.

We currently use scipy to calculate distances between samples instead of the built-in pairwise_distances in order to reduce memory consumption.

// Fredrik Appelros & Carl Ekerot

@GaelVaroquaux
Copy link
Member

How does this compare to the pull request #1984 ?

@FredrikAppelros
Copy link
Author

This implementation is the one we talked about some 3 months ago on the mailing list. I believe it complies more to the scikit-learn standard (it inherits from BaseEstimator and ClusterMixin) and it is also fairly well tested. We have ourselves been using it for 3 months on real data and tests are included.

Now, implementation wise, we use cdist from scipy to calculate the distances from the current point to each other point in order to keep memory utilization down to O(n). This is if we are not fed precomputed distances, of course.

We haven't had the time to fully examine the implementation used in #1984 but it seems to use a ball tree structure instead. It would be interesting to compare the performance of the two implementations and possibly integrate the use of a ball tree into this one if it leads to superior performance.

@robertlayton
Copy link
Member

Thanks for the PR! I was looking to add OPTICS at some point in the future :)

The branch builds, and the code looks quite good. I believe there is still the narrative documentation to go (in doc/modules/clustering.rst).

As a naive question, what does scipy's distance do that sklearn's doesn't? It would be better to have a common interface, and if there is a problem with sklearn.metric.pairwise_distance, than it would be better to fix that, then to have some classes using it and others not using it.

@FredrikAppelros
Copy link
Author

Ah, we missed the part about narrative documentation. We'll add that ASAP.

What's lacking in sklearn's metric module is a way to calculate distances individually instead of pairwise for the whole dataset as in sklearn.metric.pairwise_distance. In OPTICS there is no need to know the distances between each point to every other point during the execution of the algorithm. The only knowledge needed is the distance between the current point to every other point. By using this we only ever need to store O(n) distances in memory instead of O(n^2) distances as would be the case of a similarity matrix. This causes some duplicate calculations as each distance needs to be calculated twice but the reduced memory utilization is usually worth it.

@robertlayton
Copy link
Member

Right, I didn't think of that. The reason we have pairwise_distances at all is to allow us to override scipy's metrics, while using the ones we want, and ensuring a consistent interface that can be used with classifiers/clusterers easily. Do you think that a cdist-style function with a similar interface to pairwise_distances would help?

@GaelVaroquaux
Copy link
Member

Do you think that a cdist-style function with a similar interface to
pairwise_distances would help?

I don't really get the point, pairwise_distances can be used quite
similarly to cdist (just give it a 'Y' argument).

The big drawback I see of pairwise_distances over cdist is the overhead
when using small arrays. This makes me think that it would be more
profitable not to provide an overlay, and directly use cdist.

@CarlEkerot
Copy link

Using a structure that optimizes spatial queries, we might be able to avoid using cdist altogether. Also note that the current DBSCAN implementation uses pairwise_distances in the way we've tried to avoid. Perhaps both OPTICS and DBSCAN could be implemented with some data structure for querying neighbors so that the memory utilization wont restrict the amount of data points the implementations are able to process.

@robertlayton
Copy link
Member

Would it still support non-Euclidean metrics?

@espg
Copy link
Contributor

espg commented Jun 18, 2013

I did some testing, and it looks like pull request #1984 (my PR) runs a couple of orders of magnitude faster; more details here, but basically runtime using cdist runs in O(n**2) time. Since sklearn.neighbors.balltree implements non-euclidean distance calculations, adding them to OPTICS should be trivial.

I think that 2043 and 1984 should be merged. It's probably easier to bring 1984 up to match the scikit-learn API; main optics loop and supporting prep functions work, and any spatial index added here would likely repeat much of the coding 1984. The DBSCAN extraction algorithms from an ordered OPTICS list are better here, and could be merged into 1984 painlessly, as could the documentation.

@robertlayton
Copy link
Member

What is the memory usage like in #1984?

@jnothman
Copy link
Member

Apparently benchmarked at 250MB compared to 150MB here. See the mentioned post.

@robertlayton
Copy link
Member

Interesting. So it's memory or speed?

@jnothman
Copy link
Member

Yes, if you can compare 1.5x memory and 60x speed, assuming the benchmarks
are accurate and general.

On Wed, Jun 19, 2013 at 8:50 AM, Robert Layton notifications@github.comwrote:

Interesting. So it's memory or speed?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2043#issuecomment-19648849
.

@FredrikAppelros
Copy link
Author

I agree with @espg that the two different implementations should be merged. We didn't add a spatial indexing structure to our implementation to begin with because we didn't know that sklearn had implementations for that already. I finally got some time over to look through both #1984 and the current implementation of BallTree so now I'm back in the game.

What I'm wondering though is where the 20 different distance metrics for BallTree comes from. Are you referring to #1732? Because in the current implementation I just see the p parameter for Minkowski distances. This might be a stupid question but aren't there distance metrics that can't be represented as Minkowski distances? The problem with this is that it will be less general until #1732 has been merged into master.

Also I have to disagree on it being easier to bring #1984 up to speed as I just implemented spatial indexing into this one by basically changing 4 rows. I'll push it in here as soon as I have done some more testing on it.

What I would love to see however is if you still have that benchmark lying around so that we could compare the new implementation of this PR against the other two others (the original implementation here and the one in #1984).

A quick note: I haven't extracted the calculation of core distances into a preparation function as in #1984 as I don't know how sklearn approaches parallelization. It's a very interesting thought though and might be worthwhile!

@FredrikAppelros
Copy link
Author

Okay, we finally had some time to test the spatial indexing implementation and as far as we can see it should be working now. @espg: As I mentioned earlier it would be great if you could run your benchmark again or just publish it so we could try it out for ourselves. We're excited to see how much faster the algorithm is now.

@mblondel
Copy link
Member

mblondel commented Jan 8, 2014

@FredrikAppelros Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

@agramfort
Copy link
Member

+1

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 04aba0b on FredrikAppelros:optics into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling d16a684 on FredrikAppelros:optics into 0ccdc3a on scikit-learn:master.

si = seeds[ind[0]]
neighbors = ind[0][si]
ds = dist[0][si]
cds = core_dist * np.ones(len(ds))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. np.maximum(core_dist, ds) should suffice due to broadcasting rules.

@jnothman
Copy link
Member

jnothman commented Jan 5, 2015

I'm not yet sure about hierarchical_extraction, but I like the optics ordering extraction implementation (although perhaps we should check that a priority queue isn't actually much faster than the masking approach to seeds).

I would, however, do the radius_neighbors query en-masse, and use its output alone (rather than a second BallTree query) to pull back the core distances (getting the min_samplesth returned value from the list of radius neighbors), along the following lines:

    # TODO: check boundary cases are correct
    nn = NearestNeighbors(X, radius=eps)
    neighborhoods = nn.radius_neighbors_graph(X, mode='distance')
    n_neighbors = np.diff(neighborhoods.indptr)
    core_samples = np.flatnonzero(n_neighbors > min_samples)
    core_sample_indices = neighborhoods.indptr[core_samples]
    core_distance = neighborhoods.data[core_sample_indices + min_samples]

The OPTICS hierarchical clustering algorithm has been implemented as
well as a heuristic method for extracting clusters from it.
Delayed the pairwise distance calculations until they are
required. This resulted in a reduced memory footprint.
Memory complexity should have gone from O(n^2) to O(n).
* Allowed split point to be included in the right child.

* Added additional checks so that small children near the edges
  are not wrongfully inserted into noise.

* Added min_reach_ratio which is a lower limit for the local maxima
  expressed in terms of the largest reachability distance in the set.
Initial implementation using a BallTree for spatial indexing. Due to
BallTree's lack of a callable metric function, a test case had to be
disabled.
@chebee7i
Copy link
Contributor

Any further progress on this and/or #1984?

espg added a commit to espg/scikit-learn that referenced this pull request Oct 19, 2015
Added hierarchical extraction from scikit-learn#2043

scikit-learn#2043 is BSD licensed and hasn’t had any activity for 10 months, so
this seems pretty kosher; authors are cited in the code
@phdowling
Copy link

Is this still being worked on by anyone? If not, can I expect it to work if I merge it locally? Would be nice to try out for a project I'm working on.

@amueller
Copy link
Member

amueller commented Oct 7, 2016

Adding need contributor tag. Either to finish this or put it into scikit-learn-contrib.

@raghavrv
Copy link
Member

raghavrv commented Dec 6, 2016

Removing "Need Contributor" in favor of now active #1984

GaelVaroquaux added a commit that referenced this pull request Jul 16, 2018
* OPTICS clustering algorithm

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

* Create plot_optics

Shows example usage of OPTICS to extract clustering structure

* pep8 fixes

Mainly issues in long lines, long comments

* fixed conditional to be pep8

* updated to match sklearn API

OPTICS object, with fit method. Documentation updates.
extract method

* removed extra files

* plotting example updated, small changes

new plot example that matches the updated API
updated n_cluster attribute with reruns of extract
removed scaling factor on first ‘fit’ run

* updated OPTICS.labels to OPTICS.labels_

should pass unit test now?

* additional labels_ changes

* added stability warning

Scales eps to give stable results from first input distance. Extraction
above scaled eps is not allowed; extraction below scaled eps but
greater than input distance prints stability warning but still will run
clustering extraction. All distances below initial input distance are
stable; no warning printed

* Noise fix; updated example plot

Fixed noise points from being initialized as type ‘core point’
Fixed initialization for first ‘fit’ call
Decoupled eps and eps_prime (deep copy)

Matched plot example to same random state as dbscan
Added second plot to show ‘extract’ usage

* Changed to match Sklearn API

eps is not modified in init; kwargs fix

* Forcing 2 parameters

Why do I have to do this?

* Conforming to API

Fit now returns self; test fix for unit test fail in ball tree (asarray
problem)

* Fixing plot example

labels to labels_

* Fixed issue with sparse matrices

* Another attempt at fixing the sparse matrix error

Temporary fix until balltree can be updated to deal with sparse
matrices.

* Better checking of sparse arrays

Using ‘check_array’

* General cleanup

Removed extraneous lines and comments (old commented out code has been
removed)

*  Added unit tests for extract function

Added the unit tests in test_optics for fit, extract, and declaration.
Should bring coverage to ~100%. Additionally, fixed a small bug that
cropped up in the extract function during testing.

* Attempting for near 100% coverage

Removed unused imports, added to get warning.

* Fixed error in unit tests

* Trimmed extraneous 'if-else' check

see title

* forcing to check for a warning.

Result should be 100% coverage

* Updates to doc strings

All public methods now have doc strings; forcing a rebuild of OPTICS so
that build tests pass (last round failed due to external module)

* Style / pep8 changes

99% pep8 now… line 138 isn’t, but reads better with the long variable
names

* Added Narrative Documentation

Includes general description, discussion of algorithm output, and
comparison with DBSCAN. References and implementation notes are included

* Vectorized nneighbors lookups

Following suggestion from jnothman for doing nneighbors queries enmass.
Added OPTICS to cluster init

* fixing init build error

* reverting init

* All code now vectorized

…at least all code that can be ;)

Some general pruning and cleanup as well

* Style changes

Now 100% pep8

* Changing parameter style

matching DBSCAN

* Extraction change; Authors update

—initialize all points as ‘not core’ (fixes bug when plotting at
epsPrime larger than eps)
—added Sean Freeman to authors list

* Changed eps scaling to 5x instead of 10x

10x scaling is too conservative…eps scaling at 5x is perfectly stable,
and much faster as well.

* Fixing unit test

Should be ‘None’ for this; initialization previously at 1 for ‘is_core’
was incorrect

* Actually fixing unit test

Null comparison doesn’t work, using size

* Making ordering_ and other attributes public

renamed core_samples to core_sample_indices_ (as in DBSCAN). Used
attribute naming conventions (trailing _ character), and made
ordering_, reachability_, and core_dists_ public.

* Updates for Documentation

includes attribute strings for now public attributes

* Pep8 cleanup

Minor pep8 and pyflakes fixes

* updating plot example to match new attribute name

* CamelCase fixes

conforming to sklearn API on CamelCase

* adding hierarchical extraction function

Added hierarchical extraction from #2043

#2043 is BSD licensed and hasn’t had any activity for 10 months, so
this seems pretty kosher; authors are cited in the code

* added hierarchical switch to extract

Additional style and documentation changes as well

* cluster order bug fix

ensured that ordered list is always same length as input data

* removed hierarchical cluster extraction

Code from FredrikAppelros is totally unable to handle noise— as
currently written every point is assigned to a cluster, except the
first point of that cluster. May include later as a third method

* initial import of automatic cluster extraction code

Adding the excellent (and working) automatic cluster extraction code
from Amy X. Zhang. Some minor formatting changes on import to conform
to pep8

* wrapper for 'auto' extraction

additional fixes to style (camelCase, etc.), comments; made all helper
functions private

* test and example updates

Much better example data to showcase ‘auto’ extraction. Unit tests now
test both extraction methods. Set ‘auto’ as default, as it doesn’t
require any parameters and gives a better result. Pruned references to
hierarchical clustering

* fixing unit coverage; pruning unused functions

probably could still get a better unit test for ‘auto’ clustering…

* Added 'filter' fuction

Allows density-based filtering. Useful for cases where only a
‘noise’/‘no noise’ classification is desired. Function does not require
fitting prior to running, although it can be run after a fit if desired

* Vectorizing auto_cluster

generalizes input to multiple dimensions as well…

* updated filter function

* fixing test error

setting minPts to < data size returns None (with error message)

* removing exception handling in favor of conditional check

* Updated unit tests

Coverage to 90%. PEP8 fixes

* Additional unit test

Now at 94%; auto extract method is very hard to test… this new unit
test adds a more robust dataset to trigger more branches for testing.
Some of the remaining conditionals are pretty rare … :-/

* Fix unit test bug / python 3 compat

None type comparison problem…

* Fixed annoying deprecation warning for 1d array in BallTree

PEP8 Fixes

* More PEP8 and remove print statements fromt est

* 70 to 80% faster, fixed distance metrics

Modified to remove extraneous sorts, nneighbors query, and reduced
pairwise distance calculations to the upper triangle of a distance
matrix (instead of the full matrix). It appears that in ‘full’ scans
(i.e., when epsilon is set to inf, or the width of the data set),
OPTICS now actually outperforms DBSCAN… also, should be easy to run
distance calculations in parallel now for large datasets, with proper
heuristics.

* Fixing unit test failure

* Exposed auto_cluster parameters as public

documentation and API update so that users can tweek the auto method

* Fixed def with missing ':'

* Fix bugs from api change...

make sure that arguments are being called correctly with the new
extract_auto() method

* pep8 / pyflakes changes

* Updated example / plot

Correctly generates figure with subplots for DBSCAN / OPTICS comparison

* Tuning plot example / pep8 change

* Bug fix for commit 0b4cbdd (enforce stable sort)

the returned index from “sp.argmin(setofobjects.reachability_[n_pr])”
assumes that entires are stably sorted by distance from query point
(i.e., that ties in the argmin return the closest point to the input
point).

Still overall faster compared to the pre 0b4cbdd version, since we’re
filtering processed points before sorting by distance… and also only
calculating distances for non-processed points, instead of all within
the epsilon query.

* Code review fixes (style)

Fixes coding style (test comments, camel case, author list, relative
imports, etc). Included a new unit test and .npy file to explicitly
test reachability distances (test coverage at 99%).

* fixing new unit test

can’t import reach_values.npy in current directory….just placed testing
values in script directly (~200 lines for 1500 testing values)

* refactored min_heap to c extension

reduced optics file by about 100 lines of code… new c extension for
speedup (needs further optimization…6-10X speedup still possible)

* minor fixes..

* small cython optimizations

* cython fixes

* Add foo.txt

* Remove foo.txt

* fix compilation error

* last optimizations cython/numpy

MinHeap is only called once…so it’s faster to do a simple linear scan
here. The np.argmin() function does *almost* the exact same thing, but
the custom quick scan function is needed for cases where reachability
distances are tied (and the next point is selected based on which of
the points tied in reachability are closest to the querying point).
OPTICS is now faster than DBSCAN for medium-to-large number of input
points, and has better worst case run time (eps=inf).

* fix pyflakes errors; change default eps value

* _

* API changes from agramfort

Removed ‘filter method’, changed print statements to exceptions and
warnings as needed, remove ‘processed’ flag and replaced with fit_check
method, changed array copies to parameters, updated unit tests, changed
inline documentation to match proper doc string formatting, made
private class private. Probably some other changes too…

* fixed core samples bug

* added fit_predict

conforming to scipy api

* updates to variable names; update plot

* refactor to remove balltree specific code

* major refactor

finished decoupling balltree; lots of changes

* fixed bugs; test all pass again

:-)

* fixed weird cython bug

…not at all sure why pairwise_distances doesn’t automatically return
np.float arrays. Makes no sense to me. I could understand cases in
which the metric call returns int’s (i.e., city block)… but why it
would return float.32 instead of float.64 seems super odd.

* major refactor

deleted extract and auto_extract methods; added optics function; added
extract_dbscan method; added extract_dbscan and extract_optics
functions; updated unit tests; renamed `eps` to `max_bounds`; flake8
corrections; added types: int —> labels, bool —> is_core, enforced X to
be ‘float’ (fixes cython type errors)

* Updated Documentation!

Updated plot with reachability plot, as well as documentation :)

* fix flake8 error

* added optics to cluster comparison

Don't like the figure since it doesn't use black for noise, but added OPTICS for consistency

* Updated comparison plot to transpose

...kinda kludgy fix for the transpose

* small fix

* flake8 error

* reverting transpose of cluster comparison (seperate PR #9739)

* fixes from agramfort's review

public/private changes, numpydocstring fixes, a few pep8 fixes that flake8 didn't flag for some reason...

* fix for error message

unit test should pass now

* force cluster_id's to start at 0

* Fix sp. error and flake8 warning(s)

* Updated documentation

responses to reviews

* Removed extraneous files

also fixed small typo

* fixing lgtm alert

* changes from jnothman

small fixes for docs, plots, and tests

* Fixes from jnothman's review

Reorded parameters, updated documentation, removed unneeded else statement, changes to varible names.

* Fixing flake8 error

* Removed neighbors / balltree inheritance

Also decouples n_jobs -- can set n_jobs for just the kneighbors lookup, while keeping pairwise lookups to single job

* Made nbrs private and moved initiation to fit()

also renamed core_dists to core_distances.

* fixed non-standard characters

* Response to TomDLT review

narrative changes to tests. Normalized reachability distances for significant_min parameter. Cleaned up plots; small changes to documentation with optics_.py

* Fixed labeling bug

also minor documentation updates

* update unit test

since labels are 0 indexed, max of labels is (1 - total number of clusters). We can't take len(set(clusters)) because of noise (will be 4, not 3).

* Simple fixes per jnothman

Fixed float division, condensed variable names to be shorter, renamed bools to True and False, removed un-needed code block. Still need to add tests for cluster tree extraction :-(

* Auto-cluster tests

coverage should be complete now; fixed minor bug; removed un-needed check.

* fixing test error

* removed python loop

also fixed documentation link

* Fixing test error

* Fix typo in unit test

entry was supposed to be '1.0' not '10'; the test is supposed to posit 3 clusters, 2 of which are too small and are merged. Old version posited 4 clusters, two of which were merged as intended, and two of which were discarded (cluster merging requires one of the clusters to be large enough to be an independent cluster; with 4 instead of three, this case did not happen for either of the first two clusters).

* documentation updates

* Post-merge doctest fix

merge conflict in clustering.rst in previous commit; this push updates the doctest values to current correct values, and resolves the conflict

* DBSCAN / OPTICS invariant test

Restructured documentation. Small unit test fixes. Added test to ensure clustering metrics between OPTICS dbscan_extract and DBSCAN are within 2% of each other.

* Update _auto_cluster docstring

renamed reachability_ordering --> ordering for consistency

* changes fro jnothman

changes unit tests to check for specific error message (instead of 'a' error); minor updates to documentation. This also fixes a bug in the extract_dbscan function whereby some core points were erroneously marked as periphery... this is fixed by reverting to a previous extraction code block that initalizes all points as core and then demotes noise and periphery points during the extract scan. Parameterized unit test.

* fix spelling error in tests

* contingency_matrix test

New invarient test between optics and dbscan

* small unit test updates per jnothman

* unit test typo fix

* extract dbscan updates

Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).

* updated documentation comparing OPTICS/DBSCAN

* DOC: phrasing and whats_new

* MISC: small mem footprint in OPTICS
@qinhanmin2014
Copy link
Member

Resolved in #1984, thanks @FredrikAppelros

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet