Implementation of OPTICS #1984

Open
wants to merge 60 commits into
from

Projects

None yet
@espg

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)

First commit; would appreciate any advice on changes to help code conform to the scikit-learn API

Example usage in examples folder

espg added some commits May 21, 2013
@espg espg OPTICS clustering algorithm
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
147323d
@espg espg Create plot_optics
Shows example usage of OPTICS to extract clustering structure
d18e827
@jaquesgrobler jaquesgrobler commented on an outdated diff May 22, 2013
examples/cluster/plot_optics
+===================================
+Demo of OPTICS clustering algorithm
+===================================
+
+Finds core samples of high density and expands clusters from them.
+"""
+from sklearn.datasets.samples_generator import make_blobs
+from sklearn.preprocessing import StandardScaler
+from sklearn.cluster import optics as op
+
+##############################################################################
+# Generate sample data
+centers = [[1, 1], [-1, -1], [1, -1]]
+X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)
+
+##############################################################################
@jaquesgrobler
jaquesgrobler May 22, 2013

I tend to prefer not having these separation lines.. A blank line is fine :)

@NelleV
scikit-learn member

Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.

@NelleV NelleV commented on an outdated diff May 22, 2013
sklearn/cluster/optics.py
+###################################
+
+
+## Imports ##
+
+import sys
+import scipy
+
+from sklearn.neighbors import BallTree
+
+from sklearn.datasets.samples_generator import make_blobs
+from sklearn.preprocessing import StandardScaler
+
+## Main Class ##
+
+class setOfObjects(BallTree):
@NelleV
NelleV May 22, 2013

Make sure to follow pep8's convention for class names.

@jaquesgrobler jaquesgrobler commented on an outdated diff May 22, 2013
examples/cluster/plot_optics
@@ -0,0 +1,107 @@
+"""
+===================================
+Demo of OPTICS clustering algorithm
+===================================
+
+Finds core samples of high density and expands clusters from them.
+"""
+from sklearn.datasets.samples_generator import make_blobs
+from sklearn.preprocessing import StandardScaler
+from sklearn.cluster import optics as op
@jaquesgrobler
jaquesgrobler May 22, 2013

although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example

@jaquesgrobler jaquesgrobler commented on an outdated diff May 22, 2013
examples/cluster/plot_optics
+
+##############################################################################
+# Generate sample data
+centers = [[1, 1], [-1, -1], [1, -1]]
+X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)
+
+##############################################################################
+
+##############################################################################
+# Compute OPTICS
+
+testtree = op.setOfObjects(X)
+
+# Run the top-level optics algorithm
+
+op.prep_optics(testtree,30,10)
@jaquesgrobler
jaquesgrobler May 22, 2013

pep8 needed here

op.prep_optics(testtree, 30, 10)

@jaquesgrobler
scikit-learn member

Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..

The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.

Hope that helps for now. Thanks again 👍

espg added some commits May 22, 2013
@espg espg pep8 fixes
Mainly issues in long lines, long comments
70daac1
@espg espg fixed conditional to be pep8 4fea7ce
@mblondel
scikit-learn member

@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

@espg

@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?

I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.

@kevindavenport

This is great work!

@arnabkd

@espg : I am very interested in an OPTICS implementation in scikits-learn :)

@michaelaye

What's the status of this?

@amueller
scikit-learn member

Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.

@kevindavenport

I want to jump in and take this over if it's stalled. I can start this weekend :)

@amueller
scikit-learn member

I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.

@espg espg updated to match sklearn API
OPTICS object, with fit method. Documentation updates.
extract method
7cc7e12
@espg

@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.

The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.

While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.

I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.

@espg

@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/

@amueller
scikit-learn member

@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.

For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.

There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.

For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).

Maybe also add it to the clustering comparison example.

@amueller
scikit-learn member

I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).

@amueller
scikit-learn member

It might also be interesting to compare to #3802.

@MechCoder
scikit-learn member

I can also try reviewing, once you feel that it is in a reviewable state.

@GaelVaroquaux
scikit-learn member
@espg espg plotting example updated, small changes
new plot example that matches the updated API
updated n_cluster attribute with reruns of extract
removed scaling factor on first ‘fit’ run
7d23d01
@espg

@amueller : For rescanning-- I implemented the class so that it (similar to your suggestion) saves the clustering order and hierarchy as an attribute already, but I'll check @GaelVaroquaux suggestion of using joblib.Memory as a cleaner way to do this.

@espg

@amueller : Thank you for the directions on where this needs to go. The tests are a bit of hangup, really I need to write a different example/test that is specific to OPTICS that is appropriate for showing multiple hierarchies of clusters for extracting at different eps values. Right now, I've just been using the same example that DBSCAN uses; however, the automated unit test for clustering fail for the following reason:

The OPTICS algorithm struggles with cases where the extracted clustering structure is close to the initial eps value (really, close to eps_max). In otherwords, running OPTICS(eps=.3,min_samples=10).fit(X) gives different (and incorrect) clustering compared to:

clust = OPTICS(eps=3.0,min_samples=10)
clust.fit(X)
clust.extract(0.3)

...which matches the results given from DBSCAN.

This is because the cluster seeds, which are expanded, are sensitive to starting location for low eps values. Using an eps value of inf ensures that you always get the same results as DBSCAN, but for large point sets gives a pretty substantial slow down; it is worth having the parameter user specified, but it needs to be larger than the final extraction-- somewhere between 5 and 10 times the largest eps value that you plan to cluster at. This could be scaled automatically in the init and fit options, but I think that approach is contrary to the API style for contributing; so I'm guessing that the appropriate use of the parameter needs to be explained in a narrative documentation page?

@amueller
scikit-learn member

Sorry, which automated unit test is that?
The one that is failing on travis seems to be a missing labels_ attribute (which is part of the clustering api apparently).

And yes, discussing the parameter choices in the docs makes sense. Still, the default parameters should give sensible results in so far as that is possible.

@jnothman
scikit-learn member

Hi @espg. I think the main issue with this PR is still coding style, etc. You've chosen to split things up into many functions and classes (and even sub-classed BallTree) when much of it would be easier to read if it were implemented in a single function with a clearly-named helper or two, and then wrapped in a class, just as dbscan/DBSCAN are. See also the alternative implementation in #2043, which has other issues, but whose control structure and data model is much clearer to follow. Also, please avoid CamelCase where not used in class names.

Your code is also not flat because it fails to exploit numpy's vectorised operations, such as here, which not only make the code flatter and easier to read (although it looks a bit less like the equivalent pseudocode) but make it much faster.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it.

The rule for contributing to existing software is to start small! Enhance something that's already there, and once you are confident about your work, the project and how to contribute, and the team receiving your contributions is confident about your work, paths open up to more substantial contributions.

IMO, the version of OPTICS that will eventually, I expect, be merged into scikit-learn will look more like #2043.

@espg

hi @jnothman— thanks for the feedback, I appreciate it. I think that a major difference between 2043 and 1984 is that 1984 implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed. Without pruning searches, the OPTICS algorithms perform, for N data points, N^2 distance lookups. Pruning searches is targeted at the specific, but common, case where you have a dense or semi-dense data set with a mismatch of several orders of magnitude between the extent of the dataset and the size of the objects that are being clustered. By way of example, think clustering objects of average width 5, in a dataset that is distributed over order 100,000 units across per dimension; in these types of cases, there’s no need to perform distance lookups between points that are far apart—however, what counts as ‘far’ changes depending on the density of the dataset, so determining what to prune is done within the OPTICS loop. I don’t know of anyway to prune without keeping track of both per-point pruning distance, and whether a given point has been processed, which is why the balltree data structure is subclassed in the code. Without pruning, you either pay N^2 memory (like DBSCAN does/did) or N^2 cpu time (like 2043)…the other way to solve this problem would be to split the dataset up into overlapping tiles and merge the clustered objects, but the tiling scheme isn’t obvious and the merge is nontrivial.

I think that one of the the main problems with 1984 at this point is that it doesn’t have complete unit coverage, and I haven’t had experience with unit tests so I don’t really know how to increase the coverage. I can fix camel-case and work on narrative documentation and some of the other issues, but the coverage has been a stumbling point, so if you have any resources to point me toward, I’d be grateful!

I’m not really sure about the diff that you referenced as an example of better flat code—this is a diff from this (1984) pull request, so perhaps you meant to point to a different link; or just that this update was better? In general, I try to call either numpy or scipy for almost all operation as these functions tend to be vectorized.

The rule for contributing to existing software is to start small!

Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order; DBSCAN didn't scale for my data, and 2043 hadn't been opened yet, so I started 1984

@espg espg closed this Jan 5, 2015
@espg espg reopened this Jan 5, 2015
@jnothman
scikit-learn member

Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order.

It's not really the point; this way everyone (you and other devs) will be too tired of working on a big chunk by the time it's nearly ready to merge. You're much better off having something small that garners confidence for all, learn the ropes, and work your way up.

implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed

I'm not sure what you mean here by "search pruning". I don't see anything in your code that immediately looks like pruning, nor any reference to that term in the OPTICS paper. But your extensions to BallTree are entirely orthogonal to the function of BallTree, so the use of class inheritance seems ill-motivated.

I'm not really sure about the diff that you referenced as an example of better flat code

It was an example of something that was written as a Python for loop but could have been written vectorised, so yes, it was from your own PR. (And it is additionally in a function that only serves to perform two could-be-vectorised operations that really does not justify a separate function.) For comparison, I have suggested elsewhere a succinct and time-efficient way to identify epsilon neighbors, determine core samples and their distances.

@jnothman
scikit-learn member

it doesn’t have complete unit coverage

If not every line is tested, you need to create data scenarios that will test those untested lines, where possible. Is that the issue you're concerned with?

espg added some commits Jan 22, 2015
@espg espg added stability warning
Scales eps to give stable results from first input distance. Extraction
above scaled eps is not allowed; extraction below scaled eps but
greater than input distance prints stability warning but still will run
clustering extraction. All distances below initial input distance are
stable; no warning printed
9d163e3
@espg espg Noise fix; updated example plot
Fixed noise points from being initialized as type ‘core point’
Fixed initialization for first ‘fit’ call
Decoupled eps and eps_prime (deep copy)

Matched plot example to same random state as dbscan
Added second plot to show ‘extract’ usage
b18da24
@espg

@chebee7i I still need to finish unit tests and documentation, but the implementation works if you would like to use it for something. I'm leaving for field work in Greenland next week and will be gone for 2 months, but I'd like to have this finished for a merge by sometime in July after I get back.

espg and others added some commits Jul 15, 2015
@espg espg Changed to match Sklearn API
eps is not modified in init; kwargs fix
c0e24e5
@espg espg Forcing 2 parameters
Why do I have to do this?
90e21ac
@espg espg Conforming to API
Fit now returns self; test fix for unit test fail in ball tree (asarray
problem)
b06c76d
@espg espg Fixing plot example
labels to labels_
abbdbf2
@freemansw1 freemansw1 Fixed issue with sparse matrices
2c54cdc
@freemansw1 freemansw1 Another attempt at fixing the sparse matrix error
Temporary fix until balltree can be updated to deal with sparse
matrices.
58d530a
@espg espg Better checking of sparse arrays
Using ‘check_array’
6034139
@espg espg General cleanup
Removed extraneous lines and comments (old commented out code has been
removed)
fd5c65f
@espg espg Merge remote-tracking branch 'upstream/master'
Checking module compatibility with current sklearn build
39660a5
@freemansw1 freemansw1 Added unit tests for extract function
Added the unit tests in test_optics for fit, extract, and declaration.
Should bring coverage to ~100%. Additionally, fixed a small bug that
cropped up in the extract function during testing.
6ae8c94
@freemansw1 freemansw1 Attempting for near 100% coverage
Removed unused imports, added to get warning.
bc9cc1c
@freemansw1 freemansw1 Fixed error in unit tests
34b9a6d
@espg espg Trimmed extraneous 'if-else' check
see title
8d5ac08
@freemansw1 freemansw1 forcing to check for a warning.
Result should be 100% coverage
286f90e
@espg

@MechCoder If you're still interested in doing a code audit, I'd really appreciate your feedback. We've updated the code a fair bit-- coverage is complete, we pass the automated builds now. As best as I can tell, we also match the sklearn API now.

@espg espg Updates to doc strings
All public methods now have doc strings; forcing a rebuild of OPTICS so
that build tests pass (last round failed due to external module)
ae2bddc
@amueller
scikit-learn member

Thanks for the update. I think there are still some style issues with the code, but mostly it would be great if you could provide a comparison with DBSCAN and Birch.

espg added some commits Jul 31, 2015
@espg espg Style / pep8 changes
99% pep8 now… line 138 isn’t, but reads better with the long variable
names
55365c5
@espg espg Added Narrative Documentation
Includes general description, discussion of algorithm output, and
comparison with DBSCAN. References and implementation notes are included
e1053b0
@espg

I've uploaded narrative documentation into the clustering.rst file that discusses the similarities and differences of OPTICS and DBSCAN-- @amueller is this the sort of thing that you're looking for in terms of a comparison, or are you referring to benchmarks? Also, I wasn't sure how to get this figure into the documentation
image. It looks like DBSCAN is getting it's images into the narrative documentation from 'auto-examples'; if I modify my plot example to plot the attached image, can I pull it into the narrative documentation automatically?
Thanks

espg added some commits Aug 2, 2015
@espg espg Vectorized nneighbors lookups
Following suggestion from jnothman for doing nneighbors queries enmass.
Added OPTICS to cluster init
965225e
@espg espg fixing init build error
1bda3b1
@espg espg reverting init
1827abf
@amueller
scikit-learn member

Both documentation and benchmarking would be great ;) So the documentation part is already good.
You can indeed embed images from the examples into the documentation. You already did that here, right? https://github.com/scikit-learn/scikit-learn/pull/1984/files#diff-de1b5b01d8a5c07a265e4606fcbd28b0R805

@espg espg All code now vectorized
…at least all code that can be ;)

Some general pruning and cleanup as well
08389fc
@espg

@amueller I've run a 'first cut' benchmark, you can see the output here: http://pastebin.com/2Ns0MYSu
I think the performance looks pretty good at this point-- on a dataset that size, for any more than two separate (highish) epsilon values, it's faster to run OPTICS and extract the output than to rerun DBSCAN repeatedly.

espg added some commits Aug 4, 2015
@espg espg Style changes
Now 100% pep8
d9c2bb1
@espg espg Changing parameter style
matching DBSCAN
659a697
@espg

@amueller comprehensive benchmarks are done for OPTICS and DBSCAN— the green line is OPTICS and the blue line is DBSCAN. These took longer than I expected— in part because I had to move, and in part because the run time on the larger datasets starts to approach over a dozen hours. I ran comparisons by varying the epsilon parameter over different sizes of datasets, and did the same with varying min_samples over different size dataset— min_samples didn’t change run time appreciably for either either OPTICS or DBSCAN, so I’ll just go over the impact of epsilon values and dataset size in the summary below using the three canonical figures:

image

The figure above is for ~4k points, and shows the same general trend that the next two figures will show— OPTICS is slower than DBSCAN by a little over an order of magnitude… OPTICS is expected to be slower than DBSCAN, so no surprises here. You can see some leveling off with both algorithms at higher epsilons; this is because the physical bounds of the data set is 5 by 50 meters, so epsilons over 50 converge to distance calculations between all points. OPTICS converges first, because OPTICS scales the input epsilon by 10 internally, so an input epsilon of 5 is run at 50. The scaling is to avoid boundary cases, and is pretty conservative—scaling by 5 would probably be fine, and would move all the green OPTICS lines to the right and close the performance gap with DBSCAN.

image

This figure, generated using ~40k points, shows basically the same thing, but with the performance converging for the two algorithms at higher epsilons. The flattening for the OPTICS algorithm is again due to the physical extent of the testing data: 20 by 100 meters, so by an epsilon of 10 (which is scaled to 100), performance is basically level for OPTICS. DBSCAN also levels off, just an order of epsilon values later (as expected)…interestingly, for an exhaustive clustering at high epsilon values, the two algorithms take essentially the same amount of time, which beats expectations of performance for OPTICS!

image

The final benchmark above, ran on ~400k points (500 by 100 meters) is the real kicker. Pattern basically remains the same— the same order of magnitude difference between the algorithms. There is a bit of widening on the performance gap at the right hand side, but remember that OPTICS is lagged because of the epsilon scaling…based on the previous graphs, we would expect OPTICS to level out around 50, as scaling would put distance queries at 500 meters, the largest extent of the dataset. Of course, we don’t see that in this benchmark graph since it doesn’t go as far as the others on the x-axis— we only cluster at 20 epsilon and lower, instead of going to 250 epsilon. Why? Well, it turns out that DBSCAN doesn’t really work on large point datasets that also have large epsilon distances—running DBSCAN in these conditions uses all of my machine’s 16GB of memory, and then swaps to disk until I run out of free disk space, and crashes my system around the time that I have 80 to 100 GB of virtual memory dedicated to the benchmark. Running the same using OPTICS may take 6 hours, but the process is kind enough to stay under a gig and a half of memory while it does it—and it completes.

I first started OPTICS because, in 2013, the then implementation of DBSCAN couldn’t really cluster any datasets above 40k points. My impression was that the addition of the balltree and kdtree algorithms into DBSCAN had effectively changed that— and it has, but only for low epsilon values. The current implementation of DBSCAN effectively still calculates an N-by-N distance matrix when using epsilon values that are equal to the extent of the dataset; it just does it using a tree structure to execute the queries. This is a problem even when not at an epsilon equal to the dataset extent; we’re crashing here at epsilons in the 10 to 20 range on 100 by 500 meter dataset when using DBSCAN. For a large dataset, a ‘medium’ epsilon value will yield an N-by-P matrix, which if ‘P’ is less than ’N’ but still large, will crash the algorithm due to memory constraints. In contrast, in the worse case scenario OPTICS will calculate N-by-1 distances, one at a time for N times, such that even if runtime gets quite long, memory will stay at N-by-1. The general case calculation for OPTICS is P-by-1 distances calculated N times, where P is less than N.

The OPTICS algorithm does perform slower than the DBSCAN algorithm, but there are three points to be made on this. First, it is expected to run slower. Second, the OPTICS algorithm effectively clusters for all epsilon values below the one which it is ran at— so a fair comparison is the run time of OPTICS compared to the integral of the DBSCAN benchmark up to a given epsilon distance. Third, the internal scaling means that we should really shift the OPTICS line right by an order to a half order of magnitude in comparing performance.

It’s also worth noting that OPTICS has different (i.e., more varied) extraction options for clustering than DBSCAN; so there are many cases in which users will want to run OPTICS instead of DBSCAN regardless of the performance difference. Of course, the ability to (almost instantly) rescan for an arbitrary epsilon is also quite appealing if the operation is going to take a while. And to restate the biggest take away from these benchmarks— for large point sets with large epsilons, OPTICS is the only game in town.

@jnothman
scikit-learn member

Am I right in thinking that this OPTICS implementation only readily allows for clusterings that are more-or-less equivalent to DBSCAN, but with the ability to try different eps? Are there DBSCAN variants that adapt eps to different regions of the space?

Thinking about how this compares to other estimators in scikit-learn, it's functionally a bit like regularisation path tools, or PCA where the entire decomposition can be memoised so that the user can play with number of components, or a similar feature in hierarchical clustering and pruning. The difference is that here there is a substantially higher up-front cost.

From an API perspective, it may be sensible to build this as an option in DBSCAN, like memory in HierarchicalAgglomerativeClustering.

@espg

I'm not sure if I understand your question, but I'll answer as best I can-- OPTICS allows for you to adapt eps to different regions of the space, DBSCAN does not...and I don't know of any variant of DBSCAN that would allow you to.

This specific OPTICS implementation (i.e. #1984 ) currently only has one extraction function built in, but the algorithm (and this implementation) produces a cluster ordering and core + reachability distances. These three attributes allow for other types of 'non-DBSCAN like' extraction functions. Here are few examples, both 'abstract' (theoretical/described in a paper), and 'concrete' (already implemented elsewhere on github):

  1. Maximum compactness clusters: form clusters at the local minimum eps that forms a cluster object

  2. Maximum extent clusters: form a cluster and 'grow' with greater eps. Locally stop eps when two or more adjacent cluster objects merge. (i.e., grow cluster by adding less proximal 'noise' points, but do not incorporate merging of clusters.)

  3. Arbitrary compactness clusters: between the first two methods discussed above; see here. See also this 'automatic clustering' code from amyxzhang.

  4. Reachability plots / Dendrograms: see here (again) for an example of just the extraction feature for these dendrograms (along with the extraction function mentioned above) written in python. Here is a graphical example of the reachability plot, along with the 'non-DBSCAN-like' OPTICS cluster extraction and data points that matches the referenced reachability plot.

  5. Recursive clustering: provide an cluster with all valid sub-clusters that appear at lower eps (similar to above reachability plots, but not graphical).

  6. Identical clustering to DBSCAN, but for all lower eps values: This is what I have implemented as an extraction feature. Note that the extraction function is effectively instant once the clustering has been done and stored.

I currently only have the one method of extraction (number 6 above) in this pull request. I would like to make sure that the core algorithm meets sklearn's standards and is ready for a merge before I add any extra functionality.

Extraction functions are fairly modular and can be added easily in the future-- I think that it would make sense to add the extraction function from #2043 to this PR as a second extract function and add both the authors of that PR to authors list of this OPTICS implementation. Along the same lines, I'm not opposed to reaching out to amyxzhang to see if she'd like to contribute the reachability plot code, or the 'automatic' extraction function.

If you did want a unified API, my suggestion would be to make a top level class like 'density clustering' under which both OPTICS and DBSCAN live. Honestly though, I think that people are familiar with DBSCAN and OPTICS in general as far as algorithms, so it makes most sense to have them listed as their own algorithms with their own methods.

Note that all of amyxzhang's code assumes that you get the cluster order reachability distances elsewhere-- and that the 'hierarchical_extraction' code from FredrikAppelros similarly does not care where it gets it's cluster order and reachability distances. The point of this PR is to implement the OPTICS algorithm in a way that is scalable in terms of memory, and reasonably efficient in terms of run time.

espg added some commits Sep 21, 2015
@espg espg Extraction change; Authors update
—initialize all points as ‘not core’ (fixes bug when plotting at
epsPrime larger than eps)
—added Sean Freeman to authors list
bec154f
@espg espg Changed eps scaling to 5x instead of 10x
10x scaling is too conservative…eps scaling at 5x is perfectly stable,
and much faster as well.
5891945
@espg espg Fixing unit test
Should be ‘None’ for this; initialization previously at 1 for ‘is_core’
was incorrect
9427f1e
@espg espg Actually fixing unit test
Null comparison doesn’t work, using size
6462b5e
@espg espg Making ordering_ and other attributes public
renamed core_samples to core_sample_indices_ (as in DBSCAN). Used
attribute naming conventions (trailing _ character), and made
ordering_, reachability_, and core_dists_ public.
2785864
@espg

@jnothman it looks like there is a new variant of DBSCAN called HDBSCAN that may adapt eps locally; it was published in 2013, and has 16 citations. Is OPTICS, published 1999 with 2135 citations to date, something that sklearn is interested in potentially including at some point? It seemed to have interest from @arnabkd, @kevindavenport, and other community members--and also fits this question from the sklearn FAQ as far as I can tell.

espg added some commits Sep 21, 2015
@espg espg Updates for Documentation
includes attribute strings for now public attributes
8c4572b
@espg espg Pep8 cleanup
Minor pep8 and pyflakes fixes
3d5e0fc
@espg espg updating plot example to match new attribute name
02bb8c0
@espg

Is there any way to get this PR merged in time for the 0.17 sklearn release? I’ve worked on this PR heavily though July, August, and September and I honestly am at a loss as to what else needs to be done at this point. I’ve followed all of the published conventions and API styles that I could find on the contributors guide and on the wiki. Here is a list of requests from the sklearn maintainers that I have addressed:

@NelleV code is pep8, documentation follows numpydoc conventions

@jaquesgrobler code follows sklearn API— has constructor, calls a ‘fit’ method. All of the input parameters are defined in the docstrings, as are the attributes. All public methods have docstrings (most of the private ones do too), and all of the attributes are named according to the sklearn convention, and match the other clustering algorithms that have those attributes.

@amueller narrative documentation is provided, as are benchmarks. Unit tests are included

@jnothman coding style has been completely overhauled since June. All code is flat and vectorized. Unit tests show 98% coverage. Neighborhood size determination is done en mass according to your suggestion.

OPTICS seems to fit the guidelines that are provided in the FAQ for algorithm additions. It is a newer algorithm than DBSCAN, but older than 3 years, and has over 10 times the suggested 200 citations. Computational complexity of this particular implementation matches what is generally expected for the algorithm: performance is dominated by the need to make one eps neighbors query per point in the input data set.

In terms of if OPTICS should be a submodule to DBSCAN… both OPTICS and DBSCAN have their own, separate, wikipedia pages. Both algorithms are popular enough that they have their own variants: DBSCAN has GDBSCAN, HDBSCAN, and is used in PreDeCon and SUBCLU; OPTICS has OPTICS-OF, DeLi-Clu, HiSC, HiCO, DiSH, and FOPTICS. To me it makes sense to have OPTICS as a separate clustering algorithm from DBSCAN, but I will of course defer to the maintainers judgement on what makes the most sense for fitting in with the sklearn API.

I’m happy to make additional changes if needed, but I’m not clear on what else is expected in order to have this ready for a merge.

Thank you in advance for your feedback,
espg

@jnothman
scikit-learn member

Is there any way to get this PR merged in time for the 0.17 sklearn release?

Unfortunately, no, though we thank you for all your work. 0.17 should be released very soon, though no core developer has completed a review of this PR and given their support that it is ready for merging, and we require support from two before merge.

code follows sklearn API

The coding conventions are much closer than they were, but we still need to work out whether extract is the best way to make predictions with alternative parameters, while minor tweaks such as avoiding camel case except in class names will also be necessary.

@espg

@jnothman What do you think about using transform instead of extract? The cluster.FeatureAgglomeration class passes parameters to transform, so perhaps epsilon_prime could be passed through transform. Is the transform API flexible enough to consider X an optional argument (i.e., the way that fit takes (X,[y]), with y being ignored/optional for unsupervised learning)?

Alternatively, is there anything that's halfway between transform and predict? predict and fit_predict both return labels, but requires new data as an input. transform works on a fitted data set, which matches this application, but doesn't return labels. Perhaps a new call transform_predict makes sense for a general solution (i.e., transform the data and then use the transformed result to predict classes, returning labels)?

I used extract because neither transform or predict quite seemed to match. I understand wanting to keep a consistent API so that things work in the sklearn pipelines... is transform or transform_predict a way forward to keep compatibility and match the sklearn API?

@jnothman
scikit-learn member

I don't think transform is an option here. One option is to use fit_predict, and somehow use memoization. One option is to use predict and allow the parameter to change after fitting but we have an issue if X differs from the fit X. Sorry I don't immediately have answers...

@espg

@jnothman Thanks for the input on fit_predict and predict, it's very helpful. Here is a proposal for conforming to the sklearn API along the lines that you have suggested:

  • fit(X, y=None): return self

  • fit_predict(X, y=None, epsilon_prime=eps, clustering='dbscan'): return core_sample_indices_, labels_
    'epsilon_prime' defaults to self.eps, but can be specified to something else. 'clustering' defaults to dbscan type extraction, but allows for extention to other types of extraction (e.g., hierarchal).
    This method runs a simple check: if 'X' is the same as what fit(X) was called with, run extract. If 'X' does not equal 'X' from fit(X), or if fit(X) has not been called, run fit(X) on the new 'X', and then call extract.

  • predict(X, y=None, epsilon_prime=eps, clustering='dbscan'): return core_sample_indices_, labels_
    Check: If 'X' is the same as what fit(X) was called with, run extract with 'epsilon_prime'. If neither fit or fit_predict has been run, throw an error. If 'X' does not equal 'X' from fit(X), but a fit method has been called, infer cluster membership of the new data-- this can be done cheaply with a single en mass k=1 nearest-neighbor lookup of each new input point in 'X', combined with a run of extract on the previous input 'X' array...specifically, if the distance from new input point 'p' to the k=1 nearest-neighbor in the orginal 'X' array is less then the 'core_dist' attribute of that neighbor (and that neighbor is a 'core' point at the given espsilon_prime), then we know that point 'p' belongs to it's k=1 nearest-neighbor cluster. Likewise, if the nearest-neighbor for input point 'p' is marked as noise from extract, then 'p' is noise. This gives a simple way to predict cluster membership of new input points, using the fit data set as training.

  • extract(self, epsilon_prime, clustering='dbscan'): return core_sample_indices_, labels_
    Keep this method public as a convenience function.

I think that this allows for easy extension and addition of new extraction methods, conforms reasonably to the existing API in terms of input and output, and brings a new predict capability for use in cases where users are training the estimator. Thoughts or suggestions on the above proposed scheme?

@GaelVaroquaux
scikit-learn member
espg added some commits Oct 19, 2015
@espg espg CamelCase fixes
conforming to sklearn API on CamelCase
3dd2d29
@espg espg adding hierarchical extraction function
Added hierarchical extraction from #2043

#2043 is BSD licensed and hasn’t had any activity for 10 months, so
this seems pretty kosher; authors are cited in the code
e52cc1e
@espg espg added hierarchical switch to extract
Additional style and documentation changes as well
fcd5f70
@espg espg cluster order bug fix
ensured that ordered list is always same length as input data
4d7e3be
@espg espg removed hierarchical cluster extraction
Code from FredrikAppelros is totally unable to handle noise— as
currently written every point is assigned to a cluster, except the
first point of that cluster. May include later as a third method
b31ae1a
@espg espg initial import of automatic cluster extraction code
Adding the excellent (and working) automatic cluster extraction code
from Amy X. Zhang. Some minor formatting changes on import to conform
to pep8
92ba90d
@espg espg wrapper for 'auto' extraction
additional fixes to style (camelCase, etc.), comments; made all helper
functions private
768f7c6
@espg espg test and example updates
Much better example data to showcase ‘auto’ extraction. Unit tests now
test both extraction methods. Set ‘auto’ as default, as it doesn’t
require any parameters and gives a better result. Pruned references to
hierarchical clustering
f4d88de
@espg espg fixing unit coverage; pruning unused functions
probably could still get a better unit test for ‘auto’ clustering…
b1aeb52
@espg espg Added 'filter' fuction
Allows density-based filtering. Useful for cases where only a
‘noise’/‘no noise’ classification is desired. Function does not require
fitting prior to running, although it can be run after a fit if desired
8350711
@espg espg Vectorizing auto_cluster
generalizes input to multiple dimensions as well…
31df5eb
@espg espg updated filter function
428d7d4
@phdowling

Is this still being worked on? If not, can I expect it to work if I merge it locally? Would be nice to try out for a project I'm working on.

@espg

@phdowling Still being worked on-- I'll push an update to expose a few more extraction parameters on the 'auto' extract method next week after I get back from the American Geophysical Union. It should work fine if you merge it locally, but let me know if you run into any problems.

@Gudui

Is this still being developed?

@Broham

What still needs to be done on this PR to get it ready to merge? Is it just the comment that @GaelVaroquaux left that needs to be addressed?

@espg

Main thing is new unit tests since code changes and additions have made the old ones insufficient. There are some API changes (mainly exposing and documenting hidden parameters to be public for the auto extract function) that I can push as well.

@amueller
scikit-learn member

the test failure seems related...

@Broham

@espg - ok sounds good. If I wanted to work on the tests what would be the proper way for me to do that? Should I clone your repo (https://github.com/espg/scikit-learn.git) and then make PRs there? Sorry - I haven't contributed to something like this before.

@espg

@Broham - yeah, I think cloning and then making a PR is the standard way to do it. I'd certainly appreciate any help with the unit tests, so let me know if you have any questions looking through the code

@Broham

@espg @amueller looking at the tests using nosetests it looks like there is 75% unit test coverage (vs the recommended 80%) and then the circleci, apveyor and travis-ci failures.

When you mention fixing tests/unit tests are you thinking that getting coverage up to 80% and getting the CI checks to pass are what needs to be completed or does it go deeper than that?

@espg espg fixing test error
setting minPts to < data size returns None (with error message)
517a72e
@espg espg and 2 others commented on an outdated diff Aug 9, 2016
sklearn/cluster/optics.py
+ # ignore this split and continue (reject both child nodes)
+ _cluster_tree(node, parentNode, localMaximaPoints,
+ RPlot, RPoints, min_cluster_size)
+ return
+
+ # remove clusters that are too small
+ if len(Node1.points) < min_cluster_size:
+ # cluster 1 is too small"
+ try:
+ Nodelist.remove((Node1, LocalMax1))
+ except Exception:
+ sys.exc_clear()
+ if len(Node2.points) < min_cluster_size:
+ # cluster 2 is too small
+ try:
+ Nodelist.remove((Node2, LocalMax2))
@espg
espg Aug 9, 2016

@Broham it looks like there's a problem with how the 'auto' cluster method prunes the tree that it builds. _cluster_tree calls Nodelist.remove((Node2, LocalMax2)) , but doesn't check to see if the value is already removed or in fact exists in the list. The author of the 'auto' method, @amyxzhang , uses sys.exc_clear to clear the exception; however, sys.exc_clear doesn't exist in python3.x, so the module fails with an error during the automated tests. I think that just wrapping sys.exc_clear in a try block would probably resolve the error, but really the cleaner thing to do would be to sanitize the list removal input or check the size of the list with a conditional...I'm just not clear on the best way to do it. Open to suggestions.

@Broham
Broham Aug 10, 2016

@espg yeah that makes things interesting. My thoughts would be to change the conditionals to confirm the item exists before trying to remove it like so:

if len(Node1.points) < min_cluster_size and Nodelist.count((Node1, LocalMax1)) > 0:

Then maybe just drop the try/except that is in that statement - if there are exceptions we could let them bubble up?

Also I'm not crazy about this method for exception handling because it seems like they are just eating them which can make things confusing if something is going wrong. Maybe i'm just not seeing the big picture though - does @amyxzhang have any input?

@amyxzhang
amyxzhang Aug 10, 2016

@broham your suggestion sounds fine to me.

Broham and others added some commits Aug 10, 2016
@Broham Broham removing exception handling in favor of conditional check b7daad4
@espg espg Merge pull request #2 from Broham/exception-handling
removing exception handling in favor of conditional check
a08543b
@espg

@Broham thanks, seems to pass now.

@espg

I looked over the codebase for this PR, and I think that there's three main things that need to happen for this to merge:

1.) Update the example (a little out of date with the auto method; doesn't plot correctly when auto-examples are built)
2.) Update documentation to include/mention auto cluster parameters; fix image (related to item 1 above not plotting)
3.) Expose hidden parameters in 'auto' cluster method as public, so that people can tune the parameters.
4.) Additional unit tests (not sure if we need close to 100%, or if 80% will suffice)

I'm working on items 1 and 2; @Broham do you have any interest in helping out with item 3? I can give some more details on modifying the extract functions for this if you'd like...

@Broham

@espg definitely! Might not be able to hit it this weekend, but next week I should be able to give it some attention.

@amueller
scikit-learn member

test coverage close to 100 would be appreciate. 80% seems pretty low when the project has 95% (and most of these are backports of arpack). Where does the 80% come from?

This discussion has entirely paged out of my memory, and I would need to read up on what the benefits of optics are again.

espg added some commits Aug 27, 2016
@espg espg Updated unit tests
Coverage to 90%. PEP8 fixes
2019886
@espg espg Additional unit test
Now at 94%; auto extract method is very hard to test… this new unit
test adds a more robust dataset to trigger more branches for testing.
Some of the remaining conditionals are pretty rare … :-/
6adb41f
@espg espg Fix unit test bug / python 3 compat
None type comparison problem…
78be6e4
@espg

@amueller I went through and updated the unit tests so that we're at 94% coverage now. The remaining few percent maybe hard to cover, since the branches that get activated are dependent on the input dataset (i.e., lines 525 to 530), although I'm pretty sure that I can force a test for lines 473-477 to bump past 96%. Would the 96-97% range suffice for coverage?

There's several benefits for the OPTICS algorithm; the most obvious improvement for users is cluster identification when observation density is variable (i.e., the top clusters found from the bottom data in the figure below).
optics_auto_extract
DBSCAN will keep a constant density, and either miss the dense clusters around (2,-2), or will only get the three densest clusters and miss (-2, -6) and (6, 6).

@jnothman
scikit-learn member

@espg I know this has been open forever with a cast of thousands involved , but you should probably ping in a couple of weeks once the next release is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment