New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: OPTICS clustering #2043
WIP: OPTICS clustering #2043
Conversation
How does this compare to the pull request #1984 ? |
This implementation is the one we talked about some 3 months ago on the mailing list. I believe it complies more to the scikit-learn standard (it inherits from BaseEstimator and ClusterMixin) and it is also fairly well tested. We have ourselves been using it for 3 months on real data and tests are included. Now, implementation wise, we use cdist from scipy to calculate the distances from the current point to each other point in order to keep memory utilization down to O(n). This is if we are not fed precomputed distances, of course. We haven't had the time to fully examine the implementation used in #1984 but it seems to use a ball tree structure instead. It would be interesting to compare the performance of the two implementations and possibly integrate the use of a ball tree into this one if it leads to superior performance. |
Thanks for the PR! I was looking to add OPTICS at some point in the future :) The branch builds, and the code looks quite good. I believe there is still the narrative documentation to go (in doc/modules/clustering.rst). As a naive question, what does scipy's distance do that sklearn's doesn't? It would be better to have a common interface, and if there is a problem with |
Ah, we missed the part about narrative documentation. We'll add that ASAP. What's lacking in sklearn's metric module is a way to calculate distances individually instead of pairwise for the whole dataset as in |
Right, I didn't think of that. The reason we have |
I don't really get the point, pairwise_distances can be used quite The big drawback I see of pairwise_distances over cdist is the overhead |
Using a structure that optimizes spatial queries, we might be able to avoid using |
Would it still support non-Euclidean metrics? |
I did some testing, and it looks like pull request #1984 (my PR) runs a couple of orders of magnitude faster; more details here, but basically runtime using cdist runs in O(n**2) time. Since sklearn.neighbors.balltree implements non-euclidean distance calculations, adding them to OPTICS should be trivial. I think that 2043 and 1984 should be merged. It's probably easier to bring 1984 up to match the scikit-learn API; main optics loop and supporting prep functions work, and any spatial index added here would likely repeat much of the coding 1984. The DBSCAN extraction algorithms from an ordered OPTICS list are better here, and could be merged into 1984 painlessly, as could the documentation. |
What is the memory usage like in #1984? |
Apparently benchmarked at 250MB compared to 150MB here. See the mentioned post. |
Interesting. So it's memory or speed? |
Yes, if you can compare 1.5x memory and 60x speed, assuming the benchmarks On Wed, Jun 19, 2013 at 8:50 AM, Robert Layton notifications@github.comwrote:
|
I agree with @espg that the two different implementations should be merged. We didn't add a spatial indexing structure to our implementation to begin with because we didn't know that sklearn had implementations for that already. I finally got some time over to look through both #1984 and the current implementation of What I'm wondering though is where the 20 different distance metrics for Also I have to disagree on it being easier to bring #1984 up to speed as I just implemented spatial indexing into this one by basically changing 4 rows. I'll push it in here as soon as I have done some more testing on it. What I would love to see however is if you still have that benchmark lying around so that we could compare the new implementation of this PR against the other two others (the original implementation here and the one in #1984). A quick note: I haven't extracted the calculation of core distances into a preparation function as in #1984 as I don't know how sklearn approaches parallelization. It's a very interesting thought though and might be worthwhile! |
Okay, we finally had some time to test the spatial indexing implementation and as far as we can see it should be working now. @espg: As I mentioned earlier it would be great if you could run your benchmark again or just publish it so we could try it out for ourselves. We're excited to see how much faster the algorithm is now. |
@FredrikAppelros Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets |
+1 |
Changes Unknown when pulling 04aba0b on FredrikAppelros:optics into * on scikit-learn:master*. |
si = seeds[ind[0]] | ||
neighbors = ind[0][si] | ||
ds = dist[0][si] | ||
cds = core_dist * np.ones(len(ds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary. np.maximum(core_dist, ds)
should suffice due to broadcasting rules.
I'm not yet sure about I would, however, do the # TODO: check boundary cases are correct
nn = NearestNeighbors(X, radius=eps)
neighborhoods = nn.radius_neighbors_graph(X, mode='distance')
n_neighbors = np.diff(neighborhoods.indptr)
core_samples = np.flatnonzero(n_neighbors > min_samples)
core_sample_indices = neighborhoods.indptr[core_samples]
core_distance = neighborhoods.data[core_sample_indices + min_samples] |
The OPTICS hierarchical clustering algorithm has been implemented as well as a heuristic method for extracting clusters from it.
Delayed the pairwise distance calculations until they are required. This resulted in a reduced memory footprint. Memory complexity should have gone from O(n^2) to O(n).
* Allowed split point to be included in the right child. * Added additional checks so that small children near the edges are not wrongfully inserted into noise. * Added min_reach_ratio which is a lower limit for the local maxima expressed in terms of the largest reachability distance in the set.
Initial implementation using a BallTree for spatial indexing. Due to BallTree's lack of a callable metric function, a test case had to be disabled.
d16a684
to
24205af
Compare
Any further progress on this and/or #1984? |
Added hierarchical extraction from scikit-learn#2043 scikit-learn#2043 is BSD licensed and hasn’t had any activity for 10 months, so this seems pretty kosher; authors are cited in the code
Is this still being worked on by anyone? If not, can I expect it to work if I merge it locally? Would be nice to try out for a project I'm working on. |
Adding need contributor tag. Either to finish this or put it into scikit-learn-contrib. |
Removing "Need Contributor" in favor of now active #1984 |
* OPTICS clustering algorithm Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances * Create plot_optics Shows example usage of OPTICS to extract clustering structure * pep8 fixes Mainly issues in long lines, long comments * fixed conditional to be pep8 * updated to match sklearn API OPTICS object, with fit method. Documentation updates. extract method * removed extra files * plotting example updated, small changes new plot example that matches the updated API updated n_cluster attribute with reruns of extract removed scaling factor on first ‘fit’ run * updated OPTICS.labels to OPTICS.labels_ should pass unit test now? * additional labels_ changes * added stability warning Scales eps to give stable results from first input distance. Extraction above scaled eps is not allowed; extraction below scaled eps but greater than input distance prints stability warning but still will run clustering extraction. All distances below initial input distance are stable; no warning printed * Noise fix; updated example plot Fixed noise points from being initialized as type ‘core point’ Fixed initialization for first ‘fit’ call Decoupled eps and eps_prime (deep copy) Matched plot example to same random state as dbscan Added second plot to show ‘extract’ usage * Changed to match Sklearn API eps is not modified in init; kwargs fix * Forcing 2 parameters Why do I have to do this? * Conforming to API Fit now returns self; test fix for unit test fail in ball tree (asarray problem) * Fixing plot example labels to labels_ * Fixed issue with sparse matrices * Another attempt at fixing the sparse matrix error Temporary fix until balltree can be updated to deal with sparse matrices. * Better checking of sparse arrays Using ‘check_array’ * General cleanup Removed extraneous lines and comments (old commented out code has been removed) * Added unit tests for extract function Added the unit tests in test_optics for fit, extract, and declaration. Should bring coverage to ~100%. Additionally, fixed a small bug that cropped up in the extract function during testing. * Attempting for near 100% coverage Removed unused imports, added to get warning. * Fixed error in unit tests * Trimmed extraneous 'if-else' check see title * forcing to check for a warning. Result should be 100% coverage * Updates to doc strings All public methods now have doc strings; forcing a rebuild of OPTICS so that build tests pass (last round failed due to external module) * Style / pep8 changes 99% pep8 now… line 138 isn’t, but reads better with the long variable names * Added Narrative Documentation Includes general description, discussion of algorithm output, and comparison with DBSCAN. References and implementation notes are included * Vectorized nneighbors lookups Following suggestion from jnothman for doing nneighbors queries enmass. Added OPTICS to cluster init * fixing init build error * reverting init * All code now vectorized …at least all code that can be ;) Some general pruning and cleanup as well * Style changes Now 100% pep8 * Changing parameter style matching DBSCAN * Extraction change; Authors update —initialize all points as ‘not core’ (fixes bug when plotting at epsPrime larger than eps) —added Sean Freeman to authors list * Changed eps scaling to 5x instead of 10x 10x scaling is too conservative…eps scaling at 5x is perfectly stable, and much faster as well. * Fixing unit test Should be ‘None’ for this; initialization previously at 1 for ‘is_core’ was incorrect * Actually fixing unit test Null comparison doesn’t work, using size * Making ordering_ and other attributes public renamed core_samples to core_sample_indices_ (as in DBSCAN). Used attribute naming conventions (trailing _ character), and made ordering_, reachability_, and core_dists_ public. * Updates for Documentation includes attribute strings for now public attributes * Pep8 cleanup Minor pep8 and pyflakes fixes * updating plot example to match new attribute name * CamelCase fixes conforming to sklearn API on CamelCase * adding hierarchical extraction function Added hierarchical extraction from #2043 #2043 is BSD licensed and hasn’t had any activity for 10 months, so this seems pretty kosher; authors are cited in the code * added hierarchical switch to extract Additional style and documentation changes as well * cluster order bug fix ensured that ordered list is always same length as input data * removed hierarchical cluster extraction Code from FredrikAppelros is totally unable to handle noise— as currently written every point is assigned to a cluster, except the first point of that cluster. May include later as a third method * initial import of automatic cluster extraction code Adding the excellent (and working) automatic cluster extraction code from Amy X. Zhang. Some minor formatting changes on import to conform to pep8 * wrapper for 'auto' extraction additional fixes to style (camelCase, etc.), comments; made all helper functions private * test and example updates Much better example data to showcase ‘auto’ extraction. Unit tests now test both extraction methods. Set ‘auto’ as default, as it doesn’t require any parameters and gives a better result. Pruned references to hierarchical clustering * fixing unit coverage; pruning unused functions probably could still get a better unit test for ‘auto’ clustering… * Added 'filter' fuction Allows density-based filtering. Useful for cases where only a ‘noise’/‘no noise’ classification is desired. Function does not require fitting prior to running, although it can be run after a fit if desired * Vectorizing auto_cluster generalizes input to multiple dimensions as well… * updated filter function * fixing test error setting minPts to < data size returns None (with error message) * removing exception handling in favor of conditional check * Updated unit tests Coverage to 90%. PEP8 fixes * Additional unit test Now at 94%; auto extract method is very hard to test… this new unit test adds a more robust dataset to trigger more branches for testing. Some of the remaining conditionals are pretty rare … :-/ * Fix unit test bug / python 3 compat None type comparison problem… * Fixed annoying deprecation warning for 1d array in BallTree PEP8 Fixes * More PEP8 and remove print statements fromt est * 70 to 80% faster, fixed distance metrics Modified to remove extraneous sorts, nneighbors query, and reduced pairwise distance calculations to the upper triangle of a distance matrix (instead of the full matrix). It appears that in ‘full’ scans (i.e., when epsilon is set to inf, or the width of the data set), OPTICS now actually outperforms DBSCAN… also, should be easy to run distance calculations in parallel now for large datasets, with proper heuristics. * Fixing unit test failure * Exposed auto_cluster parameters as public documentation and API update so that users can tweek the auto method * Fixed def with missing ':' * Fix bugs from api change... make sure that arguments are being called correctly with the new extract_auto() method * pep8 / pyflakes changes * Updated example / plot Correctly generates figure with subplots for DBSCAN / OPTICS comparison * Tuning plot example / pep8 change * Bug fix for commit 0b4cbdd (enforce stable sort) the returned index from “sp.argmin(setofobjects.reachability_[n_pr])” assumes that entires are stably sorted by distance from query point (i.e., that ties in the argmin return the closest point to the input point). Still overall faster compared to the pre 0b4cbdd version, since we’re filtering processed points before sorting by distance… and also only calculating distances for non-processed points, instead of all within the epsilon query. * Code review fixes (style) Fixes coding style (test comments, camel case, author list, relative imports, etc). Included a new unit test and .npy file to explicitly test reachability distances (test coverage at 99%). * fixing new unit test can’t import reach_values.npy in current directory….just placed testing values in script directly (~200 lines for 1500 testing values) * refactored min_heap to c extension reduced optics file by about 100 lines of code… new c extension for speedup (needs further optimization…6-10X speedup still possible) * minor fixes.. * small cython optimizations * cython fixes * Add foo.txt * Remove foo.txt * fix compilation error * last optimizations cython/numpy MinHeap is only called once…so it’s faster to do a simple linear scan here. The np.argmin() function does *almost* the exact same thing, but the custom quick scan function is needed for cases where reachability distances are tied (and the next point is selected based on which of the points tied in reachability are closest to the querying point). OPTICS is now faster than DBSCAN for medium-to-large number of input points, and has better worst case run time (eps=inf). * fix pyflakes errors; change default eps value * _ * API changes from agramfort Removed ‘filter method’, changed print statements to exceptions and warnings as needed, remove ‘processed’ flag and replaced with fit_check method, changed array copies to parameters, updated unit tests, changed inline documentation to match proper doc string formatting, made private class private. Probably some other changes too… * fixed core samples bug * added fit_predict conforming to scipy api * updates to variable names; update plot * refactor to remove balltree specific code * major refactor finished decoupling balltree; lots of changes * fixed bugs; test all pass again :-) * fixed weird cython bug …not at all sure why pairwise_distances doesn’t automatically return np.float arrays. Makes no sense to me. I could understand cases in which the metric call returns int’s (i.e., city block)… but why it would return float.32 instead of float.64 seems super odd. * major refactor deleted extract and auto_extract methods; added optics function; added extract_dbscan method; added extract_dbscan and extract_optics functions; updated unit tests; renamed `eps` to `max_bounds`; flake8 corrections; added types: int —> labels, bool —> is_core, enforced X to be ‘float’ (fixes cython type errors) * Updated Documentation! Updated plot with reachability plot, as well as documentation :) * fix flake8 error * added optics to cluster comparison Don't like the figure since it doesn't use black for noise, but added OPTICS for consistency * Updated comparison plot to transpose ...kinda kludgy fix for the transpose * small fix * flake8 error * reverting transpose of cluster comparison (seperate PR #9739) * fixes from agramfort's review public/private changes, numpydocstring fixes, a few pep8 fixes that flake8 didn't flag for some reason... * fix for error message unit test should pass now * force cluster_id's to start at 0 * Fix sp. error and flake8 warning(s) * Updated documentation responses to reviews * Removed extraneous files also fixed small typo * fixing lgtm alert * changes from jnothman small fixes for docs, plots, and tests * Fixes from jnothman's review Reorded parameters, updated documentation, removed unneeded else statement, changes to varible names. * Fixing flake8 error * Removed neighbors / balltree inheritance Also decouples n_jobs -- can set n_jobs for just the kneighbors lookup, while keeping pairwise lookups to single job * Made nbrs private and moved initiation to fit() also renamed core_dists to core_distances. * fixed non-standard characters * Response to TomDLT review narrative changes to tests. Normalized reachability distances for significant_min parameter. Cleaned up plots; small changes to documentation with optics_.py * Fixed labeling bug also minor documentation updates * update unit test since labels are 0 indexed, max of labels is (1 - total number of clusters). We can't take len(set(clusters)) because of noise (will be 4, not 3). * Simple fixes per jnothman Fixed float division, condensed variable names to be shorter, renamed bools to True and False, removed un-needed code block. Still need to add tests for cluster tree extraction :-( * Auto-cluster tests coverage should be complete now; fixed minor bug; removed un-needed check. * fixing test error * removed python loop also fixed documentation link * Fixing test error * Fix typo in unit test entry was supposed to be '1.0' not '10'; the test is supposed to posit 3 clusters, 2 of which are too small and are merged. Old version posited 4 clusters, two of which were merged as intended, and two of which were discarded (cluster merging requires one of the clusters to be large enough to be an independent cluster; with 4 instead of three, this case did not happen for either of the first two clusters). * documentation updates * Post-merge doctest fix merge conflict in clustering.rst in previous commit; this push updates the doctest values to current correct values, and resolves the conflict * DBSCAN / OPTICS invariant test Restructured documentation. Small unit test fixes. Added test to ensure clustering metrics between OPTICS dbscan_extract and DBSCAN are within 2% of each other. * Update _auto_cluster docstring renamed reachability_ordering --> ordering for consistency * changes fro jnothman changes unit tests to check for specific error message (instead of 'a' error); minor updates to documentation. This also fixes a bug in the extract_dbscan function whereby some core points were erroneously marked as periphery... this is fixed by reverting to a previous extraction code block that initalizes all points as core and then demotes noise and periphery points during the extract scan. Parameterized unit test. * fix spelling error in tests * contingency_matrix test New invarient test between optics and dbscan * small unit test updates per jnothman * unit test typo fix * extract dbscan updates Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size). * updated documentation comparing OPTICS/DBSCAN * DOC: phrasing and whats_new * MISC: small mem footprint in OPTICS
Resolved in #1984, thanks @FredrikAppelros |
Added an implementation of the OPTICS clustering algorithm. OPTICS does not by itself produce a set of labels for the samples so we have also implemented a hierarchical cluster extraction algorithm.
As of now, both implementations are located in the
optics_.py
file, but the extraction algorithm should probably be refactored out. OPTICS can have several different ways to produce labels from its set of reachability distances and its ordering so a modular approach to the extraction methods would be preferred.We currently use scipy to calculate distances between samples instead of the built-in
pairwise_distances
in order to reduce memory consumption.// Fredrik Appelros & Carl Ekerot