Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Implementation of OPTICS #1984

Closed
wants to merge 137 commits into from
Closed

[MRG] Implementation of OPTICS #1984

wants to merge 137 commits into from

Conversation

@espg
Copy link

@espg espg commented May 21, 2013

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)

First commit; would appreciate any advice on changes to help code conform to the scikit-learn API

Example usage in examples folder

espg added 2 commits May 21, 2013
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Shows example usage of OPTICS to extract clustering structure
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)

##############################################################################
Copy link
Member

@jaquesgrobler jaquesgrobler May 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer not having these separation lines.. A blank line is fine :)

Loading

@NelleV
Copy link
Member

@NelleV NelleV commented May 22, 2013

Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.

Loading


## Main Class ##

class setOfObjects(BallTree):
Copy link
Member

@NelleV NelleV May 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to follow pep8's convention for class names.

Loading

@jaquesgrobler
Copy link
Member

@jaquesgrobler jaquesgrobler commented May 22, 2013

Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..

The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.

Hope that helps for now. Thanks again 👍

Loading

espg added 2 commits May 22, 2013
Mainly issues in long lines, long comments
@mblondel
Copy link
Member

@mblondel mblondel commented Jan 8, 2014

@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

Loading

@espg
Copy link
Author

@espg espg commented Jan 8, 2014

@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?

I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.

Loading

@kevindavenport
Copy link

@kevindavenport kevindavenport commented Mar 28, 2014

This is great work!

Loading

@arnabkd
Copy link

@arnabkd arnabkd commented Apr 19, 2014

@espg : I am very interested in an OPTICS implementation in scikits-learn :)

Loading

@michaelaye
Copy link

@michaelaye michaelaye commented Nov 4, 2014

What's the status of this?

Loading

@amueller
Copy link
Member

@amueller amueller commented Nov 5, 2014

Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.

Loading

@kevindavenport
Copy link

@kevindavenport kevindavenport commented Nov 5, 2014

I want to jump in and take this over if it's stalled. I can start this weekend :)

Loading

@amueller
Copy link
Member

@amueller amueller commented Nov 5, 2014

I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.

Loading

OPTICS object, with fit method. Documentation updates.
extract method
@espg
Copy link
Author

@espg espg commented Nov 5, 2014

@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.

The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.

While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.

I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.

Loading

@espg
Copy link
Author

@espg espg commented Nov 5, 2014

@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/

Loading

@amueller
Copy link
Member

@amueller amueller commented Nov 5, 2014

@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.

For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.

There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.

For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).

Maybe also add it to the clustering comparison example.

Loading

@amueller
Copy link
Member

@amueller amueller commented Nov 5, 2014

I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).

Loading

@amueller
Copy link
Member

@amueller amueller commented Nov 5, 2014

It might also be interesting to compare to #3802.

Loading

@MechCoder
Copy link
Member

@MechCoder MechCoder commented Nov 5, 2014

I can also try reviewing, once you feel that it is in a reviewable state.

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Nov 5, 2014

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a
similar problem with agglomerative approaches. My first idea would be to have
some attribute that stores the whole hierarchical structure which could be
accessed by the user. That is obviously not super convenient.

joblib.Memory (look for it in FeatureAgglomeration).

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 11, 2018

So I've tried:

@pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
    # Test that OPTICS clustering labels are < 5 difference of DBSCAN

    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers,
                                cluster_std=0.4, random_state=0)

    # calculate optics with dbscan extract at 0.3 epsilon
    op = OPTICS(min_samples=min_samples).fit(X)
    core_optics, labels_optics = op.extract_dbscan(eps)

    # calculate dbscan labels
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    contingency = contingency_matrix(db.labels_, labels_optics)
    # agreement is at best the sum of the max value on either axis
    agree = min(np.sum(np.max(contingency, axis=0)),
                np.sum(np.max(contingency, axis=1)))
    disagree = X.shape[0] - agree

    # verify core_labels match
    assert_array_equal(core_optics, db.core_sample_indices_)

    # verify label mismatch is < 5 labels
    assert disagree < 5

It fails for:

  • min_samples=10, eps=0.1 with disagree == 38.
  • min_samples=20, eps=0.3 with disagree == 9.

Otherwise it passes. Is this what we should expect?

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 11, 2018

(Note that I changed the calculation of agree. I think it's correct...)

Loading

@espg
Copy link
Author

@espg espg commented Jul 12, 2018

The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary.

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 12, 2018

Loading

Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).
@espg
Copy link
Author

@espg espg commented Jul 12, 2018

I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:

The clustering created from a cluster-ordered data set by ExtractDBSCAN-Clustering is nearly indistinguishable from a clustering created by DBSCAN. Only some border objects may be missed when extracted by the algorithm ExtractDBSCAN-Clustering if they were processed by the algorithm OPTICS before a core object of the corresponding cluster had been found. However, the fraction of such border objects is so small that we can omit a postprocessing (i.e. reassign those objects to a cluster) without much loss of information.

I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core.

Loading

Copy link
Member

@jnothman jnothman left a comment

Thanks for that!

I'm a little concerned that we've still missed something, but we've certainly improved a lot over the last months. I'm going to be bold and approve.

I think @agramfort's approval here is a bit stale. It would be good for someone to have another quick look over tests and/or implementation.

Thanks @espg!!

Loading

assert_array_equal(core_optics, db.core_sample_indices_)

non_core_count = len(labels_optics) - len(core_optics)
percent_mismatch = np.round((disagree - 1) / non_core_count, 2)
Copy link
Member

@jnothman jnothman Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer fraction when we're not multiplying by 100

Loading

Copy link
Member

@jnothman jnothman Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to round?

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

I'll try to review and hopefully merge this guy.

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP).

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984

I might merge from it.

Loading

nbrs.fit(X)
self.core_distances_[:] = nbrs.kneighbors(X,
self.min_samples)[0][:, -1]
self._nbrs = nbrs
Copy link
Member

@GaelVaroquaux GaelVaroquaux Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.

Loading

@GaelVaroquaux GaelVaroquaux mentioned this pull request Jul 16, 2018
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

Merged via #11547 . Hurray!! Thank you

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 16, 2018

Wow. That's been a long time coming!!

Loading

@LucasHMS
Copy link

@LucasHMS LucasHMS commented Jul 16, 2018

Is it planned to go out with the next release or is it already available?

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 16, 2018

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

Loading

@espg
Copy link
Author

@espg espg commented Jul 16, 2018

Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-)

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

Loading

@lmcinnes
Copy link
Contributor

@lmcinnes lmcinnes commented Jul 16, 2018

Congrats @espg! Glad this finally made it!

Loading

@espg
Copy link
Author

@espg espg commented Jul 16, 2018

It was my first PR, definitely learned a lot about the process...

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 16, 2018

Loading

@czhao028
Copy link

@czhao028 czhao028 commented Jul 19, 2018

In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it?

Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS

Loading

@jnothman
Copy link
Member

@jnothman jnothman commented Jul 20, 2018

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
scikit-learn 0.19
PRs that need reviews
Linked issues

Successfully merging this pull request may close these issues.

None yet