Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Implementation of OPTICS #1984

Closed
wants to merge 137 commits into
base: master
from

Conversation

@espg
Copy link

espg commented May 21, 2013

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)

First commit; would appreciate any advice on changes to help code conform to the scikit-learn API

Example usage in examples folder

espg added some commits May 21, 2013

OPTICS clustering algorithm
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Create plot_optics
Shows example usage of OPTICS to extract clustering structure
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)

##############################################################################

This comment has been minimized.

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

I tend to prefer not having these separation lines.. A blank line is fine :)

@NelleV

This comment has been minimized.

Copy link
Member

NelleV commented May 22, 2013

Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.


## Main Class ##

class setOfObjects(BallTree):

This comment has been minimized.

@NelleV

NelleV May 22, 2013

Member

Make sure to follow pep8's convention for class names.

"""
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import optics as op

This comment has been minimized.

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example


# Run the top-level optics algorithm

op.prep_optics(testtree,30,10)

This comment has been minimized.

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

pep8 needed here

op.prep_optics(testtree, 30, 10)

@jaquesgrobler

This comment has been minimized.

Copy link
Member

jaquesgrobler commented May 22, 2013

Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..

The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.

Hope that helps for now. Thanks again 👍

espg added some commits May 22, 2013

pep8 fixes
Mainly issues in long lines, long comments
@mblondel

This comment has been minimized.

Copy link
Member

mblondel commented Jan 8, 2014

@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

@espg

This comment has been minimized.

Copy link
Author

espg commented Jan 8, 2014

@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?

I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.

@kevindavenport

This comment has been minimized.

Copy link

kevindavenport commented Mar 28, 2014

This is great work!

@arnabkd

This comment has been minimized.

Copy link

arnabkd commented Apr 19, 2014

@espg : I am very interested in an OPTICS implementation in scikits-learn :)

@larsmans larsmans force-pushed the scikit-learn:master branch from 58a55ad to 4b82379 Aug 25, 2014

@MechCoder MechCoder force-pushed the scikit-learn:master branch from 6deaea0 to 3f49cee Nov 3, 2014

@michaelaye

This comment has been minimized.

Copy link

michaelaye commented Nov 4, 2014

What's the status of this?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 5, 2014

Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.

@kevindavenport

This comment has been minimized.

Copy link

kevindavenport commented Nov 5, 2014

I want to jump in and take this over if it's stalled. I can start this weekend :)

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 5, 2014

I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.

updated to match sklearn API
OPTICS object, with fit method. Documentation updates.
extract method
@espg

This comment has been minimized.

Copy link
Author

espg commented Nov 5, 2014

@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.

The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.

While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.

I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.

@espg

This comment has been minimized.

Copy link
Author

espg commented Nov 5, 2014

@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 5, 2014

@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.

For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.

There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.

For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).

Maybe also add it to the clustering comparison example.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 5, 2014

I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Nov 5, 2014

It might also be interesting to compare to #3802.

@MechCoder

This comment has been minimized.

Copy link
Member

MechCoder commented Nov 5, 2014

I can also try reviewing, once you feel that it is in a reviewable state.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Nov 5, 2014

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a
similar problem with agglomerative approaches. My first idea would be to have
some attribute that stores the whole hierarchical structure which could be
accessed by the user. That is obviously not super convenient.

joblib.Memory (look for it in FeatureAgglomeration).

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 11, 2018

So I've tried:

@pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
    # Test that OPTICS clustering labels are < 5 difference of DBSCAN

    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers,
                                cluster_std=0.4, random_state=0)

    # calculate optics with dbscan extract at 0.3 epsilon
    op = OPTICS(min_samples=min_samples).fit(X)
    core_optics, labels_optics = op.extract_dbscan(eps)

    # calculate dbscan labels
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    contingency = contingency_matrix(db.labels_, labels_optics)
    # agreement is at best the sum of the max value on either axis
    agree = min(np.sum(np.max(contingency, axis=0)),
                np.sum(np.max(contingency, axis=1)))
    disagree = X.shape[0] - agree

    # verify core_labels match
    assert_array_equal(core_optics, db.core_sample_indices_)

    # verify label mismatch is < 5 labels
    assert disagree < 5

It fails for:

  • min_samples=10, eps=0.1 with disagree == 38.
  • min_samples=20, eps=0.3 with disagree == 9.

Otherwise it passes. Is this what we should expect?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 11, 2018

(Note that I changed the calculation of agree. I think it's correct...)

@espg

This comment has been minimized.

Copy link
Author

espg commented Jul 12, 2018

The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 12, 2018

extract dbscan updates
Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).
@espg

This comment has been minimized.

Copy link
Author

espg commented Jul 12, 2018

I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:

The clustering created from a cluster-ordered data set by ExtractDBSCAN-Clustering is nearly indistinguishable from a clustering created by DBSCAN. Only some border objects may be missed when extracted by the algorithm ExtractDBSCAN-Clustering if they were processed by the algorithm OPTICS before a core object of the corresponding cluster had been found. However, the fraction of such border objects is so small that we can omit a postprocessing (i.e. reassign those objects to a cluster) without much loss of information.

I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core.

@jnothman
Copy link
Member

jnothman left a comment

Thanks for that!

I'm a little concerned that we've still missed something, but we've certainly improved a lot over the last months. I'm going to be bold and approve.

I think @agramfort's approval here is a bit stale. It would be good for someone to have another quick look over tests and/or implementation.

Thanks @espg!!

assert_array_equal(core_optics, db.core_sample_indices_)

non_core_count = len(labels_optics) - len(core_optics)
percent_mismatch = np.round((disagree - 1) / non_core_count, 2)

This comment has been minimized.

@jnothman

jnothman Jul 15, 2018

Member

I prefer fraction when we're not multiplying by 100

This comment has been minimized.

@jnothman

jnothman Jul 15, 2018

Member

Why do we need to round?

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

I'll try to review and hopefully merge this guy.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP).

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984

I might merge from it.

nbrs.fit(X)
self.core_distances_[:] = nbrs.kneighbors(X,
self.min_samples)[0][:, -1]
self._nbrs = nbrs

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.

@GaelVaroquaux GaelVaroquaux referenced this pull request Jul 16, 2018

Merged

OPTICS #11547

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

Merged via #11547 . Hurray!! Thank you

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 16, 2018

Wow. That's been a long time coming!!

@LucasHMS

This comment has been minimized.

Copy link

LucasHMS commented Jul 16, 2018

Is it planned to go out with the next release or is it already available?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 16, 2018

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

@espg

This comment has been minimized.

Copy link
Author

espg commented Jul 16, 2018

Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-)

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

@lmcinnes

This comment has been minimized.

Copy link
Contributor

lmcinnes commented Jul 16, 2018

Congrats @espg! Glad this finally made it!

@espg

This comment has been minimized.

Copy link
Author

espg commented Jul 16, 2018

It was my first PR, definitely learned a lot about the process...

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

GaelVaroquaux commented Jul 16, 2018

@czhao028

This comment has been minimized.

Copy link

czhao028 commented Jul 19, 2018

In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it?

Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jul 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.