[MRG] Implementation of OPTICS #1984

Closed
wants to merge 137 commits into
from

Conversation

@espg

espg commented May 21, 2013

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)

First commit; would appreciate any advice on changes to help code conform to the scikit-learn API

Example usage in examples folder

espg added some commits May 21, 2013

OPTICS clustering algorithm
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Create plot_optics
Shows example usage of OPTICS to extract clustering structure
examples/cluster/plot_optics
+centers = [[1, 1], [-1, -1], [1, -1]]
+X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)
+
+##############################################################################

This comment has been minimized.

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

I tend to prefer not having these separation lines.. A blank line is fine :)

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

I tend to prefer not having these separation lines.. A blank line is fine :)

@NelleV

This comment has been minimized.

Show comment
Hide comment
@NelleV

NelleV May 22, 2013

Member

Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.

Member

NelleV commented May 22, 2013

Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.

sklearn/cluster/optics.py
+
+## Main Class ##
+
+class setOfObjects(BallTree):

This comment has been minimized.

@NelleV

NelleV May 22, 2013

Member

Make sure to follow pep8's convention for class names.

@NelleV

NelleV May 22, 2013

Member

Make sure to follow pep8's convention for class names.

examples/cluster/plot_optics
+"""
+from sklearn.datasets.samples_generator import make_blobs
+from sklearn.preprocessing import StandardScaler
+from sklearn.cluster import optics as op

This comment has been minimized.

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example

examples/cluster/plot_optics
+
+# Run the top-level optics algorithm
+
+op.prep_optics(testtree,30,10)

This comment has been minimized.

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

pep8 needed here

op.prep_optics(testtree, 30, 10)

@jaquesgrobler

jaquesgrobler May 22, 2013

Member

pep8 needed here

op.prep_optics(testtree, 30, 10)

@jaquesgrobler

This comment has been minimized.

Show comment
Hide comment
@jaquesgrobler

jaquesgrobler May 22, 2013

Member

Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..

The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.

Hope that helps for now. Thanks again 👍

Member

jaquesgrobler commented May 22, 2013

Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..

The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.

Hope that helps for now. Thanks again 👍

espg added some commits May 22, 2013

pep8 fixes
Mainly issues in long lines, long comments
@mblondel

This comment has been minimized.

Show comment
Hide comment
@mblondel

mblondel Jan 8, 2014

Member

@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

Member

mblondel commented Jan 8, 2014

@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jan 8, 2014

@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?

I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.

espg commented Jan 8, 2014

@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?

I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.

@kevindavenport

This comment has been minimized.

Show comment
Hide comment
@kevindavenport

kevindavenport Mar 28, 2014

This is great work!

This is great work!

@arnabkd

This comment has been minimized.

Show comment
Hide comment
@arnabkd

arnabkd Apr 19, 2014

@espg : I am very interested in an OPTICS implementation in scikits-learn :)

arnabkd commented Apr 19, 2014

@espg : I am very interested in an OPTICS implementation in scikits-learn :)

@michaelaye

This comment has been minimized.

Show comment
Hide comment
@michaelaye

michaelaye Nov 4, 2014

What's the status of this?

What's the status of this?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 5, 2014

Member

Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.

Member

amueller commented Nov 5, 2014

Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.

@kevindavenport

This comment has been minimized.

Show comment
Hide comment
@kevindavenport

kevindavenport Nov 5, 2014

I want to jump in and take this over if it's stalled. I can start this weekend :)

I want to jump in and take this over if it's stalled. I can start this weekend :)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 5, 2014

Member

I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.

Member

amueller commented Nov 5, 2014

I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.

updated to match sklearn API
OPTICS object, with fit method. Documentation updates.
extract method
@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Nov 5, 2014

@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.

The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.

While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.

I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.

espg commented Nov 5, 2014

@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.

The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.

While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.

I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Nov 5, 2014

@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/

espg commented Nov 5, 2014

@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 5, 2014

Member

@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.

For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.

There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.

For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).

Maybe also add it to the clustering comparison example.

Member

amueller commented Nov 5, 2014

@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.

For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.

There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.

For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).

Maybe also add it to the clustering comparison example.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 5, 2014

Member

I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).

Member

amueller commented Nov 5, 2014

I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 5, 2014

Member

It might also be interesting to compare to #3802.

Member

amueller commented Nov 5, 2014

It might also be interesting to compare to #3802.

@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Nov 5, 2014

Member

I can also try reviewing, once you feel that it is in a reviewable state.

Member

MechCoder commented Nov 5, 2014

I can also try reviewing, once you feel that it is in a reviewable state.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Nov 5, 2014

Member

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a
similar problem with agglomerative approaches. My first idea would be to have
some attribute that stores the whole hierarchical structure which could be
accessed by the user. That is obviously not super convenient.

joblib.Memory (look for it in FeatureAgglomeration).

Member

GaelVaroquaux commented Nov 5, 2014

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a
similar problem with agglomerative approaches. My first idea would be to have
some attribute that stores the whole hierarchical structure which could be
accessed by the user. That is obviously not super convenient.

joblib.Memory (look for it in FeatureAgglomeration).

plotting example updated, small changes
new plot example that matches the updated API
updated n_cluster attribute with reruns of extract
removed scaling factor on first ‘fit’ run
@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Nov 5, 2014

@amueller : For rescanning-- I implemented the class so that it (similar to your suggestion) saves the clustering order and hierarchy as an attribute already, but I'll check @GaelVaroquaux suggestion of using joblib.Memory as a cleaner way to do this.

espg commented Nov 5, 2014

@amueller : For rescanning-- I implemented the class so that it (similar to your suggestion) saves the clustering order and hierarchy as an attribute already, but I'll check @GaelVaroquaux suggestion of using joblib.Memory as a cleaner way to do this.

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Nov 5, 2014

@amueller : Thank you for the directions on where this needs to go. The tests are a bit of hangup, really I need to write a different example/test that is specific to OPTICS that is appropriate for showing multiple hierarchies of clusters for extracting at different eps values. Right now, I've just been using the same example that DBSCAN uses; however, the automated unit test for clustering fail for the following reason:

The OPTICS algorithm struggles with cases where the extracted clustering structure is close to the initial eps value (really, close to eps_max). In otherwords, running OPTICS(eps=.3,min_samples=10).fit(X) gives different (and incorrect) clustering compared to:

clust = OPTICS(eps=3.0,min_samples=10)
clust.fit(X)
clust.extract(0.3)

...which matches the results given from DBSCAN.

This is because the cluster seeds, which are expanded, are sensitive to starting location for low eps values. Using an eps value of inf ensures that you always get the same results as DBSCAN, but for large point sets gives a pretty substantial slow down; it is worth having the parameter user specified, but it needs to be larger than the final extraction-- somewhere between 5 and 10 times the largest eps value that you plan to cluster at. This could be scaled automatically in the init and fit options, but I think that approach is contrary to the API style for contributing; so I'm guessing that the appropriate use of the parameter needs to be explained in a narrative documentation page?

espg commented Nov 5, 2014

@amueller : Thank you for the directions on where this needs to go. The tests are a bit of hangup, really I need to write a different example/test that is specific to OPTICS that is appropriate for showing multiple hierarchies of clusters for extracting at different eps values. Right now, I've just been using the same example that DBSCAN uses; however, the automated unit test for clustering fail for the following reason:

The OPTICS algorithm struggles with cases where the extracted clustering structure is close to the initial eps value (really, close to eps_max). In otherwords, running OPTICS(eps=.3,min_samples=10).fit(X) gives different (and incorrect) clustering compared to:

clust = OPTICS(eps=3.0,min_samples=10)
clust.fit(X)
clust.extract(0.3)

...which matches the results given from DBSCAN.

This is because the cluster seeds, which are expanded, are sensitive to starting location for low eps values. Using an eps value of inf ensures that you always get the same results as DBSCAN, but for large point sets gives a pretty substantial slow down; it is worth having the parameter user specified, but it needs to be larger than the final extraction-- somewhere between 5 and 10 times the largest eps value that you plan to cluster at. This could be scaled automatically in the init and fit options, but I think that approach is contrary to the API style for contributing; so I'm guessing that the appropriate use of the parameter needs to be explained in a narrative documentation page?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Nov 5, 2014

Member

Sorry, which automated unit test is that?
The one that is failing on travis seems to be a missing labels_ attribute (which is part of the clustering api apparently).

And yes, discussing the parameter choices in the docs makes sense. Still, the default parameters should give sensible results in so far as that is possible.

Member

amueller commented Nov 5, 2014

Sorry, which automated unit test is that?
The one that is failing on travis seems to be a missing labels_ attribute (which is part of the clustering api apparently).

And yes, discussing the parameter choices in the docs makes sense. Still, the default parameters should give sensible results in so far as that is possible.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 5, 2015

Member

Hi @espg. I think the main issue with this PR is still coding style, etc. You've chosen to split things up into many functions and classes (and even sub-classed BallTree) when much of it would be easier to read if it were implemented in a single function with a clearly-named helper or two, and then wrapped in a class, just as dbscan/DBSCAN are. See also the alternative implementation in #2043, which has other issues, but whose control structure and data model is much clearer to follow. Also, please avoid CamelCase where not used in class names.

Your code is also not flat because it fails to exploit numpy's vectorised operations, such as here, which not only make the code flatter and easier to read (although it looks a bit less like the equivalent pseudocode) but make it much faster.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it.

The rule for contributing to existing software is to start small! Enhance something that's already there, and once you are confident about your work, the project and how to contribute, and the team receiving your contributions is confident about your work, paths open up to more substantial contributions.

IMO, the version of OPTICS that will eventually, I expect, be merged into scikit-learn will look more like #2043.

Member

jnothman commented Jan 5, 2015

Hi @espg. I think the main issue with this PR is still coding style, etc. You've chosen to split things up into many functions and classes (and even sub-classed BallTree) when much of it would be easier to read if it were implemented in a single function with a clearly-named helper or two, and then wrapped in a class, just as dbscan/DBSCAN are. See also the alternative implementation in #2043, which has other issues, but whose control structure and data model is much clearer to follow. Also, please avoid CamelCase where not used in class names.

Your code is also not flat because it fails to exploit numpy's vectorised operations, such as here, which not only make the code flatter and easier to read (although it looks a bit less like the equivalent pseudocode) but make it much faster.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it.

The rule for contributing to existing software is to start small! Enhance something that's already there, and once you are confident about your work, the project and how to contribute, and the team receiving your contributions is confident about your work, paths open up to more substantial contributions.

IMO, the version of OPTICS that will eventually, I expect, be merged into scikit-learn will look more like #2043.

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jan 5, 2015

hi @jnothman— thanks for the feedback, I appreciate it. I think that a major difference between 2043 and 1984 is that 1984 implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed. Without pruning searches, the OPTICS algorithms perform, for N data points, N^2 distance lookups. Pruning searches is targeted at the specific, but common, case where you have a dense or semi-dense data set with a mismatch of several orders of magnitude between the extent of the dataset and the size of the objects that are being clustered. By way of example, think clustering objects of average width 5, in a dataset that is distributed over order 100,000 units across per dimension; in these types of cases, there’s no need to perform distance lookups between points that are far apart—however, what counts as ‘far’ changes depending on the density of the dataset, so determining what to prune is done within the OPTICS loop. I don’t know of anyway to prune without keeping track of both per-point pruning distance, and whether a given point has been processed, which is why the balltree data structure is subclassed in the code. Without pruning, you either pay N^2 memory (like DBSCAN does/did) or N^2 cpu time (like 2043)…the other way to solve this problem would be to split the dataset up into overlapping tiles and merge the clustered objects, but the tiling scheme isn’t obvious and the merge is nontrivial.

I think that one of the the main problems with 1984 at this point is that it doesn’t have complete unit coverage, and I haven’t had experience with unit tests so I don’t really know how to increase the coverage. I can fix camel-case and work on narrative documentation and some of the other issues, but the coverage has been a stumbling point, so if you have any resources to point me toward, I’d be grateful!

I’m not really sure about the diff that you referenced as an example of better flat code—this is a diff from this (1984) pull request, so perhaps you meant to point to a different link; or just that this update was better? In general, I try to call either numpy or scipy for almost all operation as these functions tend to be vectorized.

The rule for contributing to existing software is to start small!

Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order; DBSCAN didn't scale for my data, and 2043 hadn't been opened yet, so I started 1984

espg commented Jan 5, 2015

hi @jnothman— thanks for the feedback, I appreciate it. I think that a major difference between 2043 and 1984 is that 1984 implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed. Without pruning searches, the OPTICS algorithms perform, for N data points, N^2 distance lookups. Pruning searches is targeted at the specific, but common, case where you have a dense or semi-dense data set with a mismatch of several orders of magnitude between the extent of the dataset and the size of the objects that are being clustered. By way of example, think clustering objects of average width 5, in a dataset that is distributed over order 100,000 units across per dimension; in these types of cases, there’s no need to perform distance lookups between points that are far apart—however, what counts as ‘far’ changes depending on the density of the dataset, so determining what to prune is done within the OPTICS loop. I don’t know of anyway to prune without keeping track of both per-point pruning distance, and whether a given point has been processed, which is why the balltree data structure is subclassed in the code. Without pruning, you either pay N^2 memory (like DBSCAN does/did) or N^2 cpu time (like 2043)…the other way to solve this problem would be to split the dataset up into overlapping tiles and merge the clustered objects, but the tiling scheme isn’t obvious and the merge is nontrivial.

I think that one of the the main problems with 1984 at this point is that it doesn’t have complete unit coverage, and I haven’t had experience with unit tests so I don’t really know how to increase the coverage. I can fix camel-case and work on narrative documentation and some of the other issues, but the coverage has been a stumbling point, so if you have any resources to point me toward, I’d be grateful!

I’m not really sure about the diff that you referenced as an example of better flat code—this is a diff from this (1984) pull request, so perhaps you meant to point to a different link; or just that this update was better? In general, I try to call either numpy or scipy for almost all operation as these functions tend to be vectorized.

The rule for contributing to existing software is to start small!

Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order; DBSCAN didn't scale for my data, and 2043 hadn't been opened yet, so I started 1984

@espg espg closed this Jan 5, 2015

@espg espg reopened this Jan 5, 2015

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 5, 2015

Member

Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order.

It's not really the point; this way everyone (you and other devs) will be too tired of working on a big chunk by the time it's nearly ready to merge. You're much better off having something small that garners confidence for all, learn the ropes, and work your way up.

implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed

I'm not sure what you mean here by "search pruning". I don't see anything in your code that immediately looks like pruning, nor any reference to that term in the OPTICS paper. But your extensions to BallTree are entirely orthogonal to the function of BallTree, so the use of class inheritance seems ill-motivated.

I'm not really sure about the diff that you referenced as an example of better flat code

It was an example of something that was written as a Python for loop but could have been written vectorised, so yes, it was from your own PR. (And it is additionally in a function that only serves to perform two could-be-vectorised operations that really does not justify a separate function.) For comparison, I have suggested elsewhere a succinct and time-efficient way to identify epsilon neighbors, determine core samples and their distances.

Member

jnothman commented Jan 5, 2015

Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order.

It's not really the point; this way everyone (you and other devs) will be too tired of working on a big chunk by the time it's nearly ready to merge. You're much better off having something small that garners confidence for all, learn the ropes, and work your way up.

implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed

I'm not sure what you mean here by "search pruning". I don't see anything in your code that immediately looks like pruning, nor any reference to that term in the OPTICS paper. But your extensions to BallTree are entirely orthogonal to the function of BallTree, so the use of class inheritance seems ill-motivated.

I'm not really sure about the diff that you referenced as an example of better flat code

It was an example of something that was written as a Python for loop but could have been written vectorised, so yes, it was from your own PR. (And it is additionally in a function that only serves to perform two could-be-vectorised operations that really does not justify a separate function.) For comparison, I have suggested elsewhere a succinct and time-efficient way to identify epsilon neighbors, determine core samples and their distances.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jan 5, 2015

Member

it doesn’t have complete unit coverage

If not every line is tested, you need to create data scenarios that will test those untested lines, where possible. Is that the issue you're concerned with?

Member

jnothman commented Jan 5, 2015

it doesn’t have complete unit coverage

If not every line is tested, you need to create data scenarios that will test those untested lines, where possible. Is that the issue you're concerned with?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 3, 2018

Member
Member

jnothman commented Jul 3, 2018

espg added some commits Jul 5, 2018

DBSCAN / OPTICS invariant test
Restructured documentation. Small unit test fixes. Added test to ensure clustering metrics between OPTICS dbscan_extract and DBSCAN are within 2% of each other.
Update _auto_cluster docstring
renamed reachability_ordering --> ordering for consistency
changes fro jnothman
changes unit tests to check for specific error message (instead of 'a' error); minor updates to documentation. This also fixes a bug in the extract_dbscan function whereby some core points were erroneously marked as periphery... this is fixed by reverting to a previous extraction code block that initalizes all points as core and then demotes noise and periphery points during the extract scan. Parameterized unit test.
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 8, 2018

Member

It's good (bittersweet?) to see that writing tests is causing us to catch bugs! :) I'll take a look at the tests themselves.

Member

jnothman commented Jul 8, 2018

It's good (bittersweet?) to see that writing tests is causing us to catch bugs! :) I'll take a look at the tests themselves.

sklearn/cluster/tests/test_optics.py
+ # calculate dbscan labels and cluster metrics
+ db = DBSCAN(eps=0.3, min_samples=10).fit(X)
+
+ db_h = metrics.homogeneity_score(labels_true, db.labels_)

This comment has been minimized.

@jnothman

jnothman Jul 8, 2018

Member

Wouldn't we be better off comparing db.labels_ and op.labels_ directly, rather than testing that they are equally dissimilar from the ground truth? Probably best to just use one highly interpretable metric, or directly check a contingency_matrix. For contingency_matrix, we'd want the number of nonzero cells in each column or row to be almost zero (or the entropy in each row/column to be almost zero).

I'd also like to see in particular that they generate a near-identical set of outlier points.

@jnothman

jnothman Jul 8, 2018

Member

Wouldn't we be better off comparing db.labels_ and op.labels_ directly, rather than testing that they are equally dissimilar from the ground truth? Probably best to just use one highly interpretable metric, or directly check a contingency_matrix. For contingency_matrix, we'd want the number of nonzero cells in each column or row to be almost zero (or the entropy in each row/column to be almost zero).

I'd also like to see in particular that they generate a near-identical set of outlier points.

This comment has been minimized.

@espg

espg Jul 11, 2018

Yes, I definitely agree with you. The new unit test checks that OPTICS and DBSCAN agree on all core points, and that the mismatch for outlier point class labels is < 5 points (the test data is 750 points, ~10% of which are classified as either noise or outlier).

@espg

espg Jul 11, 2018

Yes, I definitely agree with you. The new unit test checks that OPTICS and DBSCAN agree on all core points, and that the mismatch for outlier point class labels is < 5 points (the test data is 750 points, ~10% of which are classified as either noise or outlier).

This comment has been minimized.

@jnothman

jnothman Jul 11, 2018

Member

Thanks!

@jnothman

jnothman Jul 11, 2018

Member

Thanks!

contingency_matrix test
New invarient test between optics and dbscan
sklearn/cluster/tests/test_optics.py
- db_V = metrics.v_measure_score(labels_true, db.labels_)
- db_m = metrics.adjusted_mutual_info_score(labels_true, db.labels_)
- db_s = metrics.silhouette_score(X, db.labels_)
+ confusion_m = contingency_matrix(db.labels_, labels_optics)

This comment has been minimized.

@jnothman

jnothman Jul 11, 2018

Member

I'd rather contingency to reduce confusion: a confusion matrix relies on having a fixed identity of categories, while contingency is a more general idea that can account for arbitrary cluster IDs.

@jnothman

jnothman Jul 11, 2018

Member

I'd rather contingency to reduce confusion: a confusion matrix relies on having a fixed identity of categories, while contingency is a more general idea that can account for arbitrary cluster IDs.

This comment has been minimized.

@espg

espg Jul 11, 2018

ok, fixed.

@espg

espg Jul 11, 2018

ok, fixed.

sklearn/cluster/tests/test_optics.py
- db_s = metrics.silhouette_score(X, db.labels_)
+ confusion_m = contingency_matrix(db.labels_, labels_optics)
+ agree = np.max(confusion_m, axis=1)
+ disagree = np.sum(confusion_m) - np.sum(agree)

This comment has been minimized.

@jnothman

jnothman Jul 11, 2018

Member

The number of agreements can be over-stated using this calculation.

[[5],
 [5],
 [5],
 [5]]

will be regarded as a perfect match to

[[20]]

You can fix this by using the max-sum assignment... or just by checking it is valid for axis=1 as well as axis=0. (Or agree = min((... axis=1), (... axis=0)))

@jnothman

jnothman Jul 11, 2018

Member

The number of agreements can be over-stated using this calculation.

[[5],
 [5],
 [5],
 [5]]

will be regarded as a perfect match to

[[20]]

You can fix this by using the max-sum assignment... or just by checking it is valid for axis=1 as well as axis=0. (Or agree = min((... axis=1), (... axis=0)))

This comment has been minimized.

@espg

espg Jul 11, 2018

I'm not sure if I understand... np.sum(confusion_m) is just a stand in for len(labels), which I can use instead if you prefer. I don't know what you mean by max-sum assignment, but axis=0 and axis=1 for the max call do give identical results; taking the min of both is perhaps a bit repetitive, but doesn't hurt, so I've changed it.

@espg

espg Jul 11, 2018

I'm not sure if I understand... np.sum(confusion_m) is just a stand in for len(labels), which I can use instead if you prefer. I don't know what you mean by max-sum assignment, but axis=0 and axis=1 for the max call do give identical results; taking the min of both is perhaps a bit repetitive, but doesn't hurt, so I've changed it.

sklearn/cluster/tests/test_optics.py
- # compare metrics
- assert_allclose(db_metrics, op_metrics, atol=0.02)
+ # verify label mismatch is < 5 labels
+ assert_less(disagree, 5)

This comment has been minimized.

@jnothman

jnothman Jul 11, 2018

Member

Cool! Btw, these days, with pytest, we'd prefer assert disagree < 5. (I've not mentioned this given that this PR came into our world long before pytest did.)

@jnothman

jnothman Jul 11, 2018

Member

Cool! Btw, these days, with pytest, we'd prefer assert disagree < 5. (I've not mentioned this given that this PR came into our world long before pytest did.)

This comment has been minimized.

@espg

espg Jul 11, 2018

ok, thanks-- fixed.

@espg

espg Jul 11, 2018

ok, thanks-- fixed.

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jul 11, 2018

I think this is looking in pretty good shape... but given that the SciPy sprints start in a few days, are there any omissions or issues that need to be highlighted for people to sprint on?

espg commented Jul 11, 2018

I think this is looking in pretty good shape... but given that the SciPy sprints start in a few days, are there any omissions or issues that need to be highlighted for people to sprint on?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 11, 2018

Member

By max-sum assignment, I mean identifying the one-to-one mapping between true and predicted clusters that maximises the sum of their scores (as solved by the Kuhn-Munkres algorithm).

Member

jnothman commented Jul 11, 2018

By max-sum assignment, I mean identifying the one-to-one mapping between true and predicted clusters that maximises the sum of their scores (as solved by the Kuhn-Munkres algorithm).

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 11, 2018

Member

So I've tried:

@pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
    # Test that OPTICS clustering labels are < 5 difference of DBSCAN

    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers,
                                cluster_std=0.4, random_state=0)

    # calculate optics with dbscan extract at 0.3 epsilon
    op = OPTICS(min_samples=min_samples).fit(X)
    core_optics, labels_optics = op.extract_dbscan(eps)

    # calculate dbscan labels
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    contingency = contingency_matrix(db.labels_, labels_optics)
    # agreement is at best the sum of the max value on either axis
    agree = min(np.sum(np.max(contingency, axis=0)),
                np.sum(np.max(contingency, axis=1)))
    disagree = X.shape[0] - agree

    # verify core_labels match
    assert_array_equal(core_optics, db.core_sample_indices_)

    # verify label mismatch is < 5 labels
    assert disagree < 5

It fails for:

  • min_samples=10, eps=0.1 with disagree == 38.
  • min_samples=20, eps=0.3 with disagree == 9.

Otherwise it passes. Is this what we should expect?

Member

jnothman commented Jul 11, 2018

So I've tried:

@pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
    # Test that OPTICS clustering labels are < 5 difference of DBSCAN

    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers,
                                cluster_std=0.4, random_state=0)

    # calculate optics with dbscan extract at 0.3 epsilon
    op = OPTICS(min_samples=min_samples).fit(X)
    core_optics, labels_optics = op.extract_dbscan(eps)

    # calculate dbscan labels
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    contingency = contingency_matrix(db.labels_, labels_optics)
    # agreement is at best the sum of the max value on either axis
    agree = min(np.sum(np.max(contingency, axis=0)),
                np.sum(np.max(contingency, axis=1)))
    disagree = X.shape[0] - agree

    # verify core_labels match
    assert_array_equal(core_optics, db.core_sample_indices_)

    # verify label mismatch is < 5 labels
    assert disagree < 5

It fails for:

  • min_samples=10, eps=0.1 with disagree == 38.
  • min_samples=20, eps=0.3 with disagree == 9.

Otherwise it passes. Is this what we should expect?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 11, 2018

Member

(Note that I changed the calculation of agree. I think it's correct...)

Member

jnothman commented Jul 11, 2018

(Note that I changed the calculation of agree. I think it's correct...)

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jul 12, 2018

The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary.

espg commented Jul 12, 2018

The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 12, 2018

Member
Member

jnothman commented Jul 12, 2018

extract dbscan updates
Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).
@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jul 12, 2018

I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:

The clustering created from a cluster-ordered data set by ExtractDBSCAN-Clustering is nearly indistinguishable from a clustering created by DBSCAN. Only some border objects may be missed when extracted by the algorithm ExtractDBSCAN-Clustering if they were processed by the algorithm OPTICS before a core object of the corresponding cluster had been found. However, the fraction of such border objects is so small that we can omit a postprocessing (i.e. reassign those objects to a cluster) without much loss of information.

I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core.

espg commented Jul 12, 2018

I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:

The clustering created from a cluster-ordered data set by ExtractDBSCAN-Clustering is nearly indistinguishable from a clustering created by DBSCAN. Only some border objects may be missed when extracted by the algorithm ExtractDBSCAN-Clustering if they were processed by the algorithm OPTICS before a core object of the corresponding cluster had been found. However, the fraction of such border objects is so small that we can omit a postprocessing (i.e. reassign those objects to a cluster) without much loss of information.

I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core.

@jnothman

Thanks for that!

I'm a little concerned that we've still missed something, but we've certainly improved a lot over the last months. I'm going to be bold and approve.

I think @agramfort's approval here is a bit stale. It would be good for someone to have another quick look over tests and/or implementation.

Thanks @espg!!

+ assert_array_equal(core_optics, db.core_sample_indices_)
+
+ non_core_count = len(labels_optics) - len(core_optics)
+ percent_mismatch = np.round((disagree - 1) / non_core_count, 2)

This comment has been minimized.

@jnothman

jnothman Jul 15, 2018

Member

I prefer fraction when we're not multiplying by 100

@jnothman

jnothman Jul 15, 2018

Member

I prefer fraction when we're not multiplying by 100

This comment has been minimized.

@jnothman

jnothman Jul 15, 2018

Member

Why do we need to round?

@jnothman

jnothman Jul 15, 2018

Member

Why do we need to round?

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I'll try to review and hopefully merge this guy.

Member

GaelVaroquaux commented Jul 16, 2018

I'll try to review and hopefully merge this guy.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP).

Member

GaelVaroquaux commented Jul 16, 2018

I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP).

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984

I might merge from it.

Member

GaelVaroquaux commented Jul 16, 2018

I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984

I might merge from it.

+ nbrs.fit(X)
+ self.core_distances_[:] = nbrs.kneighbors(X,
+ self.min_samples)[0][:, -1]
+ self._nbrs = nbrs

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.

@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.

@GaelVaroquaux GaelVaroquaux referenced this pull request Jul 16, 2018

Merged

OPTICS #11547

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member

Merged via #11547 . Hurray!! Thank you

Member

GaelVaroquaux commented Jul 16, 2018

Merged via #11547 . Hurray!! Thank you

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 16, 2018

Member

Wow. That's been a long time coming!!

Member

jnothman commented Jul 16, 2018

Wow. That's been a long time coming!!

@LucasHMS

This comment has been minimized.

Show comment
Hide comment
@LucasHMS

LucasHMS Jul 16, 2018

Is it planned to go out with the next release or is it already available?

Is it planned to go out with the next release or is it already available?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 16, 2018

Member
Member

jnothman commented Jul 16, 2018

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member
Member

GaelVaroquaux commented Jul 16, 2018

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jul 16, 2018

Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-)

espg commented Jul 16, 2018

Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-)

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member
Member

GaelVaroquaux commented Jul 16, 2018

@lmcinnes

This comment has been minimized.

Show comment
Hide comment
@lmcinnes

lmcinnes Jul 16, 2018

Contributor

Congrats @espg! Glad this finally made it!

Contributor

lmcinnes commented Jul 16, 2018

Congrats @espg! Glad this finally made it!

@espg

This comment has been minimized.

Show comment
Hide comment
@espg

espg Jul 16, 2018

It was my first PR, definitely learned a lot about the process...

espg commented Jul 16, 2018

It was my first PR, definitely learned a lot about the process...

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 16, 2018

Member
Member

GaelVaroquaux commented Jul 16, 2018

@czhao028

This comment has been minimized.

Show comment
Hide comment
@czhao028

czhao028 Jul 19, 2018

In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it?

Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS

czhao028 commented Jul 19, 2018

In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it?

Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment