Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign up[MRG] Implementation of OPTICS #1984
Conversation
espg
added some commits
May 21, 2013
jaquesgrobler
reviewed
May 22, 2013
| +centers = [[1, 1], [-1, -1], [1, -1]] | ||
| +X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4) | ||
| + | ||
| +############################################################################## |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jaquesgrobler
May 22, 2013
Member
I tend to prefer not having these separation lines.. A blank line is fine :)
jaquesgrobler
May 22, 2013
Member
I tend to prefer not having these separation lines.. A blank line is fine :)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
NelleV
May 22, 2013
Member
Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.
|
Your code isn't pep8, and the documention doesn't follow numpydoc's conventions. |
NelleV
reviewed
May 22, 2013
| + | ||
| +## Main Class ## | ||
| + | ||
| +class setOfObjects(BallTree): |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jaquesgrobler
reviewed
May 22, 2013
| +""" | ||
| +from sklearn.datasets.samples_generator import make_blobs | ||
| +from sklearn.preprocessing import StandardScaler | ||
| +from sklearn.cluster import optics as op |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jaquesgrobler
May 22, 2013
Member
although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example
jaquesgrobler
May 22, 2013
Member
although it's normally useful importing as.. it maybe makes the examples a little bit
less readable. Perhaps using optics. instead of op. would be better for the sake of the
example
jaquesgrobler
reviewed
May 22, 2013
| + | ||
| +# Run the top-level optics algorithm | ||
| + | ||
| +op.prep_optics(testtree,30,10) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jaquesgrobler
May 22, 2013
Member
Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..
The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.
Hope that helps for now. Thanks again
|
Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through. The other thing I'll mention, is that the scikit-learn generally follows this API style.. Hope that helps for now. Thanks again |
espg
added some commits
May 22, 2013
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mblondel
Jan 8, 2014
Member
@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets
|
@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jan 8, 2014
@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?
I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.
espg
commented
Jan 8, 2014
|
@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API? I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kevindavenport
commented
Mar 28, 2014
|
This is great work! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
arnabkd
commented
Apr 19, 2014
|
@espg : I am very interested in an OPTICS implementation in scikits-learn :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
michaelaye
commented
Nov 4, 2014
|
What's the status of this? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
amueller
Nov 5, 2014
Member
Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.
|
Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kevindavenport
Nov 5, 2014
I want to jump in and take this over if it's stalled. I can start this weekend :)
kevindavenport
commented
Nov 5, 2014
|
I want to jump in and take this over if it's stalled. I can start this weekend :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
amueller
Nov 5, 2014
Member
I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.
|
I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Nov 5, 2014
@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.
The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.
While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.
I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.
espg
commented
Nov 5, 2014
|
@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue. The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues. While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this. I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Nov 5, 2014
@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.
As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/
espg
commented
Nov 5, 2014
|
@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands. As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/ |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
amueller
Nov 5, 2014
Member
@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.
For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).
For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.
There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.
For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).
Maybe also add it to the clustering comparison example.
|
@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed. For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up). For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient. There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though. For the Documentation, we also want narrative documentation (at Maybe also add it to the clustering comparison example. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
amueller
Nov 5, 2014
Member
I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).
|
I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
It might also be interesting to compare to #3802. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
MechCoder
Nov 5, 2014
Member
I can also try reviewing, once you feel that it is in a reviewable state.
|
I can also try reviewing, once you feel that it is in a reviewable state. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
GaelVaroquaux
Nov 5, 2014
Member
For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a
similar problem with agglomerative approaches. My first idea would be to have
some attribute that stores the whole hierarchical structure which could be
accessed by the user. That is obviously not super convenient.
joblib.Memory (look for it in FeatureAgglomeration).
joblib.Memory (look for it in FeatureAgglomeration). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Nov 5, 2014
@amueller : For rescanning-- I implemented the class so that it (similar to your suggestion) saves the clustering order and hierarchy as an attribute already, but I'll check @GaelVaroquaux suggestion of using joblib.Memory as a cleaner way to do this.
espg
commented
Nov 5, 2014
|
@amueller : For rescanning-- I implemented the class so that it (similar to your suggestion) saves the clustering order and hierarchy as an attribute already, but I'll check @GaelVaroquaux suggestion of using joblib.Memory as a cleaner way to do this. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Nov 5, 2014
@amueller : Thank you for the directions on where this needs to go. The tests are a bit of hangup, really I need to write a different example/test that is specific to OPTICS that is appropriate for showing multiple hierarchies of clusters for extracting at different eps values. Right now, I've just been using the same example that DBSCAN uses; however, the automated unit test for clustering fail for the following reason:
The OPTICS algorithm struggles with cases where the extracted clustering structure is close to the initial eps value (really, close to eps_max). In otherwords, running OPTICS(eps=.3,min_samples=10).fit(X) gives different (and incorrect) clustering compared to:
clust = OPTICS(eps=3.0,min_samples=10)
clust.fit(X)
clust.extract(0.3)
...which matches the results given from DBSCAN.
This is because the cluster seeds, which are expanded, are sensitive to starting location for low eps values. Using an eps value of inf ensures that you always get the same results as DBSCAN, but for large point sets gives a pretty substantial slow down; it is worth having the parameter user specified, but it needs to be larger than the final extraction-- somewhere between 5 and 10 times the largest eps value that you plan to cluster at. This could be scaled automatically in the init and fit options, but I think that approach is contrary to the API style for contributing; so I'm guessing that the appropriate use of the parameter needs to be explained in a narrative documentation page?
espg
commented
Nov 5, 2014
|
@amueller : Thank you for the directions on where this needs to go. The tests are a bit of hangup, really I need to write a different example/test that is specific to OPTICS that is appropriate for showing multiple hierarchies of clusters for extracting at different eps values. Right now, I've just been using the same example that DBSCAN uses; however, the automated unit test for clustering fail for the following reason: The OPTICS algorithm struggles with cases where the extracted clustering structure is close to the initial eps value (really, close to eps_max). In otherwords, running
...which matches the results given from DBSCAN. This is because the cluster seeds, which are expanded, are sensitive to starting location for low eps values. Using an eps value of inf ensures that you always get the same results as DBSCAN, but for large point sets gives a pretty substantial slow down; it is worth having the parameter user specified, but it needs to be larger than the final extraction-- somewhere between 5 and 10 times the largest eps value that you plan to cluster at. This could be scaled automatically in the init and fit options, but I think that approach is contrary to the API style for contributing; so I'm guessing that the appropriate use of the parameter needs to be explained in a narrative documentation page? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
amueller
Nov 5, 2014
Member
Sorry, which automated unit test is that?
The one that is failing on travis seems to be a missing labels_ attribute (which is part of the clustering api apparently).
And yes, discussing the parameter choices in the docs makes sense. Still, the default parameters should give sensible results in so far as that is possible.
|
Sorry, which automated unit test is that? And yes, discussing the parameter choices in the docs makes sense. Still, the default parameters should give sensible results in so far as that is possible. |
espg
added some commits
Nov 5, 2014
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jan 5, 2015
Member
Hi @espg. I think the main issue with this PR is still coding style, etc. You've chosen to split things up into many functions and classes (and even sub-classed BallTree) when much of it would be easier to read if it were implemented in a single function with a clearly-named helper or two, and then wrapped in a class, just as dbscan/DBSCAN are. See also the alternative implementation in #2043, which has other issues, but whose control structure and data model is much clearer to follow. Also, please avoid CamelCase where not used in class names.
Your code is also not flat because it fails to exploit numpy's vectorised operations, such as here, which not only make the code flatter and easier to read (although it looks a bit less like the equivalent pseudocode) but make it much faster.
As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it.
The rule for contributing to existing software is to start small! Enhance something that's already there, and once you are confident about your work, the project and how to contribute, and the team receiving your contributions is confident about your work, paths open up to more substantial contributions.
IMO, the version of OPTICS that will eventually, I expect, be merged into scikit-learn will look more like #2043.
|
Hi @espg. I think the main issue with this PR is still coding style, etc. You've chosen to split things up into many functions and classes (and even sub-classed Your code is also not flat because it fails to exploit numpy's vectorised operations, such as here, which not only make the code flatter and easier to read (although it looks a bit less like the equivalent pseudocode) but make it much faster.
The rule for contributing to existing software is to start small! Enhance something that's already there, and once you are confident about your work, the project and how to contribute, and the team receiving your contributions is confident about your work, paths open up to more substantial contributions. IMO, the version of OPTICS that will eventually, I expect, be merged into scikit-learn will look more like #2043. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jan 5, 2015
hi @jnothman— thanks for the feedback, I appreciate it. I think that a major difference between 2043 and 1984 is that 1984 implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed. Without pruning searches, the OPTICS algorithms perform, for N data points, N^2 distance lookups. Pruning searches is targeted at the specific, but common, case where you have a dense or semi-dense data set with a mismatch of several orders of magnitude between the extent of the dataset and the size of the objects that are being clustered. By way of example, think clustering objects of average width 5, in a dataset that is distributed over order 100,000 units across per dimension; in these types of cases, there’s no need to perform distance lookups between points that are far apart—however, what counts as ‘far’ changes depending on the density of the dataset, so determining what to prune is done within the OPTICS loop. I don’t know of anyway to prune without keeping track of both per-point pruning distance, and whether a given point has been processed, which is why the balltree data structure is subclassed in the code. Without pruning, you either pay N^2 memory (like DBSCAN does/did) or N^2 cpu time (like 2043)…the other way to solve this problem would be to split the dataset up into overlapping tiles and merge the clustered objects, but the tiling scheme isn’t obvious and the merge is nontrivial.
I think that one of the the main problems with 1984 at this point is that it doesn’t have complete unit coverage, and I haven’t had experience with unit tests so I don’t really know how to increase the coverage. I can fix camel-case and work on narrative documentation and some of the other issues, but the coverage has been a stumbling point, so if you have any resources to point me toward, I’d be grateful!
I’m not really sure about the diff that you referenced as an example of better flat code—this is a diff from this (1984) pull request, so perhaps you meant to point to a different link; or just that this update was better? In general, I try to call either numpy or scipy for almost all operation as these functions tend to be vectorized.
The rule for contributing to existing software is to start small!
Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order; DBSCAN didn't scale for my data, and 2043 hadn't been opened yet, so I started 1984
espg
commented
Jan 5, 2015
|
hi @jnothman— thanks for the feedback, I appreciate it. I think that a major difference between 2043 and 1984 is that 1984 implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed. Without pruning searches, the OPTICS algorithms perform, for N data points, N^2 distance lookups. Pruning searches is targeted at the specific, but common, case where you have a dense or semi-dense data set with a mismatch of several orders of magnitude between the extent of the dataset and the size of the objects that are being clustered. By way of example, think clustering objects of average width 5, in a dataset that is distributed over order 100,000 units across per dimension; in these types of cases, there’s no need to perform distance lookups between points that are far apart—however, what counts as ‘far’ changes depending on the density of the dataset, so determining what to prune is done within the OPTICS loop. I don’t know of anyway to prune without keeping track of both per-point pruning distance, and whether a given point has been processed, which is why the balltree data structure is subclassed in the code. Without pruning, you either pay N^2 memory (like DBSCAN does/did) or N^2 cpu time (like 2043)…the other way to solve this problem would be to split the dataset up into overlapping tiles and merge the clustered objects, but the tiling scheme isn’t obvious and the merge is nontrivial. I think that one of the the main problems with 1984 at this point is that it doesn’t have complete unit coverage, and I haven’t had experience with unit tests so I don’t really know how to increase the coverage. I can fix camel-case and work on narrative documentation and some of the other issues, but the coverage has been a stumbling point, so if you have any resources to point me toward, I’d be grateful! I’m not really sure about the diff that you referenced as an example of better flat code—this is a diff from this (1984) pull request, so perhaps you meant to point to a different link; or just that this update was better? In general, I try to call either numpy or scipy for almost all operation as these functions tend to be vectorized.
Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order; DBSCAN didn't scale for my data, and 2043 hadn't been opened yet, so I started 1984 |
espg
closed this
Jan 5, 2015
espg
reopened this
Jan 5, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jan 5, 2015
Member
Yeah, I know (now) that I'm doing this backwards for a first time contributor-- algorithm first, then modifications, unit tests, documentation... I'm still learning all of it, but definitely in the reverse order.
It's not really the point; this way everyone (you and other devs) will be too tired of working on a big chunk by the time it's nearly ready to merge. You're much better off having something small that garners confidence for all, learn the ropes, and work your way up.
implements search pruning, which is (part of) why the code is more difficult to follow, and is why ‘balltree’ is subclassed
I'm not sure what you mean here by "search pruning". I don't see anything in your code that immediately looks like pruning, nor any reference to that term in the OPTICS paper. But your extensions to BallTree are entirely orthogonal to the function of BallTree, so the use of class inheritance seems ill-motivated.
I'm not really sure about the diff that you referenced as an example of better flat code
It was an example of something that was written as a Python for loop but could have been written vectorised, so yes, it was from your own PR. (And it is additionally in a function that only serves to perform two could-be-vectorised operations that really does not justify a separate function.) For comparison, I have suggested elsewhere a succinct and time-efficient way to identify epsilon neighbors, determine core samples and their distances.
It's not really the point; this way everyone (you and other devs) will be too tired of working on a big chunk by the time it's nearly ready to merge. You're much better off having something small that garners confidence for all, learn the ropes, and work your way up.
I'm not sure what you mean here by "search pruning". I don't see anything in your code that immediately looks like pruning, nor any reference to that term in the OPTICS paper. But your extensions to
It was an example of something that was written as a Python for loop but could have been written vectorised, so yes, it was from your own PR. (And it is additionally in a function that only serves to perform two could-be-vectorised operations that really does not justify a separate function.) For comparison, I have suggested elsewhere a succinct and time-efficient way to identify epsilon neighbors, determine core samples and their distances. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jan 5, 2015
Member
it doesn’t have complete unit coverage
If not every line is tested, you need to create data scenarios that will test those untested lines, where possible. Is that the issue you're concerned with?
If not every line is tested, you need to create data scenarios that will test those untested lines, where possible. Is that the issue you're concerned with? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 3, 2018
Member
|
It doesn't look like we're going to release before SciPy sprints, so we'll
probably expect many merges from then. But @espg seems to have it covered.
|
espg
added some commits
Jul 5, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 8, 2018
Member
It's good (bittersweet?) to see that writing tests is causing us to catch bugs! :) I'll take a look at the tests themselves.
|
It's good (bittersweet?) to see that writing tests is causing us to catch bugs! :) I'll take a look at the tests themselves. |
| + # calculate dbscan labels and cluster metrics | ||
| + db = DBSCAN(eps=0.3, min_samples=10).fit(X) | ||
| + | ||
| + db_h = metrics.homogeneity_score(labels_true, db.labels_) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 8, 2018
Member
Wouldn't we be better off comparing db.labels_ and op.labels_ directly, rather than testing that they are equally dissimilar from the ground truth? Probably best to just use one highly interpretable metric, or directly check a contingency_matrix. For contingency_matrix, we'd want the number of nonzero cells in each column or row to be almost zero (or the entropy in each row/column to be almost zero).
I'd also like to see in particular that they generate a near-identical set of outlier points.
jnothman
Jul 8, 2018
Member
Wouldn't we be better off comparing db.labels_ and op.labels_ directly, rather than testing that they are equally dissimilar from the ground truth? Probably best to just use one highly interpretable metric, or directly check a contingency_matrix. For contingency_matrix, we'd want the number of nonzero cells in each column or row to be almost zero (or the entropy in each row/column to be almost zero).
I'd also like to see in particular that they generate a near-identical set of outlier points.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jul 11, 2018
Yes, I definitely agree with you. The new unit test checks that OPTICS and DBSCAN agree on all core points, and that the mismatch for outlier point class labels is < 5 points (the test data is 750 points, ~10% of which are classified as either noise or outlier).
espg
Jul 11, 2018
Yes, I definitely agree with you. The new unit test checks that OPTICS and DBSCAN agree on all core points, and that the mismatch for outlier point class labels is < 5 points (the test data is 750 points, ~10% of which are classified as either noise or outlier).
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
| - db_V = metrics.v_measure_score(labels_true, db.labels_) | ||
| - db_m = metrics.adjusted_mutual_info_score(labels_true, db.labels_) | ||
| - db_s = metrics.silhouette_score(X, db.labels_) | ||
| + confusion_m = contingency_matrix(db.labels_, labels_optics) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 11, 2018
Member
I'd rather contingency to reduce confusion: a confusion matrix relies on having a fixed identity of categories, while contingency is a more general idea that can account for arbitrary cluster IDs.
jnothman
Jul 11, 2018
Member
I'd rather contingency to reduce confusion: a confusion matrix relies on having a fixed identity of categories, while contingency is a more general idea that can account for arbitrary cluster IDs.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
| - db_s = metrics.silhouette_score(X, db.labels_) | ||
| + confusion_m = contingency_matrix(db.labels_, labels_optics) | ||
| + agree = np.max(confusion_m, axis=1) | ||
| + disagree = np.sum(confusion_m) - np.sum(agree) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 11, 2018
Member
The number of agreements can be over-stated using this calculation.
[[5],
[5],
[5],
[5]]
will be regarded as a perfect match to
[[20]]
You can fix this by using the max-sum assignment... or just by checking it is valid for axis=1 as well as axis=0. (Or agree = min((... axis=1), (... axis=0)))
jnothman
Jul 11, 2018
Member
The number of agreements can be over-stated using this calculation.
[[5],
[5],
[5],
[5]]
will be regarded as a perfect match to
[[20]]
You can fix this by using the max-sum assignment... or just by checking it is valid for axis=1 as well as axis=0. (Or agree = min((... axis=1), (... axis=0)))
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jul 11, 2018
I'm not sure if I understand... np.sum(confusion_m) is just a stand in for len(labels), which I can use instead if you prefer. I don't know what you mean by max-sum assignment, but axis=0 and axis=1 for the max call do give identical results; taking the min of both is perhaps a bit repetitive, but doesn't hurt, so I've changed it.
espg
Jul 11, 2018
I'm not sure if I understand... np.sum(confusion_m) is just a stand in for len(labels), which I can use instead if you prefer. I don't know what you mean by max-sum assignment, but axis=0 and axis=1 for the max call do give identical results; taking the min of both is perhaps a bit repetitive, but doesn't hurt, so I've changed it.
| - # compare metrics | ||
| - assert_allclose(db_metrics, op_metrics, atol=0.02) | ||
| + # verify label mismatch is < 5 labels | ||
| + assert_less(disagree, 5) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 11, 2018
Member
Cool! Btw, these days, with pytest, we'd prefer assert disagree < 5. (I've not mentioned this given that this PR came into our world long before pytest did.)
jnothman
Jul 11, 2018
Member
Cool! Btw, these days, with pytest, we'd prefer assert disagree < 5. (I've not mentioned this given that this PR came into our world long before pytest did.)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jul 11, 2018
I think this is looking in pretty good shape... but given that the SciPy sprints start in a few days, are there any omissions or issues that need to be highlighted for people to sprint on?
espg
commented
Jul 11, 2018
|
I think this is looking in pretty good shape... but given that the SciPy sprints start in a few days, are there any omissions or issues that need to be highlighted for people to sprint on? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 11, 2018
Member
By max-sum assignment, I mean identifying the one-to-one mapping between true and predicted clusters that maximises the sum of their scores (as solved by the Kuhn-Munkres algorithm).
|
By max-sum assignment, I mean identifying the one-to-one mapping between true and predicted clusters that maximises the sum of their scores (as solved by the Kuhn-Munkres algorithm). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 11, 2018
Member
So I've tried:
@pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
# Test that OPTICS clustering labels are < 5 difference of DBSCAN
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers,
cluster_std=0.4, random_state=0)
# calculate optics with dbscan extract at 0.3 epsilon
op = OPTICS(min_samples=min_samples).fit(X)
core_optics, labels_optics = op.extract_dbscan(eps)
# calculate dbscan labels
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
contingency = contingency_matrix(db.labels_, labels_optics)
# agreement is at best the sum of the max value on either axis
agree = min(np.sum(np.max(contingency, axis=0)),
np.sum(np.max(contingency, axis=1)))
disagree = X.shape[0] - agree
# verify core_labels match
assert_array_equal(core_optics, db.core_sample_indices_)
# verify label mismatch is < 5 labels
assert disagree < 5It fails for:
min_samples=10,eps=0.1withdisagree == 38.min_samples=20,eps=0.3withdisagree == 9.
Otherwise it passes. Is this what we should expect?
|
So I've tried: @pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
# Test that OPTICS clustering labels are < 5 difference of DBSCAN
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers,
cluster_std=0.4, random_state=0)
# calculate optics with dbscan extract at 0.3 epsilon
op = OPTICS(min_samples=min_samples).fit(X)
core_optics, labels_optics = op.extract_dbscan(eps)
# calculate dbscan labels
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
contingency = contingency_matrix(db.labels_, labels_optics)
# agreement is at best the sum of the max value on either axis
agree = min(np.sum(np.max(contingency, axis=0)),
np.sum(np.max(contingency, axis=1)))
disagree = X.shape[0] - agree
# verify core_labels match
assert_array_equal(core_optics, db.core_sample_indices_)
# verify label mismatch is < 5 labels
assert disagree < 5It fails for:
Otherwise it passes. Is this what we should expect? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 11, 2018
Member
(Note that I changed the calculation of agree. I think it's correct...)
|
(Note that I changed the calculation of |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jul 12, 2018
The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary.
espg
commented
Jul 12, 2018
|
The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 12, 2018
Member
|
Well we can explicitly check for how many points are not reachable from
core points, etc. Or should we be using a count of dbscan.labels_ == -1 as
a basis for assessing discrepancy?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jul 12, 2018
I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:
The clustering created from a cluster-ordered data set by ExtractDBSCAN-Clustering is nearly indistinguishable from a clustering created by DBSCAN. Only some border objects may be missed when extracted by the algorithm ExtractDBSCAN-Clustering if they were processed by the algorithm OPTICS before a core object of the corresponding cluster had been found. However, the fraction of such border objects is so small that we can omit a postprocessing (i.e. reassign those objects to a cluster) without much loss of information.
I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core.
espg
commented
Jul 12, 2018
|
I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:
I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core. |
jnothman
approved these changes
Jul 15, 2018
Thanks for that!
I'm a little concerned that we've still missed something, but we've certainly improved a lot over the last months. I'm going to be bold and approve.
I think @agramfort's approval here is a bit stale. It would be good for someone to have another quick look over tests and/or implementation.
Thanks @espg!!
| + assert_array_equal(core_optics, db.core_sample_indices_) | ||
| + | ||
| + non_core_count = len(labels_optics) - len(core_optics) | ||
| + percent_mismatch = np.round((disagree - 1) / non_core_count, 2) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
I'll try to review and hopefully merge this guy. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
GaelVaroquaux
Jul 16, 2018
Member
I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP).
|
I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
GaelVaroquaux
Jul 16, 2018
Member
I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984
I might merge from it.
|
I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984 I might merge from it. |
| + nbrs.fit(X) | ||
| + self.core_distances_[:] = nbrs.kneighbors(X, | ||
| + self.min_samples)[0][:, -1] | ||
| + self._nbrs = nbrs |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
GaelVaroquaux
Jul 16, 2018
Member
I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.
GaelVaroquaux
Jul 16, 2018
Member
I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Merged via #11547 . Hurray!! Thank you |
GaelVaroquaux
closed this
Jul 16, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Wow. That's been a long time coming!! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
LucasHMS
commented
Jul 16, 2018
|
Is it planned to go out with the next release or is it already available? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jnothman
Jul 16, 2018
Member
|
Already available in master, and will be in a versioned release in coming
weeks.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
GaelVaroquaux
Jul 16, 2018
Member
|
Is it planned to go out with the next release or is it already available?
It's available in master. It will be in the next release.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
Jul 16, 2018
Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-)
espg
commented
Jul 16, 2018
|
Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Thanks a lot for sticking around!!! This is an awesome PR!
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Congrats @espg! Glad this finally made it! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
espg
commented
Jul 16, 2018
|
It was my first PR, definitely learned a lot about the process... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
GaelVaroquaux
Jul 16, 2018
Member
|
And you stuck around. Congratulations. We know it's not easy, and we are grateful.
So are the users!
Sent from my phone. Please forgive typos and briefness.
…On Jul 16, 2018, 20:26, at 20:26, Shane ***@***.***> wrote:
It was my first PR, definitely learned a lot about the process...
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1984 (comment)
|
amueller
referenced this pull request
Jul 16, 2018
Merged
[MRG] import optics function in cluster.__init__ #11567
jorisvandenbossche
referenced this pull request
Jul 16, 2018
Open
TST: optics function is not tested #11568
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
czhao028
Jul 19, 2018
In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it?
Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS
czhao028
commented
Jul 19, 2018
•
|
In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it? Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS |
espg commentedMay 21, 2013
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)
First commit; would appreciate any advice on changes to help code conform to the scikit-learn API
Example usage in examples folder