Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] OPTICS fit uses the selected `extract_method` parameter #12087

Merged

Conversation

9 participants
@adrinjalali
Copy link
Member

commented Sep 15, 2018

Fixes #12036, related to #12044
Fixes #12375
Closes #12049
Closes #11857

Add extract_method parameter to OPTICS init, and make fit use the appropriate method.

@jnothman
Copy link
Member

left a comment

Tests, please

Show resolved Hide resolved sklearn/cluster/optics_.py Outdated
Show resolved Hide resolved sklearn/cluster/optics_.py Outdated

@adrinjalali adrinjalali force-pushed the adrinjalali:optics/choose_extractor branch from 1e8e963 to 58913d8 Sep 16, 2018

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Sep 16, 2018

I've added two tests, but we could add more edge case tests if you wish.

I feel like, either there should be a public extract_ method for all extract methods, or all should be private. I'm happy to have them all public with good docstrings to better explain the parameters and their relation to the extract methods.

@jnothman
Copy link
Member

left a comment

sqlnk is not referenced by that name anywhere in the documentation.

I suppose this is a reasonable design. Perhaps let's make the extract_dbscan method into a public function extract_optics_dbscan and similar for sqlnk, xi, ...? That way they are simplified to not need the if blah is None: blah = self.blah logic and to explicitly account for ordering and reachability as the sufficient statistics.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Sep 17, 2018

I guess we could have all those functions public as you say, and let the user either pass whatever's in the instance or explicitly set the values. My intention with the construct (i.e. having the default as None and not our default values) was to make it convenient for users to set only one or few parameters and the rest taken from the instance, which could be set beforehand and something other than the default values. But I guess it's not necessary.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Sep 17, 2018

Oh, extract_dbscan actually checks weather the given eps has a reasonable compared to max_eps, which we won't be giving to extract_optics_dbscan if made a separate public method. Should we then keep them in the class?

@adrinjalali adrinjalali changed the title [WIP] OPTICS fit uses the selected `extract_method` parameter [MRG] OPTICS fit uses the selected `extract_method` parameter Sep 20, 2018

Show resolved Hide resolved sklearn/cluster/optics_.py Outdated
@jnothman

This comment has been minimized.

Copy link
Member

commented Sep 22, 2018

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Oct 8, 2018

@jnothman, gentle review ping :)

@jnothman
Copy link
Member

left a comment

I hope to look at this properly soon...

extract_method : string, optional (default='sqlnk')
The extraction method used to extract clusters using the calculated
reachability and ordering. Possible values are "dbscan"
and "sqlnk".

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 15, 2018

Member

sqlnk is not mentioned in the repo. The user has little way to identify its meaning.

Show resolved Hide resolved sklearn/cluster/optics_.py Outdated
@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Oct 16, 2018

We'd need to improve the user guides once we're done with the implementations anyway, but for now I added a reference to the class's description for the method.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

commented Oct 24, 2018

FYI @adrinjalali All the known bugs of CD/RD/order/predecessor have been fixed. You can try Joel's solution here. Meanwhile, we need to re-evaluate the OPTICS-related issues you raised and figure out whether these bugs still exist.

@adrinjalali adrinjalali changed the title [MRG] OPTICS fit uses the selected `extract_method` parameter [MRG+1] OPTICS fit uses the selected `extract_method` parameter Mar 1, 2019

@jnothman

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Mar 1, 2019

We can tackle that on the separate yet_to_come warm_start PR.

@GaelVaroquaux
Copy link
Member

left a comment

LGTM from a cursory look: I haven't check the mathematical validity of all lines.

@@ -116,8 +116,6 @@
n_clusters=params['n_clusters'], eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=params['eps'])
optics = cluster.OPTICS(min_samples=30, maxima_ratio=.8,
rejection_ratio=.4)

This comment has been minimized.

Copy link
@GaelVaroquaux

GaelVaroquaux Mar 1, 2019

Member

Why is optics being removed from this example?

I find this example very useful to have a big picture view of clustering in scikit-learn.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Mar 2, 2019

Author Member

The example will be added when we add the xi method (the method from the original paper) in #12077. This moves the OPTICS to an intermediate stage for us to focus on the other PR, and then better examples, and putting OPTICS back in examples.


if self.cluster_method not in ['dbscan']:
raise ValueError("cluster_method should be one of"
" 'dbscan', but is %s" %

This comment has been minimized.

Copy link
@banilo

banilo Mar 2, 2019

Contributor

Sounds a little bit like multiple-personality disorder. If there is only one option, then the phrasing my be disturbing to newcomers.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Mar 2, 2019

Author Member

This PR is not putting OPTICS in a stable and usable state, it's applying most decisions we made durint the sprint, documented also here: #12044 (comment)

return_distance=False)[0]

# Getting indices of neighbors that have not been processed
unproc = np.compress((~np.take(processed, indices)).ravel(),

This comment has been minimized.

Copy link
@banilo

banilo Mar 2, 2019

Contributor

Pretty crowded set of operations. Consider splitting into 2 lines for better readability

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Mar 2, 2019

Author Member

This PR is not touching those lines. These parts are from other PRs and I don't want to touch them here. They're only in the changelog cause they've been moved around.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Mar 2, 2019

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Mar 3, 2019

@GaelVaroquaux BTW, the math behind calculating the reachability and core distances were checked thoroughly before in other PRs, this PR is just moving them around. So I'd say that part is safe.

It'd be nice to either merge this or have more reviews on it.

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

Point taken.

It's hard to have big picture on this PR. This next one will help.

Merging!

@GaelVaroquaux GaelVaroquaux merged commit 40441e8 into scikit-learn:master Mar 3, 2019

16 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.51%)
Details
codecov/project Absolute coverage decreased by -0.15% but relative coverage increased by +7.48% compared to 0682d92
Details
scikit-learn.scikit-learn Build #20190301.70 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details

Sprint Paris 2019 automation moved this from Needs review to Done Mar 3, 2019

@adrinjalali adrinjalali deleted the adrinjalali:optics/choose_extractor branch Mar 4, 2019

@xluccianox

This comment has been minimized.

Copy link

commented Mar 15, 2019

Hi, I'm not sure this is the correct place to ask this, but I guess that this PR has something to do with my issues.
I'm using this OPTICS implementation and I noticed a considerable change in the results after this PR was merged to master. Any summary on what changed? and why the results with same input data and clustering parameters differ so much?

Thanks,
Luciano.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2019

@xluccianox yes, after this PR, OPTICS on master is almost merely a fancy DBSCAN. #12077 is the one implementing the rest of OPTICS!

@jnothman

This comment has been minimized.

Copy link
Member

commented Mar 16, 2019

@qinhanmin2014

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Distance matrices are not supported.

ping @adrinjalali we support precomputed distance matrix? Am I wrong?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2019

Yes @qinhanmin2014 , we added it here: #12028

@qinhanmin2014

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Yes @qinhanmin2014 , we added it here: #12028

So the doc is wrong? @adrinjalali

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2019

Yes it is, seems like we forgot to add 'precomputed' to the docstring in that PR.

@vegasmagician

This comment has been minimized.

Copy link

commented Apr 9, 2019

Hi everybody, first thanks for your work ! I have a little question about this topic.

If I take into account the fact that extract_method='xi' is not yet implemented and so I can't use min_cluster_size parameter (that's right ?). I don't understand why when I run :

OPTICS(eps=5, max_eps=10, min_samples=5, algorithm='auto',metric='precomputed').fit(dist)

there are clusters with less than min_samples points. If you have less than min_samples in a cluster, why this points are not considered as noise points ?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

The min_samples in this context is just to find core samples, and doesn't have too much to do with the cluster size really. Here there's a simple threshold on the reachability plot is applied (5 in the case of your example), and then the clusters are reported.

@vegasmagician

This comment has been minimized.

Copy link

commented Apr 10, 2019

@adrinjalali I think so that I don't understand well the role of min_samples. When you say "just to find core samples" these samples are not precisely the base of our clusters ? So if I have to define a min_size for my clusters (as in DB_scan when I define Min_Pts, I have not cluster with less than Min_Pts) it is possible ?

@kno10

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2019

@vegasmagician it has the same meaning as minPts in DBSCAN:

The number of samples in a neighborhood for a point to be considered as a core point.
But given large enough max_eps every point eventually becomes a core point. So instead of a "core point property", you have a "core distance" for every point, the distance when it becomes a core point.

DBSCAN can also produce clusters with fewer than minPts points, but that is a rare corner case.

In OPTICS this is common, because of the hierarchical nature. A cluster can contain two subclusters and a single additional point connecting the two clusters. A naive approach using labels will see this as a cluster with a single point only.

@vegasmagician

This comment has been minimized.

Copy link

commented Apr 15, 2019

Thanks for your answer, I understand better.
So using your module, how can I show the this hierarchy of clusters ?

@adrinjalali

This comment has been minimized.

Copy link
Member Author

commented Apr 15, 2019

@vegasmagician you would be able to get the hierarchy after #12077 is merged.

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

[MRG+1] OPTICS fit uses the selected `extract_method` parameter (scik…
…it-learn#12087)

* add extract method parameter

* change fancy to sqlnk

* test for invalid extract_method value

* test easy dbscan

* pep8

* add extract_sqlnk

* mention which parameter is used for which extact method

* add tests for extract methods with no params given

* Add the reference for the SQLNK method.

* make extract_optics_dbscan and extract_optics_sqlnk public

* fix references, sync docstrings

* fix more morege conflicts

* reorganize the code and get it closer to what we want

* pep8

* remove sqlnk, make compute_optics_graph public

* pep8

* remove unused reference

* make tests pass

* fix docstrings

* fix removal of removed parameter

* add back min_cluster size, xi needs it

* pep8

* fix docstring issue

* address Hanmin's comments

* remove core_sample_indices_

* fix comment

* pep8

* apply comments, remove example

* add public functions to classes.rst and __init__

* address Joel's comments

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

xhlulu added a commit to xhlulu/scikit-learn that referenced this pull request Apr 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.