New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Adding K-Medoids clustering algorithm revival #7694

Closed
wants to merge 55 commits into
base: master
from

Conversation

@Kornel
Copy link
Contributor

Kornel commented Oct 18, 2016

This is a revival of #5085
I've rebased the code and fixed the tests.

@TomDLT TomDLT changed the title Tk kmedoids Adding K-Medoids clustering algorithm revival Oct 21, 2016

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Oct 21, 2016

Thanks for taking over this work, it is much appreciated.
Please address our comments in #5085, and some tests are still failing.

@raghavrv raghavrv added this to the 0.19 milestone Oct 23, 2016

@ekerazha

This comment has been minimized.

Copy link

ekerazha commented Oct 24, 2016

@Kornel
Can a MiniBatchKMedoids class be added (similarly to the MiniBatchKMeans class)?

P.S.
The MSMBuilder package contains a mini-batch k-medoids implementation: https://github.com/msmbuilder/msmbuilder/blob/master/msmbuilder/cluster/minibatchkmedoids.py

@Kornel

This comment has been minimized.

Copy link
Contributor

Kornel commented Oct 24, 2016

@ekerazha I'll take a look at this.

@TomDLT No problem, one question - how can I test this locally, should a simple make (which is running clean and tests as far as I can tell) suffice?

@TomDLT

This comment has been minimized.

Copy link
Member

TomDLT commented Oct 24, 2016

Yes, running make(or make tests, or more specifically nosetests sklearn/cluster/tests/test_k_medoids.py) will run the tests locally with your configuration, which should help you fix most of the test errors.
You can alternatively push your commits to GitHub to run the tests on several configurations (Windows/Linux, Python2/Python3, 64bits/32bits, ...), and look at the results on Travis/AppVeyor logs. But the tests may be queued for a while.

@Kornel

This comment has been minimized.

Copy link
Contributor

Kornel commented Oct 24, 2016

@TomDLT ok thanks, looking at the results of appveyor it looks like the build is failing with python3. I'll take closer a look at that . Otherwise "It works on my machine" (tm) :)

@Kornel Kornel force-pushed the Kornel:tk-kmedoids branch 2 times, most recently from 8393bb6 to 567a3e2 Oct 24, 2016

@raghavrv raghavrv changed the title Adding K-Medoids clustering algorithm revival [MRG] Adding K-Medoids clustering algorithm revival Oct 24, 2016

@raghavrv raghavrv changed the title [MRG] Adding K-Medoids clustering algorithm revival Adding K-Medoids clustering algorithm revival Oct 24, 2016

@raghavrv

This comment has been minimized.

Copy link
Member

raghavrv commented Oct 24, 2016

Once the tests pass and all the comments are addressed, please prefix the PR title with a [MRG] and ping us again :)

@raghavrv raghavrv removed the Needs Review label Oct 24, 2016

@Kornel Kornel force-pushed the Kornel:tk-kmedoids branch from 567a3e2 to b357b78 Oct 24, 2016

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 15, 2018

@Kornel would you like someone to help take this on?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 15, 2018

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 15, 2018

Also, I don’t know if this has been explored for KMedoids, but it could be useful to implement something like the kmeans++ algorithm for initial medoid selection if it hasn’t been already

@Kornel

This comment has been minimized.

Copy link
Contributor

Kornel commented Apr 15, 2018

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 16, 2018

When is the next release? I'd love to help get this over the finish line for selfish reasons.

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 16, 2018

Also, maybe this should be a separate pull request after this is released, but I've found it's really computationally useful to be able to specify a degeneracy for each sample (Imagine if you have 100,000 samples but only 1,000 unique samples).

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 16, 2018

Sorry, terminology gap: what do you mean by a degeneracy, @zdog234?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 16, 2018

Next release? I'm guessing around July

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 17, 2018

In this situation, by degeneracy, I mean having duplicate samples. I’m pretty sure KMedoids can be done with just the unique values and the count of each unique value, potentially cutting the size of the distance matrix by multiple orders of magnitude.

@Kornel

This comment has been minimized.

Copy link
Contributor

Kornel commented Apr 17, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 17, 2018

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 18, 2018

@Kornel I used KMedoids recently to cluster a single string-valued column. There were ~30,000 samples but only 1000 unique values. In this case, I want to be able to pass in just the unique values, along with counts for each of them, to save on processing time/memory.

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 18, 2018

Also, I'm going through all of the review comments and Kornel's most recent commit to find change requests that haven't been made yet, making the changes when they're simple and documenting them when they're not. I should have that process done this weekend.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 18, 2018

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 20, 2018

Some changes are needed for "precomputed" to work. Would it be okay for cluster_centers_ to be a 1d array of the medoid indices if metric == 'precomputed'?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 21, 2018

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 21, 2018

Currently, the medoid indices are stored as a local variable rather than an attribute. I could totally add an attribute for it though. I like that idea.

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 21, 2018

One more question: could transform(X) and predict(X) take indices for X when metric == "precomputed", or should they raise an Exception when metric == "precomputed"?

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 21, 2018

Although, now that I'm working through making transform(X) work for "precomputed", we'd need to save the distance matrix in an instance variable, which could potentially be a major memory leak.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 22, 2018

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 22, 2018

I'm a little confused. Should X be the distance matrix again? I suppose that would work. Predict wouldn't need the distance matrix again though. I'll push what I think you mean and you can let me know if I've got the wrong idea.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Apr 22, 2018

I mean

D_train = pairwise_distances(X_train)
D_test = pairwise_distances(X_test, X_train)

This is what's we work with in NearestNeighbors given mertric='precomputed'.

Predict relies on checking which is the nearest centroid, yeah? It doesn't need the complete distance matrix in general, but surely it needs the relative distances to each centroid.

Users who want to be efficient will have to be clever to pass:

D_test = np.zeros(len(X_test), len(X_train)) + np.inf
D_test[:, kmedoids.centroid_idx_] = pairwise_distances(X_test, X_train[kmedoids.centroid_idx_])

So we can document in predict/transform that only the columns corresponding to centroid_idx_ need to be finite when metric='precomputed'

@zdog234

This comment has been minimized.

Copy link

zdog234 commented Apr 23, 2018

Ah that works. I wasn't sure if there would be an established process for this situation, so I'm glad to learn that there's already a thought-out solution.

@zdog234

This comment has been minimized.

Copy link

zdog234 commented May 11, 2018

I've almost got all of the remaining requested changes in there, but now I've got a very noobie question: Now that I have my fork of Kornel's branch, how do I add to this PR? Should I make a pull request on @Kornel's fork?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 13, 2018

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jul 8, 2018

Work continued in #11099. Move 0.20 milestone to #11099 .

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jul 8, 2018

Thanks @Kornel for your great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment