Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding K-Medoids clustering algorithm #5085

Closed
wants to merge 31 commits into from

Conversation

terkkila
Copy link

@terkkila terkkila commented Aug 4, 2015

Added K-Medoids clustering algorithm

  • Introducing new KMedoids class, which is derived from the BaseEstimator
  • Unit test coverage 91%
  • Unit tests passing: nosetests -v sklearn/cluster/tests/test_k_medoids.py
  • Code is documented
  • Example script: python examples/cluster/plot_kmedoids_digits.py

…ded tests for KMedoids::fit() and KMedoids::fit_predict()
@terkkila
Copy link
Author

terkkila commented Aug 4, 2015

Additional notes:

  • PEP8 passing without complaints: pep8 sklearn/cluster/k_medoids_.py
  • Pyflakes passing without complaints: pyflakes sklearn/cluster/k_medoids_.py

@amueller
Copy link
Member

amueller commented Aug 4, 2015

Travis is unhappy.

@terkkila
Copy link
Author

terkkila commented Aug 6, 2015

Travis is happy now. Is there something I can do to the AppVeyor error?

@TomDLT
Copy link
Member

TomDLT commented Aug 6, 2015

Nice work !
Appveyor is down


# Check n_clusters
if (n_clusters is None or
n_clusters <= 0 or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespace lines 52 and 53

return labels

def inertia(self, X):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring

@TomDLT
Copy link
Member

TomDLT commented Aug 18, 2016

This work looks good to me, thanks @terkkila !
Address my comments and I give my +1.


Parameters
----------
n_clusters : int, optional, default: 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same default as kmeans?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is.

@agramfort
Copy link
Member

all CIs are not happy

before considering estimators for inclusion, we ask for arguments why this does better/more than what we already have. I see that you motivate KMedoids by the fact that any metric can be employed. ok but in terms of use case do you see any compelling argument? Said differently for what problem should I use KMedoids vs KMeans and why?

@amueller
Copy link
Member

amueller commented Aug 26, 2016

I think the main advantage is that you can specify custom metrics for which computing the mean is not easy / feasible / known. An example illustrating that would be nice. Maybe L1? (though I have no idea how that looks actually) Doing robustness to outliers would also be interesting (also see http://stackoverflow.com/questions/21619794/what-makes-the-distance-measure-in-k-medoid-better-than-k-means)

@terkkila
Copy link
Author

Hi,

The most compelling use cases that I have faced is when working with time
series data, say demand of a products or consumption of electricity of
households, where comparing one time series along the original time axis is
not optimal due to shifts by some offsets. In such cases, Dynamic Time
Warping can be used, but is just one among many options, that can be
plugged into K-Medoids. I have worked with this DTW & KMedoids combination
using different datasets, and it seems to be robust and well-performing
choice.

Cheers,
Timo

On Fri, Aug 26, 2016 at 10:54 PM, Andreas Mueller notifications@github.com
wrote:

I think the main advantage is that you can specify custom metrics for
which computing the mean is not easy / feasible / known. An example
illustrating that would be nice.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5085 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKkLWM7hudHY3j7lezKjgkALTL5h1y0ks5qj0RZgaJpZM4FljPk
.

@amueller
Copy link
Member

there's no dtw in scipy, right?

@bmcfee you're gonna contribute upstream, IIRC ;)

It sounds like an awesome example, but without the metric being in scipy, no dice.
EMD would also be cool, but same problem.

@bmcfee
Copy link
Contributor

bmcfee commented Aug 26, 2016

We recently merged a numba-accelerated dtw in librosa. No plans yet to push
upstream, but it might be worth looking into. I'm not all that keen on
reimplementing it in cython though.

On Fri, Aug 26, 2016, 16:39 Andreas Mueller notifications@github.com
wrote:

there's no dtw in scipy, right?

@bmcfee https://github.com/bmcfee you're gonna contribute upstream,
IIRC ;)

It sounds like an awesome example, but without the metric being in scipy,
no dice.
EMD would also be cool, but same problem.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5085 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABIqjGWIXTyXeCCAjm2uCmaERg_U8S0eks5qj07fgaJpZM4FljPk
.

@jnothman
Copy link
Member

I guess the questions are:

  1. Is dtw able to be written simple enough that it can be included directly
    in an example
  2. Is there another distance measure used frequently that could motivate
    KMedoids ?
  3. Are we able to show the benefit using an external library to get this
    merged, but merely mention it in documentation without motivating example
    (and/or point to an external library where the example is shown)?

On 27 August 2016 at 07:33, Brian McFee notifications@github.com wrote:

We recently merged a numba-accelerated dtw in librosa. No plans yet to push
upstream, but it might be worth looking into. I'm not all that keen on
reimplementing it in cython though.

On Fri, Aug 26, 2016, 16:39 Andreas Mueller notifications@github.com
wrote:

there's no dtw in scipy, right?

@bmcfee https://github.com/bmcfee you're gonna contribute upstream,
IIRC ;)

It sounds like an awesome example, but without the metric being in scipy,
no dice.
EMD would also be cool, but same problem.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5085
issuecomment-242845204>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
ABIqjGWIXTyXeCCAjm2uCmaERg_U8S0eks5qj07fgaJpZM4FljPk>

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#5085 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz61EXkIf7JUQJ82rIgGdLVkpfXciEks5qj1uvgaJpZM4FljPk
.

@Kornel
Copy link

Kornel commented Oct 15, 2016

Hi, what's the status on this pull request? As far as I can follow this discussion there is no agreement weather this should be merged or not? In my opinion, it would be worth to include this to scikit. Recently I tried to find clusters within a space where the distance was calculated in a unusual way and this copy pasted code from the PR helped me to gain some insights into my data.

@amueller
Copy link
Member

I think we do want this. It would have been nice to have a compelling example, but I don't want this to be the blocker, either. The tests are failing, though, and the PR needs additional reviews.

@Kornel
Copy link

Kornel commented Oct 17, 2016

@terkkila
I did a rebase on top of scikit-learn/master and created a PR to @terkkila's repository here: terkkila#1 , it was an easy exercise as there was only a small conflict in the doc. Please take a look if it makes sense.

@amueller
However, I can't compile the rebased version, I get a lot of errors like these:

ERROR: Failure: ImportError (dlopen(/Users/kornel/workspace/github/scikit-learn/sklearn/metrics/pairwise_fast.so, 2): Library not loaded: libmkl_intel_lp64.dylib
  Referenced from: /Users/kornel.kielczewski/workspace/github/scikit-learn/sklearn/metrics/pairwise_fast.so
  Reason: image not found)

I tried setting

LD_LIBRARY_PATH=/Users/kornel/anaconda/envs/pymc1/lib make

as this is where libmkl_intel_lp64.dylib is located but that did not change anything

@amueller
Copy link
Member

@Kornel sorry, I'm not sure why that is. Possibly issues with which numpy you're using? what does ldd say?

@Kornel
Copy link

Kornel commented Oct 18, 2016

@amueller an update to mlk fixed this issue. Now make succeeds!

I've fixed the tests and did a rebase and created a new pull request here: #7694

WDYT?

@terkkila
Copy link
Author

Thanks a lot Kornel for taking this forward, much appreciated!

Cheers,
Timo

On Tue, Oct 18, 2016 at 11:51 AM, Kornel Kiełczewski <
notifications@github.com> wrote:

@amueller https://github.com/amueller an update to mlk fixed this
issue. I've fixed the tests and did a rebase and created a new pull request
here: #7694 #7694

WDYT?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#5085 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKkLR6aAM2_qj4inkGOxEBx-Brdnn4jks5q1IiPgaJpZM4FljPk
.

@qinhanmin2014
Copy link
Member

Work continued in #11099

@qinhanmin2014
Copy link
Member

Thanks @terkkila for your great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.