Can't run k-means until full convergence #3374

Closed
EdwardRaff opened this Issue Jul 14, 2014 · 6 comments

Comments

Projects
None yet
3 participants
@EdwardRaff

The k-means implementation in scikit-learn does not allow specifying parameters that will run k-means until full convergence (no more change in cluster assignment). Setting tol=0.0 should produce this behavior, but instead results in an infinite loop (MNIST with k=10, killed after over an hour).

@agramfort

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Jul 14, 2014

Owner

you cannot guarantee that kmeans stops changing assignments. It can
oscillate in some cases. Maybe it's the case here.

Owner

agramfort commented Jul 14, 2014

you cannot guarantee that kmeans stops changing assignments. It can
oscillate in some cases. Maybe it's the case here.

@EdwardRaff

This comment has been minimized.

Show comment Hide comment
@EdwardRaff

EdwardRaff Jul 14, 2014

k-means will eventually reach a point where the clusters no longer move becasue no data points change assignment, this is covered in most intro courses (see example here ).

Yes, the solution can sometimes oscillate - appearing to converge with lower number of data points changing ownership and then increasing again, but it will eventually reach a point where no points change cluster ownership which means the means can no longer change as well.

Further, if lyods k-means clustering did not guarantee convergence to a solution, this result lower bounding the needed number of iterations would not be possible . Note it also provides the trivially deducible upper bound.

So yes, k-means does reach a solution (though no guarantee on the quality of a solution). the scikit-learn implementation should allow running until full convergence.

A quick glance at the code and it doesn't seem like it even tracks the number of data points that changed cluster, so it wouldn't know how to even stop at that point.

k-means will eventually reach a point where the clusters no longer move becasue no data points change assignment, this is covered in most intro courses (see example here ).

Yes, the solution can sometimes oscillate - appearing to converge with lower number of data points changing ownership and then increasing again, but it will eventually reach a point where no points change cluster ownership which means the means can no longer change as well.

Further, if lyods k-means clustering did not guarantee convergence to a solution, this result lower bounding the needed number of iterations would not be possible . Note it also provides the trivially deducible upper bound.

So yes, k-means does reach a solution (though no guarantee on the quality of a solution). the scikit-learn implementation should allow running until full convergence.

A quick glance at the code and it doesn't seem like it even tracks the number of data points that changed cluster, so it wouldn't know how to even stop at that point.

@agramfort

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Jul 14, 2014

Owner

if you feel limited by the current implementation you're very welcome to send us a PR. The standard implementation (non mini-batch) is pretty straightforward.

Owner

agramfort commented Jul 14, 2014

if you feel limited by the current implementation you're very welcome to send us a PR. The standard implementation (non mini-batch) is pretty straightforward.

@EdwardRaff

This comment has been minimized.

Show comment Hide comment
@EdwardRaff

EdwardRaff Jul 14, 2014

I have other items to work on, I'm filing this as a bug report since to me this seems like an obvious bug. I have no hard dependence on scikit-learn so I'll simply stop using it for this work.

If you have no intentions to fix it it would be good to at least note it in the documentations that the code can not be run till absolute convergence.

I have other items to work on, I'm filing this as a bug report since to me this seems like an obvious bug. I have no hard dependence on scikit-learn so I'll simply stop using it for this work.

If you have no intentions to fix it it would be good to at least note it in the documentations that the code can not be run till absolute convergence.

@GaelVaroquaux

This comment has been minimized.

Show comment Hide comment
@GaelVaroquaux

GaelVaroquaux Jul 15, 2014

Owner

I have other items to work on,

So have we.

I'm filing this as a bug report since to me this seems like an obvious
bug.

It's not that clear. K-means can oscillate between different solution
(it's mentionned for instance in the following lecture by A. Ng
http://cs229.stanford.edu/notes/cs229-notes7a.pdf )

I have no hard dependence on scikit-learn so I'll simply stop using it
for this work.

Until whichever package you are using annoys you for another good or bad
reason, in which case you are going to rant on their bug report forms,
and move to another one, or back to us?

If you have no intentions to fix it it would be good to at least note
it in the documentations that the code can not be run till absolute
convergence.

Well, I am still not convinced that we (you and us) have fully
understood what's going on with your problem. First data point: the
implementation has a max_iter, thus an infinite loop is impossible. You
should use verbosity (and if needed add print statements in the code) to
understand what is going on.

I am closing this bug report, as it is not useful: and I am not sure what
we can do about it in it's current form.

Owner

GaelVaroquaux commented Jul 15, 2014

I have other items to work on,

So have we.

I'm filing this as a bug report since to me this seems like an obvious
bug.

It's not that clear. K-means can oscillate between different solution
(it's mentionned for instance in the following lecture by A. Ng
http://cs229.stanford.edu/notes/cs229-notes7a.pdf )

I have no hard dependence on scikit-learn so I'll simply stop using it
for this work.

Until whichever package you are using annoys you for another good or bad
reason, in which case you are going to rant on their bug report forms,
and move to another one, or back to us?

If you have no intentions to fix it it would be good to at least note
it in the documentations that the code can not be run till absolute
convergence.

Well, I am still not convinced that we (you and us) have fully
understood what's going on with your problem. First data point: the
implementation has a max_iter, thus an infinite loop is impossible. You
should use verbosity (and if needed add print statements in the code) to
understand what is going on.

I am closing this bug report, as it is not useful: and I am not sure what
we can do about it in it's current form.

@EdwardRaff

This comment has been minimized.

Show comment Hide comment
@EdwardRaff

EdwardRaff Jul 15, 2014

It's not that clear.

I gave you a paper that links several upper bounds on the number of iterations until convergence. That seems very clear to me, the Ng note simply says "in Theory".

Until whichever package you are using annoys you for another good or bad
reason, in which case you are going to rant on their bug report forms,
and move to another one, or back to us?

I use the right tool for the job. Its not scikit in this case b/c it can't get me the exact solution.

I don't know what you see as ranting. I explained the bug. agramfort wasn't sure, so I provided evidence that the behavior should be expected and since it doesn't exist, should be a bug.

agramfort asked me to fix it, I indicated I'm too busy to do so and that there is no urgency for me - and provided an alternative option if there is no one to fix the bug at this time.

Well, I am still not convinced that we (you and us) have fully
understood what's going on with your problem. First data point: the
implementation has a max_iter, thus an infinite loop is impossible. You
should use verbosity (and if needed add print statements in the code) to
understand what is going on.

I set max iters to a billion to avoid that, from a quick glance that you fail to track for the number of datums that change ownership, which precludes you from stopping correctly. The tolerance check is < instead of <=, so on zero it is forced to iterate until max iterations (which might as well be an infinite loop). Using an ulp above zero won't always work due to potential floating point issues (which also means relying on exactly zero moment when you re-average the clusters might be problematic too, depending on implementation details). So this leaves the scikit k-means unable to converge to an exact solution.

I don't understand what isn't clear.

It's not that clear.

I gave you a paper that links several upper bounds on the number of iterations until convergence. That seems very clear to me, the Ng note simply says "in Theory".

Until whichever package you are using annoys you for another good or bad
reason, in which case you are going to rant on their bug report forms,
and move to another one, or back to us?

I use the right tool for the job. Its not scikit in this case b/c it can't get me the exact solution.

I don't know what you see as ranting. I explained the bug. agramfort wasn't sure, so I provided evidence that the behavior should be expected and since it doesn't exist, should be a bug.

agramfort asked me to fix it, I indicated I'm too busy to do so and that there is no urgency for me - and provided an alternative option if there is no one to fix the bug at this time.

Well, I am still not convinced that we (you and us) have fully
understood what's going on with your problem. First data point: the
implementation has a max_iter, thus an infinite loop is impossible. You
should use verbosity (and if needed add print statements in the code) to
understand what is going on.

I set max iters to a billion to avoid that, from a quick glance that you fail to track for the number of datums that change ownership, which precludes you from stopping correctly. The tolerance check is < instead of <=, so on zero it is forced to iterate until max iterations (which might as well be an infinite loop). Using an ulp above zero won't always work due to potential floating point issues (which also means relying on exactly zero moment when you re-average the clusters might be problematic too, depending on implementation details). So this leaves the scikit k-means unable to converge to an exact solution.

I don't understand what isn't clear.

GaelVaroquaux added a commit that referenced this issue Jul 15, 2014

GaelVaroquaux added a commit that referenced this issue Jul 16, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment