Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Elkan k-means does not stop if tol=0 #16075

Merged
merged 3 commits into from
Feb 5, 2020

Conversation

kno10
Copy link
Contributor

@kno10 kno10 commented Jan 9, 2020

K-means convergence is still different between "full" k-means and "elkan" k-means.

The fix in #15831 is incomplete. Compare:

center_shift_total = squared_norm(centers_old - centers)
if center_shift_total <= tol:

and

center_shift = np.sqrt(np.sum((centers_ - new_centers) ** 2, axis=1))

center_shift_total = np.sum(center_shift ** 2)
if center_shift_total < tol:

it should be noted that in the second version, it would likely make sense to first use squared_norm, and then separate the square root, rather than taking the square of the rooted values below.
But in this PR I'm just pointing to a single character. One tests <= and the other tests <.
With tol=0 this means that "full" may stop when the clusters stop moving, while with elkan it never stops then, but always takes all iterations.

I do not think this is the best stopping criterion. If a numerical issue arises in computing the center shifts, this may cause the algorithm to always take the maximum number of iterations. The classic termination criterion for k-means is different: stop if no object is relabeled. That is more reliable.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Please add a what's new entry.

@kno10 kno10 force-pushed the patch-kmeans-convergence branch from 981e933 to 7c22fdf Compare January 14, 2020 12:20
@kno10
Copy link
Contributor Author

kno10 commented Jan 14, 2020

Extended a unit test that failed before my change (300 vs. 7 iterations; simply by adding tol=0 to the existing test case), and a changelog entry.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an empty commit to re-trigger CI (git commit --allow-empty)?
The existing failing build fails to load so I don't know if it could be related. Otherwise LGTM.

@kno10
Copy link
Contributor Author

kno10 commented Feb 5, 2020

Re-triggered CI, checks passed.

@rth
Copy link
Member

rth commented Feb 5, 2020

Thanks @kno10 !

@rth rth changed the title FIX: Elkan k-means does not stop if tol=0 ("full" does) FIX Elkan k-means does not stop if tol=0 Feb 5, 2020
@rth rth merged commit 91261c2 into scikit-learn:master Feb 5, 2020
@kno10 kno10 deleted the patch-kmeans-convergence branch February 6, 2020 00:03
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants