MiniBatchKmeans crashes #2611

Closed
douwekiela opened this Issue Nov 24, 2013 · 4 comments

Comments

Projects
None yet
4 participants

The MiniBatchKmeans implementation in sklearn/cluster/k_means_.py crashes rather ungracefully on line 860 with the following Traceback:

Init 1/3 with method: k-means++
/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py:1146: RuntimeWarning: init_size=300 should be larger than k=886. Setting it to 3*k
  init_size=init_size)
Inertia for init 1/3: 4950187.500000
Init 2/3 with method: k-means++
Inertia for init 2/3: 4464646.283333
Init 3/3 with method: k-means++
Inertia for init 3/3: 4941442.166667
Minibatch iteration 1/786200: mean batch inertia: 95868.474986, ewa inertia: 95868.474986 
Minibatch iteration 2/786200: mean batch inertia: 99673.433750, ewa inertia: 95869.442923 
Minibatch iteration 3/786200: mean batch inertia: 99292.147983, ewa inertia: 95870.313618 
Minibatch iteration 4/786200: mean batch inertia: 97593.331241, ewa inertia: 95870.751934 
Minibatch iteration 5/786200: mean batch inertia: 97558.089367, ewa inertia: 95871.181173 
Minibatch iteration 6/786200: mean batch inertia: 95642.533019, ewa inertia: 95871.123007 
Minibatch iteration 7/786200: mean batch inertia: 93952.687664, ewa inertia: 95870.634980 
Minibatch iteration 8/786200: mean batch inertia: 92603.303128, ewa inertia: 95869.803809 
Minibatch iteration 9/786200: mean batch inertia: 94457.049382, ewa inertia: 95869.444421 
[MiniBatchKMeans] Reassigning 100 cluster centers.
Traceback (most recent call last):
  File "sift.py", line 140, in <module>
    mbk.fit(random_subset)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 1190, in fit
    verbose=self.verbose)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 860, in _mini_batch_step
    centers[to_reassign] = X[new_centers]
ValueError: array is not broadcastable to correct shape

This is sklearn version 0.15-git and Python 2.7.3.

It appears that it only crashes when I use a large number of datapoints with a relatively large K - I tend to use K=sqrt(# datapoints), but it also happens for e.g. K=500. With K=100, everything works fine. In this case, I have around 40k datapoints. Here is a reproduction case:

import numpy as np
from sklearn.cluster import MiniBatchKMeans

K = 500
data = np.random.randn(42924, 128) # this is what my data looks like
mbk = MiniBatchKMeans(n_clusters=K, batch_size=100, verbose=1)
mbk.fit(data)
centroids = mbk.cluster_centers_
Owner

jaquesgrobler commented Nov 25, 2013

I can reproduce this (latest build)

Owner

larsmans commented Nov 25, 2013

I added

            print(to_reassign.shape)
            print(new_centers.shape)

just before line 860, which results in

[MiniBatchKMeans] Reassigning 100 cluster centers.
(500,)
(100,)

It looks like we should check whether the batch size is >=K, because otherwise the reassignment heuristic can't do its work.

Owner

larsmans commented Nov 25, 2013

@czxcjx czxcjx added a commit to czxcjx/scikit-learn that referenced this issue Dec 5, 2013

@czxcjx czxcjx Issue #2611: MiniBatchKmeans crashes 5cf66de

@czxcjx czxcjx added a commit to czxcjx/scikit-learn that referenced this issue Dec 5, 2013

@czxcjx czxcjx Issue #2611: MiniBatchKmeans crashes 20dda7d
Owner

GaelVaroquaux commented Jul 14, 2014

Fixed by #3376

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment