Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash on kmeans2 #5486

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
3 participants
@vbarrielle
Copy link
Contributor

vbarrielle commented Nov 10, 2015

scipy.cluster.vq.kmeans2 would crash on random initialization when its data argument contains less data points than dimensions.

This pull request adds a test, test_kmeans2_high_dim, which highlights the issue, and fixes the random initialization procedure to use pca instead of cholesky factorization of the covariance matrix when the number of data points is less than the dimension.

vbarrielle added some commits Nov 9, 2015

add test revealing kmeans2 failure on high dim data
kmeans2 with random initialization fails whenever the number of data
points is lower than the number of dimension.
BUG: fix kmeans2 on inputs with rank deficient covariance matrix
Previously, kmeans2 would crash with LinalgError on inputs with more
dimensions than data points. The crash was caused by the computation of
the cholesky factorization of the covariance matrix in the random
initialization. On such cases, this patch will instead compute the pca
of the covariance matrix.

@ev-br ev-br added this to the 0.17.0 milestone Nov 10, 2015

@ev-br

This comment has been minimized.

Copy link
Member

ev-br commented Nov 10, 2015

Looks reasonable. I confirm both the LinAlgError and that this PR fixes it.
I'll keep it open for a while in case someone has comments. Otherwise, it's a bugfix I'm going to merge soon.

mu = np.mean(data, axis=0)
_, s, vh = np.linalg.svd(data - mu, full_matrices=False)
x = np.random.randn(k, s.size)
sVh = s[:,None] * vh

This comment has been minimized.

Copy link
@pv

pv Nov 15, 2015

Member

I think you need to divide by sqrt(n-1)

This comment has been minimized.

Copy link
@vbarrielle

vbarrielle Nov 15, 2015

Author Contributor

Indeed, I had forgotten about it.

# of input points
data = np.fromfile(DATAFILE1, sep=", ")
data = data.reshape((20, 20))[:10]
kmeans2(data, 2)

This comment has been minimized.

Copy link
@pv

pv Nov 15, 2015

Member

There should also be a test directly checking that the result produced by _krandinit produces the correct covariance.
IOW, use a large k, set np.random.seed(1234), and compare the result from np.cov(x, rowvar=0) to np.cov(data, rowvar=0)

@pv pv added the needs-work label Nov 15, 2015

fix normalization of krandinit on rank def data
Also add a test to ensure the generated data has the proper covariance.
@vbarrielle

This comment has been minimized.

Copy link
Contributor Author

vbarrielle commented Nov 15, 2015

I've fixed the missing division, and added a test to check the covariance of the generated data. The comparison of the covariances makes use of a relatively loose tolerance but I can't get better precision without using a much larger k, which uses too much memory IMO.

@pv pv removed the needs-work label Nov 15, 2015

@ev-br

This comment has been minimized.

Copy link
Member

ev-br commented Dec 4, 2015

np.fromfile usage needs to be removed from tests, now that #5573 is merged. Here is the branch where I did this, squashed the commits and rebased on master
https://github.com/ev-br/scipy/commits/pr/5486

An unpleasant thing is that this line, x = np.dot(x, sVh) + mu, fails with a MemoryError in a wine VM (which, admittedly, is rather tight on memory).

@vbarrielle

This comment has been minimized.

Copy link
Contributor Author

vbarrielle commented Dec 4, 2015

An unpleasant thing is that this line, x = np.dot(x, sVh) + mu, fails with a MemoryError in a wine VM (which, admittedly, is rather tight on memory).

The MemoryError surely comes from the covariance test. It needs quite a lot of data to get a nice approximation of the covariance matrix. Maybe the test ought to be disabled on platforms without enough memory?

np.fromfile usage needs to be removed from tests, now that #5573 is merged. Here is the branch where I did this, squashed the commits and rebased on master

Is there something I should do about this? I'm unfamiliar with the process, should I update my pull request using your branch, or are you going to merge your branch?

@ev-br

This comment has been minimized.

Copy link
Member

ev-br commented Dec 4, 2015

If you have an idea of how to trigger the problem for smaller problem size, it'd be helpful (eg, do you really need k=1e6?). If you can do it, please push a commit to your branch, I'll cherry-pick it. Otherwise, I'll just add a skip condition and merge the rebased one.

@vbarrielle

This comment has been minimized.

Copy link
Contributor Author

vbarrielle commented Dec 4, 2015

If you have an idea of how to trigger the problem for smaller problem size,
it'd be helpful (eg, do you really need k=1e6?). If you can do it, please
push a commit to your branch, I'll cherry-pick it. Otherwise, I'll just add
a skip condition and merge the rebased one.

The k=1e6 is only required to test that the _krandinit actually produces the
correct covariance in the high dimensional case. With lower values it's harder
to see if the obtained covariance matches the initial covariance.

However, this test is only here to check the correctness of the fix, and as
such it is only required to pass on platforms with enough memory.

The test_kmeans2_high_dim, on the other hand, actually highlights the bug,
and should pass on any platform.

So I guess adding a skip condition is the way to go.

ev-br added a commit that referenced this pull request Dec 8, 2015

Merge branch 'pr/5486'
Reviewed at #5486
@ev-br

This comment has been minimized.

Copy link
Member

ev-br commented Dec 8, 2015

Rebased, squashed and merged with the skip condition on the test in db14b78. Thanks @vbarrielle.

@ev-br ev-br closed this Dec 8, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.