[MRG+2] Fixed n**2 memory blowup in _labels_inertia_precompute_dense #7721
Conversation
I think it would be great if you could do some benchmarks showing this doesn't impact performance. Otherwise it looks good :) |
mindist = np.minimum(dist, mindist) | ||
|
||
# Breakup nearest neighbor distance computation into batches to prevent | ||
# memory blowup |
amueller
Oct 24, 2016
Member
Maybe add "in the case of many clusters"?
Maybe add "in the case of many clusters"?
Erotemic
Oct 24, 2016
Author
Contributor
Fixed. I said in the case of a large number of samples and clusters because the issue is primarily due to assigning labels to every point in a large dataset.
Fixed. I said in the case of a large number of samples and clusters because the issue is primarily due to assigning labels to every point in a large dataset.
# Breakup nearest neighbor distance computation into batches to prevent | ||
# memory blowup | ||
metric_kwargs = dict(squared=True) | ||
# Should use check_inputs=False when speedup_kmpp is merged |
amueller
Oct 24, 2016
Member
I would rather not mention the other PR here, if you want to do so, please use the PR number. There's no way for someone else to look up the issue based on your branch name (as far as I know)
I would rather not mention the other PR here, if you want to do so, please use the PR number. There's no way for someone else to look up the issue based on your branch name (as far as I know)
Erotemic
Oct 24, 2016
Author
Contributor
I changed it to a TODO. This code is called every iteration in the mini-batch main loop and for the same reasons as #7383 it would be preferable not to repeat nan checks every time, so I want to make sure that this change is not forgotten.
I changed it to a TODO. This code is called every iteration in the mini-batch main loop and for the same reasons as #7383 it would be preferable not to repeat nan checks every time, so I want to make sure that this change is not forgotten.
# Should use check_inputs=False when speedup_kmpp is merged | ||
# metric_kwargs = dict(squared=True, check_inputs=False) | ||
labels, mindist = pairwise_distances_argmin_min( | ||
X=X, Y=centers, metric='euclidean', metric_kwargs=metric_kwargs) |
amueller
Oct 24, 2016
Member
Do we want to use a larger batch-size? Is there any impact on the overall runtime using this (also, say, for a low number of clusters?)
Do we want to use a larger batch-size? Is there any impact on the overall runtime using this (also, say, for a low number of clusters?)
Erotemic
Oct 24, 2016
Author
Contributor
The default batch size for pairwise_distances_argmin_min is 500. On average it seems that pairwise_distances_argmin_min is faster than the original implementation. When I add a batch_size to the varied parameters using 500 performs best.
The default batch size for pairwise_distances_argmin_min is 500. On average it seems that pairwise_distances_argmin_min is faster than the original implementation. When I add a batch_size to the varied parameters using 500 performs best.
Ignore this post. These results are in error As requested here are some benchmarks. Overall it seems that this change improves speed. I'm testing the changed code in a context independent of KMeans/Minibatch KMeans. I'm running each function 10 times and taking the average of the duration. My first tests is over the basis:
Here is my benchmark script: And here is its output.
I also ran another configuration to show the difference for small number of clusters. The parameter grid basis is:
The results are:
Again, in most cases the speed is improved. In the cases where it is not improved the change is not very significant. |
Ignore this post. These results are in error Here is also a set of results when adding batch_size (for the new code only) to the varied parameters
|
Thanks for the extensive benchmarks. |
Yup, that's a mistake. Thanks for catching that. This version runs all benchmarks end to end
The fixed tests show that:
|
Ok, that is a bit more the outcome I was expecting. Can you check if it has any impact in the context of k-means? I would expect it doesn't. |
Here are the results for using MiniBatchKMeans
And here is for regular KMeans
The script I used to generate these is here: The change seems to be minimal for KMeans except for the one case of 100 clusters and 100 samples. Running the script again it seems as this timing was an outlier maybe due to some hickup on my computer. Running that single test again I get
This makes sense because this function is not being extensively used in the KMeans algorithm. If this is a problem the underlying implementation of pairwise_distances_argmin_min could be changed to not use batching if batch_size=None. Then in _labels_inertia_precompute_dense we could pass batch_size=None if the number of clusters is less than 5. However, this seems like it might be more work than its worth. Large number of clusters with large numbers of data points is where the efficiency is really needed. |
I think your current patch makes a reasonable tradeoff. |
Otherwise LGTM |
# TODO: Once the functionality (PR #7383) is merged use check_inputs=False. | ||
# metric_kwargs = dict(squared=True, check_inputs=False) | ||
labels, mindist = pairwise_distances_argmin_min( | ||
X=X, Y=centers, metric='euclidean', metric_kwargs=metric_kwargs) |
jnothman
Oct 26, 2016
Member
Please inline metric_kwargs
as
X=X, Y=centers, metric='euclidean', metric_kwargs={'squared': True})
Please inline metric_kwargs
as
X=X, Y=centers, metric='euclidean', metric_kwargs={'squared': True})
LGTM. Please add a what's new / enhancements entry. |
Thanks! |
What does this implement/fix? Explain your changes.
In k-means and minibatch-k-means, the
_labels_inertia_precompute_dense
function computes the nearest neighbors of a sample of test data points with the current cluster centers by computing the full distance matrix from every test point to every cluster center. When the number of cluster centers is large and the number of test data points is large this leads to a huge memory blowup. I ran out of memory quite fast on my 64GB memory machine when computing a 65K visual vocabulary using over one million 128-dimensional SIFT descriptors, even when using a modest batch size.Initially I was going to write a helper function that would batch this operation into a few chunks to avoid having the entire distance matrix represented in memory, but it turns out the function already exists in the sklearn codebase. Yay for reusable code!
I removed the old explicit
euclidean_distances
function followed byargmin
and plugged in thepairwise_distances_argmin_min
, which does batching internally. This change dramatically reduces the amount of memory used in the computation.Any other comments?
Initially I was going to wait to submit this until #7383 was merged and finished, but its such an important change that I wanted to make sure it was in the pipeline. I do want to note that if/when #7383 is merged,
check_inputs
should be set to False when callingpairwise_distances_argmin_min
.