New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

Merged
merged 7 commits into from Jun 2, 2018

Conversation

Projects
None yet
3 participants
@jnothman
Member

jnothman commented May 25, 2018

Used to be part of #10280 and elsewhere

Closes #7175
Closes #10279
Closes #7177

@jnothman jnothman changed the title from ENH use pairwise_distances_chunked in silhouette_score to [MRG] ENH use pairwise_distances_chunked in silhouette_score May 25, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented May 25, 2018

Note that I believe no tests are needed: the functionality of silhouette_score and silhouette_samples is tested, and the memory consumption of pairwise_distances_chunked is tested elsewhere.

jnothman added some commits May 25, 2018

@rth

It will be great to close all those issues!

Not a full review - just a couple of minor comments,

@@ -255,6 +255,10 @@ Metrics
False and both inputs are sparse, will return a sparse matrix.
:issue:`10999` by :user:`Taylor G Smith <tgsmith61591>`.
- :func:`metrics.cluster.silhouette_score` and
:func:`metrics.cluster.silhouette_samples` are more memory efficient,
causing them to run faster. :issue:`11135` by `Joel Nothman`_.

This comment has been minimized.

@rth

rth May 27, 2018

Member

As previously discussed, it would run faster only assuming that swapping occurs, meaning that the full distance matrix takes more than RAM size but less than RAM + SWAP disk size (typical sizes here) which is IMO a too strong assumption.

This comment has been minimized.

@jnothman

jnothman May 27, 2018

Member

Given the reported freezes...?

This comment has been minimized.

@rth

rth Jun 1, 2018

Member

It's definitely significantly faster in light of benchmarks below - but I think because of a better implementation, not just due to chunking.

Maybe "are more memory efficient, and faster"? or "are significantly more ..."

# accumulate distances from each sample to each cluster
clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
dtype=D_chunk.dtype)
np.add.at(clust_dists.T, labels, D_chunk.T)

This comment has been minimized.

@rth

rth May 27, 2018

Member

There are some reports about np.add.at being slow (numpy/numpy#5922, numpy/numpy#11156) and it's currently almost not used in the code base, do you have an opposite experience with it?

I guess there is not much alternatives since

clust_dists.T[labels] += D_chunk.T

and np.bincount (as suggested in the first issue) wouldn't work here as far as I understand?

This comment has been minimized.

@jnothman

jnothman May 27, 2018

Member

I'm unaware of those reports. I'll have a quick look.

This comment has been minimized.

@jnothman

jnothman May 28, 2018

Member

clust_dists.T[labels] += D_chunk.T will certainly not work if labels has duplicates, which it does. No, bincount won't work either, given that the output is 2d.

We could do this in a for loop. Is it worth trying that, or should we assume that it's not the bottleneck??

This comment has been minimized.

@jnothman

jnothman May 28, 2018

Member

Yes, I'm getting better mileage with a for loop (assuming I've correctly implemented it)

n_samples = 10000
n_samples_chunk = 100
n_clusters = 10

D_chunk = np.random.rand(n_samples_chunk, n_samples)
labels = np.random.randint(n_clusters, size=n_samples)
label_freqs = np.bincount(labels)


clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
                       dtype=D_chunk.dtype)
%timeit np.add.at(clust_dists.T, labels, D_chunk.T)

clust_dists = np.zeros((len(D_chunk), len(label_freqs)),
                       dtype=D_chunk.dtype)
%timeit for i in range(len(D_chunk)): clust_dists[i] += np.bincount(labels, weights=D_chunk[i], minlength=len(label_freqs))

jnothman added some commits May 27, 2018

@TomDLT

TomDLT approved these changes May 28, 2018 edited

LGTM

I double-checked that the results are identical on a few examples.
A quick benchmark also showed that computations now take -30% time, even without swapping.
Great work !

@jnothman

This comment has been minimized.

Member

jnothman commented May 28, 2018

I made sure that there is a test with data from the original publication replicated perfectly.

Thanks for the review!

@jnothman jnothman changed the title from [MRG] ENH use pairwise_distances_chunked in silhouette_score to [MRG+1] ENH use pairwise_distances_chunked in silhouette_score May 29, 2018

@rth

This comment has been minimized.

Member

rth commented Jun 1, 2018

Adapting earlier benchmarks from #7177 (comment), where n_features=100,

import numpy as np
from sklearn.metrics.cluster import silhouette_score

from neurtu import delayed, Benchmark

n_features = 100

def runs():
    for n_samples in [100, 1000, 5000, 10000, 20000, 40000]:
        for n_labels in [4, 10, 100]:
            tags = {'n_samples': n_samples, 'n_labels': n_labels}
            
            rng = np.random.RandomState(42)
            X = rng.rand(n_samples, n_features)
            y = rng.randint(n_labels, size=n_samples)
            yield delayed(silhouette_score, tags=tags)(X, y)

df = Benchmark(wall_time=True, peak_memory={'interval': 0.0001}, repeat=1)(runs())
df.set_index(['n_labels', 'n_samples'], inplace=True)

df = df.sort_index()
df = df.rename(columns={'wall_time_mean': 'wall_time', 'peak_memory_mean': 'peak_memory'})

df = df[['wall_time', 'peak_memory']]
df['peak_memory'] = df['peak_memory'].round(2)
df['wall_time'] = df['wall_time'].round(4)
print(df)

I get,

                   wall_time (seconds)      peak_memory (MB)
                      master PR #11135      master PR #11135
n_labels n_samples
4        100          0.0010    0.0010        0.00      0.00
         1000         0.0149    0.0179        0.31      0.62
         5000         0.7804    0.4628      287.30    191.32
         10000        3.4167    1.7711     1144.00    763.28
         20000       14.4864    4.9563     4576.68   1023.87
         40000       59.7319   19.5900    18329.18   1023.87
10       100          0.0037    0.0010        0.00      0.00
         1000         0.0190    0.0182        0.00      0.00
         5000         0.6339    0.4578      247.71    190.74
         10000        3.1061    1.7667      919.54    762.94
         20000       13.3901    4.9475     3676.61   1023.87
         40000       58.4792   19.5537    14704.77   1023.87
100      100          0.1148    0.0010        0.00      0.00
         1000         0.3631    0.0180        0.00      0.00
         5000         1.0177    0.4606      190.74    190.74
         10000        2.8273    1.7763      762.94    762.94
         20000       11.7602    4.9412     3173.65   1023.87
         40000       48.2780   19.5229    12465.71   1023.87

which corresponds to a 2.5x speedup for large n_samples and with memory capped at 1GB (instead of 18GB+ in this example). This is awesome!

(Ignore cells with zero peak_memory - it's a measurement artifact).

@rth

rth approved these changes Jun 1, 2018

LGTM, really great work!

P.S: could you please partially revert what's new formulation changes from #11135 (comment) - sorry about that.

@rth

This comment has been minimized.

Member

rth commented Jun 1, 2018

BTW, if I read #7177 (comment) right, looking at the benchmarks above this PR should result in faster calculations than #7177 too..

@rth rth changed the title from [MRG+1] ENH use pairwise_distances_chunked in silhouette_score to [MRG+2] ENH use pairwise_distances_chunked in silhouette_score Jun 1, 2018

@jnothman

This comment has been minimized.

Member

jnothman commented Jun 2, 2018

The speed improvement since #7177 might be due to larger working memory, or due to np.add.at being removed...

@jnothman jnothman merged commit 86ab5a4 into scikit-learn:master Jun 2, 2018

0 of 5 checks passed

ci/circleci: python2 CircleCI is running your tests
Details
ci/circleci: python3 CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
lgtm analysis: Python Fetching git commits
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Jun 2, 2018

Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment