[MRG] Blockwise parallel silhouette computation #1976

Closed
wants to merge 9 commits into
from

Conversation

Projects
None yet
8 participants
Contributor

AlexandreAbraham commented May 21, 2013

This pull request introduces a blockwise computation of the silhouette, using less memory, but slightly slower.

First, the vocabulary can be debated. I have chosen the word "global" for the original silhouette strategy (as the global distance matrix is computed, I could not come with a better word) and "blockwise" for my method.

The other point is that I have 3 implementations of the silhouette:

  • the original one
  • a blockwise single threaded version
  • a blockwise multi threaded version (which uses a bit more memory than the single threaded version, even if n_jobs=1)

I have decided to leave the original version untouched, as the code is far more readable than mine. Then I have not kept the most efficient blockwise single threaded version because the memory gain is not worth the code complication (I think).

It is in fact possible to have "one implementation to rule them all". This could be done by keeping the same code skeleton and using a "precomputed" distance matrix for the original version. But I think that the code would become too opaque.

sklearn/metrics/cluster/unsupervised.py
+ The method used to compute distance matrix between samples. Default is
+ ``global`` which means that the full distance matrix is computed
+ yielding in fast computation but high memory consumption. ``blockwise``
+ option compute clusterwise distance matrices, dividing approximately
@jaquesgrobler

jaquesgrobler May 21, 2013

Owner

compute -> computes :)

and maybe add a the before blockwise

@jaquesgrobler

jaquesgrobler May 21, 2013

Owner

approximately should maybe be between by and the squared in the next line.
So it reads: ...dividing memory consumption by approximately the squared number of clusters.
That make sense? currently its a bit hard to read

@AlexandreAbraham

AlexandreAbraham May 21, 2013

Contributor

I agree. Plus, it is a bit superfluous. Do you think I should remove this part and just that it lowers memory consumption without giving an order of magnitude ?

@jaquesgrobler

jaquesgrobler May 21, 2013

Owner

I think it's okay as is now 👍

sklearn/metrics/cluster/unsupervised.py
+ method: string
+ The method used to compute distance matrix between samples. Default is
+ ``global`` which means that the full distance matrix is computed
+ yielding in fast computation but high memory consumption. ``blockwise``
@jaquesgrobler

jaquesgrobler May 21, 2013

Owner

Same comments as in other docstring

sklearn/metrics/cluster/unsupervised.py
+ B[indices_b] = np.minimum(values_b, B[indices_b])
+ del indices_b
+ else:
+ raise ValueError('Unknow method: %s' % method)
@jaquesgrobler

jaquesgrobler May 21, 2013

Owner

typo: Unknown

Owner

jaquesgrobler commented May 21, 2013

I had a quick look. Thanks for the add-on, @AlexandreAbraham. This seems very useful considering the conversation on the mailing list here

sklearn/metrics/cluster/unsupervised.py
@@ -43,10 +48,24 @@ def silhouette_score(X, labels, metric='euclidean', sample_size=None,
<sklearn.metrics.pairwise.pairwise_distances>`. If X is the distance
array itself, use ``metric="precomputed"``.
+ method: string
@GaelVaroquaux

GaelVaroquaux May 21, 2013

Owner

The docstring should be {'global', 'blockwise'}, and not 'string'.

sklearn/metrics/cluster/unsupervised.py
@@ -118,6 +142,20 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`. If X is
the distance array itself, use "precomputed" as the metric.
+ method: string
@GaelVaroquaux

GaelVaroquaux May 21, 2013

Owner

Same remarks here.

Owner

GaelVaroquaux commented May 21, 2013

Looks good to me.

+def _intra_cluster_distances_block(X_cluster, metric, **kwds):
+ ''' Calculate the mean intra-cluster distance for a given cluster
+
+ Parameters
@mblondel

mblondel May 21, 2013

Owner

I think I would remove the parameter documentation and leave only the short description above. Those two functions are private and short. The documentation takes more space than the actual implementation.

@GaelVaroquaux

GaelVaroquaux May 22, 2013

Owner

I think I would remove the parameter documentation and leave only the
short description above. Those two functions are private and short. The
documentation takes more space than the actual implementation.

I was actually thinking the same. So +1 on this remark

@AlexandreAbraham

AlexandreAbraham May 22, 2013

Contributor

Done. I have also removed the parameters documentation for the other private methods (original method).

Owner

jaquesgrobler commented May 22, 2013

+1 for merge

Owner

mblondel commented Jan 5, 2014

Hum this PR has two +1 for merge and is still open?!

sklearn/metrics/cluster/unsupervised.py
@@ -179,25 +238,24 @@ def _intra_cluster_distance(distances_row, labels, i):
def _nearest_cluster_distance(distances_row, labels, i):
@amueller

amueller Jan 5, 2014

Owner

why remove the docstrings?

@amueller amueller referenced this pull request Dec 2, 2014

Merged

[MRG+1] LDA refactoring #3523

Contributor

AlexandreAbraham commented Oct 19, 2015

I fixed the only remaining remark (docstring removed for no reason) and rebased. This should be good to go.

@AlexandreAbraham AlexandreAbraham changed the title from Blockwise parallel silhouette computation to [MRG] Blockwise parallel silhouette computation Oct 19, 2015

sklearn/metrics/cluster/unsupervised.py
+ ``blockwise`` option computes clusterwise distance matrices, dividing
+ memory consumption by approximately the squared number of clusters.
+ The ``blockwise`` method allows parallelization through ``n_jobs``
+ parameter.
@GaelVaroquaux

GaelVaroquaux Oct 19, 2015

Owner

I suggest that we rename this argument to 'blockwise', and have as options False, True, and 'auto', where 'auto' is the default, and you do some clever check on the size of the input to decided whether blocking is good or not.

sklearn/metrics/cluster/unsupervised.py
-def silhouette_samples(X, labels, metric='euclidean', **kwds):
+def silhouette_samples(X, labels, metric='euclidean', method='global',
@GaelVaroquaux

GaelVaroquaux Oct 19, 2015

Owner

Same remark on naming and 'auto' behavior.

Contributor

AlexandreAbraham commented Oct 19, 2015

I used this code to bench the silhouette computation:

import time

from sklearn import datasets
from sklearn.metrics.cluster.unsupervised import silhouette_score


def test_silhouette(n_samples, n_clusters):
    X, y = datasets.make_blobs(n_samples=n_samples, n_features=500,
                              centers=n_clusters)
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=False)
    t_global = time.time() - t0
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=True)
    t_block = time.time() - t0

    return t_global, t_block

for n_clusters in [50, 100, 200, 300]:
    for n_samples in [200, 300, 400, 500, 1000]:
        if n_clusters >= n_samples - 1:
            continue
        t_g, t_b = test_silhouette(n_samples, n_clusters)
        print('%d,%d,%f,%f' % (n_clusters, n_samples, t_g, t_b))

On my machine, blockwise is always faster below 50 clusters. Above 50 clusters, the global version is faster if there is less then 5 * n_labels samples. I set up a heuristic in the code accordingly.

+ # Test with the block method
+ silhouette_metric_block = silhouette_score(X, y, metric='euclidean',
+ blockwise=True)
+ assert_almost_equal(silhouette, silhouette_metric_block)
@GaelVaroquaux

GaelVaroquaux Oct 19, 2015

Owner

We need the auto logic covered too (just to explore all codepaths).

@AlexandreAbraham

AlexandreAbraham Oct 19, 2015

Contributor

The 'auto' case is covered below. But I can add an extra verification here if you want.

@GaelVaroquaux

GaelVaroquaux via email Oct 19, 2015

Owner
sklearn/metrics/cluster/unsupervised.py
If ``sample_size is None``, no sampling is used.
+ n_jobs : integer, optional
+ The number of CPUs to use to do the computation. -1 means
+ 'all CPUs'. This option can only be used with ``method="blockwise"``.
@GaelVaroquaux

GaelVaroquaux Oct 19, 2015

Owner

This docstring hasn't been updated to reflect the change of naming in the arguments.

Contributor

AlexandreAbraham commented Oct 19, 2015

Yes, given the last comments, I'm working on the docs.

Owner

GaelVaroquaux commented Oct 21, 2015

Your tests are failing under windows because of the added parallelism. You need to disable it in the test (check appveyor to see the problem).

@@ -133,6 +160,21 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
allowed by :func:`sklearn.metrics.pairwise.pairwise_distances`. If X is
the distance array itself, use "precomputed" as the metric.
+ blockwise: {'auto', True, False}
@amueller

amueller Oct 21, 2015

Owner

can you add versionadded to both new parameters.

@AlexandreAbraham

AlexandreAbraham Oct 21, 2015

Contributor

How do you add the flag to a parameter? I didn't manage to do it :-/

Owner

ogrisel commented Oct 21, 2015

The failure under windows might be random and not necessarily related to this specific PR. We had seemingly similar random failure in the past on completely unrelated PRs and the failure disappeared without having us do anything.

Apparently here we first get a WindowsError: [Error 5] Access is denied and then a series of failures that reference "Attempting to do parallel computing without protecting your import" but which are probably acutally triggered by a problem in the runtime environment on the CI server.

Let me relaunch appveyor manually on that PR.

+ approximately the squared number of clusters. The latter allows
+ parallelization through ``n_jobs`` parameter.
+ Default is 'auto' that chooses the fastest option without
+ consideration for memory consumption.
@amueller

amueller Oct 21, 2015

Owner

here just after this line add a .. versionadded::0.17 not sure if it needs a newline before it though

@krishnateja614

krishnateja614 May 24, 2016

@ogrisel , I am getting the windows error 5 access is denied inconsistently (but like 8 out of 10 times) on an Azure VM but not my PC. This is occurring after gridsearchcv fits all the folds but before it can calculate the best parameters and best scores. I raised an issue with all the details here #6820 . The VM is Windows Server 2012 R2. Is it something related to VM environment or is there something we can do?

Contributor

AlexandreAbraham commented Oct 27, 2015

The CircleCI failure is not due to the PR. Should I amend and re-launch the tests?

Owner

GaelVaroquaux commented Oct 27, 2015

Contributor

AlexandreAbraham commented Oct 27, 2015

Done.

Contributor

AlexandreAbraham commented Nov 8, 2015

I'll rebase again but I don't get why CircleCI is failing.

Owner

jnothman commented Aug 11, 2016

#7175 suggest this would still be of interest to users.

However, I have my doubts about how much time we save by doing this block computation cluster-wise, where clusters are very unevenly sized. If we simply processed b samples at a time, we could accumulate as follows:

def process_block(start, intra_cluster):
    """compute minimal inter-cluster distance for points in block and add to intra-cluster totals"""
    block_dists = pairwise_distances(X[start:start+b], X)
    cluster_dists = np.zeros((b, n_clusters))
    # Following is equivalent to cluster_dists = np.vstack([np.bincount(y, row, n_clusters) for row in block_dists])
    np.add.at(cluster_dists, (np.repeat(np.arange(b), len(y)), np.tile(y, b)), block_dists.ravel())
    np.add.at(intra_cluster, y[start:start+b], cluster_dists[np.arange(b), y[start:start+b]])
    cluster_dists[np.arange(b), y[start:start+b]] = np.inf
    return (cluster_dists / np.bincount(y)).min(axis=1)

This would also much more easily benefit from parallelism (obviously without sharing intra_cluster accumulators), I think.

Owner

jnothman commented Aug 11, 2016

That code is wrong (I forgot the definition of intra-cluster distance), but the idea still applies.

Owner

jnothman commented Jun 6, 2017

FWIW, I think this is still in the pipeline to fix, via #7979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment