Added custom kernels to SpectralClustering #1792

Closed
wants to merge 10 commits into
from

Projects

None yet

3 participants

@rolisz
rolisz commented Mar 19, 2013

Added custom kernels as per #1791.

@larsmans larsmans commented on an outdated diff Mar 19, 2013
sklearn/cluster/spectral.py
@@ -302,7 +302,9 @@ class SpectralClustering(BaseEstimator, ClusterMixin):
n_clusters : integer, optional
The dimension of the projection subspace.
- affinity: string, 'nearest_neighbors', 'rbf' or 'precomputed'
+ affinity: string, 'nearest_neighbors', 'precomputed', or a kernel:
@larsmans
larsmans Mar 19, 2013 scikit-learn member

If a callable may be passed, that should be documented. Also, the type should be

affinity : string, array-like or callable

in one line.

(Note the space before the colon, which the rest of this module is missing.)

The text below can then summarize the meanings of the options. In fact, sklearn.metrics.pairwise_kernels supports quite a few more kernels, so you might want to refer there for the complete list and only explicitly mention "rbf" from the string options.

@larsmans larsmans and 1 other commented on an outdated diff Mar 19, 2013
sklearn/cluster/tests/test_spectral.py
@@ -157,7 +159,7 @@ def test_affinities():
# a dataset that yields a stable eigen decomposition both when built
# on OSX and Linux
X, y = make_blobs(n_samples=40, random_state=2, centers=[[1, 1], [-1, -1]],
- cluster_std=0.4)
+ cluster_std=0.4)
@larsmans
larsmans Mar 19, 2013 scikit-learn member

This is probably a mistake...

@rolisz
rolisz Mar 20, 2013

The pep8 style guide checker complained about it, that's why I changed the indent.

@larsmans
larsmans Mar 20, 2013 scikit-learn member

Strange, I get no warning for it (pep8 1.2). What if you write

X, y = make_blobs(n_samples=40, random_state=2,
                  centers=[[1, 1], [-1, -1]], cluster_std=0.4)

?

@larsmans larsmans commented on an outdated diff Mar 19, 2013
sklearn/cluster/spectral.py
@@ -457,3 +463,11 @@ def mode(self):
" 0.15.")
def k(self):
return self.n_clusters
+
+ def _get_kernel_params(self):
+ params = self.kernel_params
+ if params is None:
+ params = {}
+ if not callable(self.kernel):
+ params['gamma'] = self.gamma
@larsmans
larsmans Mar 19, 2013 scikit-learn member

Now there's no way to specify degree or coef0 for the poly kernel.

@larsmans larsmans commented on an outdated diff Mar 19, 2013
sklearn/cluster/spectral.py
@@ -381,7 +387,8 @@ class SpectralClustering(BaseEstimator, ClusterMixin):
def __init__(self, n_clusters=8, eigen_solver=None, random_state=None,
n_init=10, gamma=1., affinity='rbf', n_neighbors=10, k=None,
- eigen_tol=0.0, assign_labels='kmeans', mode=None):
+ eigen_tol=0.0, assign_labels='kmeans', mode=None, kernel='rbf',
+ kernel_params=None):
@larsmans
larsmans Mar 19, 2013 scikit-learn member

This adds a second parameter kernel. When affinity != kernel, this gives inconsistent results.

@larsmans larsmans commented on an outdated diff Mar 19, 2013
sklearn/cluster/tests/test_spectral.py
@@ -168,6 +170,11 @@ def test_affinities():
labels = sp.fit(X).labels_
assert_equal(adjusted_rand_score(y, labels), 1)
+ kernels_available = kernel_metrics()
+ for kern in kernels_available:
+ sp = SpectralClustering(n_clusters=2, kernel=kern, random_state=0)
+ labels = sp.fit(X).labels_
+ assert_equal(adjusted_rand_score(y, labels), 1)
@larsmans
larsmans Mar 19, 2013 scikit-learn member

I'd like a test for a callable as well. Maybe something trivial like

def its_all_the_same(x, y):
    return 1
@rolisz

By changing the centers I managed to get rid of the negative value errors I got from most of the kernels, but the additive_chi2 still complains about ValueError: Array contains NaN or infinity.

@rolisz

The predicted labels where all over the place, so I changed the test to be more like the ones in Nystroem and KernelPCA, by testing shapes. Is it ok?

@larsmans
Member

@rolisz said:

@larsmans I fixed the things you mentioned about the commit, but I have a problem with some of the tests. For most of the kernels (all except poly, rbf and linear) complained about negative values in the arrays, so I moved the cluster centers away from the negative region, and I also test for equality of shapes, instead of the adjusted_rand_score. Is this okay?

If you're only testing whether they are called, that should be ok. Some kernels are just not applicable to some data (chi² kernels in particular want non-negative input).

And additive_chi2 still gives an error about ValueError: Array contains NaN or infinity. What should I do with it?

Try adding some print calls to see where the NaNs/infs get introduced. I'm not too familiar with chi² innards.

@larsmans
Member

Testing shapes (or just the absence of exceptions) should be fine; we don't need to test every single affinity metric. Maybe one test using, say, poly kernel is in order, since then we can also test its parameters.

@rolisz
rolisz commented Mar 20, 2013

I think the problem is that pairwise_kernels returns an affinity matrix with mostly negative values. This then causes problem when calculating the Laplacian matrix of the graph (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/graph.py#L170), because it takes roots of negative numbers => NaN. But from what I've read in the documentation, that's what chi^2 is supposed to return.

There is a test for the rbf that checks that the values are correct.

@larsmans
Member

We can just skip the tests for the chi² kernels; they won't make sense anyway. Also, it may be easier to use a different dataset for testing, e.g. random histograms. Put a note in the docs explaining that only kernels that are non-negative similarity measures make sense.

@rolisz
rolisz commented Mar 26, 2013

I removed the chi2 test. What do you mean by random histograms?

@larsmans
Member

Random matrices of positive numbers. Those should work with the histogram kernel, rbf, poly, linear and cosine.

@larsmans larsmans commented on an outdated diff Mar 30, 2013
sklearn/cluster/tests/test_spectral.py
@@ -168,6 +170,32 @@ def test_affinities():
labels = sp.fit(X).labels_
assert_equal(adjusted_rand_score(y, labels), 1)
+ X = np.random.rand(10, 5) * 10
@larsmans
larsmans Mar 30, 2013 scikit-learn member

Never use np.random directly, that leads to non-reproducible tests. sklearn.utils contains a function check_random_state that turns an integer seed into a random number generator.

@larsmans larsmans commented on an outdated diff Mar 30, 2013
sklearn/cluster/spectral.py
@@ -302,7 +302,11 @@ class SpectralClustering(BaseEstimator, ClusterMixin):
n_clusters : integer, optional
The dimension of the projection subspace.
- affinity: string, 'nearest_neighbors', 'rbf' or 'precomputed'
+ affinity : string, array-like or callable
+ Default : 'rbf'
+ If it's a string, it can be one of 'nearest_neighbors', 'precomputed',
+ 'rbf' or one of the kernels supported by `sklearn.metrics.pairwise_kernels`.
+ Only kernels that have a non-negative similarity should be used.
@larsmans
larsmans Mar 30, 2013 scikit-learn member

A kernel does not have similarity... how about

Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

@larsmans
Member

Apart from some minor things, I think this is pretty much complete. @GaelVaroquaux, care to review?

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Mar 30, 2013
sklearn/cluster/spectral.py
@@ -338,6 +342,10 @@ class SpectralClustering(BaseEstimator, ClusterMixin):
also be sensitive to initialization. Discretization is another approach
which is less sensitive to random initialization.
+ kernel_params : mapping of string to any, optional
@GaelVaroquaux
GaelVaroquaux Mar 30, 2013 scikit-learn member

I don't understand the type of this parameter. Do you mean, 'dictionary with string keys'?

@larsmans
larsmans Mar 30, 2013 scikit-learn member

Mappings are a generalization of dicts, according to the Python glossary. In practice, one would pass a dict, but any replacement with the same API should work.

@GaelVaroquaux
GaelVaroquaux Mar 30, 2013 scikit-learn member

Mappings are a generalization of dicts, according to the Python glossary.

I know, but the term is use so seldom that I wasn't sure this was what
was meant. I would still vote for using the word 'dictionary', to make it
easy to understand by a majority of people.

@larsmans
larsmans Mar 30, 2013 scikit-learn member

I guess in this case (and in KernelPCA) we could say dict as custom mapping support is a YAGNI issue.

@GaelVaroquaux GaelVaroquaux commented on the diff Mar 30, 2013
sklearn/cluster/spectral.py
@@ -381,7 +389,8 @@ class SpectralClustering(BaseEstimator, ClusterMixin):
def __init__(self, n_clusters=8, eigen_solver=None, random_state=None,
n_init=10, gamma=1., affinity='rbf', n_neighbors=10, k=None,
- eigen_tol=0.0, assign_labels='kmeans', mode=None):
+ eigen_tol=0.0, assign_labels='kmeans', mode=None,
+ degree=3, coef0=1, kernel_params=None):
@GaelVaroquaux
GaelVaroquaux Mar 30, 2013 scikit-learn member

Have all the new parameters been described in the docstring? I may be missing something, but I have the impression that I can only see kernel_params.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Mar 30, 2013
sklearn/cluster/spectral.py
@@ -457,3 +466,13 @@ def mode(self):
" 0.15.")
def k(self):
return self.n_clusters
+
+ def _get_kernel_params(self):
@GaelVaroquaux
GaelVaroquaux Mar 30, 2013 scikit-learn member

I there a reason that this is in a method. I think that from a style perspective I would prefer this to be embedded where it is used, as it is used only in one location.

@GaelVaroquaux
Member

Apart from @larsmans minor comments and mines, this is good to go, I believe. Thanks!

@rolisz
rolisz commented Mar 30, 2013

I pushed the fixes for your suggestions.

@larsmans
Member

Made a few minor fixes to the docs and pushed as bc67dbf. Thanks!

@larsmans larsmans closed this Mar 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment