Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[WIP] Add k-means to Nystroem Approximation #2591

Open
wants to merge 3 commits into from

7 participants

rootlesstree Jake Vanderplas Roland Szabo Coveralls Andreas Mueller Dougal J. Sutherland Lars
rootlesstree

Hi,
My name is Patrick and I have been using Scikit learn for a while but it's first time for me to contribute. I have added k-means sampling to Nystroem method, which is listed in the issues. Thanks!

Best,
Patrick

sklearn/kernel_approximation.py
@@ -410,7 +410,7 @@ class Nystroem(BaseEstimator, TransformerMixin):
sklearn.metric.pairwise.kernel_metrics : List of built-in kernels.
"""
def __init__(self, kernel="rbf", gamma=None, coef0=1, degree=3,
- kernel_params=None, n_components=100, random_state=None):
+ kernel_params=None, n_components=100, random_state=None,opts="k_means"):
Jake Vanderplas Collaborator
jakevdp added a note

The default should probably be "random", so that behavior is backward-compatible.

I personally don't think it will damage as it's just the approximation of kernels. The more exact approximation should be a better result and according to the papers mentioned in issue #1568 , the approximation is more accurate than random sampling.

Jake Vanderplas Collaborator
jakevdp added a note

OK - in that case, we should note this in the change log. The bigger issue is in cases where there may be different results between this and older versions. That being said, if K Means is the best approach, it may make sense to change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Jake Vanderplas
Collaborator

Hi - welcome and thanks for the submission!

I have a couple questions about what's going on here, but first a few general-level things here before this can be considered for merge:

  • The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string.
  • You should run your code through the PEP8 utility: some of it might seem a bit excessive, but it really helps maintain a consistent and readable style throughout the package.
  • The new parameter should also be tested (take a look at sklearn/tests/ for examples of how these tests are written)

Now for some specific questions:

  1. I'm not extremely familiar with this algorithm -- could you say a couple words about what this enhancement is intended to do?
  2. The component_indices_ feature is set in a strange way for the K means option. Is this compatible with the old meaning of component_indices_?

If you could link to the issue where this is mentioned, that would also help (just put # followed by the issue number in a comment, and this link will be created automatically by github.)

Again, thanks for contributing!

rootlesstree

Hi,
Thanks for your comments !! The issue is #1568 and I will try to answer your comments:

  1. I'm not extremely familiar with this algorithm -- could you say a couple words about what this enhancement is intended to do?

A: This enhancement is to select better columns used in the Nystroem Approximation. The reference paper suggests that using k-means centers outperform the random sampling.

  1. The component_indices_ feature is set in a strange way for the K means option. Is this compatible with the old meaning of component_indices_?

A: The component_indices is used to record the indices of selected columns used in the approximation; however, k-means method uses cluster-center to do approximation, and these points cannot be found in the original input array; therefore, the component_indices is not defined in the k-means.

  1. The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string.

    Sure, I will do this later.

  2. You should run your code through the PEP8 utility: some of it might seem a bit excessive, but it really helps maintain a consistent and readable style throughout the package.

Sorry I forget this one, let me check this later too.

  1. The new parameter should also be tested (take a look at sklearn/tests/ for examples of how these tests are written)

I have already run the tests in tests/test_kernel_approximation.py and tests/test_common.py and both of them passed, I guess the functional-wise is correct.

Jake Vanderplas
Collaborator

Great! Thanks for the responses.

As far as testing, it would be good to have a test that uses either value of the opts argument. Also, for easier code readability, I think it would be good to use a name like basis_method rather than opts. When you do the documentation, be sure to add a reference to the paper mentioned in the issue.

This is going to be a nice contribution!

rootlesstree

Sorry, I have a rookie-question :P, If I have updated the code, am I going to open a new pull request? Or I can change the code somewhere? Thanks!!

And I have few more technical questions:

  1. You mentioned "As far as testing, it would be good to have a test that uses either value of the opts argument. "

    I think I use the k_means as default now, which means all the Nystrom testings in test file utilize k_means now, do I still need to make explicit call about the argument?

  2. Another asking "The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string."

    I thought it's about basis within the Nystrom not the RBFSampler, so I am going to add the comments in Nystrom class-level doc string right?

Thanks !!!

Roland Szabo

To commit the new changes you made to your code, you can simply push to the same branch on your GitHub repo and it will show up here.

rootlesstree

Thanks for explanation !

Coveralls

Coverage Status

Coverage remained the same when pulling dedb8ed on rootlesstree:master into a8089b0 on scikit-learn:master.

sklearn/kernel_approximation.py
((11 lines not shown))
+ rnd = check_random_state(self.random_state)
+ inds = rnd.permutation(n_samples)
+ basis_inds = inds[:n_components]
+ basis = X[basis_inds]
+ self.component_indices_ = inds
+
+ elif (self.basis_method == "k_means"):
+
+ km = KMeans(n_clusters=n_components, init='k-means++',
+ max_iter=1000, n_init=5)
+ km.fit(X)
+ centers = km.cluster_centers_
+ basis = centers[centers[:, 0].argsort()]
+ #If we are using k_means centers as input, cannot record basis_inds
+
+ self.component_indices_ = -1
Jake Vanderplas Collaborator
jakevdp added a note

We need to raise a ValueError here if the basis method is invalid. Currently it just passes through without setting the basis: it would lead to a runtime error.

Understand ! Thanks for notification !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Jake Vanderplas
Collaborator

You mentioned "As far as testing, it would be good to have a test that uses either value of the opts argument. "
I think I use the k_means as default now, which means all the Nystrom testings in test file utilize k_means now, do I still need to make explicit call about the argument?

Yes, but "random" is still an option, so we need to test it (for example, what if someone inadvertently breaks the code in the "random" section? The tests need to tell us that happened). Regarding the default, I still think we need some discussion about what the default should be. We can't just change what the code does without letting the user know: someone's results might silently change between versions, and that's not a good thing.

If you think switching to K Means by default is the right way to go, we should open up that discussion on the mailing list once this PR is mostly ready to merge.

Another asking "The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string."
I thought it's about basis within the Nystrom not the RBFSampler, so I am going to add the comments in Nystrom class-level doc string right?

Yep, you're right. Sorry about that.

sklearn/kernel_approximation.py
((11 lines not shown))
+ rnd = check_random_state(self.random_state)
+ inds = rnd.permutation(n_samples)
+ basis_inds = inds[:n_components]
+ basis = X[basis_inds]
+ self.component_indices_ = inds
+
+ elif (self.basis_method == "k_means"):
+
+ km = KMeans(n_clusters=n_components, init='k-means++',
+ max_iter=1000, n_init=5)
+ km.fit(X)
+ centers = km.cluster_centers_
+ basis = centers[centers[:, 0].argsort()]
+ #If we are using k_means centers as input, cannot record basis_inds
+
+ self.component_indices_ = -1
if False:
basis_kernel = self.kernel(basis, basis)
Jake Vanderplas Collaborator
jakevdp added a note

I know you didn't add this @rootlesstree, but this is weird. @larsmans - the git blame says you committed this if False block. Is there a reason to leave these two lines in?

Jake Vanderplas Collaborator
jakevdp added a note

Also another similar block later in the file...

Lars Owner

Gone in af43c88.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
rootlesstree

Thanks for the comments again. I will update the testing part asap. And about the discussion, I agree with your suggestion and I will write down my arguments in a more rigorous way then.

Jake Vanderplas
Collaborator

It looks like @larsmans changed some relevant code in commit af43c88

To make this easier to review within that context, could you merge recent changes to master into this branch?

rootlesstree2 2nd Update 1ff1018
rootlesstree

I have updated the new changes, is it okay to discuss the default basis_method now?

Jake Vanderplas
Collaborator

Hi - it looks like you duplicated the changes made to master, rather than updating from master. That means that there are conflicts, and now this PR can't be merged without rebase.

If you could pull in the changes from master and fix the conflicts, that would be helpful in reviewing this.

Andreas Mueller
Owner

Just a quick comment on the PR: I played with this and didn't find a setting where using KMeans made the algorithm better or faster. Can you provide an example where it actually helps?

Sorry, I'm drowning in work currently.

Andreas Mueller
Owner

Any news on this?

Dougal J. Sutherland

I haven't tried this code in particular, but k-means often gives way better results than random selection; see e.g. Table 2 in Kumar, Mohri and Talwalkar, Sampling Methods for the Nyström Method, JMLR 2012.

I might recommend changing the value for basis_method from "random" to "uniform" or something, since there are also many non-uniform random methods.

Also, my hunch is that 1000 iterations and 5 inits on the k-means is probably unnecessary, especially if you're initializing with kmeans++; it just needs a reasonable solution, not the best possible one.

While I'm looking at the Nyström code anyway: normalization_ is computed with np.dot(U * 1. / np.sqrt(S), V), which is problematic if your kernel matrix is singular or near-singular. I think it's more typical to use a pseudoinverse, in which case it could be something like (for a value of rcond like 1e-15, which np.linalg.pinv uses):

good_eigs = S > rcond
S[~good_eigs] = 0
S[good_eigs] = 1 / np.sqrt(S[good_eigs])

...or some faster way to do the same thing, I can't think of one. numpy does it with an explicit Python loop but that seems wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Nov 14, 2013
  1. Add k-means to Nystroem Approximation

    rootlesstree2 authored
Commits on Nov 18, 2013
  1. Finish Improvement

    rootlesstree2 authored
Commits on Nov 19, 2013
  1. 2nd Update

    rootlesstree2 authored
This page is out of date. Refresh to see the latest.
56 sklearn/kernel_approximation.py
View
@@ -19,6 +19,7 @@
as_float_array)
from .utils.extmath import safe_sparse_dot
from .metrics.pairwise import pairwise_kernels
+from .cluster import KMeans
class RBFSampler(BaseEstimator, TransformerMixin):
@@ -376,6 +377,9 @@ class Nystroem(BaseEstimator, TransformerMixin):
If int, random_state is the seed used by the random number generator;
if RandomState instance, random_state is the random number generator.
+ basis_method : string "k_means" or "random"
+ Use k_means centers or random sampled columns to construct Nystrom
+ Approximation
Attributes
----------
@@ -401,6 +405,9 @@ class Nystroem(BaseEstimator, TransformerMixin):
Comparison",
Advances in Neural Information Processing Systems 2012
+ * K. Zhang, Ivor W. Tsang, James T. Kwok
+ "Improved Nystrom Low-Rank Approximation and Error Analysis"
+ International Conference of Machine Learning (ICML) 2008
See also
--------
@@ -410,7 +417,8 @@ class Nystroem(BaseEstimator, TransformerMixin):
sklearn.metric.pairwise.kernel_metrics : List of built-in kernels.
"""
def __init__(self, kernel="rbf", gamma=None, coef0=1, degree=3,
- kernel_params=None, n_components=100, random_state=None):
+ kernel_params=None, n_components=100, random_state=None,
+ basis_method="k_means"):
self.kernel = kernel
self.gamma = gamma
self.coef0 = coef0
@@ -418,6 +426,7 @@ def __init__(self, kernel="rbf", gamma=None, coef0=1, degree=3,
self.kernel_params = kernel_params
self.n_components = n_components
self.random_state = random_state
+ self.basis_method = basis_method
def fit(self, X, y=None):
"""Fit estimator to data.
@@ -431,7 +440,6 @@ def fit(self, X, y=None):
Training data.
"""
- rnd = check_random_state(self.random_state)
n_samples = X.shape[0]
# get basis vectors
@@ -444,23 +452,38 @@ def fit(self, X, y=None):
else:
n_components = self.n_components
+
n_components = min(n_samples, n_components)
- inds = rnd.permutation(n_samples)
- basis_inds = inds[:n_components]
- basis = X[basis_inds]
- if False:
- basis_kernel = self.kernel(basis, basis)
+ if (self.basis_method == "random"):
+ rnd = check_random_state(self.random_state)
+ inds = rnd.permutation(n_samples)
+ basis_inds = inds[:n_components]
+ basis = X[basis_inds]
+ self.component_indices_ = inds
+
+ elif (self.basis_method == "k_means"):
+
+ km = KMeans(n_clusters=n_components, init='k-means++',
+ max_iter=1000, n_init=5)
+ km.fit(X)
+ centers = km.cluster_centers_
+ basis = centers[centers[:, 0].argsort()]
+ #If we are using k_means centers as input, cannot record basis_inds
+
+ self.component_indices_ = -1
+
else:
- basis_kernel = pairwise_kernels(basis, metric=self.kernel,
- filter_params=True,
- **self._get_kernel_params())
+ raise ValueError("No basis_method: %s.", self.basis_method)
+
+ basis_kernel = pairwise_kernels(basis, metric=self.kernel,
+ filter_params=True,
+ **self._get_kernel_params())
# sqrt of kernel matrix on basis vectors
U, S, V = svd(basis_kernel)
self.normalization_ = np.dot(U * 1. / np.sqrt(S), V)
self.components_ = basis
- self.component_indices_ = inds
return self
def transform(self, X):
@@ -479,14 +502,11 @@ def transform(self, X):
X_transformed : array, shape=(n_samples, n_components)
Transformed data.
"""
+ embedded = pairwise_kernels(X, self.components_,
+ metric=self.kernel,
+ filter_params=True,
+ **self._get_kernel_params())
- if False:
- embedded = self.kernel(X, self.components_)
- else:
- embedded = pairwise_kernels(X, self.components_,
- metric=self.kernel,
- filter_params=True,
- **self._get_kernel_params())
return np.dot(embedded, self.normalization_.T)
def _get_kernel_params(self):
59 sklearn/tests/test_kernel_approximation.py
View
@@ -127,22 +127,37 @@ def test_nystroem_approximation():
K = rbf_kernel(X)
assert_array_almost_equal(np.dot(X_transformed, X_transformed.T), K)
- trans = Nystroem(n_components=2, random_state=rnd)
- X_transformed = trans.fit(X).transform(X)
- assert_equal(X_transformed.shape, (X.shape[0], 2))
+ trans_k_means = Nystroem(n_components=2, random_state=rnd,
+ basis_method="k_means")
+ trans_random = Nystroem(n_components=2, random_state=rnd,
+ basis_method="random")
+ X_transformed_k_means = trans_k_means.fit(X).transform(X)
+ X_transformed_random = trans_random.fit(X).transform(X)
+ assert_equal(X_transformed_k_means.shape, (X.shape[0], 2))
+ assert_equal(X_transformed_random.shape, (X.shape[0], 2))
# test callable kernel
linear_kernel = lambda X, Y: np.dot(X, Y.T)
- trans = Nystroem(n_components=2, kernel=linear_kernel, random_state=rnd)
- X_transformed = trans.fit(X).transform(X)
- assert_equal(X_transformed.shape, (X.shape[0], 2))
+ trans_k_means = Nystroem(n_components=2, kernel=linear_kernel,
+ random_state=rnd, basis_method="k_means")
+ trans_random = Nystroem(n_components=2, kernel=linear_kernel,
+ random_state=rnd, basis_method="random")
+ X_transformed_k_means = trans_k_means.fit(X).transform(X)
+ X_transformed_random = trans_random.fit(X).transform(X)
+ assert_equal(X_transformed_k_means.shape, (X.shape[0], 2))
+ assert_equal(X_transformed_random.shape, (X.shape[0], 2))
# test that available kernels fit and transform
kernels_available = kernel_metrics()
for kern in kernels_available:
- trans = Nystroem(n_components=2, kernel=kern, random_state=rnd)
- X_transformed = trans.fit(X).transform(X)
- assert_equal(X_transformed.shape, (X.shape[0], 2))
+ trans_k_means = Nystroem(n_components=2, kernel=kern,
+ random_state=rnd, basis_method="k_means")
+ trans_random = Nystroem(n_components=2, kernel=kern,
+ random_state=rnd, basis_method="random")
+ X_transformed_k_means = trans_k_means.fit(X).transform(X)
+ X_transformed_random = trans_random.fit(X).transform(X)
+ assert_equal(X_transformed_k_means.shape, (X.shape[0], 2))
+ assert_equal(X_transformed_random.shape, (X.shape[0], 2))
def test_nystroem_poly_kernel_params():
@@ -151,10 +166,18 @@ def test_nystroem_poly_kernel_params():
X = rnd.uniform(size=(10, 4))
K = polynomial_kernel(X, degree=3.1, coef0=.1)
- nystroem = Nystroem(kernel="polynomial", n_components=X.shape[0],
- degree=3.1, coef0=.1)
- X_transformed = nystroem.fit_transform(X)
- assert_array_almost_equal(np.dot(X_transformed, X_transformed.T), K)
+ nystroem_random = Nystroem(kernel="polynomial", n_components=X.shape[0],
+ degree=3.1, coef0=.1, basis_method="random")
+ nystroem_k_means = Nystroem(kernel="polynomial", n_components=X.shape[0],
+ degree=3.1, coef0=.1, basis_method="k_means")
+
+ X_transformed_k_means = nystroem_k_means.fit_transform(X)
+ X_transformed_random = nystroem_random.fit_transform(X)
+
+ assert_array_almost_equal(np.dot(X_transformed_k_means,
+ X_transformed_k_means.T), K)
+ assert_array_almost_equal(np.dot(X_transformed_random,
+ X_transformed_random.T), K)
def test_nystroem_callable():
@@ -171,5 +194,13 @@ def logging_histogram_kernel(x, y, log):
kernel_log = []
Nystroem(kernel=logging_histogram_kernel,
n_components=(n_samples - 1),
- kernel_params={'log': kernel_log}).fit(X)
+ kernel_params={'log': kernel_log}, basis_method="k_means").fit(X)
+
+ assert_equal(len(kernel_log), n_samples * (n_samples - 1) / 2)
+
+ kernel_log = []
+ Nystroem(kernel=logging_histogram_kernel,
+ n_components=(n_samples - 1),
+ kernel_params={'log': kernel_log}, basis_method="random").fit(X)
+
assert_equal(len(kernel_log), n_samples * (n_samples - 1) / 2)
Something went wrong with that request. Please try again.