[WIP] Add k-means to Nystroem Approximation #2591

Open
wants to merge 3 commits into
from

7 participants

@rootlesstree

Hi,
My name is Patrick and I have been using Scikit learn for a while but it's first time for me to contribute. I have added k-means sampling to Nystroem method, which is listed in the issues. Thanks!

Best,
Patrick

@jakevdp jakevdp and 1 other commented on an outdated diff Nov 14, 2013
sklearn/kernel_approximation.py
@@ -410,7 +410,7 @@ class Nystroem(BaseEstimator, TransformerMixin):
sklearn.metric.pairwise.kernel_metrics : List of built-in kernels.
"""
def __init__(self, kernel="rbf", gamma=None, coef0=1, degree=3,
- kernel_params=None, n_components=100, random_state=None):
+ kernel_params=None, n_components=100, random_state=None,opts="k_means"):
@jakevdp
scikit-learn member
jakevdp added a line comment Nov 14, 2013

The default should probably be "random", so that behavior is backward-compatible.

@rootlesstree
rootlesstree added a line comment Nov 15, 2013

I personally don't think it will damage as it's just the approximation of kernels. The more exact approximation should be a better result and according to the papers mentioned in issue #1568 , the approximation is more accurate than random sampling.

@jakevdp
scikit-learn member
jakevdp added a line comment Nov 15, 2013

OK - in that case, we should note this in the change log. The bigger issue is in cases where there may be different results between this and older versions. That being said, if K Means is the best approach, it may make sense to change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
scikit-learn member

Hi - welcome and thanks for the submission!

I have a couple questions about what's going on here, but first a few general-level things here before this can be considered for merge:

  • The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string.
  • You should run your code through the PEP8 utility: some of it might seem a bit excessive, but it really helps maintain a consistent and readable style throughout the package.
  • The new parameter should also be tested (take a look at sklearn/tests/ for examples of how these tests are written)

Now for some specific questions:

  1. I'm not extremely familiar with this algorithm -- could you say a couple words about what this enhancement is intended to do?
  2. The component_indices_ feature is set in a strange way for the K means option. Is this compatible with the old meaning of component_indices_?

If you could link to the issue where this is mentioned, that would also help (just put # followed by the issue number in a comment, and this link will be created automatically by github.)

Again, thanks for contributing!

@rootlesstree

Hi,
Thanks for your comments !! The issue is #1568 and I will try to answer your comments:

  1. I'm not extremely familiar with this algorithm -- could you say a couple words about what this enhancement is intended to do?

A: This enhancement is to select better columns used in the Nystroem Approximation. The reference paper suggests that using k-means centers outperform the random sampling.

  1. The component_indices_ feature is set in a strange way for the K means option. Is this compatible with the old meaning of component_indices_?

A: The component_indices is used to record the indices of selected columns used in the approximation; however, k-means method uses cluster-center to do approximation, and these points cannot be found in the original input array; therefore, the component_indices is not defined in the k-means.

  1. The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string.

    Sure, I will do this later.

  2. You should run your code through the PEP8 utility: some of it might seem a bit excessive, but it really helps maintain a consistent and readable style throughout the package.

Sorry I forget this one, let me check this later too.

  1. The new parameter should also be tested (take a look at sklearn/tests/ for examples of how these tests are written)

I have already run the tests in tests/test_kernel_approximation.py and tests/test_common.py and both of them passed, I guess the functional-wise is correct.

@jakevdp
scikit-learn member

Great! Thanks for the responses.

As far as testing, it would be good to have a test that uses either value of the opts argument. Also, for easier code readability, I think it would be good to use a name like basis_method rather than opts. When you do the documentation, be sure to add a reference to the paper mentioned in the issue.

This is going to be a nice contribution!

@rootlesstree

Sorry, I have a rookie-question :P, If I have updated the code, am I going to open a new pull request? Or I can change the code somewhere? Thanks!!

And I have few more technical questions:

  1. You mentioned "As far as testing, it would be good to have a test that uses either value of the opts argument. "

    I think I use the k_means as default now, which means all the Nystrom testings in test file utilize k_means now, do I still need to make explicit call about the argument?

  2. Another asking "The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string."

    I thought it's about basis within the Nystrom not the RBFSampler, so I am going to add the comments in Nystrom class-level doc string right?

Thanks !!!

@rolisz

To commit the new changes you made to your code, you can simply push to the same branch on your GitHub repo and it will show up here.

@rootlesstree

Thanks for explanation !

@coveralls

Coverage Status

Coverage remained the same when pulling dedb8ed on rootlesstree:master into a8089b0 on scikit-learn:master.

@jakevdp jakevdp and 1 other commented on an outdated diff Nov 18, 2013
sklearn/kernel_approximation.py
+ rnd = check_random_state(self.random_state)
+ inds = rnd.permutation(n_samples)
+ basis_inds = inds[:n_components]
+ basis = X[basis_inds]
+ self.component_indices_ = inds
+
+ elif (self.basis_method == "k_means"):
+
+ km = KMeans(n_clusters=n_components, init='k-means++',
+ max_iter=1000, n_init=5)
+ km.fit(X)
+ centers = km.cluster_centers_
+ basis = centers[centers[:, 0].argsort()]
+ #If we are using k_means centers as input, cannot record basis_inds
+
+ self.component_indices_ = -1
@jakevdp
scikit-learn member
jakevdp added a line comment Nov 18, 2013

We need to raise a ValueError here if the basis method is invalid. Currently it just passes through without setting the basis: it would lead to a runtime error.

@rootlesstree
rootlesstree added a line comment Nov 18, 2013

Understand ! Thanks for notification !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jakevdp
scikit-learn member

You mentioned "As far as testing, it would be good to have a test that uses either value of the opts argument. "
I think I use the k_means as default now, which means all the Nystrom testings in test file utilize k_means now, do I still need to make explicit call about the argument?

Yes, but "random" is still an option, so we need to test it (for example, what if someone inadvertently breaks the code in the "random" section? The tests need to tell us that happened). Regarding the default, I still think we need some discussion about what the default should be. We can't just change what the code does without letting the user know: someone's results might silently change between versions, and that's not a good thing.

If you think switching to K Means by default is the right way to go, we should open up that discussion on the mailing list once this PR is mostly ready to merge.

Another asking "The new parameter needs to be documented. This would happen in the RBFSampler class-level doc string."
I thought it's about basis within the Nystrom not the RBFSampler, so I am going to add the comments in Nystrom class-level doc string right?

Yep, you're right. Sorry about that.

@jakevdp jakevdp and 1 other commented on an outdated diff Nov 18, 2013
sklearn/kernel_approximation.py
+ rnd = check_random_state(self.random_state)
+ inds = rnd.permutation(n_samples)
+ basis_inds = inds[:n_components]
+ basis = X[basis_inds]
+ self.component_indices_ = inds
+
+ elif (self.basis_method == "k_means"):
+
+ km = KMeans(n_clusters=n_components, init='k-means++',
+ max_iter=1000, n_init=5)
+ km.fit(X)
+ centers = km.cluster_centers_
+ basis = centers[centers[:, 0].argsort()]
+ #If we are using k_means centers as input, cannot record basis_inds
+
+ self.component_indices_ = -1
if False:
basis_kernel = self.kernel(basis, basis)
@jakevdp
scikit-learn member
jakevdp added a line comment Nov 18, 2013

I know you didn't add this @rootlesstree, but this is weird. @larsmans - the git blame says you committed this if False block. Is there a reason to leave these two lines in?

@jakevdp
scikit-learn member
jakevdp added a line comment Nov 18, 2013

Also another similar block later in the file...

@larsmans
scikit-learn member
larsmans added a line comment Nov 18, 2013

Gone in af43c88.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rootlesstree

Thanks for the comments again. I will update the testing part asap. And about the discussion, I agree with your suggestion and I will write down my arguments in a more rigorous way then.

@jakevdp
scikit-learn member

It looks like @larsmans changed some relevant code in commit af43c88

To make this easier to review within that context, could you merge recent changes to master into this branch?

@rootlesstree

I have updated the new changes, is it okay to discuss the default basis_method now?

@jakevdp
scikit-learn member

Hi - it looks like you duplicated the changes made to master, rather than updating from master. That means that there are conflicts, and now this PR can't be merged without rebase.

If you could pull in the changes from master and fix the conflicts, that would be helpful in reviewing this.

@amueller
scikit-learn member

Just a quick comment on the PR: I played with this and didn't find a setting where using KMeans made the algorithm better or faster. Can you provide an example where it actually helps?

Sorry, I'm drowning in work currently.

@amueller
scikit-learn member

Any news on this?

@dougalsutherland

I haven't tried this code in particular, but k-means often gives way better results than random selection; see e.g. Table 2 in Kumar, Mohri and Talwalkar, Sampling Methods for the Nyström Method, JMLR 2012.

I might recommend changing the value for basis_method from "random" to "uniform" or something, since there are also many non-uniform random methods.

Also, my hunch is that 1000 iterations and 5 inits on the k-means is probably unnecessary, especially if you're initializing with kmeans++; it just needs a reasonable solution, not the best possible one.

While I'm looking at the Nyström code anyway: normalization_ is computed with np.dot(U * 1. / np.sqrt(S), V), which is problematic if your kernel matrix is singular or near-singular. I think it's more typical to use a pseudoinverse, in which case it could be something like (for a value of rcond like 1e-15, which np.linalg.pinv uses):

good_eigs = S > rcond
S[~good_eigs] = 0
S[good_eigs] = 1 / np.sqrt(S[good_eigs])

...or some faster way to do the same thing, I can't think of one. numpy does it with an explicit Python loop but that seems wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment