Updated K-means clustering for Nystroem #3126

nateyoder · 2014-05-02T00:27:44Z

Because I wanted to try K-means clustering as the basis for Nystroem approximation and it appeared as though pull request #2591 might be stalled I created a slightly modified version. I also tried to address @amueller comment about the effectiveness of the method by including it in the plot_kernel_approximation example and @dougalsutherland comment concerning the possible singularity of the sub-sampled kernel matrix using the same approach as scipy does in pinv2.

Since it is my first commit to the project (hopefully the first of many) any feedback or suggestions you have would be appreciated.

…mplement it in plot_kernel_approximation example to show difference

coveralls · 2014-05-02T00:38:25Z

Coverage remained the same when pulling 0b139b4 on nateyoder:kmeans-nystroem into 48e2b13 on scikit-learn:master.

amueller · 2014-05-03T15:07:56Z

Hi @nateyoder.
Thanks for tackling this. Could you maybe post the plot from the example?
Have you experimented with some datasets and seen an improvement?

Cheers,
Andy

amueller · 2014-05-03T15:11:15Z

doc/modules/kernel_approximation.rst

@@ -35,9 +35,15 @@ Nystroem Method for Kernel Approximation
 The Nystroem method, as implemented in :class:`Nystroem` is a general method
 for low-rank approximations of kernels. It achieves this by essentially subsampling
 the data on which the kernel is evaluated.
+The subsampling methodology used to generate the approximate kernel is specified by
+the parameter ``basis_method`` which can either be ``random`` or ``clustered``.


I would call it kmeans instead of clustered, to be more specific.

Maybe also basis_sampling or basis_selection?

Great suggestions. They are incorporated in the new version.

nateyoder · 2014-05-03T21:47:47Z

As far as performance it seems to help a bit, but not quite as much as I had hoped. I think the results would be bigger if the random selection method happened to select an outlier as part of the basis sampling set but didn't try different random seeds to make that occur.

coveralls · 2014-05-03T21:52:18Z

Coverage remained the same when pulling 5f313f8 on nateyoder:kmeans-nystroem into 48e2b13 on scikit-learn:master.

nateyoder · 2014-05-12T18:22:18Z

Sorry I accidentally deleted the branch and I think doing this closed the issue. Sorry!!

amueller · 2014-05-12T20:18:51Z

Have you tried it on a different dataset? This above is digits, right? Maybe try MNIST? Or is there some other dataset where RBF works well?

amueller · 2014-05-12T20:19:46Z

I think this should help but I also think we should make sure that it actually does ;)

ogrisel · 2014-05-13T14:29:15Z

Have you tried it on a different dataset? This above is digits, right? Maybe try MNIST? Or is there some other dataset where RBF works well?

You could also try on Olivetti faces with RandomizedPCA preprocessing: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html

To try on a bigger dataset you can use LFW instead of Olivetti.

nateyoder · 2014-05-13T20:08:32Z

Sounds great guys thanks for the suggestions. I'll give them a shot this week and post the results.

Also I noticed my build failed but it failed because of errors in OrthogonalMatchingPursuitCV. Do you guys know if this an intermitant test or something I should look into?

ogrisel · 2014-05-15T08:13:34Z

The travis failure is unrelated, you can ignore it.

nateyoder · 2014-07-16T18:24:30Z

Sorry for the long layoff guys.

Finally got a chance to run amueller's MINST example with k-means and random. As the graph shows k-means does show some minor improvement but nothing big. However, since it seems to almost always be a little better in the examples I tried it seems like it might still be worth adding it?

I briefly tried on Olivetti but I think because of the limited amount of faces saw a lot of variance in the output and didn't really get anything useful other than k-means definitely isn't a silver bullet. I didn't have time to look into LFW.

kastnerkyle · 2014-07-17T09:50:19Z

It seems consistent from the little I have seen thus far - I will try to run some tests as well. Looks pretty nice!

ogrisel · 2014-07-18T22:02:16Z

Thanks for the bench on mnist. It would be great to run the same on lfw and
covertype.

djsutherland · 2014-07-19T19:45:37Z

At first these results seemed at odds to me with the MNIST line in Table 2 of Kumar, Mohri and Talwalkar, Sampling Methods for the Nyström Method, JMLR 2012. But actually, that table is showing the kernel reconstruction "accuracy" $|| K - K_k ||_F / || K - \tilde{K}_k ||_F * 100}$ , where K_k is the optimal rank-k reconstruction (the truncated SVD), and \tilde{K}_k is the rank-k Nyström approximation. I guess the kernel isn't as well-approximated by the uniform reconstruction, but it's still good enough to do classification with. Might be good to make sure that's the case.

Also, it might be better to use kmeans++ initialization rather than random; did you try that?

nateyoder · 2014-07-29T17:25:42Z

Brief update. I ran MINST again to compare "better" clustering with k-means [KMeans++ initialization, max_iter=300, and n_init=10] vs k-means as suggested in the literature ['random' initialization, max_iter=5, n_init=1] vs random Nystroem. As shown below the much more time intensive clustering has almost no impact on the classification performance while significantly increasing the time needed to train the model.

I also did the same on LFW and the results are below. In this case k-means appears to little to no consistent improvement over random selection. If you are interested I used the parameters found in http://nbviewer.ipython.org/github/jakevdp/sklearn_scipy2013/blob/master/rendered_notebooks/05.1_application_to_face_recognition.ipynb other than doing my own RBF grid search to find the optimal RBF parameters.

I'll try to do the covertype test later this week if I get time and you guys think it is still needed.

ogrisel · 2014-07-29T21:59:08Z

Can you please rebase your branch on master and try with MinibatchKMeans? This might be master to converge while giving good enough centroids.

mth4saurabh · 2015-12-14T08:51:03Z

EDIT - I will post plots and numbers soon.
@amueller @ogrisel , I extracted and used portion from class MiniBatchKMeans(On top of work done in this PR), as expected we can improve on time but performance takes hit for low dimensions.

amueller · 2016-09-14T20:20:26Z

hm... this actually looks good. @mth4saurabh any chance you are still interested in working on this?

mth4saurabh · 2016-09-15T19:55:12Z

@amueller : sure, would love to; will start on monday.

haiatn · 2023-08-25T21:37:11Z

Since #1568 is marked as completed after no evidence that K-means would be a better way I think we can close this, and if someone finds a better method they could open a new PR for solving this issue #4982

nateyoder added 2 commits May 1, 2014 17:06

Add k-means clustering to Nystroem kernel approximation method; and i…

e4aed09

…mplement it in plot_kernel_approximation example to show difference

Deal with kernel matrix singularity in Nystroem kernel approximation

0b139b4

Fix error message formatting issue on Python 2.6

e7bec1e

nateyoder changed the title ~~Implemented~~ Updated K-means clustering for Nystroem May 2, 2014

amueller reviewed May 3, 2014
View reviewed changes

change basis_metod to basis_sampling and clustered to kmeans

5f313f8

nateyoder closed this May 12, 2014

nateyoder deleted the kmeans-nystroem branch May 12, 2014 18:19

nateyoder restored the kmeans-nystroem branch May 12, 2014 18:20

nateyoder reopened this May 12, 2014

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller mentioned this pull request Jul 20, 2015

Custom indices for Nystroem approximation and other kernel methods #4982

Open

amueller added the Needs Benchmarks A tag for the issues and PRs which require some benchmarks label Aug 5, 2019

cmarmo added the help wanted label Sep 21, 2020

Base automatically changed from master to main January 22, 2021 10:48

thomasjpfan mentioned this pull request Apr 14, 2022

[WIP] Add k-means to Nystroem Approximation #2591

Closed

adrinjalali closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated K-means clustering for Nystroem #3126

Updated K-means clustering for Nystroem #3126

nateyoder commented May 2, 2014

coveralls commented May 2, 2014

amueller commented May 3, 2014

amueller May 3, 2014

amueller May 3, 2014

nateyoder May 3, 2014

nateyoder commented May 3, 2014

coveralls commented May 3, 2014

nateyoder commented May 12, 2014

amueller commented May 12, 2014

amueller commented May 12, 2014

ogrisel commented May 13, 2014

nateyoder commented May 13, 2014

ogrisel commented May 15, 2014

nateyoder commented Jul 16, 2014

kastnerkyle commented Jul 17, 2014

ogrisel commented Jul 18, 2014

djsutherland commented Jul 19, 2014

nateyoder commented Jul 29, 2014

ogrisel commented Jul 29, 2014

mth4saurabh commented Dec 14, 2015

amueller commented Sep 14, 2016

mth4saurabh commented Sep 15, 2016

haiatn commented Aug 25, 2023

Updated K-means clustering for Nystroem #3126

Updated K-means clustering for Nystroem #3126

Conversation

nateyoder commented May 2, 2014

coveralls commented May 2, 2014

amueller commented May 3, 2014

amueller May 3, 2014

Choose a reason for hiding this comment

amueller May 3, 2014

Choose a reason for hiding this comment

nateyoder May 3, 2014

Choose a reason for hiding this comment

nateyoder commented May 3, 2014

coveralls commented May 3, 2014

nateyoder commented May 12, 2014

amueller commented May 12, 2014

amueller commented May 12, 2014

ogrisel commented May 13, 2014

nateyoder commented May 13, 2014

ogrisel commented May 15, 2014

nateyoder commented Jul 16, 2014

kastnerkyle commented Jul 17, 2014

ogrisel commented Jul 18, 2014

djsutherland commented Jul 19, 2014

nateyoder commented Jul 29, 2014

ogrisel commented Jul 29, 2014

mth4saurabh commented Dec 14, 2015

amueller commented Sep 14, 2016

mth4saurabh commented Sep 15, 2016

haiatn commented Aug 25, 2023