WIP: Power Iteration Clustering #138

Closed
wants to merge 164 commits into
from

Conversation

Projects
None yet
Owner

ogrisel commented Apr 25, 2011

Here is an early pull request, far from being ready for merging to master (lacks tests, docs, and a class that implements the estimator public API).

The goal is to let other developers know about some ongoing work to implement Power Iteration Clustering, a "variant" of Spectral Clustering that is supposed to be more scalable (according to the authors of the reference paper).

In practice I am not so sure about the tradeoff speed / robustness of the results. We need to implement better clustering metrics (as mentioned in the paper) so as to be able to do a principled comparison and evaluate the performance of the algorithm.

Edit: here is a direct link to the reference paper: http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf

Feedback greatly appreciated.

Status / TODO before merging

  • find a way to quantitatively evaluate the quality of the clusters(implemented v-measure and adjusted Rand index now in master)
  • investigate with the dependency on the value of the tol hyperparameter
  • write documentation
  • write tests
  • cleanup the convergence plot code (or find a non intrusive, generic API to deal with such convergence monitoring, maybe using callbacks)
  • reproduce convergence results of the paper on the 20 newsgroups dataset

ogrisel and others added some commits Apr 18, 2011

@ogrisel ogrisel started work on Power Iteration Clustering 469187d
@ogrisel ogrisel saving work in progress on PIC 581d88a
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering c785f68
@ogrisel ogrisel tentative example for spectral clustering d636f5f
@ogrisel ogrisel tentative improvement by changing the window 2bbc656
@ogrisel ogrisel tentative work with KNN connectivity graph as affinity matrix, does n…
…ot work either...
5f44548
@ogrisel ogrisel Merge branch 'spectral-clustering-example' into power-iteration-clust…
…ering
20a71cb
@agramfort agramfort FIX : fixing plot_clustering_toy_2D_circles.py 080c2f2
@ogrisel ogrisel more work on PIC 9a84598
@ogrisel ogrisel dataset cheat to make spectral clustering work 083b640
@ogrisel ogrisel Merge branch 'spectral-clustering-example' into power-iteration-clust…
…ering
e68fa64
@ogrisel ogrisel work in progress debugging the PIC convergence 5f6b085
@ogrisel ogrisel ignore debug images d67247b
@ogrisel ogrisel more explicit variable names 92e1b03
@ogrisel ogrisel compare more methods in the circles example c6d978d
@ogrisel ogrisel ENH: plot improvements stolen from Gael's bag of mpl tricks 8c85586
@ogrisel ogrisel Merge branch 'master' into spectral-clustering-example dc4e270
@ogrisel ogrisel Merge branch 'master' into spectral-clustering-example f0df252
@ogrisel ogrisel merge 3b08b34
@ogrisel ogrisel fixed the convergence criterion to follow the paper recommendation: n…
…ow it works
68ede30
@ogrisel ogrisel typo + TODO a4d839c
@ogrisel ogrisel factoring out row normalization 34e652b
@ogrisel ogrisel factorize out the row normalization logic to reuse in the cosine simi…
…larity metric
b58d9f2
Owner

GaelVaroquaux commented Apr 25, 2011

Looks cool. I don't have time to look at that right now. Could you also send an e-mail on the list: not everybody monitors the pull requests well enough, as we have seen recently.

Owner

ogrisel commented Apr 25, 2011

Done.

Owner

mblondel commented Apr 25, 2011

Power Iteration Clustering looks interesting! Need to read the paper. Two short remarks:

  • inplace_row_normalize overlaps with LengthNormalizer in the preprocessing module (the sparse version is coded in Cython but it may be possible to get rid of it thanks to the scipy.sparse.sparsetools which I didn't know when I wrote LengthNormalizer).
  • The cosine_similarity function uses a singular form but the euclidean_distances function uses a plural form. This needs to be normalized (I prefer singular form).
Owner

mblondel commented Apr 25, 2011

Another name for the cosine_similarity function could be normalized_linear_kernel.

Owner

ogrisel commented Apr 25, 2011

About inplace_row_normalize vs cython normalizers: I know about the redundancy: we need to benchmark b
oth and keep the best. sparsetools is completely undocumented. I found about it by chatting with David during PyCon.

About the plural inconsistency, ok for cosine_similarities. As for normalized_linear_kernel I mentioned the equivalence in the docstring, I think more people will find it under the cosine_similarities name because of the popularity of the vector space model and the geometric interpretation of the cosinus between the angles of the pair of vectors.

Owner

mblondel commented Apr 25, 2011

  • So I'm in favor of using LengthNormalizer in cosine_similarities. Then, depending on the benchmark results, we can either use the sparsetools and get rid of the Cython module, or, continue to use the Cython module. If the sparsetools are written in C, as I except them to be, I suspect that there won't be any major difference between them and the Cython module which means that we will be able to get rid of it.
  • What do you think of an alias normalized_linear_kernel = cosine_similarities? Aliases should usually be avoided but in that case, it makes it obvious that you can pass it as a parameter to objects that support callable kernels.
Owner

GaelVaroquaux commented Apr 25, 2011

What do you think of an alias normalized_linear_kernel = cosine_similarities?

-1. If we really need kernel functions, then let's have a kernel module. And to reply to mblondel's remarks on a kernel API a while ago, it seems to me that, for now, defining a kernel as a callable is good enough.

Owner

mblondel commented Apr 25, 2011

I agree that a kernel API is not a priority. Plus, if we are to use callables, I prefer to leave the kernels in the pairwise module. It's a good way to show the user that kernels are pairwise metrics.

Owner

ogrisel commented Apr 27, 2011

@mblondel: I fixed the normalizer to use the real l1 norm (with absolute values) and did some bench:

http://friendpaste.com/3TwKVt61Eag45X66fpHcpX

Your cython version is significantly faster (esp. if you discard the time of the array copy at the beginning of the bench): I plan to use it and drop the inplace_row_normalizer function but indeed it would make the preprocessing package simpler if we would get rid off the sparse sub package as discussed on the mailing list and merge the Dense And Sparse normalizer variants into a single class.

Also I think we could have a single RowNormalizer transformer class and select between l1 and l2 norm as a parameter of the constructor.

Do you want to work on this? Or should I do it? In any case this work should be done in a new branch forked from the master with a cherry pick of my 7eb8fca changset so that it can be merged to master before I finish the work on power iteration clustering.

Owner

mblondel commented Apr 28, 2011

@ogrisel: I'm +1 for the changes you propose and we can even combine Normalizer and LengthNormalizer in ColumnNormalizer if it's ok to break API. I cannot do it right now though, much busy. Can't we postpone to after your Power Iteration Branch is merged?

Owner

mblondel commented Apr 28, 2011

Or we could have an axis option, whcih is more numpy-ish.

Owner

ogrisel commented Apr 28, 2011

I am ok for breaking the API to make it much simpler.

This can and should be done independently of the Power Iteration Clustering branch which is far from ready yet: I need to implement clustering metrics to be able to check the quality / robustness of the results empirically + check that we can reproduce the results of the PIC papers. Whatever branch moves first is not important.

Owner

ogrisel commented Apr 28, 2011

@mblondel I just pushed the cherripicking of the l1 norm fix to the master to make the branches independent.

Owner

GaelVaroquaux commented Jul 15, 2011

What's the status of this pull request? Should we try merging it?

Owner

ogrisel commented Jul 15, 2011

Nope the status is explicit in the PR header. Feel free to further investigate but this is not ready for merge as long as we cannot reproduce the results of the paper on the 20 newsgroups dataset. Note that the recent fix on the TF-IDF normalization might help improve the results.

Owner

GaelVaroquaux commented Jul 15, 2011

Nope the status is explicit in the PR header.

OK, so we'll live it on standby.

Owner

ogrisel commented Jul 20, 2011

FYI I updated to master to get the "good" TF-IDF and I fixed a bug in the power iteration code but I still cannot reproduce the paper results so this will require more debugging before being able to merge.

ogrisel and others added some commits Jul 31, 2011

@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering bb69231
@ogrisel ogrisel tweak toy example to make it work better (cheating) 159741c
@ogrisel ogrisel ENH: joblib + fixed random state in doc clustering example 2dc27a7
@ogrisel ogrisel fixed power iteration clustering by taking the transposed normalized …
…matrix
ac90d7a
@ogrisel ogrisel cosmit 92815ea
@ogrisel ogrisel track changes from master 56ded9e
@ogrisel ogrisel checkpoint some untested brain dump f2ddc81
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering dd4d3f7
@ogrisel ogrisel OPTIM: set zero elements of afinity matrix much faster 77ec757
@ogrisel ogrisel remove sparse_dense_dot_nocheck cython code as sparse * dense matrix …
…multiply was not the bad perf culprit
302b8b8
@ogrisel ogrisel be concise daceeec
@ogrisel ogrisel fix verbose mode bug 4b1f734
@ogrisel ogrisel merge changes from master c99517f
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering 17ee585
@ogrisel ogrisel merge master 2bc4e40
@jakevdp @ogrisel jakevdp add warning flag to balltree + tests 62a27a5
@jakevdp @ogrisel jakevdp warning_flag doc 79ef074
@jakevdp @ogrisel jakevdp add warning messages to KNeighbors a8b7489
@jakevdp @ogrisel jakevdp fixes for tests 78550d8
@ogrisel ogrisel OPTIM: inplace dense minibatch updates and better variable names 554a986
@ogrisel ogrisel cosmit e78bb42
@larsmans @ogrisel larsmans FIX Issue 379 and use the opportunity to refactor libsvm code
Sparse SVC's would try to use the dense decision_function, since
SparseBaseLibSVM did not override that method. Solution: factor out
non-common code to DenseBaseLibSVM. (@diogojc's test program still
crashes, but with an appropriate error message.)

Also, leverage superclass __init__ in SparseBaseLibSVM
027648f
@ogrisel ogrisel cosmit: better variable name in MiniBatchKMeans 5ac7b80
@ogrisel ogrisel ENH: make it possible to control the add variance caused by Randomize…
…d SVD
957a09e
@glouppe @ogrisel glouppe Added a numerical stability test to decision trees 8dba451
@ogrisel ogrisel ENH: document clustering example simplification ab9376e
@ogrisel ogrisel FIX broken doctests on buildbot + pep257 691ea6a
@GaelVaroquaux @ogrisel GaelVaroquaux PEP08 names in graph_shortest_path 8a27c3e
@jakevdp @ogrisel jakevdp attempt to address warnings catcher 61d31b3
@jakevdp @ogrisel jakevdp hack to fix warning test d5e78b7
@jakevdp @ogrisel jakevdp change warning message 28a8f7d
@jakevdp @ogrisel jakevdp simplify warning test; remove assert_warns from utils 46815f8
@GaelVaroquaux @ogrisel GaelVaroquaux COSMIT 799c245
@larsmans @ogrisel larsmans DOC copy-edit naive bayes doc, with an emphasis on the formulas
* follow scikit-learn naming convention (y for classes, x_i for features)
* don't assume text classification, even for multinomial NB
d181c0d
@larsmans @ogrisel larsmans COSMIT in chi² feature selection 8011269
@jmetzen @ogrisel jmetzen Fixed bug in updating structure matrix in ward_tree algorithm.
Change: Formerly, when merging two clusters, the merge was not connected to all other clusters the two childs had been connected to. This could result in raised Exceptions when connected components of the graph become unconnected. Now, the connections are propagated up through the ward tree during merging such that connected components remain always connected.
c911a9a
@jmetzen @ogrisel jmetzen Added test case that reproduces crashes in old version of ward_tree a…
…lgorithm.
4bb7581
@jmetzen @ogrisel jmetzen Performance tweaking in ward_tree.
Using a set instead of a list for lookup of already considered tree nodes during update of structure matrix.

Using a set instead of a list in
a5e74b0
@larsmans @ogrisel larsmans DOC ported latexpdf target from Sphinx 1.0.7-generated Makefile
Also, clean up doc/README
9694ef2
@vene @ogrisel vene DOC: added Gaussian Processes to class reference a873c00
@larsmans @ogrisel larsmans DOC typos in Ward tree docstring 2da4b40
@GaelVaroquaux @ogrisel GaelVaroquaux TEST: simplify test case f11b797
@GaelVaroquaux @ogrisel GaelVaroquaux SPEED tree: 2X in Gini criteria feaba6e
@fabianp @ogrisel fabianp Cosmetic changes in LARS f6c0c46
@fabianp @ogrisel fabianp FIX: Py3k compatibility.
Fixes #395
09a602a
@ogrisel ogrisel FIX: missing relative import marker 4552f29
@larsmans @ogrisel larsmans COSMIT little things in hierarchical.py ba36e19
@glouppe @ogrisel glouppe DOC: Added load_boston in classes.rst 1c076b3
@larsmans @ogrisel larsmans BUG NMF topic spotting example would output n_top_words-1 terms 48789ce
@amueller @ogrisel amueller DOCS: Image is aligned to the right... bf88bd1
@larsmans @ogrisel larsmans DOC explain multiclass behavior in LogisticRegression 756772f
@ogrisel ogrisel DOC: LogisticRegression is a wrapper for liblinear. 49992fd
@GaelVaroquaux @ogrisel GaelVaroquaux MISC: mk roc_curve work on lists fcabe0e
@vene @ogrisel vene FIX: keep track of index swapping in OMP 25b4540
@amueller @ogrisel amueller DOC Added documentation for important attributes of GridSearchCV de53d27
@amueller @ogrisel amueller specify dict type 5ee39dd
@larsmans @ogrisel larsmans COSMIT pep8 feature_extraction.text bcdb9f0
@larsmans @ogrisel larsmans DOC some stuff on input validation 7d3f052
@larsmans @ogrisel larsmans ENH Cython version of SVMlight loader a1b1dc7
@amueller @ogrisel amueller DOCS: Typo in url 569b78b
@vene @ogrisel vene Testing for swapped regressors in OMP d9fe42b
@vene @ogrisel vene PEP8 f3d3886
@larsmans @ogrisel larsmans ENH accept matrix input throughout
Simple solution: replace np.asanyarray with np.asarray throughout.
Also, use utils.as_float_array where appropriate.
a272afc
@larsmans @ogrisel larsmans COSMIT rename safe_asanyarray to safe_asarray to prevent confusion 8815064
@satra @ogrisel satra fix: permutation test score averages across folds 3eb83e1
@larsmans @ogrisel larsmans FIX replace np.atleast_2d with new utils.array2d
atleast_2d will pass through np.matrix
8693d52
@agramfort @ogrisel agramfort STY: pep8 + naming 0285579
@agramfort @ogrisel agramfort DOC: prettify plot_permutation_test_for_classification.py e608dd0
@agramfort @ogrisel agramfort DOC : adding permutation_test_score to changelog f63fb0e
@larsmans @ogrisel larsmans DOC correct Google URL 43f2ae0
@GaelVaroquaux @ogrisel GaelVaroquaux MISC: __version__ in scikits.learn
For backward compatibility, this is required.
a447637
@amueller @ogrisel amueller ENH: Adds more verbosity to grid_search. verbose > 2 gives scores whi…
…le running grid.
eedf518
@ogrisel ogrisel FIX #401: update tutorial doctests to reflect recent changes and add …
…them to
5f0a36b
@ogrisel ogrisel DOC: new scikit-learn.org URLs and mention license in README.md 3ebd396
@larsmans @ogrisel larsmans pep8 grid_search.py a4e5561
@fabianp @ogrisel fabianp Delete benchmarks/bench_svm.py
ml-benchmarks has a much more comprehensive benchmark.]
8a50942
@fabianp @ogrisel fabianp Delette benchmarks/bench_neighbors.py
ml-benchmarks should be used instead.
11f325a
@larsmans @ogrisel larsmans DOC correct and clean up empirical covariance docstrings 4485065
@larsmans @ogrisel larsmans ENH test input validation code on memmap arrays 2d934c4
@larsmans @ogrisel larsmans ENH sample_weight argument in discrete NB estimators
Also, renamed Naive Bayes' unique_y attribute to _classes; it was undocumented.
c1f36ff
@fabianp @ogrisel fabianp MISC: More meaningful names for lapack functions in least_angle. 1f6f145
@fabianp @ogrisel fabianp Removed unused parameters in least_angle c3cd74a
@larsmans @ogrisel larsmans BUG handle two-class multilabel case in LabelBinarizer
Would throw an exception due to special handling of binary case.
230dd3f
@larsmans @ogrisel larsmans TEST better test for binary multilabel case in LabelBinarizer 9a6712b
@larsmans @ogrisel larsmans FIX Python 2.5 compat in utils/tests 8af8a3e
@larsmans @ogrisel larsmans DOC expand Naive Bayes narrative doc (BernoulliNB formula) 5062f3d
@GaelVaroquaux @ogrisel GaelVaroquaux DOC: add IterGrid in reference 5491965
@larsmans @ogrisel larsmans COSMIT in naive_bayes ad2b46a
@larsmans @ogrisel larsmans ENH prevent copy in sparse.LogisticRegression 16974c0
@yarikoptic @ogrisel yarikoptic DOC: minor typo "precom[p]uted" 853dd09
@larsmans @ogrisel larsmans Revert "ENH prevent copy in sparse.LogisticRegression"
This reverts commit b453e1e.
3528250
@GaelVaroquaux @ogrisel GaelVaroquaux COSMIT: no import as
There is really no point to do an 'import as' to rename a function. It is
bad for readability.
8d1c42c
@ogrisel ogrisel ENH: more informative error messages when input has invalid shapes 2325661
@ogrisel ogrisel ENH: more informative error message when shape mismatch in TF IDF tra…
…nsformer
4538b7e
@ogrisel ogrisel ENH: make it possible to pass class_weight='auto' as constructor para…
…m for SGDClassifier
9c1c99b
@GaelVaroquaux @ogrisel GaelVaroquaux MISC: Warn for integers in scaling/normalize 11fc117
@GaelVaroquaux @ogrisel GaelVaroquaux MISC: better warning message 7ee9d17
@amueller @ogrisel amueller DOC: Document "cache size" argument of SVR 1878383
@amueller @ogrisel amueller COSMIT: remove unused error string. 4b256fc
@amueller @ogrisel amueller ENH: removed kernel cache from fit method of DenseLibSVM, added to __…
…init__ of BaseLibSVM
6c32a13
@amueller @ogrisel amueller Added kernel cache argument to init of all SVC and SVR classes. For t…
…he moment the conservative 100MB default.
95b84b4
@amueller @ogrisel amueller BUG cache_size instead of cache as paramter name e68061f
@amueller @ogrisel amueller BUG: cache_size also for sparse SVMs 5d145e4
@larsmans @ogrisel larsmans DOC typos and style in linear_model docs 16f9cc4
@pprett @ogrisel pprett cosmit: get rid of gcc warning (q_data_ptr was not initialized) d01b54f
@pprett @ogrisel pprett fix: overflow of `offset` variable if X.shape[0] * X.shape[1] > 250M 358beab
@GaelVaroquaux @ogrisel GaelVaroquaux COSMIT: never use np.linalg, but scipy.linalg c3cc180
@GaelVaroquaux @ogrisel GaelVaroquaux BUG: ProbabilisticPCA.score work with pipeline
score(X, y=None) is required to be used by generic objects that work on
supervised and unsupervised learners.
56227a1
@fabianp @ogrisel fabianp Convert to scipy doc convention + add missing options 0df170f
@amueller @ogrisel amueller ENH: SVM cache_size default value changed to 200 mb 445524e
@amueller @ogrisel amueller ENH Sparse SVM: removed cache_size parameter from fit method. Is now …
…part of constructur.
3f088da
@amueller @ogrisel amueller DOC fixed doctests for cache_size parameter dd6bd67
@amueller @ogrisel amueller DOC slight reformatting of kernel cache note in module docs. d78e381
@amueller @ogrisel amueller BUG: minor mistake in earlier commit. 43f48ba
@amueller @ogrisel amueller DOC: fogot doctests in python files. 5fda25e
@amueller @ogrisel amueller DOC: another doctest. cacb56b
@larsmans @ogrisel larsmans COSMIT cleanup sgd Cython code
Remove unused variables in dense version, use xrange (now compiled to
for loop by Cython), use unsigned where appropriate.
073ffc3
@amueller @ogrisel amueller ENH: in Scaler, warn if fit or transform called with integer data. f6c325b
@ogrisel ogrisel FIX: buggy usage of for / else for k-means n_init loop 1fa53c9
@ogrisel ogrisel DOC: update what's new cb3f1fe
@ogrisel ogrisel FIX: broken HMM tests caused by KMeans convergence in one step b8777b8
@amueller @ogrisel amueller ENH parameter "epsilon" in SVR error messages is given correct name. 84ce710
@ogrisel ogrisel ENH: use integer indexing instead of boolean masks by default for CV 1b3db16
@larsmans @ogrisel larsmans DOC update cross validation docstrings for default indices=True 00e2761
@GaelVaroquaux @ogrisel GaelVaroquaux MISC: remove links to sourceforge URL 5dc1e89
@larsmans @ogrisel larsmans BUG handle broken estimators in grid search by cloning them
Reported by Sami Liedes of Aalto University.
2a51d04
@GaelVaroquaux @ogrisel GaelVaroquaux DOC: fix links in mixture 5eb6710
@GaelVaroquaux @ogrisel GaelVaroquaux MISC: add citation information ed19e28
@GaelVaroquaux @ogrisel GaelVaroquaux BUG: vectorizer.inverse_transform on arrays
Matrices and arrays have different indexing rules: matrices are always 2D
objects, thus slicing a matrix returns a 2D object.
8bd7fb4
@larsmans @ogrisel larsmans ENH don't require numeric class labels in SGDClassifier 0540e00
@ogrisel ogrisel s/safe_asanyarray/safe_asanyarray/g 7b4917e
@ogrisel ogrisel checkpointing experiments / parameter tweaks ee694cb
Owner

ogrisel commented Jan 8, 2014

+1 but no time before the release on my side.

Just out of curiosity. Why there so many un merged PR. Some of them have a lot of work going into them and also very old without being closed of merged.

Owner

ogrisel commented Mar 29, 2014

Just because people loose interest or don't have the time to finish the work required to get them merged. This is natural I think.

auvipy commented Sep 27, 2015

This branch has conflicts that must be resolved

Owner

amueller commented Oct 25, 2016

I think this doesn't really fulfil relevance requirements. @ogrisel wanna close and maybe put into contrib?

Owner

amueller commented Oct 25, 2016

closing. If you think this should go into sklearn and not contrib, feel free to reopen.

amueller closed this Oct 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment