Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

WIP: Power Iteration Clustering #138

Open
wants to merge 164 commits into from
@ogrisel
Owner

Here is an early pull request, far from being ready for merging to master (lacks tests, docs, and a class that implements the estimator public API).

The goal is to let other developers know about some ongoing work to implement Power Iteration Clustering, a "variant" of Spectral Clustering that is supposed to be more scalable (according to the authors of the reference paper).

In practice I am not so sure about the tradeoff speed / robustness of the results. We need to implement better clustering metrics (as mentioned in the paper) so as to be able to do a principled comparison and evaluate the performance of the algorithm.

Edit: here is a direct link to the reference paper: http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf

Feedback greatly appreciated.

Status / TODO before merging

  • find a way to quantitatively evaluate the quality of the clusters(implemented v-measure and adjusted Rand index now in master)

  • investigate with the dependency on the value of the tol hyperparameter

  • write documentation

  • write tests

  • cleanup the convergence plot code (or find a non intrusive, generic API to deal with such convergence monitoring, maybe using callbacks)

  • reproduce convergence results of the paper on the 20 newsgroups dataset

ogrisel and others added some commits
@ogrisel ogrisel started work on Power Iteration Clustering 469187d
@ogrisel ogrisel saving work in progress on PIC 581d88a
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering c785f68
@ogrisel ogrisel tentative example for spectral clustering d636f5f
@ogrisel ogrisel tentative improvement by changing the window 2bbc656
@ogrisel ogrisel tentative work with KNN connectivity graph as affinity matrix, does n…
…ot work either...
5f44548
@ogrisel ogrisel Merge branch 'spectral-clustering-example' into power-iteration-clust…
…ering
20a71cb
@agramfort agramfort FIX : fixing plot_clustering_toy_2D_circles.py 080c2f2
@ogrisel ogrisel more work on PIC 9a84598
@ogrisel ogrisel dataset cheat to make spectral clustering work 083b640
@ogrisel ogrisel Merge branch 'spectral-clustering-example' into power-iteration-clust…
…ering
e68fa64
@ogrisel ogrisel work in progress debugging the PIC convergence 5f6b085
@ogrisel ogrisel ignore debug images d67247b
@ogrisel ogrisel more explicit variable names 92e1b03
@ogrisel ogrisel compare more methods in the circles example c6d978d
@ogrisel ogrisel ENH: plot improvements stolen from Gael's bag of mpl tricks 8c85586
@ogrisel ogrisel Merge branch 'master' into spectral-clustering-example dc4e270
@ogrisel ogrisel Merge branch 'master' into spectral-clustering-example f0df252
@ogrisel ogrisel merge 3b08b34
@ogrisel ogrisel fixed the convergence criterion to follow the paper recommendation: n…
…ow it works
68ede30
@ogrisel ogrisel typo + TODO a4d839c
@ogrisel ogrisel factoring out row normalization 34e652b
@ogrisel ogrisel factorize out the row normalization logic to reuse in the cosine simi…
…larity metric
b58d9f2
@GaelVaroquaux

Looks cool. I don't have time to look at that right now. Could you also send an e-mail on the list: not everybody monitors the pull requests well enough, as we have seen recently.

@ogrisel
Owner

Done.

@mblondel
Owner

Power Iteration Clustering looks interesting! Need to read the paper. Two short remarks:

  • inplace_row_normalize overlaps with LengthNormalizer in the preprocessing module (the sparse version is coded in Cython but it may be possible to get rid of it thanks to the scipy.sparse.sparsetools which I didn't know when I wrote LengthNormalizer).

  • The cosine_similarity function uses a singular form but the euclidean_distances function uses a plural form. This needs to be normalized (I prefer singular form).

@mblondel
Owner

Another name for the cosine_similarity function could be normalized_linear_kernel.

@ogrisel
Owner

About inplace_row_normalize vs cython normalizers: I know about the redundancy: we need to benchmark b
oth and keep the best. sparsetools is completely undocumented. I found about it by chatting with David during PyCon.

About the plural inconsistency, ok for cosine_similarities. As for normalized_linear_kernel I mentioned the equivalence in the docstring, I think more people will find it under the cosine_similarities name because of the popularity of the vector space model and the geometric interpretation of the cosinus between the angles of the pair of vectors.

@mblondel
Owner
  • So I'm in favor of using LengthNormalizer in cosine_similarities. Then, depending on the benchmark results, we can either use the sparsetools and get rid of the Cython module, or, continue to use the Cython module. If the sparsetools are written in C, as I except them to be, I suspect that there won't be any major difference between them and the Cython module which means that we will be able to get rid of it.

  • What do you think of an alias normalized_linear_kernel = cosine_similarities? Aliases should usually be avoided but in that case, it makes it obvious that you can pass it as a parameter to objects that support callable kernels.

@GaelVaroquaux

What do you think of an alias normalized_linear_kernel = cosine_similarities?

-1. If we really need kernel functions, then let's have a kernel module. And to reply to mblondel's remarks on a kernel API a while ago, it seems to me that, for now, defining a kernel as a callable is good enough.

@mblondel
Owner

I agree that a kernel API is not a priority. Plus, if we are to use callables, I prefer to leave the kernels in the pairwise module. It's a good way to show the user that kernels are pairwise metrics.

@ogrisel
Owner

@mblondel: I fixed the normalizer to use the real l1 norm (with absolute values) and did some bench:

http://friendpaste.com/3TwKVt61Eag45X66fpHcpX

Your cython version is significantly faster (esp. if you discard the time of the array copy at the beginning of the bench): I plan to use it and drop the inplace_row_normalizer function but indeed it would make the preprocessing package simpler if we would get rid off the sparse sub package as discussed on the mailing list and merge the Dense And Sparse normalizer variants into a single class.

Also I think we could have a single RowNormalizer transformer class and select between l1 and l2 norm as a parameter of the constructor.

Do you want to work on this? Or should I do it? In any case this work should be done in a new branch forked from the master with a cherry pick of my 7eb8fca changset so that it can be merged to master before I finish the work on power iteration clustering.

@mblondel
Owner

@ogrisel: I'm +1 for the changes you propose and we can even combine Normalizer and LengthNormalizer in ColumnNormalizer if it's ok to break API. I cannot do it right now though, much busy. Can't we postpone to after your Power Iteration Branch is merged?

@mblondel
Owner

Or we could have an axis option, whcih is more numpy-ish.

@ogrisel
Owner

I am ok for breaking the API to make it much simpler.

This can and should be done independently of the Power Iteration Clustering branch which is far from ready yet: I need to implement clustering metrics to be able to check the quality / robustness of the results empirically + check that we can reproduce the results of the PIC papers. Whatever branch moves first is not important.

@ogrisel
Owner

@mblondel I just pushed the cherripicking of the l1 norm fix to the master to make the branches independent.

@GaelVaroquaux

What's the status of this pull request? Should we try merging it?

@ogrisel
Owner

Nope the status is explicit in the PR header. Feel free to further investigate but this is not ready for merge as long as we cannot reproduce the results of the paper on the 20 newsgroups dataset. Note that the recent fix on the TF-IDF normalization might help improve the results.

@GaelVaroquaux
@ogrisel
Owner

FYI I updated to master to get the "good" TF-IDF and I fixed a bug in the power iteration code but I still cannot reproduce the paper results so this will require more debugging before being able to merge.

ogrisel and others added some commits
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering bb69231
@ogrisel ogrisel tweak toy example to make it work better (cheating) 159741c
@ogrisel ogrisel ENH: joblib + fixed random state in doc clustering example 2dc27a7
@ogrisel ogrisel fixed power iteration clustering by taking the transposed normalized …
…matrix
ac90d7a
@ogrisel ogrisel cosmit 92815ea
@ogrisel ogrisel track changes from master 56ded9e
@ogrisel ogrisel checkpoint some untested brain dump f2ddc81
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering dd4d3f7
@ogrisel ogrisel OPTIM: set zero elements of afinity matrix much faster 77ec757
@ogrisel ogrisel remove sparse_dense_dot_nocheck cython code as sparse * dense matrix …
…multiply was not the bad perf culprit
302b8b8
@ogrisel ogrisel be concise daceeec
@ogrisel ogrisel fix verbose mode bug 4b1f734
@ogrisel ogrisel merge changes from master c99517f
@ogrisel ogrisel Merge branch 'master' into power-iteration-clustering 17ee585
@ogrisel ogrisel merge master 2bc4e40
@jakevdp jakevdp add warning flag to balltree + tests 62a27a5
@jakevdp jakevdp warning_flag doc 79ef074
@jakevdp jakevdp add warning messages to KNeighbors a8b7489
@jakevdp jakevdp fixes for tests 78550d8
@ogrisel ogrisel OPTIM: inplace dense minibatch updates and better variable names 554a986
@ogrisel ogrisel cosmit e78bb42
@larsmans larsmans FIX Issue 379 and use the opportunity to refactor libsvm code
Sparse SVC's would try to use the dense decision_function, since
SparseBaseLibSVM did not override that method. Solution: factor out
non-common code to DenseBaseLibSVM. (@diogojc's test program still
crashes, but with an appropriate error message.)

Also, leverage superclass __init__ in SparseBaseLibSVM
027648f
@ogrisel ogrisel cosmit: better variable name in MiniBatchKMeans 5ac7b80
@ogrisel ogrisel ENH: make it possible to control the add variance caused by Randomize…
…d SVD
957a09e
@glouppe glouppe Added a numerical stability test to decision trees 8dba451
@ogrisel ogrisel ENH: document clustering example simplification ab9376e
@ogrisel ogrisel FIX broken doctests on buildbot + pep257 691ea6a
@GaelVaroquaux GaelVaroquaux PEP08 names in graph_shortest_path 8a27c3e
@jakevdp jakevdp attempt to address warnings catcher 61d31b3
@jakevdp jakevdp hack to fix warning test d5e78b7
@jakevdp jakevdp change warning message 28a8f7d
@jakevdp jakevdp simplify warning test; remove assert_warns from utils 46815f8
@GaelVaroquaux GaelVaroquaux COSMIT 799c245
@larsmans larsmans DOC copy-edit naive bayes doc, with an emphasis on the formulas
* follow scikit-learn naming convention (y for classes, x_i for features)
* don't assume text classification, even for multinomial NB
d181c0d
@larsmans larsmans COSMIT in chi² feature selection 8011269
@jmetzen jmetzen Fixed bug in updating structure matrix in ward_tree algorithm.
Change: Formerly, when merging two clusters, the merge was not connected to all other clusters the two childs had been connected to. This could result in raised Exceptions when connected components of the graph become unconnected. Now, the connections are propagated up through the ward tree during merging such that connected components remain always connected.
c911a9a
@jmetzen jmetzen Added test case that reproduces crashes in old version of ward_tree a…
…lgorithm.
4bb7581
@jmetzen jmetzen Performance tweaking in ward_tree.
Using a set instead of a list for lookup of already considered tree nodes during update of structure matrix.

Using a set instead of a list in
a5e74b0
@larsmans larsmans DOC ported latexpdf target from Sphinx 1.0.7-generated Makefile
Also, clean up doc/README
9694ef2
@vene vene DOC: added Gaussian Processes to class reference a873c00
@larsmans larsmans DOC typos in Ward tree docstring 2da4b40
@GaelVaroquaux GaelVaroquaux TEST: simplify test case f11b797
@GaelVaroquaux GaelVaroquaux SPEED tree: 2X in Gini criteria feaba6e
@fabianp fabianp Cosmetic changes in LARS f6c0c46
@fabianp fabianp FIX: Py3k compatibility. 09a602a
@ogrisel ogrisel FIX: missing relative import marker 4552f29
@larsmans larsmans COSMIT little things in hierarchical.py ba36e19
@glouppe glouppe DOC: Added load_boston in classes.rst 1c076b3
@larsmans larsmans BUG NMF topic spotting example would output n_top_words-1 terms 48789ce
@amueller amueller DOCS: Image is aligned to the right... bf88bd1
@larsmans larsmans DOC explain multiclass behavior in LogisticRegression 756772f
@ogrisel ogrisel DOC: LogisticRegression is a wrapper for liblinear. 49992fd
@GaelVaroquaux GaelVaroquaux MISC: mk roc_curve work on lists fcabe0e
@vene vene FIX: keep track of index swapping in OMP 25b4540
@amueller amueller DOC Added documentation for important attributes of GridSearchCV de53d27
@amueller amueller specify dict type 5ee39dd
@larsmans larsmans COSMIT pep8 feature_extraction.text bcdb9f0
@larsmans larsmans DOC some stuff on input validation 7d3f052
@larsmans larsmans ENH Cython version of SVMlight loader a1b1dc7
@amueller amueller DOCS: Typo in url 569b78b
@vene vene Testing for swapped regressors in OMP d9fe42b
@vene vene PEP8 f3d3886
@larsmans larsmans ENH accept matrix input throughout
Simple solution: replace np.asanyarray with np.asarray throughout.
Also, use utils.as_float_array where appropriate.
a272afc
@larsmans larsmans COSMIT rename safe_asanyarray to safe_asarray to prevent confusion 8815064
@satra satra fix: permutation test score averages across folds 3eb83e1
@larsmans larsmans FIX replace np.atleast_2d with new utils.array2d
atleast_2d will pass through np.matrix
8693d52
@agramfort agramfort STY: pep8 + naming 0285579
@agramfort agramfort DOC: prettify plot_permutation_test_for_classification.py e608dd0
@agramfort agramfort DOC : adding permutation_test_score to changelog f63fb0e
@larsmans larsmans DOC correct Google URL 43f2ae0
@GaelVaroquaux GaelVaroquaux MISC: __version__ in scikits.learn
For backward compatibility, this is required.
a447637
@amueller amueller ENH: Adds more verbosity to grid_search. verbose > 2 gives scores whi…
…le running grid.
eedf518
@ogrisel ogrisel FIX #401: update tutorial doctests to reflect recent changes and add …
…them to
5f0a36b
@ogrisel ogrisel DOC: new scikit-learn.org URLs and mention license in README.md 3ebd396
@larsmans larsmans pep8 grid_search.py a4e5561
@fabianp fabianp Delete benchmarks/bench_svm.py
ml-benchmarks has a much more comprehensive benchmark.]
8a50942
@fabianp fabianp Delette benchmarks/bench_neighbors.py
ml-benchmarks should be used instead.
11f325a
@larsmans larsmans DOC correct and clean up empirical covariance docstrings 4485065
@larsmans larsmans ENH test input validation code on memmap arrays 2d934c4
@larsmans larsmans ENH sample_weight argument in discrete NB estimators
Also, renamed Naive Bayes' unique_y attribute to _classes; it was undocumented.
c1f36ff
@fabianp fabianp MISC: More meaningful names for lapack functions in least_angle. 1f6f145
@fabianp fabianp Removed unused parameters in least_angle c3cd74a
@larsmans larsmans BUG handle two-class multilabel case in LabelBinarizer
Would throw an exception due to special handling of binary case.
230dd3f
@larsmans larsmans TEST better test for binary multilabel case in LabelBinarizer 9a6712b
@larsmans larsmans FIX Python 2.5 compat in utils/tests 8af8a3e
@larsmans larsmans DOC expand Naive Bayes narrative doc (BernoulliNB formula) 5062f3d
@GaelVaroquaux GaelVaroquaux DOC: add IterGrid in reference 5491965
@larsmans larsmans COSMIT in naive_bayes ad2b46a
@larsmans larsmans ENH prevent copy in sparse.LogisticRegression 16974c0
@yarikoptic yarikoptic DOC: minor typo "precom[p]uted" 853dd09
@larsmans larsmans Revert "ENH prevent copy in sparse.LogisticRegression"
This reverts commit b453e1e.
3528250
@GaelVaroquaux GaelVaroquaux COSMIT: no import as
There is really no point to do an 'import as' to rename a function. It is
bad for readability.
8d1c42c
@ogrisel ogrisel ENH: more informative error messages when input has invalid shapes 2325661
@ogrisel ogrisel ENH: more informative error message when shape mismatch in TF IDF tra…
…nsformer
4538b7e
@ogrisel ogrisel ENH: make it possible to pass class_weight='auto' as constructor para…
…m for SGDClassifier
9c1c99b
@GaelVaroquaux GaelVaroquaux MISC: Warn for integers in scaling/normalize 11fc117
@GaelVaroquaux GaelVaroquaux MISC: better warning message 7ee9d17
@amueller amueller DOC: Document "cache size" argument of SVR 1878383
@amueller amueller COSMIT: remove unused error string. 4b256fc
@amueller amueller ENH: removed kernel cache from fit method of DenseLibSVM, added to __…
…init__ of BaseLibSVM
6c32a13
@amueller amueller Added kernel cache argument to init of all SVC and SVR classes. For t…
…he moment the conservative 100MB default.
95b84b4
@amueller amueller BUG cache_size instead of cache as paramter name e68061f
@amueller amueller BUG: cache_size also for sparse SVMs 5d145e4
@larsmans larsmans DOC typos and style in linear_model docs 16f9cc4
@pprett pprett cosmit: get rid of gcc warning (q_data_ptr was not initialized) d01b54f
@pprett pprett fix: overflow of `offset` variable if X.shape[0] * X.shape[1] > 250M 358beab
@GaelVaroquaux GaelVaroquaux COSMIT: never use np.linalg, but scipy.linalg c3cc180
@GaelVaroquaux GaelVaroquaux BUG: ProbabilisticPCA.score work with pipeline
score(X, y=None) is required to be used by generic objects that work on
supervised and unsupervised learners.
56227a1
@fabianp fabianp Convert to scipy doc convention + add missing options 0df170f
@amueller amueller ENH: SVM cache_size default value changed to 200 mb 445524e
@amueller amueller ENH Sparse SVM: removed cache_size parameter from fit method. Is now …
…part of constructur.
3f088da
@amueller amueller DOC fixed doctests for cache_size parameter dd6bd67
@amueller amueller DOC slight reformatting of kernel cache note in module docs. d78e381
@amueller amueller BUG: minor mistake in earlier commit. 43f48ba
@amueller amueller DOC: fogot doctests in python files. 5fda25e
@amueller amueller DOC: another doctest. cacb56b
@larsmans larsmans COSMIT cleanup sgd Cython code
Remove unused variables in dense version, use xrange (now compiled to
for loop by Cython), use unsigned where appropriate.
073ffc3
@amueller amueller ENH: in Scaler, warn if fit or transform called with integer data. f6c325b
@ogrisel ogrisel FIX: buggy usage of for / else for k-means n_init loop 1fa53c9
@ogrisel ogrisel DOC: update what's new cb3f1fe
@ogrisel ogrisel FIX: broken HMM tests caused by KMeans convergence in one step b8777b8
@amueller amueller ENH parameter "epsilon" in SVR error messages is given correct name. 84ce710
@ogrisel ogrisel ENH: use integer indexing instead of boolean masks by default for CV 1b3db16
@larsmans larsmans DOC update cross validation docstrings for default indices=True 00e2761
@GaelVaroquaux GaelVaroquaux MISC: remove links to sourceforge URL 5dc1e89
@larsmans larsmans BUG handle broken estimators in grid search by cloning them
Reported by Sami Liedes of Aalto University.
2a51d04
@GaelVaroquaux GaelVaroquaux DOC: fix links in mixture 5eb6710
@GaelVaroquaux GaelVaroquaux MISC: add citation information ed19e28
@GaelVaroquaux GaelVaroquaux BUG: vectorizer.inverse_transform on arrays
Matrices and arrays have different indexing rules: matrices are always 2D
objects, thus slicing a matrix returns a 2D object.
8bd7fb4
@larsmans larsmans ENH don't require numeric class labels in SGDClassifier 0540e00
@ogrisel ogrisel s/safe_asanyarray/safe_asanyarray/g 7b4917e
@ogrisel ogrisel checkpointing experiments / parameter tweaks ee694cb
@ogrisel
Owner

+1 but no time before the release on my side.

@sirinath

Just out of curiosity. Why there so many un merged PR. Some of them have a lot of work going into them and also very old without being closed of merged.

@ogrisel
Owner

Just because people loose interest or don't have the time to finish the work required to get them merged. This is natural I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.