Here is an early pull request, far from being ready for merging to master (lacks tests, docs, and a class that implements the estimator public API).

The goal is to let other developers know about some ongoing work to implement Power Iteration Clustering, a "variant" of Spectral Clustering that is supposed to be more scalable (according to the authors of the reference paper).

In practice I am not so sure about the tradeoff speed / robustness of the results. We need to implement better clustering metrics (as mentioned in the paper) so as to be able to do a principled comparison and evaluate the performance of the algorithm.

Edit: here is a direct link to the reference paper:

Feedback greatly appreciated.

Status / TODO before merging

  • find a way to quantitatively evaluate the quality of the clusters(implemented v-measure and adjusted Rand index now in master)

  • investigate with the dependency on the value of the tol hyperparameter

  • write documentation

  • write tests

  • cleanup the convergence plot code (or find a non intrusive, generic API to deal with such convergence monitoring, maybe using callbacks)

  • reproduce convergence results of the paper on the 20 newsgroups dataset

+1 but no time before the release on my side.


Just out of curiosity. Why there so many un merged PR. Some of them have a lot of work going into them and also very old without being closed of merged.


Just because people loose interest or don't have the time to finish the work required to get them merged. This is natural I think.

