WIP : added a module for "Label Propagation" #301

clayw · 2011-08-08T21:52:50Z

Label Propagation algorithms are a powerful family of semi-supervised learning algorithms. I original wrote a demo for a workshop at the Noisebridge machine learning group, but now I think it's ready to join the scikits.learn codebase.

I implemented two label propagation algorithms, "label spreading" and "label propagation", including one example that demonstrates performance versus SVM with small number of labels and another showing how the algorithm can learn a complex structure.

Also included a little documentation describing the algorithm.

…rences between the two models

ogrisel · 2011-08-08T22:27:58Z

Interesting contrib. Before diving into the actual review work could you please run the pep8 script and fix all issues?

http://pypi.python.org/pypi/pep8

ogrisel · 2011-08-08T22:34:24Z

examples/label_propagation/plot_label_propagation_structure.py

+
+# generate ring with inner box
+num_points = 100
+outer_circ_xs = [np.cos(2*np.pi*x/num_points) for x in range(0,num_points)]


Rather than list comprehension, just use numpy:

outer_circ_xs = np.cos(np.linspace(0, 2 * np.pi, num_points))

and so on. Also for instance with the scikit conventions: num_points should be renamed to n_samples_per_circle for instance.

Actually you could just reuse the noisy circles code generation snippet from the following example in my power iteration clustering branch.

https://github.com/ogrisel/scikit-learn/blob/power-iteration-clustering/examples/cluster/plot_clustering_toy_2D_circles.py

This code could be factorized into the scikits.learn.datasets.samples_generator package.

mblondel · 2011-08-09T09:59:57Z

I like the idea of encoding unlabeled examples with -1, as having X and y of the same size is a good consistency check. The only thing that I'm worried about is that such an API will force us to use views of the data (X[y == -1] and X[y != -1]) and thus maybe suboptimal in terms of performance. Another possible API would be to pass X as a tuple.

X = (X_labeled, X_unlabeled)
LabelPropagation.fit(X, y)

ogrisel · 2011-08-09T10:55:58Z

I we decide to not keep the -1 trick for whatever reason, I think I would rather go for:

LabelPropagation.fit(X_labeled, y, unlabeled=X_unlabeled)

instead of X being a tuple.

clayw · 2011-08-09T21:38:17Z

Either of those input methods are fine. I like the unlabeled trick that @GaelVaroquaux suggested which is why I changed the implementation to it.

Keep in mind that probability distributions over the true ground labels can change after running the algorithm.

ogrisel · 2011-09-14T16:19:25Z

This indeed looks much clearer (esp. the variable / attribute names and docstring). I agree that the active learning example is neat. For the structure example I think it would be worth using two separate pylab figures instead of subplots (have a look at other examples in the scikit such as http://scikit-learn.sourceforge.net/dev/auto_examples/decomposition/plot_faces_decomposition.html for instance). The sphinx doc will aggregate them all.

You could also do the same for the active learning steps: one figure per active learning round trip to see the evolution of the uncertain samples for instance. That will slow down the execution when run manually though. There might be a mode to avoid the blocking behavior of the pl.show() call. Maybe @GaelVaroquaux or @agramfort know?

GaelVaroquaux · 2011-09-14T16:37:13Z

There might be a mode to avoid the blocking behavior of the
pl.show() call.

By definition no, as pl.show() is used to start the event loop.

ogrisel · 2011-09-14T16:47:50Z

Alright then we can just defer the call to pl.show to the end of the active learning loop.

agramfort · 2011-09-15T02:50:53Z

please pull from : https://github.com/agramfort/scikit-learn/tree/clayw-master

there are 3 commits.

I cannot issue a pull request to your clone I don't know why.

quick remarks:

I feel there is too many examples and a lot of redundant code in them
this code needs to be improved so it scales to large datasets. Now it requires to assemble a matrix n_samples x n_samples which means it does not scale.
please have a look on how to use np.random.RandomState and you should use more broadcasting.

but I like the examples especially with the decision functions on iris :)

GaelVaroquaux · 2011-09-15T03:03:14Z

I have been glancing at this pull request, and it seems to me that it is based on diffusion maps (or spectral embedding). Looking at the code, it seems to me that:

A graph laplacian is built, and normalized on only one side
A generalized eigenvalue problem is solved by a power iteration method

I don't have access to the paper and thus I cannot check the maths and am bound to do some reverse engineering from the code, so this might be wrong (as a side note, I am a bit uneasy about having an algorithm in the scikit with no downloadable source. Is there another reference than the 'Semi supervised learning' book).

If I am right, the hand-coded power iteration method is probably not the right way to solve this. Using arpack or pyamg to do the power iteration would work much better. @emmanuelle and myself have run this on graph with million of nodes. See the code in sklearn.cluster.spectral, or in null_space in manifold.local_linear_embedding.

Also, to make this algorithm scalable, it should be able to use a sparse affinity graph, for instance using a neighbors_graph, and the graph_laplacian helper function.

I know that this remark adds a fair amount of work to the pull request, but it is important in my eyes to pay the cost upfront in order to reduce the long term maintenance of the code and also to keep numerical efficiency in the scikit.

ogrisel · 2011-09-15T12:09:54Z

Indeed having the possibility to computed truncated, sparse affinity matrices (using k-NN as done by sklearn.neighbors.kneighbors_graph) would probably make it possible to scale to large problems. However this is changing the behavior or original algorithms. Maybe we should add an affinity constructor param to be able to choose between gaussian kernel, cosine similarity (I have this implemented in another branch) and their sparse knn-truncated equivalent (e.g. using kneighbors_graph).

agramfort · 2011-09-15T17:00:39Z

+1 for using sklearn.neighbors.kneighbors_graph to have a sparse affinity matrix so it scales to large problems.

mblondel · 2011-09-15T17:16:46Z

BTW, do we have an idea of the impact on memory consumption of the current API? I'm afraid that fancy indexing will create copies. I was starting to think that Olivier's suggestion to use fit(X, unlabeled=X_u) could actually be nice. Recently, I was thinking of an algorithm where having fit(X, validation=X_v) would be helpful too, to automatically tune the hyparameters against the validation set (if you have enough data, holding out some for validation purpose can be enough).

ogrisel · 2011-09-15T17:18:06Z

Interesting thoughts.

With examples, including an active learning demo with label propagation.

larsmans · 2012-01-06T18:20:18Z

Even though I still have to study this algorithm and probably won't have time to complete it myself, I found it a waste to just leave this pull request in its current state. I've rebased it onto master, rebased all of @clayw's commits into one and cleaned it up a bit (moved to sklearn, mostly). The result is in my branch label_propagation.

ogrisel · 2012-01-06T18:33:16Z

Indeed thanks @larsmans for reactivating this review. It was on my backburner too for quite some time...

@clayw I had not noticed that you had implemented the sparse kNN approach (reviewers don't receive email notifications for pushes / commits as opposed to PR comments). What is your impression w.r.t. the impact on scalability vs the dense heat kernel variant?

@jakevdp do you think this PR could benefit from some advanced arpack wizardry?

@ALL how should we proceed for the rest of this review? Shall @larsmans open a new PR on his branch or @claw might want to go on a push -f the content of @larsmans branch on his repo to update this PR?

BTW: it would be great to come up with a combined example that compares LabelSpreading with the semi supervised naive bayes models from @larsmans.

larsmans · 2012-01-06T18:35:35Z

I think a new PR is in order in any case; this one came from @clayw's master branch, which is not a very good idea.

ogrisel · 2012-01-06T19:28:09Z

Indeed. @clayw do you want to start a new PR from a new branch of your with the content @larsmans branch or do you prefer to have @larsmans start the new PR himself?

Conflicts: doc/modules/classes.rst doc/modules/label_propagation.rst sklearn/label_propagation.py sklearn/tests/test_label_propagation.py

clayw · 2012-01-10T02:31:59Z

Thank you guys for bringing this up and helping me out. I took a look at the KNN code and realized that it was fairly incomplete. I completed the KNN kernel and it scales on my laptop to handle simulated data with 1 million points.

I also issued a new PR from my label-propagation branch. #547

@larsmans it would be awesome to have a good benchmark of your NB code versus label propagation

clayw · 2012-01-21T19:54:33Z

Closing this PR because to remove confusion. #547 covers this commit

clayw added 17 commits July 17, 2011 21:00

added label propagation class

fe62153

switch map and sum commands to numpy

c69a904

fixing up tests, adding "unlabeled_identifier"

f801626

basic features of multiclass labeling up

b0eeaa2

fixing the way labeling works

9a4a684

checking in minor changes

436f0d3

added documentation, reworking tests

8c7cc15

fixing up tests

071d14b

added a lot more to label propagation, explained algorithms and diffe…

b1f1dcd

…rences between the two models

more documentation

ea5a422

added beginning of examples

e664434

added "structure" example

49e6f5e

tweaked structure plot

bcf48f9

finalized SVM comparison example

d814dbf

all tests pass

840ef6e

removed some stuff from documentation

b1abfdc

updated pydoc to make behaviour clearer

2c63ba0

ogrisel reviewed Aug 8, 2011
View reviewed changes

clayw added 2 commits August 8, 2011 17:30

passed PEP8, using already implemented kernel functions

a9eef34

making everything more numpy compatible

d52985e

clayw added 5 commits August 9, 2011 15:50

graph construction and example more numpy-like

aa77f82

fixed other diagonal matrix construction

e53ecc7

rename misnamed "plot" example

877068c

example conforms to pep8

12fc1dc

other example conforms to pep8

6d4231c

agramfort added 3 commits September 14, 2011 22:06

STY: mostly style + avoid a zip in favor of an np.argsort

efddf2a

STY : in label_propagation.py

3a1e98c

ENH : using numpy broadcasting instead of dot_out

082c873

clayw added 6 commits September 20, 2011 22:42

Merge branch 'master' of git://github.com/scikit-learn/scikit-learn

ac198aa

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

2f7e997

added support for sparse KNN graphs and tests

0ac341a

finishing up sparse additions (need to complete todo)

5e7d3c9

sparse KNN graphs now work

24ac2f5

ENH add label propagation algorithm

0faf870

With examples, including an active learning demo with label propagation.

larsmans added 2 commits January 6, 2012 19:47

scikits.learn -> sklearn migration in label propagation

fbc1bac

BUG don't pass estimator params to fit in label propagation

4c44755

clayw added 3 commits January 9, 2012 16:21

finalized KNN work, all tests pass properly

68444c0

Merge branch 'larsmans-label-propagation'

fdfb531

Conflicts: doc/modules/classes.rst doc/modules/label_propagation.rst sklearn/label_propagation.py sklearn/tests/test_label_propagation.py

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn

af0feb9

clayw closed this Jan 21, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP : added a module for "Label Propagation" #301

WIP : added a module for "Label Propagation" #301

clayw commented Aug 8, 2011

ogrisel commented Aug 8, 2011

ogrisel Aug 8, 2011

ogrisel Aug 8, 2011

mblondel commented Aug 9, 2011

ogrisel commented Aug 9, 2011

clayw commented Aug 9, 2011

ogrisel commented Sep 14, 2011

GaelVaroquaux commented Sep 14, 2011

ogrisel commented Sep 14, 2011

agramfort commented Sep 15, 2011

GaelVaroquaux commented Sep 15, 2011

ogrisel commented Sep 15, 2011

agramfort commented Sep 15, 2011

mblondel commented Sep 15, 2011

ogrisel commented Sep 15, 2011

larsmans commented Jan 6, 2012

ogrisel commented Jan 6, 2012

larsmans commented Jan 6, 2012

ogrisel commented Jan 6, 2012

clayw commented Jan 10, 2012

clayw commented Jan 21, 2012

WIP : added a module for "Label Propagation" #301

WIP : added a module for "Label Propagation" #301

Conversation

clayw commented Aug 8, 2011

ogrisel commented Aug 8, 2011

ogrisel Aug 8, 2011

Choose a reason for hiding this comment

ogrisel Aug 8, 2011

Choose a reason for hiding this comment

mblondel commented Aug 9, 2011

ogrisel commented Aug 9, 2011

clayw commented Aug 9, 2011

ogrisel commented Sep 14, 2011

GaelVaroquaux commented Sep 14, 2011

ogrisel commented Sep 14, 2011

agramfort commented Sep 15, 2011

GaelVaroquaux commented Sep 15, 2011

ogrisel commented Sep 15, 2011

agramfort commented Sep 15, 2011

mblondel commented Sep 15, 2011

ogrisel commented Sep 15, 2011

larsmans commented Jan 6, 2012

ogrisel commented Jan 6, 2012

larsmans commented Jan 6, 2012

ogrisel commented Jan 6, 2012

clayw commented Jan 10, 2012

clayw commented Jan 21, 2012