Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP : added a module for "Label Propagation" #301

Closed
wants to merge 94 commits into from

Conversation

clayw
Copy link
Contributor

@clayw clayw commented Aug 8, 2011

Label Propagation algorithms are a powerful family of semi-supervised learning algorithms. I original wrote a demo for a workshop at the Noisebridge machine learning group, but now I think it's ready to join the scikits.learn codebase.

I implemented two label propagation algorithms, "label spreading" and "label propagation", including one example that demonstrates performance versus SVM with small number of labels and another showing how the algorithm can learn a complex structure.

Also included a little documentation describing the algorithm.

@ogrisel
Copy link
Member

ogrisel commented Aug 8, 2011

Interesting contrib. Before diving into the actual review work could you please run the pep8 script and fix all issues?

http://pypi.python.org/pypi/pep8


# generate ring with inner box
num_points = 100
outer_circ_xs = [np.cos(2*np.pi*x/num_points) for x in range(0,num_points)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than list comprehension, just use numpy:

 outer_circ_xs = np.cos(np.linspace(0, 2 * np.pi, num_points))

and so on. Also for instance with the scikit conventions: num_points should be renamed to n_samples_per_circle for instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you could just reuse the noisy circles code generation snippet from the following example in my power iteration clustering branch.

https://github.com/ogrisel/scikit-learn/blob/power-iteration-clustering/examples/cluster/plot_clustering_toy_2D_circles.py

This code could be factorized into the scikits.learn.datasets.samples_generator package.

@mblondel
Copy link
Member

mblondel commented Aug 9, 2011

I like the idea of encoding unlabeled examples with -1, as having X and y of the same size is a good consistency check. The only thing that I'm worried about is that such an API will force us to use views of the data (X[y == -1] and X[y != -1]) and thus maybe suboptimal in terms of performance. Another possible API would be to pass X as a tuple.

X = (X_labeled, X_unlabeled)
LabelPropagation.fit(X, y)

@ogrisel
Copy link
Member

ogrisel commented Aug 9, 2011

I we decide to not keep the -1 trick for whatever reason, I think I would rather go for:

LabelPropagation.fit(X_labeled, y, unlabeled=X_unlabeled)

instead of X being a tuple.

@clayw
Copy link
Contributor Author

clayw commented Aug 9, 2011

Either of those input methods are fine. I like the unlabeled trick that @GaelVaroquaux suggested which is why I changed the implementation to it.

Keep in mind that probability distributions over the true ground labels can change after running the algorithm.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2011

This indeed looks much clearer (esp. the variable / attribute names and docstring). I agree that the active learning example is neat. For the structure example I think it would be worth using two separate pylab figures instead of subplots (have a look at other examples in the scikit such as http://scikit-learn.sourceforge.net/dev/auto_examples/decomposition/plot_faces_decomposition.html for instance). The sphinx doc will aggregate them all.

You could also do the same for the active learning steps: one figure per active learning round trip to see the evolution of the uncertain samples for instance. That will slow down the execution when run manually though. There might be a mode to avoid the blocking behavior of the pl.show() call. Maybe @GaelVaroquaux or @agramfort know?

@GaelVaroquaux
Copy link
Member

There might be a mode to avoid the blocking behavior of the
pl.show() call.

By definition no, as pl.show() is used to start the event loop.

@ogrisel
Copy link
Member

ogrisel commented Sep 14, 2011

Alright then we can just defer the call to pl.show to the end of the active learning loop.

@agramfort
Copy link
Member

please pull from : https://github.com/agramfort/scikit-learn/tree/clayw-master

there are 3 commits.

I cannot issue a pull request to your clone I don't know why.

quick remarks:

  • I feel there is too many examples and a lot of redundant code in them
  • this code needs to be improved so it scales to large datasets. Now it requires to assemble a matrix n_samples x n_samples which means it does not scale.
  • please have a look on how to use np.random.RandomState and you should use more broadcasting.

but I like the examples especially with the decision functions on iris :)

@GaelVaroquaux
Copy link
Member

I have been glancing at this pull request, and it seems to me that it is based on diffusion maps (or spectral embedding). Looking at the code, it seems to me that:

  • A graph laplacian is built, and normalized on only one side
  • A generalized eigenvalue problem is solved by a power iteration method

I don't have access to the paper and thus I cannot check the maths and am bound to do some reverse engineering from the code, so this might be wrong (as a side note, I am a bit uneasy about having an algorithm in the scikit with no downloadable source. Is there another reference than the 'Semi supervised learning' book).

If I am right, the hand-coded power iteration method is probably not the right way to solve this. Using arpack or pyamg to do the power iteration would work much better. @emmanuelle and myself have run this on graph with million of nodes. See the code in sklearn.cluster.spectral, or in null_space in manifold.local_linear_embedding.

Also, to make this algorithm scalable, it should be able to use a sparse affinity graph, for instance using a neighbors_graph, and the graph_laplacian helper function.

I know that this remark adds a fair amount of work to the pull request, but it is important in my eyes to pay the cost upfront in order to reduce the long term maintenance of the code and also to keep numerical efficiency in the scikit.

@ogrisel
Copy link
Member

ogrisel commented Sep 15, 2011

Indeed having the possibility to computed truncated, sparse affinity matrices (using k-NN as done by sklearn.neighbors.kneighbors_graph) would probably make it possible to scale to large problems. However this is changing the behavior or original algorithms. Maybe we should add an affinity constructor param to be able to choose between gaussian kernel, cosine similarity (I have this implemented in another branch) and their sparse knn-truncated equivalent (e.g. using kneighbors_graph).

@agramfort
Copy link
Member

+1 for using sklearn.neighbors.kneighbors_graph to have a sparse affinity matrix so it scales to large problems.

@mblondel
Copy link
Member

BTW, do we have an idea of the impact on memory consumption of the current API? I'm afraid that fancy indexing will create copies. I was starting to think that Olivier's suggestion to use fit(X, unlabeled=X_u) could actually be nice. Recently, I was thinking of an algorithm where having fit(X, validation=X_v) would be helpful too, to automatically tune the hyparameters against the validation set (if you have enough data, holding out some for validation purpose can be enough).

@ogrisel
Copy link
Member

ogrisel commented Sep 15, 2011

Interesting thoughts.

@larsmans
Copy link
Member

larsmans commented Jan 6, 2012

Even though I still have to study this algorithm and probably won't have time to complete it myself, I found it a waste to just leave this pull request in its current state. I've rebased it onto master, rebased all of @clayw's commits into one and cleaned it up a bit (moved to sklearn, mostly). The result is in my branch label_propagation.

@ogrisel
Copy link
Member

ogrisel commented Jan 6, 2012

Indeed thanks @larsmans for reactivating this review. It was on my backburner too for quite some time...

@clayw I had not noticed that you had implemented the sparse kNN approach (reviewers don't receive email notifications for pushes / commits as opposed to PR comments). What is your impression w.r.t. the impact on scalability vs the dense heat kernel variant?

@jakevdp do you think this PR could benefit from some advanced arpack wizardry?

@ALL how should we proceed for the rest of this review? Shall @larsmans open a new PR on his branch or @claw might want to go on a push -f the content of @larsmans branch on his repo to update this PR?

BTW: it would be great to come up with a combined example that compares LabelSpreading with the semi supervised naive bayes models from @larsmans.

@larsmans
Copy link
Member

larsmans commented Jan 6, 2012

I think a new PR is in order in any case; this one came from @clayw's master branch, which is not a very good idea.

@ogrisel
Copy link
Member

ogrisel commented Jan 6, 2012

Indeed. @clayw do you want to start a new PR from a new branch of your with the content @larsmans branch or do you prefer to have @larsmans start the new PR himself?

Conflicts:
	doc/modules/classes.rst
	doc/modules/label_propagation.rst
	sklearn/label_propagation.py
	sklearn/tests/test_label_propagation.py
@clayw
Copy link
Contributor Author

clayw commented Jan 10, 2012

Thank you guys for bringing this up and helping me out. I took a look at the KNN code and realized that it was fairly incomplete. I completed the KNN kernel and it scales on my laptop to handle simulated data with 1 million points.

I also issued a new PR from my label-propagation branch. #547

@larsmans it would be awesome to have a good benchmark of your NB code versus label propagation

@clayw
Copy link
Contributor Author

clayw commented Jan 21, 2012

Closing this PR because to remove confusion. #547 covers this commit

@clayw clayw closed this Jan 21, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants