New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP : added a module for "Label Propagation" #301
Conversation
…rences between the two models
Interesting contrib. Before diving into the actual review work could you please run the pep8 script and fix all issues? |
|
||
# generate ring with inner box | ||
num_points = 100 | ||
outer_circ_xs = [np.cos(2*np.pi*x/num_points) for x in range(0,num_points)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than list comprehension, just use numpy:
outer_circ_xs = np.cos(np.linspace(0, 2 * np.pi, num_points))
and so on. Also for instance with the scikit conventions: num_points
should be renamed to n_samples_per_circle
for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you could just reuse the noisy circles code generation snippet from the following example in my power iteration clustering branch.
This code could be factorized into the scikits.learn.datasets.samples_generator
package.
I like the idea of encoding unlabeled examples with X = (X_labeled, X_unlabeled) LabelPropagation.fit(X, y) |
I we decide to not keep the
instead of X being a tuple. |
Either of those input methods are fine. I like the unlabeled trick that @GaelVaroquaux suggested which is why I changed the implementation to it. Keep in mind that probability distributions over the true ground labels can change after running the algorithm. |
This indeed looks much clearer (esp. the variable / attribute names and docstring). I agree that the active learning example is neat. For the structure example I think it would be worth using two separate pylab figures instead of subplots (have a look at other examples in the scikit such as http://scikit-learn.sourceforge.net/dev/auto_examples/decomposition/plot_faces_decomposition.html for instance). The sphinx doc will aggregate them all. You could also do the same for the active learning steps: one figure per active learning round trip to see the evolution of the uncertain samples for instance. That will slow down the execution when run manually though. There might be a mode to avoid the blocking behavior of the |
By definition no, as pl.show() is used to start the event loop. |
Alright then we can just defer the call to pl.show to the end of the active learning loop. |
please pull from : https://github.com/agramfort/scikit-learn/tree/clayw-master there are 3 commits. I cannot issue a pull request to your clone I don't know why. quick remarks:
but I like the examples especially with the decision functions on iris :) |
I have been glancing at this pull request, and it seems to me that it is based on diffusion maps (or spectral embedding). Looking at the code, it seems to me that:
I don't have access to the paper and thus I cannot check the maths and am bound to do some reverse engineering from the code, so this might be wrong (as a side note, I am a bit uneasy about having an algorithm in the scikit with no downloadable source. Is there another reference than the 'Semi supervised learning' book). If I am right, the hand-coded power iteration method is probably not the right way to solve this. Using arpack or pyamg to do the power iteration would work much better. @emmanuelle and myself have run this on graph with million of nodes. See the code in sklearn.cluster.spectral, or in Also, to make this algorithm scalable, it should be able to use a sparse affinity graph, for instance using a neighbors_graph, and the graph_laplacian helper function. I know that this remark adds a fair amount of work to the pull request, but it is important in my eyes to pay the cost upfront in order to reduce the long term maintenance of the code and also to keep numerical efficiency in the scikit. |
Indeed having the possibility to computed truncated, sparse affinity matrices (using k-NN as done by |
+1 for using sklearn.neighbors.kneighbors_graph to have a sparse affinity matrix so it scales to large problems. |
BTW, do we have an idea of the impact on memory consumption of the current API? I'm afraid that fancy indexing will create copies. I was starting to think that Olivier's suggestion to use |
Interesting thoughts. |
With examples, including an active learning demo with label propagation.
Even though I still have to study this algorithm and probably won't have time to complete it myself, I found it a waste to just leave this pull request in its current state. I've rebased it onto master, rebased all of @clayw's commits into one and cleaned it up a bit (moved to |
Indeed thanks @larsmans for reactivating this review. It was on my backburner too for quite some time... @clayw I had not noticed that you had implemented the sparse kNN approach (reviewers don't receive email notifications for pushes / commits as opposed to PR comments). What is your impression w.r.t. the impact on scalability vs the dense heat kernel variant? @jakevdp do you think this PR could benefit from some advanced arpack wizardry? @ALL how should we proceed for the rest of this review? Shall @larsmans open a new PR on his branch or @claw might want to go on a BTW: it would be great to come up with a combined example that compares |
I think a new PR is in order in any case; this one came from @clayw's |
Conflicts: doc/modules/classes.rst doc/modules/label_propagation.rst sklearn/label_propagation.py sklearn/tests/test_label_propagation.py
Thank you guys for bringing this up and helping me out. I took a look at the KNN code and realized that it was fairly incomplete. I completed the KNN kernel and it scales on my laptop to handle simulated data with 1 million points. I also issued a new PR from my label-propagation branch. #547 @larsmans it would be awesome to have a good benchmark of your NB code versus label propagation |
Closing this PR because to remove confusion. #547 covers this commit |
Label Propagation algorithms are a powerful family of semi-supervised learning algorithms. I original wrote a demo for a workshop at the Noisebridge machine learning group, but now I think it's ready to join the scikits.learn codebase.
I implemented two label propagation algorithms, "label spreading" and "label propagation", including one example that demonstrates performance versus SVM with small number of labels and another showing how the algorithm can learn a complex structure.
Also included a little documentation describing the algorithm.