New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add Self Training Estimator #11682
Conversation
86574f8
to
af65887
Compare
I think I failed to post an issue, but I've also recently seen work that mentions the effectiveness of tri-training. Do you know anything about the relative effectiveness of these models? I suppose tri-training is harder to define for regression... |
For now we've only been testing the simpler, original self-training algorithm which is in this PR, but implementing something similar to tri-training or co-training seems like a natural next step once this feature is done. |
Tri-training looks promising in the paper (outperforming self-training), this might be something we could work on after finishing this PR |
Well, you'd be best pinging in a few weeks, as we are currently trying to
push out a release.
|
Sure, that works for us. Is this generally a feature you would potentially include (once it is done)? |
Is this generally a feature you would potentially include?
I think it's an area where we should improve our current offerings, yes.
|
Co-authored-by: oliverrausch <oliverrausch99@gmail.com> Co-authored-by: pr0duktiv <patrice@5becker.de>
All the CI say no! :\ |
Yes, we will fix that as soon as possible. We have exams for the next two weeks and will get back to it then. |
sklearn/tests/test_metaestimators.py
Outdated
'predict']) | ||
'predict']), | ||
DelegatorData('SelfTraining', lambda est: SelfTraining(est), | ||
skip_methods=['transform', 'inverse_transform', 'score', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skipping transform
and inverse_transform
because, to my understanding, they aren't part of the estimator API (and therefore I don't expose them as methods).
predict_proba
is skipped because the SelfTrainingClassifier throws an error if it is missing.
predict_log_proba
and predict
are no longer skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning score
, I think there might be inconsistency in the test. The score method I use comes straight from ClassifierMixin
: with the following signature
def score(self, X, y, sample_weight=None):
However the test calls score with only X
, which causes it to fail. I suspect that the reason for this is that since Pipeline
allows score to be called with y=None
, the test uses the following score method for the estimator:
def score(self, X, *args, **kwargs):
This causes the test to fail on my classifier because ClassifierMixin
expects a y
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'm a bit confused as to what the score
method is supposed to do, since some classifiers (for example ClassifierMixin
, RFE
, EllipticEnvelope
) return the accuracy score on the passed X
and y
.
However, other estimators, like PCA
and KernelDensity
use their own version of score that completely ignores y
and only uses X
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, score can be a little confusing. But PCA and KernelDensity are not supervised so their evaluation is not by default against a ground truth. They report model likelihoods instead.
Yes, it might be a mistake in the test to not try passing y to score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a PR to fix the test. Once it is merged I will be able to remove score
from the skipped methods.
@jnothman Thanks for the review, I'll be addressing those issues shortly. I have a question about checking if SelfTrainingClassifier is fitted. I know about Since this class (unlike other estimators) doesn't create any attributes during the fitting process that would signify that fitting is done, would it be acceptable to introduce a signifier variable (something like EDIT: or I could perhaps create an attribute |
a0956cc
to
b58374c
Compare
b58374c
to
cace7e5
Compare
Can't you just delegate to the underlying estimator to check if fitted? |
Yes, but I decided against this because this would erroneously report the estimator as fitted if a fitted estimator is passed to the class. This seems like incorrect behavior to me. I will be adding the predicted labels and the iteration they were labeled as attributes, so I'll check if the estimator is fitted based on those. |
@orausch Thanks for addressing my comments. Following the conversion of this PR, several core developers seemed to welcome this as a good addition, but none approved. Currently, I can only assess the code which is in a good shape, only very little work left. I cannot, however, judge the usefulness, robustness and safety of this self training semi-supervised algorithm. A little advocate in my head is asking: How can self training gain more information about the relation between label Therefore my question: Shall we postpone and milestone v0.25 instead? This would give us time to find and evaluate such an example. |
If this wasn't possible semi-supervised learning wouldn't exist as a term I think it might be good to ping some of the previous reviewers to get their thoughts on this @jnothman @NicolasHug. Interesting side note: I just checked and the linked issue is the 20th oldest in sklearn! |
I come from a very supervised environment |
examples/semi_supervised/plot_self_training_varying_threshold.py
Outdated
Show resolved
Hide resolved
st50 = (SelfTrainingClassifier(base_classifier).fit(X, y_50), | ||
y_50, 'Self-training 50% data') | ||
|
||
rbf_svc = (SVC(kernel='rbf', gamma=.5).fit(X, y), y, 'SVC with rbf kernel') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have one rbf_svc_30, fitted on y_30
.
Can you also print the achieved accuracy of all the models on the full (or better test) dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not really sure how we could do that without making the layout messy.
Yes, I think finding a convincing example is the main blocker from merging this PR. |
Indeed, semi-supervised learning can probably only really shine on medium sized dataset with at least a few hundreds labels samples and at least a few thousand unlabeled samples. Maybe trying to demonstrate its usefulness on 20 newsgroups classification would work? This has 18k samples. I assume we could randomly extract 10 to 100 samples per class in the training set and consider all the remaining training set samples as unlabeled. For the model we could use a pipeline of TfIdfVectorizer with ngram_range=(1, 2), min_df=5 and max_df=0.8 + SGDClassifier with some amount l2 or elasticnet reguarlization. Feel free to take inspiration from https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py to design the pipeline. Alternatively, one might consider the adult census dataset from openml. For the result it would be interesting to compare to test accuracy the self-training classifier trained on 10% of the training samples labeled (+ the remaining samples considered as unlabeled) vs the 100% supervised equivalent model. |
I removed the milestone to not block the release but if a convincing example is contributed before the final release we can still consider merging this PR in time. |
@ogrisel I wrote up a quick example for the newsgroups dataset: https://gist.github.com/orausch/acf62e3fbf0ea5176f768d3dd3340ae1 Would something like this be sufficient? |
0887a8c
to
97971ee
Compare
This looks great indeed. Please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @orausch ! I did another pass on the code, generally it LGTM. The example looks quite good as well.
A few minor comments are below, and it would be good to mention somewhere (in the parameter docstring) that the value of the threshold
is linked to the calibration of the classifier, ideally with a link to the calibration user-guide section.
print() | ||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just run the example (rename it to plot_*
), and remove the example of output above? If the run time is an issue, as far as I can tell due to LabelSpreading
, we can skip it in CI:
if 'CI' not in os.environ:
# LabelSpreading takes too long to run in the online documentation
# add code here
Without LabelSpreading
it should run in under in less than 30s I think?
The 20 newsgroup dataset is already used in other examples so it should be in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approval... almost there... :)
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new example on the 20 newsgroup shows the added value of self-training, thanks!
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failure in "Test Docs" seems unrelated. CI has been completely green before my last change. Therefore merging.
|
Reference Issues/PRs
Fixes #1243.
What does this implement/fix? Explain your changes.
Implements a meta classifier for semi-supervised learning based on the original Yarowsky self-training algorithm (refer to http://www.aclweb.org/anthology/P95-1026 for details).

Here is a comparison graph of our implemented version on the IRIS dataset. You can find the code under
examples/semi_supervised/plot_self_training_performance.py
.Any other comments?
This PR was created in collaboration of @oliverrausch.
This PR is a work in progress and we'd continue working on it if you would be willing to merge once it is completed.