Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add Self Training Estimator #11682

Merged
merged 148 commits into from Nov 20, 2020
Merged

Conversation

plbecker
Copy link
Contributor

@plbecker plbecker commented Jul 25, 2018

Reference Issues/PRs

Fixes #1243.

What does this implement/fix? Explain your changes.

Implements a meta classifier for semi-supervised learning based on the original Yarowsky self-training algorithm (refer to http://www.aclweb.org/anthology/P95-1026 for details).
Here is a comparison graph of our implemented version on the IRIS dataset. You can find the code under examples/semi_supervised/plot_self_training_performance.py.
Comparison Graph

Any other comments?

This PR was created in collaboration of @oliverrausch.
This PR is a work in progress and we'd continue working on it if you would be willing to merge once it is completed.

@orausch orausch force-pushed the self-training branch 3 times, most recently from 86574f8 to af65887 Compare July 26, 2018 01:11
@jnothman
Copy link
Member

I think I failed to post an issue, but I've also recently seen work that mentions the effectiveness of tri-training. Do you know anything about the relative effectiveness of these models? I suppose tri-training is harder to define for regression...

@orausch
Copy link

orausch commented Jul 26, 2018

For now we've only been testing the simpler, original self-training algorithm which is in this PR, but implementing something similar to tri-training or co-training seems like a natural next step once this feature is done.

@plbecker
Copy link
Contributor Author

plbecker commented Jul 26, 2018

Tri-training looks promising in the paper (outperforming self-training), this might be something we could work on after finishing this PR

@jnothman
Copy link
Member

jnothman commented Jul 26, 2018 via email

@plbecker
Copy link
Contributor Author

plbecker commented Jul 26, 2018

Sure, that works for us. Is this generally a feature you would potentially include (once it is done)?

@jnothman
Copy link
Member

jnothman commented Jul 26, 2018 via email

plbecker and others added 2 commits July 28, 2018 17:03
Co-authored-by: oliverrausch <oliverrausch99@gmail.com>
Co-authored-by: pr0duktiv <patrice@5becker.de>
@jnothman
Copy link
Member

jnothman commented Aug 7, 2018

All the CI say no! :\

@plbecker
Copy link
Contributor Author

plbecker commented Aug 8, 2018

Yes, we will fix that as soon as possible. We have exams for the next two weeks and will get back to it then.

sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
'predict'])
'predict']),
DelegatorData('SelfTraining', lambda est: SelfTraining(est),
skip_methods=['transform', 'inverse_transform', 'score',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these skipped?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skipping transform and inverse_transform because, to my understanding, they aren't part of the estimator API (and therefore I don't expose them as methods).

predict_proba is skipped because the SelfTrainingClassifier throws an error if it is missing.

predict_log_proba and predict are no longer skipped.

Copy link

@orausch orausch Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning score, I think there might be inconsistency in the test. The score method I use comes straight from ClassifierMixin: with the following signature
def score(self, X, y, sample_weight=None):

However the test calls score with only X, which causes it to fail. I suspect that the reason for this is that since Pipeline allows score to be called with y=None, the test uses the following score method for the estimator:
def score(self, X, *args, **kwargs):

This causes the test to fail on my classifier because ClassifierMixin expects a y argument.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'm a bit confused as to what the score method is supposed to do, since some classifiers (for example ClassifierMixin, RFE, EllipticEnvelope) return the accuracy score on the passed X and y.

However, other estimators, like PCA and KernelDensity use their own version of score that completely ignores y and only uses X.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, score can be a little confusing. But PCA and KernelDensity are not supervised so their evaluation is not by default against a ground truth. They report model likelihoods instead.

Yes, it might be a mistake in the test to not try passing y to score.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a PR to fix the test. Once it is merged I will be able to remove score from the skipped methods.

sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
sklearn/semi_supervised/self_training.py Outdated Show resolved Hide resolved
@orausch
Copy link

orausch commented Sep 12, 2018

@jnothman Thanks for the review, I'll be addressing those issues shortly. I have a question about checking if SelfTrainingClassifier is fitted. I know about sklearn.utils.validation.check_is_fitted, but it only seems to verify that an attribute exists.

Since this class (unlike other estimators) doesn't create any attributes during the fitting process that would signify that fitting is done, would it be acceptable to introduce a signifier variable (something like is_fitted) during the fit method, such that check_is_fitted(self, "is_fitted") returns true iff the estimator is fitted?

EDIT: or I could perhaps create an attribute base_estimator_ that is cloned from base_estimator at the start of the fit method, and then fit that estimator.

@jnothman
Copy link
Member

Can't you just delegate to the underlying estimator to check if fitted?

@orausch
Copy link

orausch commented Sep 13, 2018

Yes, but I decided against this because this would erroneously report the estimator as fitted if a fitted estimator is passed to the class. This seems like incorrect behavior to me.

I will be adding the predicted labels and the iteration they were labeled as attributes, so I'll check if the estimator is fitted based on those.

@lorentzenchr
Copy link
Member

@orausch Thanks for addressing my comments.

Following the conversion of this PR, several core developers seemed to welcome this as a good addition, but none approved. Currently, I can only assess the code which is in a good shape, only very little work left. I cannot, however, judge the usefulness, robustness and safety of this self training semi-supervised algorithm. A little advocate in my head is asking: How can self training gain more information about the relation between label y and features X than there is in the labeled data alone? I'd like see see an example demonstrating exactly this: a performance comparison of a classifier fit on labeled data alone with self-trained classifier using labeled + unlabeled data.

Therefore my question: Shall we postpone and milestone v0.25 instead? This would give us time to find and evaluate such an example.

@orausch
Copy link

orausch commented Nov 17, 2020

How can self training gain more information about the relation between label y and features X than there is in the labeled data alone?

If this wasn't possible semi-supervised learning wouldn't exist as a term 😄. Joking aside, I anecdotally would say that self-training is one of the more established/popular approaches to semi-supervised learning. What kind of "example" would you have in mind? We have a toy example in this PR, but it's understandably not convincing.

I think it might be good to ping some of the previous reviewers to get their thoughts on this @jnothman @NicolasHug.

Interesting side note: I just checked and the linked issue is the 20th oldest in sklearn!

@lorentzenchr
Copy link
Member

lorentzenchr commented Nov 17, 2020

If this wasn't possible semi-supervised learning wouldn't exist as a term 😄.

I come from a very supervised environment 🤣 (inside joke)

st50 = (SelfTrainingClassifier(base_classifier).fit(X, y_50),
y_50, 'Self-training 50% data')

rbf_svc = (SVC(kernel='rbf', gamma=.5).fit(X, y), y, 'SVC with rbf kernel')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have one rbf_svc_30, fitted on y_30.

Can you also print the achieved accuracy of all the models on the full (or better test) dataset?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not really sure how we could do that without making the layout messy.

@rth
Copy link
Member

rth commented Nov 18, 2020

We have a toy example in this PR, but it's understandably not convincing.

Yes, I think finding a convincing example is the main blocker from merging this PR.
From feedback I heard while discussing this PR with other core devs a few months ago, for instance label propagation, another semi-supervised model merged some time ago isn't working very well in practical applications, and we would like to avoid that situation here.

@ogrisel
Copy link
Member

ogrisel commented Nov 18, 2020

Indeed, semi-supervised learning can probably only really shine on medium sized dataset with at least a few hundreds labels samples and at least a few thousand unlabeled samples.

Maybe trying to demonstrate its usefulness on 20 newsgroups classification would work? This has 18k samples. I assume we could randomly extract 10 to 100 samples per class in the training set and consider all the remaining training set samples as unlabeled. For the model we could use a pipeline of TfIdfVectorizer with ngram_range=(1, 2), min_df=5 and max_df=0.8 + SGDClassifier with some amount l2 or elasticnet reguarlization.

Feel free to take inspiration from https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py to design the pipeline.

Alternatively, one might consider the adult census dataset from openml.

For the result it would be interesting to compare to test accuracy the self-training classifier trained on 10% of the training samples labeled (+ the remaining samples considered as unlabeled) vs the 100% supervised equivalent model.

@ogrisel ogrisel removed this from the 0.24 milestone Nov 18, 2020
@ogrisel
Copy link
Member

ogrisel commented Nov 18, 2020

I removed the milestone to not block the release but if a convincing example is contributed before the final release we can still consider merging this PR in time.

@orausch
Copy link

orausch commented Nov 18, 2020

@ogrisel I wrote up a quick example for the newsgroups dataset: https://gist.github.com/orausch/acf62e3fbf0ea5176f768d3dd3340ae1

Would something like this be sufficient?

@ogrisel
Copy link
Member

ogrisel commented Nov 18, 2020

Would something like this be sufficient?

This looks great indeed. Please open a PR add it to the PR.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orausch ! I did another pass on the code, generally it LGTM. The example looks quite good as well.

A few minor comments are below, and it would be good to mention somewhere (in the parameter docstring) that the value of the threshold is linked to the calibration of the classifier, ideally with a link to the calibration user-guide section.

doc/modules/semi_supervised.rst Outdated Show resolved Hide resolved
print()


if __name__ == "__main__":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just run the example (rename it to plot_*), and remove the example of output above? If the run time is an issue, as far as I can tell due to LabelSpreading, we can skip it in CI:

if 'CI' not in os.environ:
   # LabelSpreading takes too long to run in the online documentation
   # add code here

Without LabelSpreading it should run in under in less than 30s I think?

The 20 newsgroup dataset is already used in other examples so it should be in the cache.

examples/semi_supervised/semisupervised_newsgroups.py Outdated Show resolved Hide resolved
Copy link
Member

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approval... almost there... :)

doc/modules/semi_supervised.rst Outdated Show resolved Hide resolved
sklearn/semi_supervised/_self_training.py Outdated Show resolved Hide resolved
sklearn/semi_supervised/_self_training.py Outdated Show resolved Hide resolved
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new example on the 20 newsgroup shows the added value of self-training, thanks!

orausch and others added 2 commits November 20, 2020 13:00
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@lorentzenchr lorentzenchr changed the title [MRG] Implement Self Training Estimator [See #1243] FEA Add Self Training Estimator Nov 20, 2020
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orausch @plbecker Thanks for this new meta-estimator.
@orausch Thank you very much for you effort latetly. Let's squeeze this into the next upcoming release🚀

@lorentzenchr
Copy link
Member

CI failure in "Test Docs" seems unrelated. CI has been completely green before my last change. Therefore merging.

=================================== FAILURES ===================================
_____________________ [doctest] loading_other_datasets.rst _____________________
093 data and experiments, that allows everybody to upload open datasets.
094 
095 The ``sklearn.datasets`` package is able to download datasets
096 from the repository using the function
097 :func:`sklearn.datasets.fetch_openml`.
098 
099 For example, to download a dataset of gene expressions in mice brains::
100 
101   >>> from sklearn.datasets import fetch_openml
102   >>> mice = fetch_openml(name='miceprotein', version=4)
UNEXPECTED EXCEPTION: OpenMLError('Dataset with data_id 40966 not found.')
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/testvenv/lib/python3.9/doctest.py", line 1336, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest loading_other_datasets.rst[7]>", line 1, in <module>
  File "/home/vsts/work/1/s/sklearn/utils/validation.py", line 60, in inner_f
    return f(*args, **kwargs)
  File "/home/vsts/work/1/s/sklearn/datasets/_openml.py", line 873, in fetch_openml
    features_list = _get_data_features(data_id, data_home)
  File "/home/vsts/work/1/s/sklearn/datasets/_openml.py", line 450, in _get_data_features
    json_data = _get_json_content_from_openml_api(
  File "/home/vsts/work/1/s/sklearn/datasets/_openml.py", line 179, in _get_json_content_from_openml_api
    raise OpenMLError(error_message)
sklearn.datasets._openml.OpenMLError: Dataset with data_id 40966 not found.
/home/vsts/work/1/s/doc/datasets/loading_other_datasets.rst:102: UnexpectedException

@lorentzenchr lorentzenchr merged commit 2f6af71 into scikit-learn:master Nov 20, 2020
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Sprint Paris 2019
  
Needs review
Development

Successfully merging this pull request may close these issues.

Meta-estimator for semi-supervised learning