Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature RobustWeightedEstimator : code + examples #42

Merged

Conversation

TimotheeMathieu
Copy link
Contributor

Algorithms to do robust classification and regression.

Example of use of this code gives the two attached plots (the code used to generate the plots is in the example folder). The principle is to have algorithms that are robust to outliers both in the feature X and in the labels Y. Most sklearn algorithm allow robust loss functions that exhibit only the robustness in Y (except for Ransac and Theilsen estimators).

sklearn-extra_robust_classif
sklearn-extra_robust_regression

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @TimotheeMathieu ! A first partial review is below.

In major points,

  • it would necessary to add user manual section with a short description of the algorithm (like this one), and maybe mentioning what are known advantages / limitations of it with respect to other outlier robust learners in scikit-learn/
  • another major point is that this need tests. In particular,
  • ideally it would be good to have a benchmarks script in benchmarks/ to see how well this works on larger datasets (and have a way to evaluate the performance impact of future changes)
  • also illustrating that it works on some real world datasets would be useful (including a comparison with other outlier robust learners) even if it's not part of the PR -- or it could be an additional example. There is some discussion about in in the 2018 paper (and in particular in figure 8) but that doesn't say, unless I missed it, how well this approach performs as compared to other classical approaches one might try. If you need more datasets, have a look at https://www.openml.org/ which could be loaded with sklearn.datasets.fetch_openml.

examples/plot_RobustClassification.py Outdated Show resolved Hide resolved
examples/plot_RobustRegression.py Outdated Show resolved Hide resolved
examples/plot_RobustRegression.py Outdated Show resolved Hide resolved
sklearn_extra/robust/RobustWeightedEstimator.py Outdated Show resolved Hide resolved
sklearn_extra/robust/RobustWeightedEstimator.py Outdated Show resolved Hide resolved
sklearn_extra/robust/mean_estimators.py Outdated Show resolved Hide resolved
sklearn_extra/robust/mean_estimators.py Outdated Show resolved Hide resolved
sklearn_extra/robust/mean_estimators.py Outdated Show resolved Hide resolved
sklearn_extra/robust/mean_estimators.py Outdated Show resolved Hide resolved
sklearn_extra/robust/tests/test_RobustWeightedEstimator.py Outdated Show resolved Hide resolved
@rth
Copy link
Contributor

rth commented Oct 27, 2019

The codecov CI failure can be fixed by adding tests to increase the code coverage of added code..

@chkoar
Copy link
Member

chkoar commented Nov 6, 2019

I would use snake_case in filenames.

doc/modules/robust.rst Outdated Show resolved Hide resolved
doc/modules/robust.rst Outdated Show resolved Hide resolved
@TimotheeMathieu
Copy link
Contributor Author

The problem comes from EigenPro and as it is not my code I could not find quickly where is the bug.

@chkoar
Copy link
Member

chkoar commented Mar 4, 2020

I saw this that's why I asked.

@TimotheeMathieu
Copy link
Contributor Author

The link you gave is of the test 136.1, the current Travis test is 136.4 from what I gathered (I am new to CI tests).

@chkoar
Copy link
Member

chkoar commented Mar 4, 2020

Basically is one build with different configurations. The one you are referring to is against the nightly version of the scikit-learn. For instance the two first builds are against 0.21.2 version of scikit-learn where the robust estimator doc test fails due to changes to check_is_fitted.

@rth whats your opinion here? In the README we say that we support scikit-learn (>=0.21)

@rth
Copy link
Contributor

rth commented Mar 4, 2020

whats your opinion here? In the README we say that we support scikit-learn (>=0.21)

+1 to support just 1 latest version of scikit-learn. @glemaitre was also for it.

@TimotheeMathieu
Copy link
Contributor Author

Ok it is fixed, the problem of compatibility was with sklearn0.21, in my version (sklearn 0.22) there was no problem so I didn't see it. Thanks for explaining how CI work.

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is taking a long time to merge. I was hoping that scikit-learn-extra would be faster there than scikit-learn, otherwise it defeats the purpose of this repo somewhat.

Generally code and documentation wise LGTM. I have not done a very though review. I think we should merge.

@rth
Copy link
Contributor

rth commented Jun 17, 2020

@ngoix What's your opinion about this general approach outlined in the user guide? also cc @agramfort if you have any opinions.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 17, 2020 via email

@TimotheeMathieu
Copy link
Contributor Author

@rth No worries, I am already thankful that you all helped me to make it better and that you accepted to teach a rookie how to do a PR and we can't expect you to be a lot faster than scikit-learn when you have in fact less manpower.

@rth
Copy link
Contributor

rth commented Jun 17, 2020

@TimotheeMathieu Could you please merge upstream/master into this branch, that should fix the failing CI.

@TimotheeMathieu
Copy link
Contributor Author

I merged but CI still fail, it is only for python3.8 and seems to be because of sklearn-extra/utils/_cyfht.pyx I don't understand the import of pyx files so I can't help. I have python38 and the test also fail for master.

TimotheeMathieu and others added 3 commits June 18, 2020 11:14
Fix CI Python 3.8

Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
@TimotheeMathieu
Copy link
Contributor Author

TimotheeMathieu commented Jun 18, 2020

LooseVersion(sklearn.__version__) does not work in python3.8 with scikit-learn dev version, it does not parse current master of sklearn which is '0.24.dev0' (this contain a string) and this implies that your fix does not work, I don't know how to cleanly fix this but we could do a custom parser like

if sklearn.__version__.split('.')[1]>=24:

would this be acceptable ? It is a little hackish.

@rth
Copy link
Contributor

rth commented Jun 18, 2020

Yes, looks like LooseVersion is actually not reliable. I pushed a commit to switch to pkg_resources.parse_version , where pkg_ressources is installed as part of setuptools. It would mean making setuptools a run time dependency in addition to being a build dependency. There doesn't seem to be much better solution for this scikit-learn/scikit-learn#7980 (comment)

Edit: The problem of a custom regex is that we potentially may have the same issue when comparing againt versions of scipy/numpy and then using standard tools for version comparison is probably safer.

@rth rth merged commit c903687 into scikit-learn-contrib:master Jun 26, 2020
@rth
Copy link
Contributor

rth commented Jun 26, 2020

Merged. Thank you @TimotheeMathieu !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants