New feature RobustWeightedEstimator : code + examples #42

TimotheeMathieu · 2019-10-26T13:16:30Z

Algorithms to do robust classification and regression.

Example of use of this code gives the two attached plots (the code used to generate the plots is in the example folder). The principle is to have algorithms that are robust to outliers both in the feature X and in the labels Y. Most sklearn algorithm allow robust loss functions that exhibit only the robustness in Y (except for Ransac and Theilsen estimators).

…ator

rth

Thanks for the PR @TimotheeMathieu ! A first partial review is below.

In major points,

it would necessary to add user manual section with a short description of the algorithm (like this one), and maybe mentioning what are known advantages / limitations of it with respect to other outlier robust learners in scikit-learn/
another major point is that this need tests. In particular,
- for various helper functions that are added.
- for the known edge/special cases of the estimator itself
- finally, ideally it should pass common tests. I.e. it would be necessary to add it here which will run checks from sklearn.utils.estimator_checks. Related to Add instance-level calls to estimator_checks for meta-estimators scikit-learn/scikit-learn#9443. Because it's a meta-estimator common test would need to determine at runtime whether it's a classifier or a regressor (depending on the base estimator), or maybe we would need to expose two meta-estimator one for regressors and one for classifiers.
ideally it would be good to have a benchmarks script in benchmarks/ to see how well this works on larger datasets (and have a way to evaluate the performance impact of future changes)
also illustrating that it works on some real world datasets would be useful (including a comparison with other outlier robust learners) even if it's not part of the PR -- or it could be an additional example. There is some discussion about in in the 2018 paper (and in particular in figure 8) but that doesn't say, unless I missed it, how well this approach performs as compared to other classical approaches one might try. If you need more datasets, have a look at https://www.openml.org/ which could be loaded with sklearn.datasets.fetch_openml.

examples/plot_RobustClassification.py

examples/plot_RobustRegression.py

sklearn_extra/robust/RobustWeightedEstimator.py

sklearn_extra/robust/mean_estimators.py

sklearn_extra/robust/tests/test_RobustWeightedEstimator.py

rth · 2019-10-27T07:24:01Z

The codecov CI failure can be fixed by adding tests to increase the code coverage of added code..

chkoar · 2019-11-06T09:55:18Z

I would use snake_case in filenames.

doc/modules/robust.rst

examples/plot_RobustRegression_california_houses.py

examples/plot_robust_regression_toy.py

doc/modules/robust.rst

sklearn_extra/robust/robust_weighted_estimator.py

sklearn_extra/robust/tests/test_RobustWeightedEstimator.py

sklearn_extra/robust/robust_weighted_estimator.py

TimotheeMathieu · 2020-03-04T14:53:54Z

The problem comes from EigenPro and as it is not my code I could not find quickly where is the bug.

chkoar · 2020-03-04T14:56:41Z

I saw this that's why I asked.

TimotheeMathieu · 2020-03-04T15:01:52Z

The link you gave is of the test 136.1, the current Travis test is 136.4 from what I gathered (I am new to CI tests).

chkoar · 2020-03-04T15:20:54Z

Basically is one build with different configurations. The one you are referring to is against the nightly version of the scikit-learn. For instance the two first builds are against 0.21.2 version of scikit-learn where the robust estimator doc test fails due to changes to check_is_fitted.

@rth whats your opinion here? In the README we say that we support scikit-learn (>=0.21)

rth · 2020-03-04T15:25:01Z

whats your opinion here? In the README we say that we support scikit-learn (>=0.21)

+1 to support just 1 latest version of scikit-learn. @glemaitre was also for it.

TimotheeMathieu · 2020-03-05T08:53:27Z

Ok it is fixed, the problem of compatibility was with sklearn0.21, in my version (sklearn 0.22) there was no problem so I didn't see it. Thanks for explaining how CI work.

rth

Sorry this is taking a long time to merge. I was hoping that scikit-learn-extra would be faster there than scikit-learn, otherwise it defeats the purpose of this repo somewhat.

Generally code and documentation wise LGTM. I have not done a very though review. I think we should merge.

rth · 2020-06-17T12:00:44Z

@ngoix What's your opinion about this general approach outlined in the user guide? also cc @agramfort if you have any opinions.

GaelVaroquaux · 2020-06-17T14:59:34Z

Sorry this is taking a long time to merge. I was hoping that scikit-learn-extra would be faster there than scikit-learn, otherwise it defeats the purpose of this repo somewhat.

The bottleneck is the same: the reviewer time.

TimotheeMathieu · 2020-06-17T17:00:35Z

@rth No worries, I am already thankful that you all helped me to make it better and that you accepted to teach a rookie how to do a PR and we can't expect you to be a lot faster than scikit-learn when you have in fact less manpower.

rth · 2020-06-17T20:19:23Z

@TimotheeMathieu Could you please merge upstream/master into this branch, that should fix the failing CI.

TimotheeMathieu · 2020-06-18T06:32:33Z

I merged but CI still fail, it is only for python3.8 and seems to be because of sklearn-extra/utils/_cyfht.pyx I don't understand the import of pyx files so I can't help. I have python38 and the test also fail for master.

sklearn_extra/robust/robust_weighted_estimator.py

Fix CI Python 3.8 Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>

TimotheeMathieu · 2020-06-18T13:41:50Z

LooseVersion(sklearn.__version__) does not work in python3.8 with scikit-learn dev version, it does not parse current master of sklearn which is '0.24.dev0' (this contain a string) and this implies that your fix does not work, I don't know how to cleanly fix this but we could do a custom parser like

if sklearn.__version__.split('.')[1]>=24:

would this be acceptable ? It is a little hackish.

rth · 2020-06-18T17:35:52Z

Yes, looks like LooseVersion is actually not reliable. I pushed a commit to switch to pkg_resources.parse_version , where pkg_ressources is installed as part of setuptools. It would mean making setuptools a run time dependency in addition to being a build dependency. There doesn't seem to be much better solution for this scikit-learn/scikit-learn#7980 (comment)

Edit: The problem of a custom regex is that we potentially may have the same issue when comparing againt versions of scipy/numpy and then using standard tools for version comparison is probably safer.

rth · 2020-06-26T13:01:55Z

Merged. Thank you @TimotheeMathieu !

RobustWeightedEstimator code + examples

ac8679e

rth mentioned this pull request Oct 26, 2019

MAINT Use disassembled estimator checks / fix CI #40

Merged

TimotheeMathieu added 3 commits October 26, 2019 15:30

black reformatted

c467fe7

update RobustWeightedEstimator

abe2457

Merge remote-tracking branch 'origin/master' into RobustWeightedEstim…

c982256

…ator

rth reviewed Oct 27, 2019

View reviewed changes

TimotheeMathieu added 3 commits October 27, 2019 13:44

Add random_state parameter, fix plots in examples.

956fae9

Doc added, real dataset examples added.

788ed35

Fix examples, added tests.

4038a93