[MRG+1] TheilSen robust linear regression #2949

Merged
merged 99 commits into from Nov 20, 2014

Conversation

Projects
None yet
Contributor

FlorianWilhelm commented Mar 8, 2014

 A multiple linear Theil-Sen regression for the Scikit-Learn toolbox. The implementation is based on the algorithm from the paper "Theil-Sen Estimators in a Multiple Linear Regression Model" of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang. It is parallelized with the help of joblib. On a personal note, I think that the popular Theil-Sen regression would be a nice addition to Scikit-Learn. I am looking forward to your feedback. Florian

FlorianWilhelm added some commits Jan 14, 2014

``` Added multiple linear Theil-Sen regression ```
``` d4519d2 ```
``` Added an example and documentatin for Theil-Sen ```
``` b3e8a64 ```
``` Added subpopulation parameter to Theil-Sen ```
``` 7dea60d ```
``` Added parallelization support to TheilSen Estimator ```
``` 8d48d45 ```
``` Improved parallelization for Theilsen estimator ```
``` 86a7461 ```
``` Merge branch 'master' into theilsen ```
``` 601b82b ```
``` Cleanups and corrections for Theil-Sen regression. ```
``` c19d967 ```
``` Removed subpopulation=None option in TheilSen ```
``` 8653c20 ```
``` xrange fix for Python3 in TheilSen ```
``` 960c45b ```

jnothman reviewed Mar 8, 2014

 + res = np.zeros(y.shape) + T_nom = np.zeros(y.shape) + T_denom = 0. + for x in X.T:

jnothman Mar 8, 2014

Owner

I think it should be possible to vectorize this loop (i.e. not use a Python `for` loop), with something like:

```diff = X.T - y
normdiff = norm(diff, axis=1)
eta = 1.
res = np.sum(diff / normdiff, axis=0)
T_denom = np.sum(1 / normdiff, axis=0)
T_nom = np.sum(X / normdiff, axis=1)```

(warning: code untested)

FlorianWilhelm Mar 9, 2014

Contributor

Thanks for pointing this out, I fixed it in commit 3e8ca8c.

agramfort reviewed Mar 9, 2014

 +_logger = logging.getLogger(__name__) + + +def modweiszfeld_step(X, y):

agramfort Mar 9, 2014

Owner

the internally used functions should be private hence : modweiszfeld_step > _modweiszfeld_step

agramfort Mar 9, 2014

Owner

this applies to the 3 functions below

FlorianWilhelm Mar 9, 2014

Contributor

My first idea was that a function like spatial_median could also be useful on its own that's why I made it public. But until this really is the case I changed it to private for now in commit d133712.

agramfort reviewed Mar 9, 2014

 + # Check that Theil-Sen works + theilsen = TheilSen(n_jobs=-1).fit(X, y) + nptest.assert_array_almost_equal(theilsen.coef_, w, 1) + nptest.assert_array_almost_equal(theilsen.intercept_, c, 1)

agramfort Mar 9, 2014

Owner

we usually don't test with various number of n_jobs if the code path is the same.
We rely on joblib internal testing. Here you should just check the splits

FlorianWilhelm Mar 9, 2014

Contributor

Thank you for pointing this out. The unit tests cover now explicitly _ split_indices and _ get_n_jobs in commit 1d75896.

agramfort reviewed Mar 9, 2014

 + X, y, w, c = gen_toy_problem_2d() + # Check that Least Squares fails + lstq = LinearRegression().fit(X, y) + assert np.linalg.norm(lstq.coef_ - w) > 1.0

agramfort Mar 9, 2014

Owner

we use linalg from scipy not numpy

FlorianWilhelm Mar 9, 2014

Contributor

Fixed in commit d1d221b where possible. Is it okay to use numpy.linalg.norm? It seems that scipy.linalg.norm does not provide the axis parameter that I need.

Owner

agramfort commented Mar 9, 2014

 can this method work with n_samples < n_features? how does it compare to ransac which we already have? thanks

FlorianWilhelm added some commits Mar 9, 2014

``` FIX Theil-Sen unittest for older Scipy Version ```
`Usage of fmin_bfgs instead of minimize.`
``` 0eda5bf ```
``` FIX that some functions in Theil-Sen were public ```
```- _spatial_median, _breakdown_point and _modweiszfeld_step are now
private as suggested by agramfort.
- moved _lse before TheilSen class.```
``` d133712 ```
``` FIX usage of linalg from Scipy in Theil-Sen ```
```Replaced usage of numpy.linalg by scipy.linalg as suggested by
agramfort.```
``` d1d221b ```
``` FIX: Let Theil-Sen handle n_samples < n_features case ```
```Theil-Sen will fall back to Least Squares like if number of samples is
smaller than the number of features.```
``` 61a5195 ```
``` FIX: Python 2.6 format syntax in Theil-Sen ```
```Fix of "zero length field name in format" ValueError due to Python 2.6
in TheilSen class.```
``` 13d9212 ```
``` Vectorization of theilsen._modweiszfeld_step ```
```Replaced the for-loop in linear_model.theilsen._modweiszfeld_step by
array operations as suggested by jnothman```
``` 3e8ca8c ```
``` FIX: Parallel unittests for Theil-Sen estimator ```
```Instead of testing the various codepaths with n_jobs of joblib, now the
functions _get_n_jobs and _split_indices of the theilsen class are
tested as pointed out by agramfort.```
``` 1d75896 ```
Contributor

FlorianWilhelm commented Mar 9, 2014

 For the n_samples < n_features case it did some modifications in commit 61a5195 in order to fall back to least squares in this case as this Theil-Sen implementation can be seen as a generalization of the ordinary least squares method anyways. So far I have not found a direct comparison between RANSAC and Theil-Sen but I'm not that familiar with RANSAC that's why I should have deeper look into this subject. So far, my impression is that RANSAC is a more heuristic approach coming from computer science while Theil-Sen comes from robust statistics.

chrisjordansquire commented Mar 9, 2014

 Not to derail the conversation, but is this kind of estimator more appropriate for statsmodels? (My thinking is scikit-learn is focused on predicting outputs while statsmodels is focused on inferences about parameters.)
Contributor

FlorianWilhelm commented Mar 9, 2014

 @chrisjordansquire In order to predict anything you always have to train the parameters of your model and quite often your training samples are less than perfect (i.e. having outliers or if the error is not normally distributed). This is where non-parametric methods like Theil-Sen shine since they do not assume the errors to be normally distributed. The new RANSAC estimator also seems to fit into the class of robust estimators. From my understanding it takes an estimator like LinearRegression and presents a proper subset to it in order to remove the negative effects of outliers. In this way you could also say that all RANSAC does is to help LinearRegression to infer the right parameters.
``` FIX: TheilSen supports old Numpy versions ```
``` b374e7e ```

coveralls commented Mar 10, 2014

 Coverage remained the same when pulling b374e7e on FlorianWilhelm:theilsen into 64fd085 on scikit-learn:master.
Owner

GaelVaroquaux commented Mar 10, 2014

 Not to derail the conversation, but is this kind of estimator more appropriate for statsmodels? I am also hesitating on whether this estimator should go in scikit-learn, or whether we should push users to statsmodels. The reason that we have a RANSAC is that it is widely used in computer vision, where people really only about it's ability to fit and predict, and not corresponding p-values. I am +0 on including it in scikit-learn: robust models are sometimes really useful for prediction. However, I would indeed like to hear a little discussion about the usecases in mind.
Contributor

FlorianWilhelm commented Mar 11, 2014

 @GaelVaroquaux: I understand the hesitation to include yet another estimator especially since RANSAC and Theil-Sen seem to open the gates for a new class of robust estimators and we want scikit-learn to be focused on its goals: An easy-to-use machine learning library with a consistent interface. But for exactly that reason we are heavily using scikit-learn at Blue Yonder (http://www.blue-yonder.com/en/) to provide predictive analysis products to our customers. At its core everything we do is about making good predictions with the help of machine learning. For our own algorithms we have adapted the scikit-learn interface in order to extend it and that works really well for us. In some projects we had to deal with suboptimal data (hard to determine outliers) from customers where we applied Theil-Sen successfully. Since Theil-Sen is quite a popular and well-known algorithm my idea was to include it directly into scikit-learn where others can use it too and I got the okay of the management to do so. I am convinced that Theil-Sen (although coming from statistics) is a robust machine learning algorithm that nicely complements Scikit-Learn whenever the data you get is suboptimal in various kinds.

sebastianneubauer commented Mar 15, 2014

 I'm a bit surprised because I never noticed that there is a clean separation between statsmodel and scikit-learn. Clearly this issue here is the wrong place, but I would pretty much appreciate a discussion about this. At least for me, I wouldn't be surprised at all when I would find a Theil-sen implementation in scikit-learn (and another one in scipy and one in statsmodel ;-) ) On the contrary, I heavily use scikit-learn and would be happy to be able to try a Theil-sen (e.g for preprocessing/regularisation) without introducing another dependency (I pretty much never need to introduce statsmodel as a dependency) and with the same interface as the other algorithms. Last point from me, if someone spends time and wants to contribute to scikit learn, shouldn't the prior be +1 as long as there are no strong arguments against the contribution? Therefore +1 from me as long as there is no official "that's not scikit-learn" guideline/rule (and that would be great!)
Owner

agramfort commented Mar 15, 2014

 to make your point I would compare performance and speed with the RANSAC we already have. Often computer scientists hacks make things pretty efficient in practice... statisticians care much less about performance ... in general ... :)

sebastianneubauer commented Mar 15, 2014

 Shouldn't the user of scikit-learn has the right to decide whether he/she needs to have better results or higher throughput (no matter which algorithm is better or faster or both)? Just based on his/her needs? Why do we have to decide this in advance? Is it because of simplicity?
Owner

agramfort commented Mar 15, 2014

 Shouldn't the user of scikit-learn has the right to decide whether he/she needs to have better results or higher throughput (no matter which algorithm is better or faster or both)? Just based on his/her needs? Why do we have to decide this in advance? let me ask the question differently. When should I use one or the other? even if you already partially answered it, this type of reasons should appear in the doc and ideally illustrated with an example. the vision of sklearn has never been to be a collection of algorithms. If we add a piece of code we will pay the price in maintenance so we just want to be sure it's worth it.

sebastianneubauer commented Mar 16, 2014

 @agramfort : I cannot agree more! This is exactly the right question to ask and the answer of course has to go to the documentation: What is it and what is it good for?
Contributor

FlorianWilhelm commented Mar 18, 2014

 @sebastianneubauer and @agramfort : Good points, I'll rework the documentation a bit and give a direct comparison of RANSAC and Theil-Sen regarding the question when to use what. Gonna take me a little while though since up to now I haven't found any paper or literature about a direct comparison between those two. So far I can only say that Theil-Sen is just a completely different approach towards the same goal, so I agree, I should point out to the users when to use Theil-Sen over RANSAC or vice versa.
``` DOC: Comparison of Theil-Sen and RANSAC ```
```Added a simple case where Theil-Sen apparently outperforms RANSAC.
Some recommendations when to use one over another.```
``` 0b10f49 ```
Contributor

FlorianWilhelm commented Mar 22, 2014

 @sebastianneubauer and @agramfort: I extended in the commit 0b10f49 the documentation with a small comparison of Theil-Sen and RANSAC as well as an example. I hope this clarifies a bit that both methods have different advantages and that it cannot be said that one is superior over the other. While they have nearly identical results in same cases, in others the results vary largely. Therefore, both methods complement each other and should co-exist.
Owner

 robust

agramfort reviewed Mar 22, 2014

 + return X, y, w, c + + +def test__modweiszfeld_step_1d():

agramfort Mar 22, 2014

Owner

why 2 __ after test?

FlorianWilhelm Mar 22, 2014

Contributor

Since _modweiszfeld_step is a private function and my pattern is test_FUNCTIONNAME_EXTRAINFO.
I will change it.

agramfort reviewed Mar 22, 2014

 +from os import devnull + +import numpy as np +import numpy.testing as nptest

agramfort Mar 22, 2014

Owner

nptest is a non standard import name in sklearn

we usually import all assert_*

``` DOC: Fixed typo in Theil-Sen example. ```
``` ca9a275 ```

arjoly reviewed Oct 16, 2014

 + X, y, w, c = gen_toy_problem_2d() + # Check that Least Squares fails + lstq = LinearRegression().fit(X, y) + assert norm(lstq.coef_ - w) > 1.0

arjoly Oct 16, 2014

Owner

Can you use assert_greater here?

arjoly reviewed Oct 16, 2014

 + +def test_calc_breakdown_point(): + bp = _breakdown_point(1e10, 2) + assert np.abs(bp - 1 + 1/(np.sqrt(2))) <= 1.e-6

arjoly Oct 16, 2014

Owner

Can you use assert_lower_equal here?

arjoly reviewed Oct 16, 2014

 + X, y, w, c = gen_toy_problem_2d() + # Check that Least Squares fails + lstq = LinearRegression().fit(X, y) + assert norm(lstq.coef_ - w) > 1.0

arjoly Oct 16, 2014

Owner

Can you use assert_greater here?

arjoly reviewed Oct 16, 2014

 + X, y, w, c = gen_toy_problem_1d() + # Check that Theil-Sen can be verbose + TheilSenRegressor(verbose=True).fit(X, y) + TheilSenRegressor(verbose=True, max_subpopulation=10).fit(X, y)

arjoly Oct 16, 2014

Owner

Can you avoid printing anything during the testing phase?
If you want to check verbosity, you can catch the stdin and check that it corresponds to what you want.

arjoly reviewed Oct 16, 2014

 +def test_subsamples(): + X, y, w, c = gen_toy_problem_4d() + theil_sen = TheilSenRegressor(n_subsamples=X.shape[0], + verbose=True).fit(X, y)

arjoly Oct 16, 2014

Owner

Can you set verbose to False if possible?

Owner

arjoly commented Oct 16, 2014

 The testing time of the theilsen estimator took me 3 second. ``````\$ nosetests sklearn/linear_model/tests/test_theil_sen.py -s .............Breakdown point: 6.9312316997e-09 Number of samples: 10000 Tolerable outliers: 0 Number of subpopulations: 1 [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished .Breakdown point: 0.287035354437 Number of samples: 50 Tolerable outliers: 14 Number of subpopulations: 1225 [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished Breakdown point: 0.287035354437 Number of samples: 50 Tolerable outliers: 14 Number of subpopulations: 10 [Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished ... ---------------------------------------------------------------------- Ran 17 tests in 3.212s `````` Could you try to reduce the testing to something smaller (less than 1s) without any message printed on the console?
Owner

arjoly commented Oct 16, 2014

 By the way congratulation for the 100% coverage!
Owner

arjoly commented Oct 16, 2014

 What is the time required to run the new examples?
Owner

arjoly commented Oct 16, 2014

 At the moment, I don't have time to properly review the correctness of the algorithm. For the rest, I am +1 when the last tidbits are fixed.

arjoly reviewed Oct 16, 2014

 @@ -0,0 +1,87 @@ +""" +Demo robust fitting

arjoly Oct 16, 2014

Owner

What do you think of `Robust linear estimator fitting`

Owner

GaelVaroquaux commented Oct 16, 2014

 @arjoly : a bunch of very good catches (eg the random_state of the estimator in the tests and naming suggestions, and test time) that should be addressed.

Florian Wilhelm added some commits Oct 16, 2014

``` COSMIT: pep8 and renaming ```
``` 4fcd6b8 ```
``` COSMIT: replaced assert by assert_less/greater etc. ```
``` b8cd60b ```
``` TEST: No console output during unit tests ```
``` d5ec39b ```
``` ENH: Always set random_state in unit tests ```
``` f9ecbf7 ```
``` ENH: Speedup of unit tests ```
``` f92074a ```

coveralls commented Oct 17, 2014

 Coverage increased (+0.06%) when pulling f92074a on FlorianWilhelm:theilsen into 031a3fc on scikit-learn:master.
Contributor

FlorianWilhelm commented Oct 17, 2014

 @arjoly Regarding the new examples, I used the Unix `time` command and looked at the user time since the script finishes only when I close the plotting windows. plot_theilsen: 1.865s plot_robust_fit: 6.152s
``` COSMIT: Better consistency ```
``` c54c17e ```

arjoly reviewed Oct 17, 2014

 +print(__doc__) + +estimators = [('OLS', LinearRegression()), + ('Theil-Sen', TheilSenRegressor()),

arjoly Oct 17, 2014

Owner

the random state is missing here

FlorianWilhelm Oct 17, 2014

Contributor

@arjoly sorry, that I forgot that one, my bad :-/

arjoly Oct 17, 2014

Owner

Thanks !

``` ENH: Added random_state in plot_theilsen.py ```
``` 2137d82 ```
Owner

arjoly commented Oct 17, 2014

 I won't be able to find much time for this before next week. @ogrisel and @GaelVaroquaux feel free to finish the review and merged before if you want.
Contributor

FlorianWilhelm commented Oct 27, 2014

 @arjoly @GaelVaroquaux What are the next steps? Do I need to ping someone like @larsmans in order to get this merged? The related PR #3764 has also [MRG+1].
Owner

Owner

MechCoder commented Nov 20, 2014

 Is this ready to go? @ogrisel and @GaelVaroquaux I see +1's from you, and another implicit +1 by @arjoly !
Owner

ogrisel commented Nov 20, 2014

 Both I and @GaelVaroquaux already gave +1 and the last batch of comments by @arjoly have been seemingly addressed, let's merge.

ogrisel added a commit that referenced this pull request Nov 20, 2014

``` Merge pull request #2949 from FlorianWilhelm/theilsen ```
` [MRG+1] TheilSen robust linear regression`
``` f0fe4af ```

ogrisel merged commit `f0fe4af` into scikit-learn:master Nov 20, 2014 1 check passed

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Owner

ogrisel commented Nov 20, 2014

 Thanks again @FlorianWilhelm for the contribution (especially the effort in the doc and examples)!
Owner

GaelVaroquaux commented Nov 20, 2014

 Thanks heaps. This is a quality contribution!
Owner

arjoly commented Nov 20, 2014

 Congratulation @FlorianWilhelm 👍
Contributor

FlorianWilhelm commented Nov 20, 2014

 Thank you all for helping me making my first Scikit-Learn contribution! It was a lot of work but I had tons of fun :-)
Owner

ogrisel commented Nov 20, 2014

 @FlorianWilhelm I forgot: can you please open a new PR to add an entry to the `doc/whats_new.rst` file? Please link your name either to your github account or a personal webpage of your choice at the end of the file.

Merged

Contributor

FlorianWilhelm commented Nov 21, 2014

 @ogrisel I updated `whats_new.rst` in PR #3870.

Owner

amueller commented Jan 8, 2015

 Stupid question: is there a simple way to make this run fast? It is quite slow in the common tests, even on trivial datasets. I guess setting `n_subsamples` is the trick, but is only possible if you know the number of samples, right?
Contributor

FlorianWilhelm commented Jan 8, 2015

 @amueller You could also use `max_subpopulation` to reduce the maximum number of samples that are considered. If you want to reduce the runtime with `n_subsamples` you need to know the number of samples.
Owner

jnothman commented Jan 8, 2015

 Would it be appropriate to allow `n_subsamples` as a float to be a proportion of the number of samples?
Contributor

FlorianWilhelm commented Jan 10, 2015

 No, the complexity is \$\binom{n_{samples}}{n_{subsamples}}\$. If you consider Pascal's triangle, the closer you get to the center of a row the larger the number is. So consider you have `n_features` and you fit with intercept a problem with `n_samples`, the efficiency starts to become better only when `n_subsamples` is larger or equal than `n_samples` - `n_features`. What we could do is to specify it as a ratio of the number of features. If you choose `ratio=1.0` (the default) you have the complexity \$\binom{n_{samples}}{n_{subsamples}}\$. Otherwise we change `n_subsample` in order to get the complexity \$\binom{n_{samples}}{ (n_{samples} - ratio * n_{features}) }\$.