[MRG+1] TheilSen robust linear regression #2949

Merged
merged 99 commits into from Nov 20, 2014

Conversation

Projects
None yet
Contributor

FlorianWilhelm commented Mar 8, 2014

A multiple linear Theil-Sen regression for the Scikit-Learn toolbox. The implementation is based on the algorithm from the paper "Theil-Sen Estimators in a Multiple Linear Regression Model" of Xin Dang, Hanxiang Peng, Xueqin Wang and Heping Zhang. It is parallelized with the help of joblib.
On a personal note, I think that the popular Theil-Sen regression would be a nice addition to Scikit-Learn.
I am looking forward to your feedback.

Florian

sklearn/linear_model/theilsen.py
+ res = np.zeros(y.shape)
+ T_nom = np.zeros(y.shape)
+ T_denom = 0.
+ for x in X.T:
@jnothman

jnothman Mar 8, 2014

Owner

I think it should be possible to vectorize this loop (i.e. not use a Python for loop), with something like:

diff = X.T - y
normdiff = norm(diff, axis=1)
mask = normdiff >= 1e-6
if mask.sum() < X.shape[1]:
    eta = 1.
diff = diff[mask, :]
normdiff = normdiff[mask]
res = np.sum(diff / normdiff, axis=0)
T_denom = np.sum(1 / normdiff, axis=0)
T_nom = np.sum(X / normdiff, axis=1)

(warning: code untested)

@FlorianWilhelm

FlorianWilhelm Mar 9, 2014

Contributor

Thanks for pointing this out, I fixed it in commit 3e8ca8c.

sklearn/linear_model/theilsen.py
+_logger = logging.getLogger(__name__)
+
+
+def modweiszfeld_step(X, y):
@agramfort

agramfort Mar 9, 2014

Owner

the internally used functions should be private hence : modweiszfeld_step > _modweiszfeld_step

@agramfort

agramfort Mar 9, 2014

Owner

this applies to the 3 functions below

@FlorianWilhelm

FlorianWilhelm Mar 9, 2014

Contributor

My first idea was that a function like spatial_median could also be useful on its own that's why I made it public. But until this really is the case I changed it to private for now in commit d133712.

+ # Check that Theil-Sen works
+ theilsen = TheilSen(n_jobs=-1).fit(X, y)
+ nptest.assert_array_almost_equal(theilsen.coef_, w, 1)
+ nptest.assert_array_almost_equal(theilsen.intercept_, c, 1)
@agramfort

agramfort Mar 9, 2014

Owner

we usually don't test with various number of n_jobs if the code path is the same.
We rely on joblib internal testing. Here you should just check the splits

@FlorianWilhelm

FlorianWilhelm Mar 9, 2014

Contributor

Thank you for pointing this out. The unit tests cover now explicitly _ split_indices and _ get_n_jobs in commit 1d75896.

+ X, y, w, c = gen_toy_problem_2d()
+ # Check that Least Squares fails
+ lstq = LinearRegression().fit(X, y)
+ assert np.linalg.norm(lstq.coef_ - w) > 1.0
@agramfort

agramfort Mar 9, 2014

Owner

we use linalg from scipy not numpy

@FlorianWilhelm

FlorianWilhelm Mar 9, 2014

Contributor

Fixed in commit d1d221b where possible. Is it okay to use numpy.linalg.norm? It seems that scipy.linalg.norm does not provide the axis parameter that I need.

Owner

agramfort commented Mar 9, 2014

can this method work with n_samples < n_features?
how does it compare to ransac which we already have?

thanks

FlorianWilhelm added some commits Mar 9, 2014

FIX Theil-Sen unittest for older Scipy Version
Usage of fmin_bfgs instead of minimize.
FIX that some functions in Theil-Sen were public
- _spatial_median, _breakdown_point and _modweiszfeld_step are now
private as suggested by agramfort.
- moved _lse before TheilSen class.
FIX usage of linalg from Scipy in Theil-Sen
Replaced usage of numpy.linalg by scipy.linalg as suggested by
agramfort.
FIX: Let Theil-Sen handle n_samples < n_features case
Theil-Sen will fall back to Least Squares like if number of samples is
smaller than the number of features.
FIX: Python 2.6 format syntax in Theil-Sen
Fix of "zero length field name in format" ValueError due to Python 2.6
in TheilSen class.
Vectorization of theilsen._modweiszfeld_step
Replaced the for-loop in linear_model.theilsen._modweiszfeld_step by
array operations as suggested by jnothman
FIX: Parallel unittests for Theil-Sen estimator
Instead of testing the various codepaths with n_jobs of joblib, now the
functions _get_n_jobs and _split_indices of the theilsen class are
tested as pointed out by agramfort.
Contributor

FlorianWilhelm commented Mar 9, 2014

For the n_samples < n_features case it did some modifications in commit 61a5195 in order to fall back to least squares in this case as this Theil-Sen implementation can be seen as a generalization of the ordinary least squares method anyways.
So far I have not found a direct comparison between RANSAC and Theil-Sen but I'm not that familiar with RANSAC that's why I should have deeper look into this subject. So far, my impression is that RANSAC is a more heuristic approach coming from computer science while Theil-Sen comes from robust statistics.

Not to derail the conversation, but is this kind of estimator more appropriate for statsmodels? (My thinking is scikit-learn is focused on predicting outputs while statsmodels is focused on inferences about parameters.)

Contributor

FlorianWilhelm commented Mar 9, 2014

@chrisjordansquire In order to predict anything you always have to train the parameters of your model and quite often your training samples are less than perfect (i.e. having outliers or if the error is not normally distributed). This is where non-parametric methods like Theil-Sen shine since they do not assume the errors to be normally distributed.
The new RANSAC estimator also seems to fit into the class of robust estimators. From my understanding it takes an estimator like LinearRegression and presents a proper subset to it in order to remove the negative effects of outliers. In this way you could also say that all RANSAC does is to help LinearRegression to infer the right parameters.

Coverage Status

Coverage remained the same when pulling b374e7e on FlorianWilhelm:theilsen into 64fd085 on scikit-learn:master.

Owner

GaelVaroquaux commented Mar 10, 2014

Not to derail the conversation, but is this kind of estimator more appropriate
for statsmodels?

I am also hesitating on whether this estimator should go in scikit-learn,
or whether we should push users to statsmodels. The reason that we have a
RANSAC is that it is widely used in computer vision, where people really
only about it's ability to fit and predict, and not corresponding
p-values.

I am +0 on including it in scikit-learn: robust models are sometimes
really useful for prediction. However, I would indeed like to hear a
little discussion about the usecases in mind.

Contributor

FlorianWilhelm commented Mar 11, 2014

@GaelVaroquaux: I understand the hesitation to include yet another estimator especially since RANSAC and Theil-Sen seem to open the gates for a new class of robust estimators and we want scikit-learn to be focused on its goals: An easy-to-use machine learning library with a consistent interface.
But for exactly that reason we are heavily using scikit-learn at Blue Yonder (http://www.blue-yonder.com/en/) to provide predictive analysis products to our customers. At its core everything we do is about making good predictions with the help of machine learning. For our own algorithms we have adapted the scikit-learn interface in order to extend it and that works really well for us.
In some projects we had to deal with suboptimal data (hard to determine outliers) from customers where we applied Theil-Sen successfully. Since Theil-Sen is quite a popular and well-known algorithm my idea was to include it directly into scikit-learn where others can use it too and I got the okay of the management to do so. I am convinced that Theil-Sen (although coming from statistics) is a robust machine learning algorithm that nicely complements Scikit-Learn whenever the data you get is suboptimal in various kinds.

I'm a bit surprised because I never noticed that there is a clean separation between statsmodel and scikit-learn. Clearly this issue here is the wrong place, but I would pretty much appreciate a discussion about this. At least for me, I wouldn't be surprised at all when I would find a Theil-sen implementation in scikit-learn (and another one in scipy and one in statsmodel ;-) )
On the contrary, I heavily use scikit-learn and would be happy to be able to try a Theil-sen (e.g for preprocessing/regularisation) without introducing another dependency (I pretty much never need to introduce statsmodel as a dependency) and with the same interface as the other algorithms.
Last point from me, if someone spends time and wants to contribute to scikit learn, shouldn't the prior be +1 as long as there are no strong arguments against the contribution?
Therefore +1 from me as long as there is no official "that's not scikit-learn" guideline/rule (and that would be great!)

Owner

agramfort commented Mar 15, 2014

to make your point I would compare performance and speed with the RANSAC we already have. Often computer scientists hacks make things pretty efficient in practice... statisticians care much less about performance ... in general ... :)

Shouldn't the user of scikit-learn has the right to decide whether he/she needs to have better results or higher throughput (no matter which algorithm is better or faster or both)? Just based on his/her needs? Why do we have to decide this in advance? Is it because of simplicity?

Owner

agramfort commented Mar 15, 2014

Shouldn't the user of scikit-learn has the right to decide whether he/she needs to have better results or higher throughput (no matter which algorithm is better or faster or both)? Just based on his/her needs? Why do we have to decide this in advance?

let me ask the question differently. When should I use one or the other?
even if you already partially answered it, this type of reasons should
appear in the doc and ideally illustrated with an example.

the vision of sklearn has never been to be a collection of algorithms.
If we add a piece of code we will pay the price in maintenance so we
just want to be sure it's worth it.

@agramfort : I cannot agree more! This is exactly the right question to ask and the answer of course has to go to the documentation: What is it and what is it good for?

Contributor

FlorianWilhelm commented Mar 18, 2014

@sebastianneubauer and @agramfort : Good points, I'll rework the documentation a bit and give a direct comparison of RANSAC and Theil-Sen regarding the question when to use what. Gonna take me a little while though since up to now I haven't found any paper or literature about a direct comparison between those two. So far I can only say that Theil-Sen is just a completely different approach towards the same goal, so I agree, I should point out to the users when to use Theil-Sen over RANSAC or vice versa.

DOC: Comparison of Theil-Sen and RANSAC
Added a simple case where Theil-Sen apparently outperforms RANSAC.
Some recommendations when to use one over another.
Contributor

FlorianWilhelm commented Mar 22, 2014

@sebastianneubauer and @agramfort: I extended in the commit 0b10f49 the documentation with a small comparison of Theil-Sen and RANSAC as well as an example. I hope this clarifies a bit that both methods have different advantages and that it cannot be said that one is superior over the other. While they have nearly identical results in same cases, in others the results vary largely. Therefore, both methods complement each other and should co-exist.

+ return X, y, w, c
+
+
+def test__modweiszfeld_step_1d():
@agramfort

agramfort Mar 22, 2014

Owner

why 2 __ after test?

@FlorianWilhelm

FlorianWilhelm Mar 22, 2014

Contributor

Since _modweiszfeld_step is a private function and my pattern is test_FUNCTIONNAME_EXTRAINFO.
I will change it.

+from os import devnull
+
+import numpy as np
+import numpy.testing as nptest
@agramfort

agramfort Mar 22, 2014

Owner

nptest is a non standard import name in sklearn

we usually import all assert_*

+ X, y, w, c = gen_toy_problem_2d()
+ # Check that Least Squares fails
+ lstq = LinearRegression().fit(X, y)
+ assert norm(lstq.coef_ - w) > 1.0
@arjoly

arjoly Oct 16, 2014

Owner

Can you use assert_greater here?

+
+def test_calc_breakdown_point():
+ bp = _breakdown_point(1e10, 2)
+ assert np.abs(bp - 1 + 1/(np.sqrt(2))) <= 1.e-6
@arjoly

arjoly Oct 16, 2014

Owner

Can you use assert_lower_equal here?

+ X, y, w, c = gen_toy_problem_2d()
+ # Check that Least Squares fails
+ lstq = LinearRegression().fit(X, y)
+ assert norm(lstq.coef_ - w) > 1.0
@arjoly

arjoly Oct 16, 2014

Owner

Can you use assert_greater here?

+ X, y, w, c = gen_toy_problem_1d()
+ # Check that Theil-Sen can be verbose
+ TheilSenRegressor(verbose=True).fit(X, y)
+ TheilSenRegressor(verbose=True, max_subpopulation=10).fit(X, y)
@arjoly

arjoly Oct 16, 2014

Owner

Can you avoid printing anything during the testing phase?
If you want to check verbosity, you can catch the stdin and check that it corresponds to what you want.

+def test_subsamples():
+ X, y, w, c = gen_toy_problem_4d()
+ theil_sen = TheilSenRegressor(n_subsamples=X.shape[0],
+ verbose=True).fit(X, y)
@arjoly

arjoly Oct 16, 2014

Owner

Can you set verbose to False if possible?

Owner

arjoly commented Oct 16, 2014

The testing time of the theilsen estimator took me 3 second.

$ nosetests sklearn/linear_model/tests/test_theil_sen.py  -s
.............Breakdown point: 6.9312316997e-09
Number of samples: 10000
Tolerable outliers: 0
Number of subpopulations: 1
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
.Breakdown point: 0.287035354437
Number of samples: 50
Tolerable outliers: 14
Number of subpopulations: 1225
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
Breakdown point: 0.287035354437
Number of samples: 50
Tolerable outliers: 14
Number of subpopulations: 10
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
...
----------------------------------------------------------------------
Ran 17 tests in 3.212s

Could you try to reduce the testing to something smaller (less than 1s) without any message printed on the console?

Owner

arjoly commented Oct 16, 2014

By the way congratulation for the 100% coverage!

Owner

arjoly commented Oct 16, 2014

What is the time required to run the new examples?

Owner

arjoly commented Oct 16, 2014

At the moment, I don't have time to properly review the correctness of the algorithm. For the rest, I am +1 when the last tidbits are fixed.

examples/linear_model/plot_robust_fit.py
@@ -0,0 +1,87 @@
+"""
+Demo robust fitting
@arjoly

arjoly Oct 16, 2014

Owner

What do you think of Robust linear estimator fitting

Owner

GaelVaroquaux commented Oct 16, 2014

@arjoly : a bunch of very good catches (eg the random_state of the estimator in the tests and naming suggestions, and test time) that should be addressed.

Coverage Status

Coverage increased (+0.06%) when pulling f92074a on FlorianWilhelm:theilsen into 031a3fc on scikit-learn:master.

Contributor

FlorianWilhelm commented Oct 17, 2014

@arjoly Regarding the new examples, I used the Unix time command and looked at the user time since the script finishes only when I close the plotting windows.

  • plot_theilsen: 1.865s
  • plot_robust_fit: 6.152s
examples/linear_model/plot_theilsen.py
+print(__doc__)
+
+estimators = [('OLS', LinearRegression()),
+ ('Theil-Sen', TheilSenRegressor()),
@arjoly

arjoly Oct 17, 2014

Owner

the random state is missing here

@FlorianWilhelm

FlorianWilhelm Oct 17, 2014

Contributor

@arjoly sorry, that I forgot that one, my bad :-/

@arjoly

arjoly Oct 17, 2014

Owner

Thanks !

Owner

arjoly commented Oct 17, 2014

I won't be able to find much time for this before next week. @ogrisel and @GaelVaroquaux feel free to finish the review and merged before if you want.

Contributor

FlorianWilhelm commented Oct 27, 2014

@arjoly @GaelVaroquaux What are the next steps? Do I need to ping someone like @larsmans in order to get this merged? The related PR #3764 has also [MRG+1].

Owner

larsmans commented Oct 27, 2014

I know nothing about this estimator, so no.

Owner

MechCoder commented Nov 20, 2014

Is this ready to go? @ogrisel and @GaelVaroquaux I see +1's from you, and another implicit +1 by @arjoly !

Owner

ogrisel commented Nov 20, 2014

Both I and @GaelVaroquaux already gave +1 and the last batch of comments by @arjoly have been seemingly addressed, let's merge.

ogrisel added a commit that referenced this pull request Nov 20, 2014

Merge pull request #2949 from FlorianWilhelm/theilsen
 [MRG+1] TheilSen robust linear regression

@ogrisel ogrisel merged commit f0fe4af into scikit-learn:master Nov 20, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Owner

ogrisel commented Nov 20, 2014

Thanks again @FlorianWilhelm for the contribution (especially the effort in the doc and examples)!

Owner

GaelVaroquaux commented Nov 20, 2014

Thanks heaps. This is a quality contribution!

Owner

arjoly commented Nov 20, 2014

Congratulation @FlorianWilhelm 👍

Contributor

FlorianWilhelm commented Nov 20, 2014

Thank you all for helping me making my first Scikit-Learn contribution! It was a lot of work but I had tons of fun :-)

Owner

ogrisel commented Nov 20, 2014

@FlorianWilhelm I forgot: can you please open a new PR to add an entry to the doc/whats_new.rst file? Please link your name either to your github account or a personal webpage of your choice at the end of the file.

Contributor

FlorianWilhelm commented Nov 21, 2014

@ogrisel I updated whats_new.rst in PR #3870.

@FlorianWilhelm FlorianWilhelm deleted the FlorianWilhelm:theilsen branch Nov 21, 2014

Owner

amueller commented Jan 8, 2015

Stupid question: is there a simple way to make this run fast? It is quite slow in the common tests, even on trivial datasets. I guess setting n_subsamples is the trick, but is only possible if you know the number of samples, right?

Contributor

FlorianWilhelm commented Jan 8, 2015

@amueller You could also use max_subpopulation to reduce the maximum number of samples that are considered. If you want to reduce the runtime with n_subsamples you need to know the number of samples.

Owner

jnothman commented Jan 8, 2015

Would it be appropriate to allow n_subsamples as a float to be a proportion of the number of samples?

Contributor

FlorianWilhelm commented Jan 10, 2015

No, the complexity is $\binom{n_{samples}}{n_{subsamples}}$. If you consider Pascal's triangle, the closer you get to the center of a row the larger the number is. So consider you have n_features and you fit with intercept a problem with n_samples, the efficiency starts to become better only when n_subsamples is larger or equal than n_samples - n_features.
What we could do is to specify it as a ratio of the number of features. If you choose ratio=1.0 (the default) you have the complexity $\binom{n_{samples}}{n_{subsamples}}$. Otherwise we change n_subsample in order to get the complexity $\binom{n_{samples}}{ (n_{samples} - ratio * n_{features}) }$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment