[MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling #2514

Closed
wants to merge 7 commits into
from

Conversation

Projects
None yet
8 participants
Contributor

untom commented Oct 11, 2013

Hi there!

This PR refactors the data-scaling code from sklearn.preprocessing to remove some duplicated code, and adds some new features. More specifically:

  • adds RobustScaler and robust_scale functions that use robust estimates of data center/scale (median & interquartile range), which should work better for outliers.
  • Adds possibility to scale by sample instead of by feature, via an axis=1 parameter (the scale function could already do this, now the *Scaler classes can, too!)
  • adds minmax_scale function (requrested by @ogrisel)
  • adds MaxAbsScaler, similar in functionality to MinMaxScaler, but also works on Sparse Matrices, as proposed in the discussions of #1799.
  • Reuses the code common to StandardScaler, RobustScaler, MinMaxScaler and MaxAbsScaler. by putting it in an abstract base class. Essentially the *Scaler classes are only responsible for estimating the necessary statistics in fit, the rest of the Transformer API (transform/inverse_transform, handling sparseness/different axis-parameters) is implemented in a BaseScaler, as this code is common to all of the scalers.
    This caused some parameters to be renamed and some attributes to be renamed: with_centering and with_scaling are the parameters that control if scaling/centering is performed and the center_ and scale_ attributes are used to store the centering/scaling values.
  • *_scale functions now simply reuse the *Scaler classes internally to avoid code duplication.
  • adds a lot of new tests for all the new functionality.

Notes and Caveats

  • StandardScaler had parameters with_mean and with_std which are renamed to with_centering and with_scaling to fall in line with the other Scalers. I wasn't sure how to handle deprecating the old parameter names in __init__ -- what's the protocol here?

  • RobustScaler cannot be fitted on sparse matrices:

    • centering doesn't make sense because it risks destroying sparsity (similar to what StandardScaler does)
    • Scaling doesn't work because there is no decent code to calculate the IQR of sparse matrices available in scipy
      As an alternative, we could advice people to scale using the MinMaxScaler instead to scale features to the same range with::

    scaler = MinMaxScaler()
    scaler.with_centering=False
    scaler.fit_transform(X)

(MinMaxScaler doesn't support the with_centering parameter directly because I wasn't sure if this would lead to confusion).

Owner

ogrisel commented Oct 11, 2013

Please "git grep @deprecated" to find examples of estimators with deprecated init parameters. The old names should be preserved and the default value to None (or some other non-ambiguous default marker instance) and if the value is not None an informative deprecation warning should be raised and the new parameter value should be set to the the non-None old parameter value to preserve backward compat.

Owner

ogrisel commented Oct 11, 2013

For the sparse case I agree we should raise an informative exception that advise the user to try scaler = MinMaxScaler(with_centering=False) instead of using RobustScaler.

Owner

amueller commented Oct 14, 2013

It would be awesome if we could also somehow include this: #1799

doc/modules/preprocessing.rst
+ of the data does sometimes not work very well. In these cases, you can use
+ :func:`robust_scale` and :class:`RobustScaler` as drop-in replacements
+ instead, which use more robust estimates for the center and range of your
+ data.
@arjoly

arjoly Oct 14, 2013

Owner

There is one more character at the beginning of the two previous lines.

@ogrisel

ogrisel Nov 21, 2013

Owner

This comment still needs to be addressed.

@untom

untom Nov 21, 2013

Contributor

d'oh =)

sklearn/preprocessing/data.py
+class Scaler(StandardScaler):
+ def __init__(self, copy=True, with_mean=True, with_std=True):
+ warnings.warn("Scaler was renamed to StandardScaler. The old name "
+ " will be removed in 0.15.", DeprecationWarning)
@arjoly

arjoly Oct 14, 2013

Owner

As the warning suggests, you can remove the old Scaler transformer

sklearn/preprocessing/data.py
+ if len(q.shape) == 1:
+ q = q.reshape(-1, 1)
+ self.scale_ = (q[1, :] - q[0, :]) / self.iqr_scale
+ #self.iqr_ = np.sqrt((q[1, :] - q[0, :]))
@arjoly

arjoly Oct 14, 2013

Owner

can be removed

sklearn/preprocessing/data.py
+
+ Parameters
+ ----------
+ iqr_scale : float, default = 1.34898.
@arjoly

arjoly Oct 14, 2013

Owner

What is the motivation for the default? Why not using an explicit name as interquartile_scale or interquartile?

@untom

untom Oct 14, 2013

Contributor

I'm glad you bring this up, because this is something I'm not too sure about myself. The motivation for it is that when the input data is Gaussian, this scaling would cause the result to have a variance/std of 1 (In other words, for a Normal distribution, the iqr = 1.349*sigma), which is a theoretical nicety -- assuming most data is somewhat Gaussian. But it's entirely optional and I'm not sure whether to leave it in or not.

It might also be misleading if users really do expect to have an std of 1 after scaling: even with data drawn from a proper Gaussian you need ~1k-10k datapoints until the robust_scale'd data reaches an std of 1.00 (= up to two significant digits).

Regarding the name you are probably right, interquartile_scale would be better (If the scaling stays in at all).

@arjoly

arjoly Oct 14, 2013

Owner

An option would be to accept a string to set the interquartile_scale and a float.
You would get something like

interquartile_scale: float or string in  ["normal" (default), ],
       The interquartile range is scaled by this factor. If ``interquartile_scale``
       is equal to ``"normal"``, the interquartile range will be 
       set such that the data scaled to unit variance. In order
       to converge to the unit variance, a sufficient number of samples is needed.

More choice of automatic setting of interquartile range could be provided. This also makes things less magic for the user.

It might also be misleading if users really do expect to have an std of 1 after scaling: even with data drawn from a proper Gaussian you need ~1k-10k datapoints until the robust_scale'd data reaches an std of 1.00 (= up to two significant digits).

This kind of remark is very good in the narrative user documentation.

@untom

untom Oct 14, 2013

Contributor

I'm not sure if "normal" is a helpful abstraction or not. On one hand, making string values available might make this more intuitive, on the other hand it might also be good if the user can see the numerical value. That way it's easier to understand how to adapt this for different input distributions.

@arjoly

arjoly Oct 14, 2013

Owner

On one hand, making string values available might make this more intuitive, on the other hand it might also be good if the user can see the numerical value.

You can remind the value in the doc.

@amueller

amueller Oct 22, 2013

Owner

I would use a string but put the float in the docstring (I think this is what @arjoly suggests). Having it only in the narrative would be somewhat annoying.

Contributor

untom commented Oct 14, 2013

Thanks for all the feedback. I will try to address this in a new commit in the upcoming days.

@amueller: If I understood it correctly #1799 is actually something my implementation can already handle, it's just a matter of making the parameter with_centering=False explicit within MinMaxScaler. So including this feature should be no problem and is actually something I was already thinking about anyhow.

(Interestingly, #1799 also includes a per_feature parameter which has the same function than the axis parameter in my submission).

Owner

ogrisel commented Oct 21, 2013

@untom any news on this?

Contributor

untom commented Oct 21, 2013

Integrating #1799 wasn't as trivial as I previously thought, and I hadn't had enough time to do it last week. I should have some time to get this done within the next few days.

Owner

ogrisel commented Oct 21, 2013

Thanks no problem. I just wanted to make sure that the PR was not dying :)

Owner

amueller commented Oct 22, 2013

@untom what are the problems that you found in integrating with #1799. The problems I discussed with @temporaer on that one were mostly about the semantics. I think my last opinion was to disallow sparse input for min-max scaling if there was any offset. (maybe bail?)

Contributor

untom commented Oct 27, 2013

I've added a MaxAbsScaler as discussed in #1799. If this is to everyone's liking, I can start writing documentation for it, as well. Note that my implementations do not include a "global scaling" mode, simply because I'm not sure there's a usecase for it, but it can be easily added to all the scalers if required.

Owner

ogrisel commented Oct 28, 2013

Thanks @untom please feel free to go forward with the doc, it's interesting. Also please have a look at the travis failures.

Contributor

untom commented Oct 29, 2013

Thanks for looking over the code.... Your proposed test did indeed unearth a bug in my implementation. I'll write some documentation in the upcoming days, then :)

Owner

ogrisel commented Oct 30, 2013

Thanks @untom. Could you also please have a look at the broken tests reported by travis?

https://travis-ci.org/scikit-learn/scikit-learn/builds/13211844

Coverage Status

Coverage remained the same when pulling 99f5391 on untom:robust_scaling into d82cf06 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 5ac05a9 on untom:robust_scaling into d82cf06 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 8dfd0bf on untom:robust_scaling into d82cf06 on scikit-learn:master.

Coverage Status

Coverage remained the same when pulling 0a47b76 on untom:robust_scaling into d82cf06 on scikit-learn:master.

Contributor

untom commented Nov 7, 2013

Sorry for all the commits lately, but I stumbled over a late bug today. The good news is that thanks to the new added tests, the test-coverage for all the scaling code is now at 100% (except for 3 lines relating to printing depreciation warning).

Coverage Status

Coverage remained the same when pulling 098a4ee on untom:robust_scaling into d82cf06 on scikit-learn:master.

Contributor

untom commented Nov 21, 2013

@ogrisel Any change of this getting merged within the 0.15 window?

Coverage Status

Coverage remained the same when pulling 4df8cbb on untom:robust_scaling into d82cf06 on scikit-learn:master.

@ghost ghost assigned ogrisel Nov 21, 2013

Owner

ogrisel commented Nov 21, 2013

@untom this looks good. I'll try to have a deeper look soon and should probably make it into 0.15.

Owner

ogrisel commented Nov 21, 2013

About the center_ name, I think renaming to shift_ would make more sense. I wonder what other people think.

sklearn/preprocessing/data.py
+ ----------
+ copy : boolean, optional, default is True
+ Set to False to perform inplace row normalization and avoid a
+ copy (if the input is already a numpy array).
@ogrisel

ogrisel Nov 22, 2013

Owner

I would not call that "row normalization" but rather just "scaling".

sklearn/preprocessing/data.py
+ axis : int (0 by default)
+ axis used to compute the scaling statistics along. If 0,
+ independently standardize each feature, otherwise (if 1) standardize
+ each sample.
@ogrisel

ogrisel Nov 22, 2013

Owner

I would not call that "standardize" but just "scale".

sklearn/preprocessing/data.py
+ The interquartile range is divided by this factor. If
+ `interquartile_scale` is "normal", the data is scaled so it
+ approximately reaches unit variance. This converge assumes Gaussian
+ input data and will need a largenumber of samples.
@ogrisel

ogrisel Nov 22, 2013

Owner

=> large number.

Owner

ogrisel commented Nov 22, 2013

I rebased this PR on top of the current master and got the following doctest failures when running the full test suite (I don't think the doctest failures are consequence of the rebase though).

$ nosetests doc/modules/preprocessing.rst
======================================================================
FAIL: Doctest: preprocessing.rst
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 2201, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for preprocessing.rst
  File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 0

----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 75, in preprocessing.rst
Failed example:
    scaler
Expected:
    StandardScaler(copy=True, with_centering=True, with_scaling=True)
Got:
    StandardScaler(axis=0, copy=True, with_centering=True, with_mean=None,
            with_scaling=True, with_std=None)
----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 137, in preprocessing.rst
Failed example:
    min_max_scaler.scale_                             # doctest: +ELLIPSIS
Expected:
    array([ 0.5       ,  0.5       ,  0.33...])
Got:
    array([ 2.,  2.,  3.])
----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 140, in preprocessing.rst
Failed example:
    min_max_scaler.min_                               # doctest: +ELLIPSIS
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest preprocessing.rst[24]>", line 1, in <module>
        min_max_scaler.min_                               # doctest: +ELLIPSIS
    AttributeError: 'MinMaxScaler' object has no attribute 'min_'
----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 167, in preprocessing.rst
Failed example:
    X_train_maxabs
Expected:
    array([[ 0.5       , -1.        ,  1.  ],
           [ 1.        ,  0.        ,  0.  ],
           [ 0.        ,  1.        , -0.5 ]])
Got:
    array([[ 0.5, -1. ,  1. ],
           [ 1. ,  0. ,  0. ],
           [ 0. ,  1. , -0.5]])
----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 173, in preprocessing.rst
Failed example:
    X_test_maxabs
Expected:
    array([[-1.5       ,  0.       ,  2. ]])
Got:
    array([[-1.5, -1. ,  2. ]])
----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 175, in preprocessing.rst
Failed example:
    max_abs_scaler.scale_                             # doctest: +ELLIPSIS
Expected:
    array([ 2.        ,  1.        ,  2. ...])
Got:
    array([ 2.,  1.,  2.])
----------------------------------------------------------------------
File "/Users/ogrisel/code/scikit-learn/doc/modules/preprocessing.rst", line 177, in preprocessing.rst
Failed example:
    min_max_scaler.min_                               # doctest: +ELLIPSIS
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 1289, in __run
        compileflags, 1) in test.globs
      File "<doctest preprocessing.rst[33]>", line 1, in <module>
        min_max_scaler.min_                               # doctest: +ELLIPSIS
    AttributeError: 'MinMaxScaler' object has no attribute 'min_'

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x10a39f3b0>.getvalue()))
----------------------------------------------------------------------
Ran 34 tests in 13.719s
Owner

ogrisel commented Nov 22, 2013

Also @untom what do you think of a robust variant of the MaxAbsScaler using iqr independently on the positive and negative sides to discard the outliers when estimating the scale factor?

Owner

ogrisel commented Nov 22, 2013

BTW most of the doctest failures can be fixed by adding #doctest +NORMALIZE_WHITESPACE where appropriate.

Contributor

untom commented Nov 23, 2013

Thanks looking all of this over once more. Sorry for not having checked the doc-tests myself, I should've thought of that.

#doctest +NORMALIZE_WHITESPACE sometimes didn't work for me, even when whitespace was the only thing wrong. In those cases I directly fixed the whitespace itself. E.g.:

File "./scikit-learn/doc/modules/preprocessing.rst", line 171, in preprocessing.rst
Failed example:
    X_test_maxabs                               # doctest: +NORMALIZE_WHITESPACE
Expected:
    array([[-1.5      , -1.        ,  2. ]])
Got:
    array([[-1.5, -1. ,  2. ]])

In those cases I just fixed the expected output directly.

Owner

ogrisel commented Nov 24, 2013

Thanks. What do you think of the idea of a robust scaler that does not break sparsity such as a robust max abs scaler?

Contributor

untom commented Nov 24, 2013

I like your idea, It would of course be nice to have a robust scaler for sparse data. This functionality could easily be integrated into MaxAbsScaler, by adding a quantile argument. So instead of always scaling to the maximal absolute value (which is what quantile=1 would do ), it could also scale to e.g. the 3rd quartile (when quantile=0.75). This would also keep the overall API quite simple (I feel like adding a RobustMaxAbsScaler might lead to confusion, with many different, yet very similar scalers around).

The only drawback is that the name "MaxAbsScaler" might not be a perfectly good fit. The best I could come up with is AbsQuantileScaler, maybe you have a better idea?

Owner

ogrisel commented Nov 25, 2013

I agree: adding a quantile argument to the existing MaxAbsScaler sounds like the best way to go. About the name I think I am fine with both. @amueller @arjoly any opinion?

Coverage Status

Coverage remained the same when pulling 0a9bc01 on untom:robust_scaling into d82cf06 on scikit-learn:master.

sklearn/preprocessing/tests/test_data.py
+ assert_array_almost_equal(X_trans, X_expected_t, 4)
+ X_trans_inv = scaler.inverse_transform(X_trans)
+ assert_array_almost_equal(X_t, X_trans_inv)
+
@ogrisel

ogrisel Nov 27, 2013

Owner

PEP8 / Cosmetics: 2 blank lines between top level function or class definitions.

+ X = [[0., 1., +0.5],
+ [0., 1., -0.3],
+ [0., 1., +1.5],
+ [0., 0., +0.0]]
@jnothman

jnothman Jul 5, 2014

Owner

I just noticed this different line (although perhaps I've noticed it before). It might be worth leaving a comment on this difference from other zero_variance tests.

Owner

larsmans commented Jul 13, 2014

Ugh... this PR is huge and won't merge cleanly. I'm going to try and cherry-pick it (at least partially) into master piece by piece.

Owner

larsmans commented Jul 13, 2014

I've managed to drag out the multi-axis mean_variance_axis function, but I'm not pushing it to master since it serves no purpose on its own.

I'm -1 on merging this in its current state because it's too messy. The code needs a new rebase, and preferably it should be broken back up into individual commits. It's not a refactor, it adds two new classes and deprecates existing functionality.

Also, was there ever a discussion about this deprecation? Introducing a base class can be done without renaming attributes.

@larsmans larsmans changed the title from [MRG+1] Refactoring and expanding sklearn.preprocessing scaling to [MRG+1-1] Refactoring and expanding sklearn.preprocessing scaling Jul 13, 2014

Owner

larsmans commented Jul 13, 2014

That's f08b8c8. Sorry for the noise.

Owner

jnothman commented Jul 13, 2014

There was a lot of discussion of this PR before I got to it, so in truth I
have no idea if the deprecation was discussed; you might be right that the
new naming is counter-intuitive.

Yes, this is a large PR. I found the testing particularly difficult to work
through. I think there might still be a few small errors and untidynesses
in the tests. I've slowly attempted to rewrite it using test case
inheritance to make the invariants and specialised tests a little clearer,
even if this is not a testing approach I have seen in scikit-learn. The
work in progress is at
https://github.com/jnothman/scikit-learn/compare/untom:robust_scaling...scaling?expand=1

On 14 July 2014 02:50, Lars Buitinck notifications@github.com wrote:

That's f08b8c8
f08b8c8.
Sorry for the noise.


Reply to this email directly or view it on GitHub
#2514 (comment)
.

Contributor

untom commented Jul 13, 2014

Parts of this commit were already merged during another commit, and I don't mind splitting it apart further (e.g. moving out the sparsefuns stuff into its own commit). I will have time to do a proper rebase next week.

I am not aware of this deprecating existing functionality, except for renaming a parameter. This is done purely to maintain coherent parameter names across the scalers, which I think makes sense.

Owner

larsmans commented Jul 14, 2014

I meant exactly the parameter renaming, which I don't like very much because the standardizers are core functionality that should really, really be stable across releases.

Contributor

untom commented Jul 21, 2014

I rebased this PR onto the current master, and grouped the changes into several commits, as requested by @larsmans . This probably also make them easier to review and/or cherry-pick (I tried my best to make each cherry-pickable and work in isolation, but there might be slight modifications needed).

As for the renaming: I agree that it is a nuisance. However it makes it easier to make the different scalers interchangeable. Imagine someone changing his algorithm to use to a RobustScaler, and suddenly errors popping out at a different end of the code because suddendly scaler.std_. Or someone trying out different scalers for a specific problem who will have to riddle her code with if isinstance(scaler, StandardScaler) checks because the names aren't consistent. It also makes the names easier to remember since they're the same across all Scalers/scaling-functions. Thus I think down the road it's better to rename the parameters.

However since StandardScaler is probably a heavily-used class (and more used than the others), it might make sense to make the deprecation period longer, so everyone can catch up (Even though based on the recent schedule, a 2-version period still means everyone has 2 years to update his/her code).

Contributor

untom commented Jul 28, 2014

ping @larsmans @ogrisel @amueller

Any more thoughts on this?

Coverage Status

Coverage increased (+0.01%) when pulling b918e66 on untom:robust_scaling into 0a7bef6 on scikit-learn:master.

Contributor

untom commented Sep 1, 2014

This PR has been sitting idly for quite some while now. I still think it's a worthwhile addition to sklearn. If there's anything I can do to speed up the acceptance/review process, let me know.

Owner

arjoly commented Sep 2, 2014

@untom Thanks for your work and patience.

Something that could help to attract reviewers would be to break this pull request into small and independent ones.

Contributor

untom commented Sep 2, 2014

Thanks for the advise, I will do that :)

Owner

jnothman commented Sep 2, 2014

@arjoly one problem with that is that this PR puts a lot of effort into
consistency and invariance tests.

On 2 September 2014 18:34, untom notifications@github.com wrote:

Thanks for the advise, I will do that :)


Reply to this email directly or view it on GitHub
#2514 (comment)
.

Contributor

untom commented Sep 2, 2014

@jnothman: My current plan for splitting this is as follows:

  1. sparsefuncs improvements ( #3622 )
  2. BaseEstimator abstraction ( ==> those will also include all the invariance tests)
  3. RobustScaler
  4. MaxAbsScaler

PR number 2 will include all the consistency/invariance tests that are present here, but will only test them on StandardScaler/MinMaxScaler. But I will make sure that no tests will be lost. (This might also give me an incentive to review the testcases to make sure they make sense / are thorough and nonredundant).

PR 3/4 will then add the other scalers to the list of tested scalers (and of course include any tests specific to Robust- /MaxAbsScaler).

This way the changes are more modular, and e.g. RobustScaler can be included even if MaxAbsScaler is deemed non-worty for inclusion.

So all in all I think arjoly's proposal makes sense, and it should be doable without too much effort on neither my side nor the reviewers side.

Owner

jnothman commented Sep 2, 2014

Okay. @arjoly, @untom, what do you think of the alternative construction of invariance tests in jnothman/scikit-learn@0e4d04c ? It uses test inheritance to make clear the common features and differences between different scalers. (Just to be annoying, that reconstruction isn't quite complete, but does do some cleaning up of the tests' content without documenting exactly what it fixes.)

Contributor

untom commented Sep 2, 2014

I like it, looks elegant and a bit "cleaner" than iterating over lists.

Owner

jnothman commented Sep 2, 2014

The issue then is whether it's acceptable in a project where unittest
classes are avoided...

On 2 September 2014 23:06, untom notifications@github.com wrote:

I like it, looks elegant and a bit "cleaner" than iterating over lists.


Reply to this email directly or view it on GitHub
#2514 (comment)
.

Contributor

untom commented Sep 4, 2014

Are they being actively avoided, or was there just no usecase for them until now?

Owner

larsmans commented Sep 4, 2014

The SGD code uses them, but they're a bit of a pain with nosetests.

Contributor

untom commented Sep 4, 2014

How so?

Owner

larsmans commented Sep 4, 2014

Because running a single test takes a lot of typing if it's in a class :)

And there's seldom a need for these classes. We usually just loop over things or call common functions.

Owner

amueller commented Aug 25, 2015

Closing as merged in #4828 and #4125.

@amueller amueller closed this Aug 25, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment