Skip to content

Truncating in MinMaxScaler #3342

Open
wants to merge 2 commits into from

4 participants

@dougalsutherland

The output of MinMaxScaler doesn't always lie within the passed feature range, if data that you transform() has values outside the range of the values that you fit() on. If you're using it just to make the scale of the data nicer, this probably doesn't matter, but if your algorithm actually relies on the data lying in a certain range (example) this is no good.

So, this PR adds optional support for truncation, so that values that would be transformed outside of feature_range are clipped to the ends of it. It also adds a fit_feature_range to make truncation less likely (e.g. if you need your data to lie in [0, 1], you can make your training data like in [.1, .9] and then test values have more of a range to avoid clipping).

Incidentally, I also add assert_array_{less_equal,greater,greater_equal} because my tests wanted them and it's silly that numpy only provides assert_array_less.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 304b84b on dougalsutherland:truncating-minmax into 82611e8 on scikit-learn:master.

@dougalsutherland dougalsutherland added a commit to dougalsutherland/skl-groups that referenced this pull request Jul 3, 2014
@dougalsutherland dougalsutherland add truncating MinMaxScaler 36ab4fa
@jnothman
scikit-learn member
@dougalsutherland

Noticed a small doc error in the testing utils, so fixed that.

I should say that I'm not totally satisfied with the fit_feature_range argument and would be happy to hear another way to handle that. (A "wiggle_room" parameter that shrinks the range by some portion?)

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling 97099d9 on dougalsutherland:truncating-minmax into 82611e8 on scikit-learn:master.

@untom
untom commented Jul 5, 2014

I think the truncation is a nice new feature, but It seems to me that fit_feature_range has a very narrow use case, so I'm not sure that parameter is worth the added complexity -- users that really need such a behaviour would probably also be better off running the data through a sigmoid transformation as apreprocessing step, instead of having a "hard" cut-off, no?

@dougalsutherland

@untom Yeah, a sigmoid transformation might make sense, depending on the use case. I agree that fit_feature_range is probably more complex than it's worth.

@jnothman
scikit-learn member
jnothman commented Aug 3, 2014

Another -1 here for fit_feature_range, but truncate might still be useful.

@dougalsutherland

Okay, here's a new version without fit_feature_range.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling b8fbc74 on dougalsutherland:truncating-minmax into 0a7bef6 on scikit-learn:master.

@coveralls

Coverage Status

Coverage increased (+0.01%) when pulling b8fbc74 on dougalsutherland:truncating-minmax into 0a7bef6 on scikit-learn:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.