Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Global normalization and sparse matrix support for MinMaxScaler #1799

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
6 participants
Contributor

temporaer commented Mar 21, 2013

tests working & doc in place, needs feedback and merge if OK

Owner

amueller commented Mar 21, 2013

awesome, thanks :)
Could it be that you forgot to add some files / modify the setup.py? Travis is complaining.

Owner

amueller commented Mar 21, 2013

You didn't add the updates c file ;)

Contributor

temporaer commented Mar 21, 2013

yep, I didn't want to clutter the commit log at this stage. I'll add some tests first and then care about this whiny Travis guy.

@mblondel mblondel commented on an outdated diff Mar 22, 2013

sklearn/utils/sparsefuncs.pyx
+ max: float array with shape (n_features,)
+ Feature-wise maxima
+ """
+ cdef unsigned int n_samples = X.shape[0]
+ cdef unsigned int n_features = X.shape[1]
+
+ cdef np.ndarray[DOUBLE, ndim=1] X_data = X.data
+ cdef np.ndarray[int, ndim=1] X_indices = X.indices
+ cdef np.ndarray[int, ndim=1] X_indptr = X.indptr
+
+ cdef unsigned int i
+ cdef unsigned int j
+ cdef unsigned int ind
+ cdef double v
+
+ # means[j] contains the minimum of feature j
@mblondel

mblondel Mar 22, 2013

Owner

wrong comment :)

Owner

ogrisel commented Mar 22, 2013

yep, I didn't want to clutter the commit log at this stage.

Please feel free to checkin the .c file as well. The diff clutter can be controlled by:

https://github.com/scikit-learn/scikit-learn/blob/master/.gitattributes

Owner

ogrisel commented Mar 22, 2013

This looks like a great contrib @temporaer . Looking forward to the tests and some documentation updates in the preprocessing section.

@larsmans larsmans commented on an outdated diff Mar 27, 2013

sklearn/preprocessing.py
- data_min = np.min(X, axis=0)
- data_range = np.max(X, axis=0) - data_min
+ if sp.issparse(X):
+ warn_if_not_float(X, estimator=self)
+ if self.per_feature:
+ data_min, data_max = min_max_axis0(X)
+ else:
+ data_min, data_max, dense = np.min(X.data), np.max(X.data), X.data.size == np.prod(X.shape)
+ if not dense:
+ data_min = min(0, data_min)
+ data_max = max(0, data_max)
+
+ data_range = data_max - data_min
+ else:
+ # TODO why would anyone want to force a copy here? (this was here
+ # before, I believe)
@larsmans

larsmans Mar 27, 2013

Owner

I don't think the copy is necessary. It's only relevant in the transform method.

@larsmans larsmans and 2 others commented on an outdated diff Mar 27, 2013

sklearn/preprocessing.py
self.scale_ = (feature_range[1] - feature_range[0]) / data_range
self.min_ = feature_range[0] - data_min * self.scale_
+ if sp.issparse(X):
+ if np.any(self.min_ != 0.0):
+ raise ValueError("For sparse X, the data minimum must equal the range minimum, "
+ "otherwise we'd need to densify the matrix, which you probably "
+ "do not want to happen.")
@larsmans

larsmans Mar 27, 2013

Owner

This is brittle; sparse matrices are mostly used for frequency tables, and it just might happen that one of the fold in a k-fold cross validation has non-zero minima. I'd rather have MinMaxScaler assume zero minima for sparse matrices. To make this less of a surprise, we might add a parameter to the constructor to control this.

@larsmans

larsmans Mar 27, 2013

Owner

Also, it just crossed my mind that the values produced by FeatureHasher are "signed frequencies", where the sign is pseudo-random. Should we handle that as well? (The non_negative option on FeatureHasher can take care of it, but destroys the built-in centering of the hashing trick.)

@temporaer

temporaer Mar 27, 2013

Contributor

That was my initial plan, simply assuming that every sparse matrix [column] has zeros in it (which could also be maxima, btw). The disadvantage is that the scaler behaves differently for a dense matrix X and its sparse conversion. I'd suggest keeping the current behavior as the default and adding an additional "assume_contains_zeros" option to the constructor for the case you mentioned?

Can you be more specific about "handling" the pseudorandom signs?

@larsmans

larsmans Mar 27, 2013

Owner

I use hashing for booleans and frequencies, and I think that's the main use case (it is the one envisioned by Weinberger et al.). So, negative elements are only produced by the hash function, not the actual data source. In that case, scaling should be done by setting the min to zero and the max to the maximum of absolute values in X.

@larsmans

larsmans Mar 27, 2013

Owner

If it's too much work, then don't bother -- StandardScaler(with_mean=False) works fine on hashing output.

@temporaer

temporaer Mar 27, 2013

Contributor

If I understand correctly, you want to have an option for doing data_max = max(data_max, -data_min)? And the same for data_min? I can see that this could come in handy at times...

@larsmans

larsmans Mar 27, 2013

Owner

Yes, except that data_min would just be zero.

@amueller

amueller Apr 2, 2013

Owner

hum I don't completely understand. Basically what you want is simply scale the data, not shift it, right?

You proposed to scale it by scaling=max(data_max, -data_min). How would this translate to the feature_range parameter of MinMaxScaler?

The feature range basically has two parameters and just scaling has only one parameter.
So no matter how you choose to parametrize, afterwards you won't have both, X.min() == feature_range[0] and X.max() == feature_range[1].

I agree that this would be a nice-to-have feature, but I don't see currently how to build an intuitive API with MinMaxScaler.

@larsmans

larsmans Apr 2, 2013

Owner

Indeed, the API would be complicated. As I already pointed out, a StandardScaler can handle this kind of data without shifting, so don't bother if it's too hard. (Also, with the boolean features I'm using right now, scaling has hardly any effect if you use plenty of hash buckets.)

Owner

amueller commented Apr 2, 2013

@temporaer could you please squash the commits? (because that's something we do now ^^)

Contributor

temporaer commented Apr 2, 2013

uh... squash /all/ of them into one? Or just clean up a little?

Owner

larsmans commented Apr 2, 2013

Preferably everything. That keeps the history slim, makes reverting easy, prevents broken states in the history, and hides our mistakes from the users ;)

Owner

amueller commented Apr 2, 2013

Thanks. +1 for merge :)

@larsmans larsmans and 1 other commented on an outdated diff Apr 16, 2013

sklearn/preprocessing.py
@@ -171,6 +172,19 @@ class MinMaxScaler(BaseEstimator, TransformerMixin):
Set to False to perform inplace row normalization and avoid a
copy (if the input is already a numpy array).
+ per_feature : boolean, optional, default is True
+ Set to False to normalize using the minimum/maximum of all
+ features, not per column
+
+ assume_contains_zeros : boolean, optional, default is False
@larsmans

larsmans Apr 16, 2013

Owner

I'm sorry, I really don't understand this explanation.

@temporaer

temporaer Apr 17, 2013

Contributor

How about: If True, normalize all columns separately. Set to False to normalize using minimum and maximum of X. Maybe also rename to per_column?

@larsmans

larsmans Apr 17, 2013

Owner

I wasn't referring to per_feature, that's fine as it is. I don't understand the behavior of assume_contains_zeros.

@temporaer

temporaer Apr 17, 2013

Contributor

Hm OK, there was only one "explanation" in the excerpt that github selected :-)

How about:
MinMaxScaler does not work for sparse matrices when the additive component of the normalization is non-zero. When normalizing to the range of [0,1], this implies that the minimum in every column of X must be 0. Setting assume_contains_zeros=True enforces this constraint even if some columns of X are strictly positive.

Owner

amueller commented May 5, 2013

@larsmans ok for merge?

Owner

larsmans commented May 5, 2013

The code looks alright, but I still don't understand the assume_contains_zeros option, and I want the API to be clear before merging this.

Owner

amueller commented May 5, 2013

I agree. I'm also not sure what it is we actually want in the sparse case.
I think the default option should keep sparsity structure, as this is how we did it in other cases afaik (RandomizedPCA for example), and the user needs to be explicit to destroy sparsity structure.

So maybe we want an with_offset option which is True for dense and False for sparse input.

If the input if strictly positive, between 0 and x and the given range is [0, a] then the right thing will happen.
If the input is between -x and x and the range is [-a, a] also the right thing will happen.

In all other cases, something a bit weird would happen, though
If the input is between -x and x and you ask for [0, a] you would get [0, a/2] I think... well...

Contributor

temporaer commented May 5, 2013

Making centering optional sounds good to me, it is somewhat more intuitive than "injecting" artificial zeros. We'd still need to explain the reasoning why the user would want to change default behavior, and there the complexity would be unchanged. I have no preference here and I'm happy to implement what you API guys decide :-)

Andreas Mueller notifications@github.com schrieb:

I agree. I'm also not sure what it is we actually want in the sparse case.
I think the default option should keep sparsity structure, as this is how we did it in other cases afaik (RandomizedPCA for example), and the user needs to be explicit to destroy sparsity structure.

So maybe we want an with_offset option which is True for dense and False for sparse input.

If the input if strictly positive, between 0 and x and the given range is [0, a] then the right thing will happen.
If the input is between -x and x and the range is [-a, a] also the right thing will happen.

In all other cases, something a bit weird would happen, though
If the input is between -x and x and you ask for [0, a] you would get [0, a/2] I think... well...


Reply to this email directly or view it on GitHub.

Owner

amueller commented May 6, 2013

Any opinion @larsmans ?

Owner

larsmans commented May 6, 2013

I think center="auto" with a default value of issparse(X) and an exception for the combination of True and dense arrays would be fine.

Owner

amueller commented May 6, 2013

I wouldn't call it 'center'. And you probably mean "sparse" + True is an exception, right?

Owner

larsmans commented May 20, 2013

It's time to get this merged. @amueller, what name do you suggest?

Owner

amueller commented May 21, 2013

I would have called it with_offset (similar to with_mean).

Contributor

temporaer commented Jun 30, 2013

Hope it's all right now...

Owner

ogrisel commented Jul 9, 2013

Could you please rebase on top of master (or merge master to your branch) and fix any conflict so as to make github happy?

Contributor

temporaer commented Jul 10, 2013

@ogrisel rebase done, travis happy ;-)

@ogrisel ogrisel commented on the diff Jul 10, 2013

doc/modules/preprocessing.rst
@@ -145,6 +145,30 @@ full formula is::
X_scaled = X_std / (max - min) + min
+The :class:`MinMaxScaler` can also be applied to sparse matrices in
+CSR or CSC format, *as long as the value of the ``min_`` attribute
+amounts to zero*.
+
+Finally, features do not have to be rescaled separately. Different
+feature ranges are interpreted as a relative "importance" by many
+classifiers (consider the k-Nearest Neighbor classifier, for example,
+where differences between features with a small range will influence
+distances, and thus classification, less). You may want to keep this
+relative importance information and still rescale your features to a
+sensible range. This can be done by setting ``per_feature``::
@ogrisel

ogrisel Jul 10, 2013

Owner

=> "This can be done by setting per_feature to False:"

@ogrisel ogrisel commented on the diff Jul 10, 2013

doc/modules/preprocessing.rst
+distances, and thus classification, less). You may want to keep this
+relative importance information and still rescale your features to a
+sensible range. This can be done by setting ``per_feature``::
+
+ >>> X_train = np.array([[ 1., -1., 2.],
+ ... [ 2., 0., 0.],
+ ... [ 0., 1., -1.]])
+ ...
+ >>> min_max_scaler = preprocessing.MinMaxScaler(per_feature=False)
+ >>> X_train_minmax = min_max_scaler.fit_transform(X_train)
+ >>> X_train_minmax
+ array([[ 0.66666667, 0. , 1. ],
+ [ 1. , 0.33333333, 0.33333333],
+ [ 0.33333333, 0.66666667, 0. ]])
+
+
@ogrisel

ogrisel Jul 10, 2013

Owner

You should also document the motivation for the with_offset parameter on this example. I am not sure I understand it right now.

@ogrisel ogrisel commented on the diff Jul 10, 2013

sklearn/utils/sparsefuncs.pyx
@@ -207,7 +322,7 @@ def csc_mean_variance_axis0(X):
cdef np.ndarray[DOUBLE, ndim=1] variances = np.zeros_like(means)
# counts[j] contains the number of samples where feature j is non-zero
- counts = np.zeros_like(means)
+ cdef np.ndarray[np.int64_t, ndim=1] counts = np.zeros_like(means, dtype=np.int64)

@ogrisel ogrisel commented on the diff Jul 10, 2013

sklearn/preprocessing.py
@@ -176,6 +177,15 @@ class MinMaxScaler(BaseEstimator, TransformerMixin):
Set to False to perform inplace row normalization and avoid a
copy (if the input is already a numpy array).
+ per_feature : boolean, optional, default is True
+ Set to False to normalize using the minimum/maximum of all
+ features, not per column
+
+ with_offset : boolean, optional, default is True for dense,
+ and False for sparse X.
+ If False, no additive normalization is used, only scaling.
@ogrisel

ogrisel Jul 10, 2013

Owner

Could rephrase this to make it more explicit? What is "additive transformation"?

Owner

ogrisel commented Jul 10, 2013

Apart from the lack of documentation for the with_offset option, it looks good. I am not sure about the fact that the behavior is not the same for sparse matrix and dense arrays is different. A naive user would likely expect the default behavior to be the same (as long as min_ == 0).

Contributor

temporaer commented Jul 10, 2013

@ogrisel we discussed the with_offset behavior above. Summary: My first implementation was to keep the behavior the same for dense/sparse. The complicating issue here is that sparse matrices include implicit zeros, which needs to be taken into account. You could then argue that sparse matrices are /meant/ to have zeros, so zero should always be part of the input range. A sparse matrix which does not have zeros would then behave differently than its dense counterpart. The original implementation had a parameter assume_contains_zeros, with default True for sparse matrices. That was too complicated to document and non-intuitive for naive users. I think the solution as it is now (by default, only use scaling for sparse) is the better of the two evils.

Owner

ogrisel commented Jul 10, 2013

Still better documentation (with example) is required because right now I am not sure what it actually does.

Owner

ogrisel commented Oct 21, 2013

@temporaer do you plan to finish the doc work on this PR or would you like someone else to take over the PR from there?

Contributor

temporaer commented Oct 21, 2013

@ogrisel when I tried last, it turned out to be more than just doc work. I'll try to to produce something discussable in the next few days.

Owner

ogrisel commented Oct 21, 2013

Thanks!

Contributor

untom commented Oct 21, 2013

Hi there!

I'm working on something similar over in #2514, and I think if we might end up replicating effort (I was in fact already looking into merging your PR into mine) :) Maybe we should hash something out together before we both waste time?

Where are you currently stuck/having problems?

Owner

amueller commented Oct 22, 2013

Tonights favoured (by me) solution to the semantic problems: if data is sparse and the offset is non-zero, raise a value error. That means that on the off-chance that a sparse array doesn't contain a zero (and it is scaled between, say 0 and 1), it will bail.

Pro:

  • when it computes something, it is consistent with the dense case
  • It never makes a matrix dense
  • If it computes something, the result is very natural

Con:

  • it might fail during a GridSearch.

@larsmans @ogrisel wdyt?

The second best option: make MinMaxScaler densify if it has to and make an additional estimator that only scales data.

Owner

amueller commented Oct 22, 2013

Ok even better (though deprecation cycle inducing) idea: remove MinMaxScaler as no-one needs that functionality anyway. Instead introduce MaxAbsScaler, that scales the data to have a given maximum absolute value, and never adds anything to the data.

Contributor

untom commented Oct 22, 2013

I personally was always wondering why the option to add something to the data existed, I figured someone must've had a use case for it. What I came up with last night was making the behavior dependent on the feature_range argument. On dense input, keep the current behavior, while on sparse input:

  • throw execption if range is [+a, +b] or [-a, -b] (i.e., does not cross zero, hence would destroy sparsity)
  • throw exception if range is [0, +a], but the data contains values < 0: (i.e., it would destroy zeros in at least one feature).... same if range is [-a, 0] and data contains values > 0.
  • if range is [-a, +b], scale all nonzero values (i.e., zeros will be preserved).... this behavior is different from what the dense part does, but it should be obvious why and it is easy to document/grasp for users.
Contributor

temporaer commented Oct 22, 2013

When I first implemented this, I had something quite like Andy's solution. With one addition, however: I had a parameter assume_data_contains_zeros, which was True by default for sparse X. If the empirical data range did not include zeros, it would be enlarged to contain it. I thought it was intuitive since sparse data should be allowed to contain zeros anyway.

Pro:

  • more stable than Andys solution (less likely to fail during grid search)

Con:

  • behavior on dense and sparse, but otherwise identical X may differ if assume_data_contains_zeros is left to default (False for dense, True for sparse)
Owner

larsmans commented Oct 22, 2013

Different behavior on sparse and dense has turned out to be very confusing for users. I've been trying to get rid of that everywhere I can.

Owner

ogrisel commented Oct 22, 2013

The MaxAbsScaler idea is interesting. We can keep the MinMaxScaler as it is to not break the backward compat but recommend to use MaxAbsScaler in the doc as it can support dense and sparse data in a consistent way by default.

Owner

amueller commented Oct 22, 2013

Does anyone have a usecase that needs the current functionality of MinMaxScaler? I think I coded it this way, but I have no idea why.

@untom untom referenced this pull request Jun 7, 2015

Merged

[MRG+2] add MaxAbsScaler #4828

Owner

amueller commented Jun 11, 2015

Superseded by #4828.

@amueller amueller closed this Jun 11, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment