Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[MRG] Added code for sklearn.preprocessing.RankScaler #2176

Open
wants to merge 7 commits into from

9 participants

@turian

I wrote code for doing rank-scaling. This scaling technique is more robust than StandardScaler (unit variance, zero mean).

I believe that "scale" is the wrong term for this operation. It's actually feature "normalization". This name-conflicts with the "normalize" method, though.

I wrote documentation and tests. However, I was unable to get the doc-suite or test-suite to build for the current sklearn HEAD, so I couldn't double-check all my documentation and tests.

@larsmans
Owner

We use normalization to refer to normalized (unit) sample vectors. Standardizer would maybe be more apt.

sklearn/preprocessing.py
((47 lines not shown))
+
+ fit will take time O(n_features * n_samples * log(n_samples)),
+ and use memory O(n_samples * n_features).
+
+ Parameters
+ ----------
+ X : array-like or CSR matrix with shape [n_samples, n_features]
+ The data used to compute feature ranks.
+ """
+ X = check_arrays(X, copy=self.copy, sparse_format="csr")[0]
+ if sp.issparse(X):
+ raise ValueError("Cannot rank-standardize sparse matrices.")
+ if X.ndim != 2:
+ raise ValueError("Rank-standardization only tested on 2-D matrices.")
+ else:
+ self.sort_X_ = np.sort(X, axis=0)
@larsmans Owner

Shouldn't this be an argsort if indices are supposed to come out? Also, do you need to store the full training set, or could a summary statistic over axis 0 suffice?

@turian
turian added a note

This shouldn't be an argsort. In fit, I simply sort the feature values. In transform, I use np.searchsorted (which is like Python bisect) to find the index that a particular feature value would be inserted at.

It is possible that you do not need to store the full training set. The # of different feature values that you store will determine the granularity of the final transformed feature values. e.g. if you store only 100 values, the resolution of the transformed values should be 0.01.

However, I am not sure the best way to implement this in practice. Things get tricky when there is a large number of repeated values. One naive way to implement this would be to define a resolution parameter (e.g. 100), and compute the feature value for 0/resolution through resolution/resolution. This truncated table would be stored instead of Sort_X_. If you want to take advantage of many repeated values, this would require larger changes to the code.

@turian
turian added a note

I have pushed a change that implements the summary statistic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((63 lines not shown))
+ return self
+
+ def transform(self, X):
+ """Perform rank-standardization.
+
+ transform will take O(n_samples * n_samples * log(n_fit_samples)),
+ where `n_fit_samples` is the number of samples used during `fit`.
+
+ Parameters
+ ----------
+ X : array-like with shape [n_samples, n_features]
+ The data used to scale along the features axis.
+ """
+# copy = copy if copy is not None else self.copy
+# X = check_arrays(X, copy=copy, sparse_format="csr")[0]
+ X = check_arrays(X, copy=self.copy, sparse_format="csr")[0]
@agramfort Owner

why csr if below you say that you don't accept sparse data?

@turian
turian added a note

To be honest, this was based upon my lack of familiarity with sklearn type-checking.

My code is known to work ONLY on 2-d dense numpy arrays.

I appreciate any advice you have about how to improve the type-checking.

@turian
turian added a note

I have updated the type-checking to use array2d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((72 lines not shown))
+ ----------
+ X : array-like with shape [n_samples, n_features]
+ The data used to scale along the features axis.
+ """
+# copy = copy if copy is not None else self.copy
+# X = check_arrays(X, copy=copy, sparse_format="csr")[0]
+ X = check_arrays(X, copy=self.copy, sparse_format="csr")[0]
+ if sp.issparse(X):
+ raise ValueError("Cannot rank-standardize sparse matrices.")
+ if X.ndim != 2:
+ raise ValueError("Rank-standardization only tested on 2-D matrices.")
+ else:
+ warn_if_not_float(X, estimator=self)
+ newX = []
+ for j in range(X.shape[1]):
+ newX.append(1. * (np.searchsorted(self.sort_X_[:,j], X[:,j], side='left') + np.searchsorted(self.sort_X_[:,j], X[:,j], side='right')) / (2 * self.sort_X_.shape[0]))
@agramfort Owner

ppeeeppppp88888 :)

@turian
turian added a note

which pep?

@turian
turian added a note

ah, pep8. what in particular do you think my code violates?

@GaelVaroquaux Owner
@ogrisel Owner
ogrisel added a note

In general just run the pep8 linter from https://pypi.python.org/pypi/pep8 and fix the violations.

@turian
turian added a note

I have done this and pushed the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux

Travis build failed on this branch:
https://travis-ci.org/scikit-learn/scikit-learn/builds/9334424
That's improper handling of sparse matrices (you need to fix the "sparse='csr'" argument)

@ogrisel
Owner

Storing a (sorted) copy of the original data in memory seems wasteful to me. Wouldn't it make more sense to compute percentile bin boundaries at fit time to only store the bin boundary values on as an attribute on the scaler object to be reused a transform time to do the actual scaling (with linear interpolation)?

Also we should find a way to not call searchsorted twice if possible.

@turian

Points taken.

Could someone point me to documentation on how to do type-checking correctly?

I will think over how I can do the approximate fit correctly.

@ogrisel
Owner

Could someone point me to documentation on how to do type-checking correctly?

Unfortunately I don't think there is a good doc for this and the current code base is not very consistent. As you don't support sparse matrices (at least not in the current state of this PR) you should probably use X = array2d(X) that comes from sklearn.utils.array2d.

@turian
  1. I have fixed PEP8 violations.
  2. I have improved type-checking, as suggesting by @ogrisel
  3. I have written an approximate transform, that is less memory-intensive. (Suggested by @ogrisel and @larsmans ) By default, the resolution is 1000. I have also implemented tests to make sure that the approximation does not differ too much from the exact version.
sklearn/preprocessing.py
((8 lines not shown))
+ the percentile of the feature value.
+ A feature value that is smaller than observed during fitting
+ will scale to 0.
+ A feature value that is larger than observed during fitting
+ will scale to 1.
+ A feature value that is the median will scale to 0.5.
+
+ Standardization of a dataset is a common requirement for many
+ machine learning estimators. Rank-scaling is useful when
+ estimators perform badly on StandardScalar features. Rank-scaling
+ is more robust than StandardScaler, because outliers can't have
+ large values post scaling. It is an empirical question whether
+ you want outliers to be given high importance (StandardScaler)
+ or not (RankScaler).
+
+ TODO: min and max parameters?
@agramfort Owner

maybe you can keep it for later.

to keep a note add a commented line in the code e.g.

# XXX add min max parameters

@turian
turian added a note

As requested, I have added a commented line in the code. Will push soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((28 lines not shown))
+ The number of different ranks possible.
+ i.e. The number of indices in the compressed ranking matrix
+ `sort_X_`.
+ This is an approximation, to save memory and transform
+ computation time.
+ e.g. if 1000, transformed values will have resolution 0.001.
+ If `None`, we store the full size matrix, comparable
+ in size to the initial fit `X`.
+
+ Attributes
+ ----------
+ `sort_X_` : array of ints with shape [n_samples, n_features]
+ The rank-index of every feature in the fit X.
+
+ `resolution_` : int
+ Desired resolution (number of ranks).
@agramfort Owner

is resolution_ different from the resolution init param as been data dependent?

@turian
turian added a note

No.

self.resolution_ = resolution in __init__

@turian
turian added a note

Oh, I understand. I have fixed and will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((40 lines not shown))
+ The rank-index of every feature in the fit X.
+
+ `resolution_` : int
+ Desired resolution (number of ranks).
+
+ See also
+ --------
+ :class:`sklearn.preprocessing.StandardScaler` to perform standardization
+ that is faster, but less robust to outliers.
+ """
+
+ def __init__(self, resolution):
+ """
+ TODO: Add min and max parameters? Default = [0, 1]
+ """
+ self.copy = True # We don't have self.copy=False implemented
@agramfort Owner

either expose copy in init or remove it.

@turian
turian added a note

Removed it. Will push soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((42 lines not shown))
+ `resolution_` : int
+ Desired resolution (number of ranks).
+
+ See also
+ --------
+ :class:`sklearn.preprocessing.StandardScaler` to perform standardization
+ that is faster, but less robust to outliers.
+ """
+
+ def __init__(self, resolution):
+ """
+ TODO: Add min and max parameters? Default = [0, 1]
+ """
+ self.copy = True # We don't have self.copy=False implemented
+ self.resolution_ = resolution
+ pass
@agramfort Owner

you don't need the pass

@turian
turian added a note

Removed it. Will push soon.

@turian
turian added a note

Actually I think I need it again. You'll see when I push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((54 lines not shown))
+ """
+ self.copy = True # We don't have self.copy=False implemented
+ self.resolution_ = resolution
+ pass
+
+ def fit(self, X, y=None):
+ """Compute the feature ranks for later scaling.
+
+ fit will take time O(n_features * n_samples * log(n_samples)),
+ because it must sort the entire matrix.
+
+ It use memory O(n_features * resolution).
+
+ Parameters
+ ----------
+ X : array-like matrix with shape [n_samples, n_features]
@agramfort Owner

X : array-like, shape (n_samples, n_features)

@turian
turian added a note

Changed. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((59 lines not shown))
+ def fit(self, X, y=None):
+ """Compute the feature ranks for later scaling.
+
+ fit will take time O(n_features * n_samples * log(n_samples)),
+ because it must sort the entire matrix.
+
+ It use memory O(n_features * resolution).
+
+ Parameters
+ ----------
+ X : array-like matrix with shape [n_samples, n_features]
+ The data used to compute feature ranks.
+ """
+ X = array2d(X)
+ full_sort_X_ = np.sort(X, axis=0)
+ if not self.resolution_ or self.resolution_ >= X.shape[0]:
@agramfort Owner

you cannot use a data dependent param self.resolution_ entering the fit. Such a param should be set by the fit method.

@agramfort Owner

it seems you don't need the self.resolution_ but just the self.resolution

remove the self.resolution_ attribute and just rely on self.resolution

@agramfort Owner

so

n_samples, n_features = X.shape
if self.resolution is None or self.resolution >= n_samples:

@turian
turian added a note
  1. MinMaxScaler has feature_range in __init__. StandardScaler has with_mean=True, with_std=True in __init__. So I believe RankScaler should have resolution in __init__.
  2. There is no self.resolution, only a self.resolution_.
  3. Great suggestion. Done. Will push.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((108 lines not shown))
+ ----------
+ X : array-like with shape [n_samples, n_features]
+ The data used to scale along the features axis.
+ """
+ X = array2d(X)
+ warn_if_not_float(X, estimator=self)
+ newX = []
+ for j in range(X.shape[1]):
+ lidx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='left')
+ ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
+ newX.append(1. * (lidx + ridx) / (2 * self.sort_X_.shape[0]))
+ X = np.vstack(newX).T
+ return X
+
+# def inverse_transform(self, X, copy=None):
+## Not implemented
@agramfort Owner

remove these comments before a merge.

@turian
turian added a note

Is it okay to leave them? To suggest that we would want to implement this method later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tests/test_preprocessing.py
((8 lines not shown))
+
+ rank_scaler = RankScaler()
+ rank_scaler.fit(X)
+ X_scaled = rank_scaler.transform(X)
+ assert_array_almost_equal(X_scaled, [[ 0.125, 0.25 , 0.25 , 0.125, 0.625],
+ [ 0.375, 0.625, 0.875, 0.5 , 0.625],
+ [ 0.75 , 0.875, 0.625, 0.5 , 0.125],
+ [ 0.75 , 0.25 , 0.25 , 0.875, 0.625]])
+
+ X2 = np.array([[0, 1.5, 0, 5, 10]])
+ X2_scaled = rank_scaler.transform(X2)
+ assert_array_almost_equal(X2_scaled, [[ 0. , 0.75, 0.25, 1. , 1. ]])
+
+
+ # Check RankScaler at different resolution
+ for ninstances in [10, 100, 1000]:
@agramfort Owner

n_samples

@turian
turian added a note

Done. Good suggestion. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tests/test_preprocessing.py
((22 lines not shown))
+ # Check RankScaler at different resolution
+ for ninstances in [10, 100, 1000]:
+ for resolution in [ninstances+1, ninstances, ninstances-1, \
+ int(ninstances/2), int(ninstances/7), int(ninstances/10)]:
+ X = rng.randn(ninstances, 100)
+ rank_scaler1 = RankScaler(resolution=None)
+ rank_scaler2 = RankScaler(resolution=resolution)
+ rank_scaler1.fit(X)
+ rank_scaler2.fit(X)
+
+ X2 = rng.randn(1000, 100)
+ X21 = rank_scaler1.transform(X2)
+ X22 = rank_scaler2.transform(X2)
+
+ # In the approximate version X22, all values must
+ # be within resolution of the exact value X11.
@agramfort Owner

indent

@turian
turian added a note

I fixed the indent in line 152. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((24 lines not shown))
+
+ Parameters
+ ----------
+ resolution : int, 1000 by default
+ The number of different ranks possible.
+ i.e. The number of indices in the compressed ranking matrix
+ `sort_X_`.
+ This is an approximation, to save memory and transform
+ computation time.
+ e.g. if 1000, transformed values will have resolution 0.001.
+ If `None`, we store the full size matrix, comparable
+ in size to the initial fit `X`.
+
+ Attributes
+ ----------
+ `sort_X_` : array of ints with shape [n_samples, n_features]
@arjoly Owner
arjoly added a note

Maybe X_rank_index_ or X_rank_ is more explicit?

@turian
turian added a note

I like n_ranks, like you suggested below.

@agramfort Owner

same as below. Doc should be

sort_X_ : array of ints, shape (n_samples, n_features)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((12 lines not shown))
+ will scale to 1.
+ A feature value that is the median will scale to 0.5.
+
+ Standardization of a dataset is a common requirement for many
+ machine learning estimators. Rank-scaling is useful when
+ estimators perform badly on StandardScalar features. Rank-scaling
+ is more robust than StandardScaler, because outliers can't have
+ large values post scaling. It is an empirical question whether
+ you want outliers to be given high importance (StandardScaler)
+ or not (RankScaler).
+
+ TODO: min and max parameters?
+
+ Parameters
+ ----------
+ resolution : int, 1000 by default
@arjoly Owner
arjoly added a note

Maybe n_ranks would be more explicit?

@turian
turian added a note

Great. Good suggestion. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly arjoly commented on the diff
sklearn/preprocessing.py
((33 lines not shown))
+ e.g. if 1000, transformed values will have resolution 0.001.
+ If `None`, we store the full size matrix, comparable
+ in size to the initial fit `X`.
+
+ Attributes
+ ----------
+ `sort_X_` : array of ints with shape [n_samples, n_features]
+ The rank-index of every feature in the fit X.
+
+ `resolution_` : int
+ Desired resolution (number of ranks).
+
+ See also
+ --------
+ :class:`sklearn.preprocessing.StandardScaler` to perform standardization
+ that is faster, but less robust to outliers.
@arjoly Owner
arjoly added a note

It's pretty cool :-)

Do you think that you can share this knowledge with an example?

@turian
turian added a note

What did you have in mind specifically?

I used this for a project, not sure if I can share the data. It was social network data.
There were a lot of outliers, so after using StandardScaler, some things still had like feature value = +90!

This method was originally suggested to me by Yoshua Bengio. He spoke about it like it was a well-known thing, but I haven't really seen anyone else using it. Maybe it's one of those tricks that a bunch of neural networks people know, but no-one else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((41 lines not shown))
+
+ `resolution_` : int
+ Desired resolution (number of ranks).
+
+ See also
+ --------
+ :class:`sklearn.preprocessing.StandardScaler` to perform standardization
+ that is faster, but less robust to outliers.
+ """
+
+ def __init__(self, resolution):
+ """
+ TODO: Add min and max parameters? Default = [0, 1]
+ """
+ self.copy = True # We don't have self.copy=False implemented
+ self.resolution_ = resolution
@arjoly Owner
arjoly added a note

If you don't change the parameter value, we store it in an attribute with the same name

self.argument_name = argument_name

If there is future preprocessing, for instance you allow a string argument that is latter transform in a more concrete, you do in the fit
method

self.argument_name_ = XXX  # Where XXX is the result of some computation.
@turian
turian added a note

Gotcha! Very cool. Will pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((58 lines not shown))
+
+ def fit(self, X, y=None):
+ """Compute the feature ranks for later scaling.
+
+ fit will take time O(n_features * n_samples * log(n_samples)),
+ because it must sort the entire matrix.
+
+ It use memory O(n_features * resolution).
+
+ Parameters
+ ----------
+ X : array-like matrix with shape [n_samples, n_features]
+ The data used to compute feature ranks.
+ """
+ X = array2d(X)
+ full_sort_X_ = np.sort(X, axis=0)
@arjoly Owner
arjoly added a note

Maybe X_sorted?

@turian
turian added a note

I want to keep the name similar to self.sort_X_, so I prefer full_sort_X_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tests/test_preprocessing.py
((9 lines not shown))
+ rank_scaler = RankScaler()
+ rank_scaler.fit(X)
+ X_scaled = rank_scaler.transform(X)
+ assert_array_almost_equal(X_scaled, [[ 0.125, 0.25 , 0.25 , 0.125, 0.625],
+ [ 0.375, 0.625, 0.875, 0.5 , 0.625],
+ [ 0.75 , 0.875, 0.625, 0.5 , 0.125],
+ [ 0.75 , 0.25 , 0.25 , 0.875, 0.625]])
+
+ X2 = np.array([[0, 1.5, 0, 5, 10]])
+ X2_scaled = rank_scaler.transform(X2)
+ assert_array_almost_equal(X2_scaled, [[ 0. , 0.75, 0.25, 1. , 1. ]])
+
+
+ # Check RankScaler at different resolution
+ for ninstances in [10, 100, 1000]:
+ for resolution in [ninstances+1, ninstances, ninstances-1, \
@ogrisel Owner
ogrisel added a note

You don't need the trailing \ as there are angle brackets around.

@turian
turian added a note

Thanks. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tests/test_preprocessing.py
((10 lines not shown))
+ rank_scaler.fit(X)
+ X_scaled = rank_scaler.transform(X)
+ assert_array_almost_equal(X_scaled, [[ 0.125, 0.25 , 0.25 , 0.125, 0.625],
+ [ 0.375, 0.625, 0.875, 0.5 , 0.625],
+ [ 0.75 , 0.875, 0.625, 0.5 , 0.125],
+ [ 0.75 , 0.25 , 0.25 , 0.875, 0.625]])
+
+ X2 = np.array([[0, 1.5, 0, 5, 10]])
+ X2_scaled = rank_scaler.transform(X2)
+ assert_array_almost_equal(X2_scaled, [[ 0. , 0.75, 0.25, 1. , 1. ]])
+
+
+ # Check RankScaler at different resolution
+ for ninstances in [10, 100, 1000]:
+ for resolution in [ninstances+1, ninstances, ninstances-1, \
+ int(ninstances/2), int(ninstances/7), int(ninstances/10)]:
@ogrisel Owner
ogrisel added a note

Please run pep8 ( https://pypi.python.org/pypi/pep8 ) to detect and fix the missing operator whitespaces.

@turian
turian added a note

Done. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@amueller
Owner

sorry, we are in feature feeze already and decided not to merge any more. It will probably be merged shortly after the release.

@turian

@amueller How long till the next release? i.e. how long until one can pip install this feature?

@turian

I have pushed changes, reflecting all above discussion.

sklearn/preprocessing.py
((86 lines not shown))
+ assert wlo >= 0 and wlo <= 1
+ assert whi >= 0 and whi <= 1
+ assert_almost_equal(wlo+whi, 1.)
+ self.sort_X_[i, j] = wlo * full_sort_X_[ioriglo, j] \
+ + whi * full_sort_X_[iorighi, j]
+ return self
+
+ def transform(self, X):
+ """Perform rank-standardization.
+
+ transform will take O(n_features * n_samples * log(n_ranks)),
+ where `n_fit_samples` is the number of samples used during `fit`.
+
+ Parameters
+ ----------
+ X : array-like with shape [n_samples, n_features]
@agramfort Owner

same here

X : array-like, shape (n_samples, n_features)

@turian
turian added a note

Fixed, will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((91 lines not shown))
+ return self
+
+ def transform(self, X):
+ """Perform rank-standardization.
+
+ transform will take O(n_features * n_samples * log(n_ranks)),
+ where `n_fit_samples` is the number of samples used during `fit`.
+
+ Parameters
+ ----------
+ X : array-like with shape [n_samples, n_features]
+ The data used to scale along the features axis.
+ """
+ X = array2d(X)
+ warn_if_not_float(X, estimator=self)
+ newX = []
@agramfort Owner

can't you preallocate newX to avoid the list to array ?

@turian
turian added a note

Good idea. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/preprocessing.py
((101 lines not shown))
+ X : array-like with shape [n_samples, n_features]
+ The data used to scale along the features axis.
+ """
+ X = array2d(X)
+ warn_if_not_float(X, estimator=self)
+ newX = []
+ for j in range(X.shape[1]):
+ lidx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='left')
+ ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
+ newX.append(1. * (lidx + ridx) / (2 * self.sort_X_.shape[0]))
+ X = np.vstack(newX).T
+ return X
+
+# def inverse_transform(self, X):
+## Not implemented, but I believe we could reuse the approximation
+## code in `fit`.
@agramfort Owner

don't leave these lines but add a note with

# XXX TODO : add inverse transform method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort agramfort commented on the diff
sklearn/tests/test_preprocessing.py
@@ -63,6 +64,9 @@ def test_scaler_1d():
assert_array_almost_equal(X_scaled.mean(axis=0), 0.0)
assert_array_almost_equal(X_scaled.std(axis=0), 1.0)
+# rank_scaler = RankScaler()
+# X_rank_scaled = rank_scaler.fit(X).transform(X)
@agramfort Owner

can you remove these lines?

@turian
turian added a note

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/tests/test_preprocessing.py
((10 lines not shown))
+ rank_scaler.fit(X)
+ X_scaled = rank_scaler.transform(X)
+ assert_array_almost_equal(X_scaled, [[0.125, 0.25, 0.25, 0.125, 0.625],
+ [0.375, 0.625, 0.875, 0.5, 0.625],
+ [0.75, 0.875, 0.625, 0.5, 0.125],
+ [0.75, 0.25, 0.25, 0.875, 0.625]])
+
+ X2 = np.array([[0, 1.5, 0, 5, 10]])
+ X2_scaled = rank_scaler.transform(X2)
+ assert_array_almost_equal(X2_scaled, [[0., 0.75, 0.25, 1., 1.]])
+
+ # Check RankScaler at different n_ranks
+ n_features = 100
+ for n_samples in [10, 100, 1000]:
+ for n_ranks in [n_samples+1, n_samples, n_samples-1,
+ int(n_samples/2), int(n_samples/7), int(n_samples/10)]:
@agramfort Owner

you still have pep8 violations in this files. Please run autopep8 on it. thanks

@turian
turian added a note

Done, will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
Owner

We'll need this new estimator to be added in classes.rst + some paragraph in documentation and an example that illustrates its usage and benefit on one of the data we ship or a sampled dataset. Don't think it's possible to have this merged before the end of the day for the release. It will be pip installable for the next release I guess.

@GaelVaroquaux
@turian

We'll need this new estimator to be added in classes.rst + some paragraph in documentation and an example that illustrates its usage and benefit on one of the data we ship or a sampled dataset.

@agramfort This is a more serious request. I don't know what datasets this will be better on. It's an empirical question, and it is hard to throw our weight behind recommending it, without more testing. All I know is that it was better on a private dataset of mine, where there are many outliers.

I would prefer that we ask people on the mailing list to try the method, and see if we can get a sense of how strongly to recommend this technique.

@arjoly
Owner

To start, you can try on an existing example with outlier.

@turian

@arjoly Do you have a suggestion about a particular dataset?

@GaelVaroquaux
@vene
Owner
@turian

Let's say I run an SVM or SVR on a dataset with outliers, like housing.

If RankScaler has a better validation measure than StandardScalar, is that useful?
Perhaps it is just because the hyperparameters haven't been tuned.

@larsmans
Owner

You could make a script that grid searches (or random searches) a small set of hyperparams for both.

@jnothman jnothman commented on the diff
sklearn/preprocessing.py
@@ -399,6 +404,120 @@ def __init__(self, copy=True, with_mean=True, with_std=True):
super(Scaler, self).__init__(copy, with_mean, with_std)
+class RankScaler(BaseEstimator, TransformerMixin):
+ """Rank-standardize features to a percentile, in the range [0, 1].
+
+ Rank-scaling happens independently on each feature, by determining
+ the percentile of the feature value.
+ A feature value that is smaller than observed during fitting
+ will scale to 0.
+ A feature value that is larger than observed during fitting
+ will scale to 1.
+ A feature value that is the median will scale to 0.5.
+
+ Standardization of a dataset is a common requirement for many
+ machine learning estimators. Rank-scaling is useful when
+ estimators perform badly on StandardScalar features. Rank-scaling
@jnothman Owner
jnothman added a note

StandardScalar should be StandardScaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 21, 2013
  1. @turian
Commits on Jul 27, 2013
  1. @turian
  2. @turian
  3. @turian

    PEP8 fixes

    turian authored
Commits on Jul 28, 2013
  1. @turian
Commits on Jul 29, 2013
  1. @turian

    Cosmetic changes

    turian authored
  2. @turian

    autopep8

    turian authored
This page is out of date. Refresh to see the latest.
Showing with 159 additions and 0 deletions.
  1. +119 −0 sklearn/preprocessing.py
  2. +40 −0 sklearn/tests/test_preprocessing.py
View
119 sklearn/preprocessing.py
@@ -10,6 +10,8 @@
import numpy as np
import scipy.sparse as sp
+from numpy.testing import assert_almost_equal
+
from .base import BaseEstimator, TransformerMixin
from .externals.six import string_types
from .utils import check_arrays
@@ -294,6 +296,9 @@ class StandardScaler(BaseEstimator, TransformerMixin):
:func:`sklearn.preprocessing.scale` to perform centering and
scaling without using the ``Transformer`` object oriented API
+ :class:`sklearn.preprocessing.RankScaler` to perform standardization
+ that is more robust to outliers, but slower and more memory-intensive.
+
:class:`sklearn.decomposition.RandomizedPCA` with `whiten=True`
to further remove the linear correlation across features.
"""
@@ -399,6 +404,120 @@ def __init__(self, copy=True, with_mean=True, with_std=True):
super(Scaler, self).__init__(copy, with_mean, with_std)
+class RankScaler(BaseEstimator, TransformerMixin):
+ """Rank-standardize features to a percentile, in the range [0, 1].
+
+ Rank-scaling happens independently on each feature, by determining
+ the percentile of the feature value.
+ A feature value that is smaller than observed during fitting
+ will scale to 0.
+ A feature value that is larger than observed during fitting
+ will scale to 1.
+ A feature value that is the median will scale to 0.5.
+
+ Standardization of a dataset is a common requirement for many
+ machine learning estimators. Rank-scaling is useful when
+ estimators perform badly on StandardScalar features. Rank-scaling
@jnothman Owner
jnothman added a note

StandardScalar should be StandardScaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ is more robust than StandardScaler, because outliers can't have
+ large values post scaling. It is an empirical question whether
+ you want outliers to be given high importance (StandardScaler)
+ or not (RankScaler).
+
+ Parameters
+ ----------
+ n_ranks : int, 1000 by default
+ The number of different ranks possible.
+ i.e. The number of indices in the compressed ranking matrix
+ `sort_X_`.
+ This is an approximation, to save memory and transform
+ computation time.
+ e.g. if 1000, transformed values will have resolution 0.001.
+ If `None`, we store the full size matrix, comparable
+ in size to the initial fit `X`.
+
+ Attributes
+ ----------
+ `sort_X_` : array of ints, shape (n_samples, n_features)
+ The rank-index of every feature in the fit X.
+
+ See also
+ --------
+ :class:`sklearn.preprocessing.StandardScaler` to perform standardization
+ that is faster, but less robust to outliers.
@arjoly Owner
arjoly added a note

It's pretty cool :-)

Do you think that you can share this knowledge with an example?

@turian
turian added a note

What did you have in mind specifically?

I used this for a project, not sure if I can share the data. It was social network data.
There were a lot of outliers, so after using StandardScaler, some things still had like feature value = +90!

This method was originally suggested to me by Yoshua Bengio. He spoke about it like it was a well-known thing, but I haven't really seen anyone else using it. Maybe it's one of those tricks that a bunch of neural networks people know, but no-one else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ """
+
+ def __init__(self, n_ranks=1000):
+ # TODO: Add min and max parameters? Default = [0, 1]
+ self.n_ranks = n_ranks
+
+ def fit(self, X, y=None):
+ """Compute the feature ranks for later scaling.
+
+ fit will take time O(n_features * n_samples * log(n_samples)),
+ because it must sort the entire matrix.
+
+ It use memory O(n_features * n_ranks).
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+ The data used to compute feature ranks.
+ """
+ X = array2d(X)
+ n_samples, n_features = X.shape
+ full_sort_X_ = np.sort(X, axis=0)
+ if not self.n_ranks or self.n_ranks >= n_samples:
+ # Store the full matrix
+ self.sort_X_ = full_sort_X_
+ else:
+ # Approximate the stored sort_X_
+ self.sort_X_ = np.zeros((self.n_ranks, n_features))
+ for i in range(self.n_ranks):
+ for j in range(n_features):
+ # Find the corresponding i in the original ranking
+ iorig = i * 1. * n_samples / self.n_ranks
+ ioriglo = int(iorig)
+ iorighi = ioriglo + 1
+
+ if ioriglo == n_samples:
+ self.sort_X_[i, j] = full_sort_X_[ioriglo, j]
+ else:
+ # And use linear interpolation to combine the
+ # original values.
+ wlo = (1 - (iorig - ioriglo))
+ whi = (1 - (iorighi - iorig))
+ assert wlo >= 0 and wlo <= 1
+ assert whi >= 0 and whi <= 1
+ assert_almost_equal(wlo+whi, 1.)
+ self.sort_X_[i, j] = wlo * full_sort_X_[ioriglo, j] \
+ + whi * full_sort_X_[iorighi, j]
+ return self
+
+ def transform(self, X):
+ """Perform rank-standardization.
+
+ transform will take O(n_features * n_samples * log(n_ranks)),
+ where `n_fit_samples` is the number of samples used during `fit`.
+
+ Parameters
+ ----------
+ X : array-like, shape (n_samples, n_features)
+ The data used to scale along the features axis.
+ """
+ X = array2d(X)
+ warn_if_not_float(X, estimator=self)
+ # TODO: Can add a copy parameter, and simply overwrite X if copy=False
+ X2 = np.zeros(X.shape)
+ for j in range(X.shape[1]):
+ lidx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='left')
+ ridx = np.searchsorted(self.sort_X_[:, j], X[:, j], side='right')
+ v = 1. * (lidx + ridx) / (2 * self.sort_X_.shape[0])
+ X2[:,j] = v
+ return X2
+
+ # TODO : Add inverse_transform method.
+ # I believe we could reuse the approximation code in `fit`.
+
def normalize(X, norm='l2', axis=1, copy=True):
"""Normalize a dataset along any axis
View
40 sklearn/tests/test_preprocessing.py
@@ -23,6 +23,7 @@
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale
from sklearn.preprocessing import MinMaxScaler
+from sklearn.preprocessing import RankScaler
from sklearn.preprocessing import add_dummy_feature
from sklearn import datasets
@@ -63,6 +64,9 @@ def test_scaler_1d():
assert_array_almost_equal(X_scaled.mean(axis=0), 0.0)
assert_array_almost_equal(X_scaled.std(axis=0), 1.0)
+# rank_scaler = RankScaler()
+# X_rank_scaled = rank_scaler.fit(X).transform(X)
@agramfort Owner

can you remove these lines?

@turian
turian added a note

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+
def test_scaler_2d_arrays():
"""Test scaling of 2d array along first axis"""
@@ -112,6 +116,42 @@ def test_scaler_2d_arrays():
# Check that X has not been copied
assert_true(X_scaled is not X)
+ X = np.array([[1, 0, 0, 0, 1],
+ [2, 1, 4, 1, 1],
+ [3, 2, 3, 1, 0],
+ [3, 0, 0, 4, 1]])
+
+ rank_scaler = RankScaler()
+ rank_scaler.fit(X)
+ X_scaled = rank_scaler.transform(X)
+ assert_array_almost_equal(X_scaled, [[0.125, 0.25, 0.25, 0.125, 0.625],
+ [0.375, 0.625, 0.875, 0.5, 0.625],
+ [0.75, 0.875, 0.625, 0.5, 0.125],
+ [0.75, 0.25, 0.25, 0.875, 0.625]])
+
+ X2 = np.array([[0, 1.5, 0, 5, 10]])
+ X2_scaled = rank_scaler.transform(X2)
+ assert_array_almost_equal(X2_scaled, [[0., 0.75, 0.25, 1., 1.]])
+
+ # Check RankScaler at different n_ranks
+ n_features = 100
+ for n_samples in [10, 100, 1000]:
+ for n_ranks in [n_samples + 1, n_samples, n_samples - 1,
+ int(n_samples / 2), int(n_samples / 7), int(n_samples / 10)]:
+ X = rng.randn(n_samples, n_features)
+ rank_scaler1 = RankScaler(n_ranks=None)
+ rank_scaler2 = RankScaler(n_ranks=n_ranks)
+ rank_scaler1.fit(X)
+ rank_scaler2.fit(X)
+
+ X2 = rng.randn(1000, n_features)
+ X21 = rank_scaler1.transform(X2)
+ X22 = rank_scaler2.transform(X2)
+
+ # In the approximate version X22, all values must
+ # be within 1./n_ranks of the exact value X11.
+ assert_true(np.all(np.fabs(X21 - X22) < 1. / n_ranks))
+
def test_min_max_scaler_iris():
X = iris.data
Something went wrong with that request. Please try again.