[MRG+1] QuantileTransformer #8363

Merged
merged 107 commits into from Jun 9, 2017

Conversation

Projects
None yet
@glemaitre
Contributor

glemaitre commented Feb 15, 2017

Reference Issue

Cont'd of #2176

What does this implement/fix? Explain your changes.

Implementation of quantile normalizer

Any other comments?

@glemaitre

This comment has been minimized.

Show comment
Hide comment
Contributor

glemaitre commented Feb 15, 2017

@jnothman jnothman referenced this pull request Feb 16, 2017

Closed

[MRG+2] Add fixed width discretization to scikit-learn #7668

8 of 8 tasks complete

@lesteve lesteve referenced this pull request in scikit-learn/scikit-learn.github.io Feb 16, 2017

Merged

[MRG] Better handling of Circle CI artifact URLs #6

sklearn/preprocessing/tests/test_data.py
+ X_trans = normalizer.fit_transform(X)
+ # FIXME: one of those will drive to precision error
+ # in the interpolation
+ # assert_array_almost_equal(np.min(X_trans, axis=0), 0.)

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017

Contributor

I'm working on it.

@tguillemot

tguillemot Feb 16, 2017

Contributor

I'm working on it.

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

I checked yesterday for while and there is nothing wrong with our code.
f(min(X)) of the interpolated function do not want to return 0.
The issue should come from numpy.interp

This is working on the toy :D
I will try to sort out the issue with the CI error coming from different numpy version I think.

@glemaitre

glemaitre Feb 16, 2017

Contributor

I checked yesterday for while and there is nothing wrong with our code.
f(min(X)) of the interpolated function do not want to return 0.
The issue should come from numpy.interp

This is working on the toy :D
I will try to sort out the issue with the CI error coming from different numpy version I think.

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017

Contributor

It's a problem of precision with numpy.interp indeed.

@tguillemot

tguillemot Feb 16, 2017

Contributor

It's a problem of precision with numpy.interp indeed.

sklearn/preprocessing/data.py
+ self : object
+ Returns self
+ """
+ X = self._validate_X(X)

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017

Contributor

Just a niptick, is it necessary to create a specific function ?
When there are few lines I prefer not create function ;).

@tguillemot

tguillemot Feb 16, 2017

Contributor

Just a niptick, is it necessary to create a specific function ?
When there are few lines I prefer not create function ;).

sklearn/preprocessing/tests/test_data.py
+ normalizer = QuantileNormalizer()
+ normalizer.fit(X)
+ X_trans = normalizer.fit_transform(X)
+ assert_array_almost_equal(np.min(X_trans, axis=0), 0.)

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

You can use use assert_almost_equal when you compare scalar values.

@ogrisel

ogrisel Feb 16, 2017

Member

You can use use assert_almost_equal when you compare scalar values.

sklearn/preprocessing/tests/test_data.py
+ X_trans = normalizer.fit_transform(X)
+ assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
+ assert_array_almost_equal(np.max(X_trans, axis=0), 1.)
+

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Can you please add a check that extreme values are mapped to 0 or 1, e.g.

X_test = np.array([
    [ -1,  1,  0],
    [101, 11, 10],
])
expected = np.array([
    [0, 0, 0],
    [1, 1, 1],
])
assert_array_almost_equal(normalizer.transform(X_test), expected)
@ogrisel

ogrisel Feb 16, 2017

Member

Can you please add a check that extreme values are mapped to 0 or 1, e.g.

X_test = np.array([
    [ -1,  1,  0],
    [101, 11, 10],
])
expected = np.array([
    [0, 0, 0],
    [1, 1, 1],
])
assert_array_almost_equal(normalizer.transform(X_test), expected)
sklearn/preprocessing/data.py
+
+ for feat_idx, f in enumerate(func_transform):
+ Xt.data[Xt.indptr[feat_idx]:Xt.indptr[feat_idx + 1]] = f(
+ Xt.data[Xt.indptr[feat_idx]:Xt.indptr[feat_idx + 1]])

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Nice :)

@ogrisel

ogrisel Feb 16, 2017

Member

Nice :)

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Actually you could factorize the slicing to make the code more readable:

column_slice = slice(Xt.indptr[feat_idx], Xt.indptr[feat_idx + 1])
Xt.data[column_slice] = f(Xt.data[column_sclice])
@ogrisel

ogrisel Feb 16, 2017

Member

Actually you could factorize the slicing to make the code more readable:

column_slice = slice(Xt.indptr[feat_idx], Xt.indptr[feat_idx + 1])
Xt.data[column_slice] = f(Xt.data[column_sclice])
sklearn/preprocessing/data.py
+ ----------
+ X : sparse matrix, shape (n_samples, n_features)
+ The data used to scale along the features axis. The sparse matrix
+ needs to be semi-positive.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

You should make it explicit that it only works for CSC sparse matrices (I know this is not public API but it makes it easier to understand how the code works).

@ogrisel

ogrisel Feb 16, 2017

Member

You should make it explicit that it only works for CSC sparse matrices (I know this is not public API but it makes it easier to understand how the code works).

sklearn/preprocessing/data.py
+ # we only accept positive sparse matrix
+ if sparse.issparse(X) and X.min() < 0:
+ raise ValueError('QuantileNormalizer only accepts semi-positive'
+ ' sparse matrices')

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Not "semi-positive sparse matrix" but "sparse matrices with all non-negative entries".

@ogrisel

ogrisel Feb 16, 2017

Member

Not "semi-positive sparse matrix" but "sparse matrices with all non-negative entries".

sklearn/preprocessing/tests/test_data.py
+def test_quantile_normalizer_error_neg_sparse():
+ X = np.array([[0, 25, 50, 75, 100],
+ [-2, 4, 6, 8, 10],
+ [2.6, 4.1, 2.3, 9.5, 0.1]]).T

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

You should insert more zero values in this matrix to make it sparser.

@ogrisel

ogrisel Feb 16, 2017

Member

You should insert more zero values in this matrix to make it sparser.

sklearn/preprocessing/tests/test_data.py
+
+ X = np.array([[0, 25, 50, 75, 100],
+ [2, 4, 6, 8, 10],
+ [2.6, 4.1, 2.3, 9.5, 0.1]]).T

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

You should insert more zero values in this matrix to make it sparser.

@ogrisel

ogrisel Feb 16, 2017

Member

You should insert more zero values in this matrix to make it sparser.

sklearn/preprocessing/tests/test_data.py
+ qn_ser = pickle.dumps(qn, pickle.HIGHEST_PROTOCOL)
+ qn2 = pickle.loads(qn_ser)
+ assert_array_almost_equal(qn.transform(iris.data),
+ qn2.transform(iris.data))

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

You should also check that it can pickle correctly before fitting (evenn though it should trivially work).

@ogrisel

ogrisel Feb 16, 2017

Member

You should also check that it can pickle correctly before fitting (evenn though it should trivially work).

sklearn/preprocessing/data.py
+
+ The normalization is applied on each feature independently.
+ The cumulative density function of a feature is used to project the
+ original values.

This comment has been minimized.

@dengemann

dengemann Feb 16, 2017

Contributor

Add something like:

Features of new/unseen data that fall below or above the fitted range will be mapped to 0 and one, respectively.
Note that this transform and non-linear. It may remove correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

@dengemann

dengemann Feb 16, 2017

Contributor

Add something like:

Features of new/unseen data that fall below or above the fitted range will be mapped to 0 and one, respectively.
Note that this transform and non-linear. It may remove correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

sklearn/preprocessing/data.py
+ See also
+ --------
+ :class:`sklearn.preprocessing.StandardScaler` to perform standardization
+ that is faster, but less robust to outliers.

This comment has been minimized.

@dengemann

dengemann Feb 16, 2017

Contributor

Add maybe

:class:`sklearn.preprocessing.Ro bustScaler` to perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
     
@dengemann

dengemann Feb 16, 2017

Contributor

Add maybe

:class:`sklearn.preprocessing.Ro bustScaler` to perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
     
sklearn/preprocessing/data.py
+ bounds_error=False,
+ fill_value=(min(quantiles_feat),
+ max(quantiles_feat)))
+ for quantiles_feat in self.quantiles_.T]

This comment has been minimized.

@dengemann

dengemann Feb 16, 2017

Contributor

Is there any reason for these guys being lists, hence mutable?

@dengemann

dengemann Feb 16, 2017

Contributor

Is there any reason for these guys being lists, hence mutable?

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

Good point.

@glemaitre

glemaitre Feb 16, 2017

Contributor

Good point.

sklearn/preprocessing/data.py
+ self.references_ = np.linspace(0, 1, self.n_quantiles,
+ endpoint=True)
+ # FIXME: it does not take into account the zero in the computation
+ self.quantiles_ = np.array([np.percentile(

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

@ogrisel @tguillemot Here I am not really sure what should be the right way.
Assuming that the sparse matrix as a lot of zeros fo a given feature, it will have a bad influence on the normalisation, didn't it?
It could also be the case in the dense in fact. That was the reason of including a quantile_range.

@glemaitre

glemaitre Feb 16, 2017

Contributor

@ogrisel @tguillemot Here I am not really sure what should be the right way.
Assuming that the sparse matrix as a lot of zeros fo a given feature, it will have a bad influence on the normalisation, didn't it?
It could also be the case in the dense in fact. That was the reason of including a quantile_range.

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017

Contributor

Can we modify the reference value to take into account of the number of 0 ?
Not sure it's what we want.

@tguillemot

tguillemot Feb 16, 2017

Contributor

Can we modify the reference value to take into account of the number of 0 ?
Not sure it's what we want.

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

it is in np.percentiles that we can do that. We know the size of X_col and we can now the number of non-zeros. Therefore, we can add the zeros in the data to compute the percentiles.

@glemaitre

glemaitre Feb 16, 2017

Contributor

it is in np.percentiles that we can do that. We know the size of X_col and we can now the number of non-zeros. Therefore, we can add the zeros in the data to compute the percentiles.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

yes we need to find a way to shift the percentile distribution efficiently. It probably better to do the quantile computation ourselves: sort the subsampled column non-zero data, then consider the fraction of zeros that should be considered to be added at the beginning of that array (without actually materializing it) also taking the subsampling rate into account and do the quantile lookups manually.

@ogrisel

ogrisel Feb 16, 2017

Member

yes we need to find a way to shift the percentile distribution efficiently. It probably better to do the quantile computation ourselves: sort the subsampled column non-zero data, then consider the fraction of zeros that should be considered to be added at the beginning of that array (without actually materializing it) also taking the subsampling rate into account and do the quantile lookups manually.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

But then we also need to handle the linear interpolation...

@ogrisel

ogrisel Feb 16, 2017

Member

But then we also need to handle the linear interpolation...

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Actually, no need to do that, let's do:

column_nnz_data = X.data[X.indptr[feat]:X.indptr[feat + 1]]
column_subsample = subsample * len(column_nnz_data) // X.shape[0]
column_data = np.zeros(shape=subsample, dtype=X.dtype)
column_data[:column_subsample] = rng.choice(column_nnz_data, column_subsample,
                                            replace=False)

and then proceed to extract the quantiles from column_data as usual.

@ogrisel

ogrisel Feb 16, 2017

Member

Actually, no need to do that, let's do:

column_nnz_data = X.data[X.indptr[feat]:X.indptr[feat + 1]]
column_subsample = subsample * len(column_nnz_data) // X.shape[0]
column_data = np.zeros(shape=subsample, dtype=X.dtype)
column_data[:column_subsample] = rng.choice(column_nnz_data, column_subsample,
                                            replace=False)

and then proceed to extract the quantiles from column_data as usual.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Because subsample is going to be smallish and independent of X.shape[0] this is good enough and easier to maintain.

@ogrisel

ogrisel Feb 16, 2017

Member

Because subsample is going to be smallish and independent of X.shape[0] this is good enough and easier to maintain.

sklearn/preprocessing/data.py
+ # FIXME: it does not take into account the zero in the computation
+ self.quantiles_ = np.array([np.percentile(
+ X.data[X.indptr[feat]:X.indptr[feat + 1]], self.references_ * 100)
+ for feat in range(n_feat)]).T

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

Cosmetics: please use n_features and feature_idx.

@ogrisel

ogrisel Feb 16, 2017

Member

Cosmetics: please use n_features and feature_idx.

sklearn/preprocessing/tests/test_data.py
+ # assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
+ # assert_array_almost_equal(np.max(X_trans, axis=0), 1.)
+ X_trans_inv = normalizer.inverse_transform(X_trans)
+ assert_array_almost_equal(X, X_trans_inv)

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

Not directly related with the line but with the transform.inverse_transform. It will not be equal if X have out of bounds value which will be clipped during transform and mapped to minimum of maximum of the references_ during inverse transform

@glemaitre

glemaitre Feb 16, 2017

Contributor

Not directly related with the line but with the transform.inverse_transform. It will not be equal if X have out of bounds value which will be clipped during transform and mapped to minimum of maximum of the references_ during inverse transform

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017

Contributor

There are no problem for that case (and it's a way to be sure the normalizer works in a correct way).
But what you say is true indeed for general cases.

@tguillemot

tguillemot Feb 16, 2017

Contributor

There are no problem for that case (and it's a way to be sure the normalizer works in a correct way).
But what you say is true indeed for general cases.

sklearn/preprocessing/data.py
if direction:
+ print(1)

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

@tguillemot That look like debugging flags

@glemaitre

glemaitre Feb 16, 2017

Contributor

@tguillemot That look like debugging flags

sklearn/preprocessing/data.py
func_transform = self.f_transform_
else:
+ print(2)

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017

Contributor

@tguillemot That look like debugging flags

@glemaitre

glemaitre Feb 16, 2017

Contributor

@tguillemot That look like debugging flags

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017

Contributor

oups indeed. Sorry

@tguillemot

tguillemot Feb 16, 2017

Contributor

oups indeed. Sorry

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Feb 16, 2017

Contributor

@ogrisel I was checking the User guide for the preprocessing to see what to add.

I have a second thought on the naming of the class. From the description in the user guide, QuantileScaler would be more appropriate.

What is the reason to stick to normalizer?

Contributor

glemaitre commented Feb 16, 2017

@ogrisel I was checking the User guide for the preprocessing to see what to add.

I have a second thought on the naming of the class. From the description in the user guide, QuantileScaler would be more appropriate.

What is the reason to stick to normalizer?

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 16, 2017

Member

The problem is that (feature-wise) scaling stands for deviding each feature by a scalar value. This is the case for StandardScaler and RobustScaler but not in our case. I prefer QuantileNormalizer or QuantileTransformer.

Member

ogrisel commented Feb 16, 2017

The problem is that (feature-wise) scaling stands for deviding each feature by a scalar value. This is the case for StandardScaler and RobustScaler but not in our case. I prefer QuantileNormalizer or QuantileTransformer.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 16, 2017

Member

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

Member

ogrisel commented Feb 16, 2017

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

sklearn/preprocessing/data.py
+ The cumulative density function of a feature is used to project the
+ original values. Features values of new/unseen data that fall below
+ or above the fitted range will be mapped to 0 and 1, respectively.
+ Note that this transform is non-linear. It may remove correlations between

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

"remove correlations" => "distort linear correlations"

@ogrisel

ogrisel Feb 16, 2017

Member

"remove correlations" => "distort linear correlations"

sklearn/preprocessing/data.py
+ This Normalizer scales the features between 0 and 1, equalizing the
+ distribution of each feature to a uniform distribution. Therefore,
+ for a given feature, this normalization tends to spread out the most
+ frequent values.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

@ogrisel

ogrisel Feb 16, 2017

Member

It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Feb 16, 2017

Contributor

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

Fair enough. The narration of the User guide needs to be changed to be coherent.

Contributor

glemaitre commented Feb 16, 2017

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

Fair enough. The narration of the User guide needs to be changed to be coherent.

sklearn/preprocessing/data.py
+
+ f_inverse_transform_ : list of callable, shape (n_quantiles,)
+ The inverse of the cumulative density function used to project the
+ data.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

I think we should keep the f_transform_ and f_inverse_transform_ attribute private (with a leading underscore).

@ogrisel

ogrisel Feb 16, 2017

Member

I think we should keep the f_transform_ and f_inverse_transform_ attribute private (with a leading underscore).

sklearn/preprocessing/data.py
+ # 0 and 100.
+ self.quantiles_ = [np.percentile(X[subsample_idx, feature_idx],
+ [x * 100 for x in self.references_])
+ for feature_idx in range(n_features)]

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

quantiles_ should be a 2D array, not a list of 1D arrays.

@ogrisel

ogrisel Feb 16, 2017

Member

quantiles_ should be a 2D array, not a list of 1D arrays.

sklearn/preprocessing/data.py
+ Attributes
+ ----------
+ references_ : ndarray, shape (n_quantiles,)
+ The quantiles of reference.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017

Member

I don't see the point of storing this as an attribute. It can be trivially recomputed from n_quantiles if needed.

@ogrisel

ogrisel Feb 16, 2017

Member

I don't see the point of storing this as an attribute. It can be trivially recomputed from n_quantiles if needed.

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Feb 16, 2017

Codecov Report

Merging #8363 into master will decrease coverage by <.01%.
The diff coverage is 99.62%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #8363      +/-   ##
=========================================
- Coverage    95.5%   95.5%   -0.01%     
=========================================
  Files         342     342              
  Lines       61290   61301      +11     
=========================================
+ Hits        58538   58546       +8     
- Misses       2752    2755       +3
Impacted Files Coverage Δ
sklearn/preprocessing/tests/test_data.py 99.91% <100%> (+0.01%) ⬆️
sklearn/preprocessing/__init__.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/data.py 99.24% <99.2%> (-0.02%) ⬇️
sklearn/linear_model/tests/test_bayes.py 83.01% <0%> (-2.47%) ⬇️
sklearn/datasets/california_housing.py 42.1% <0%> (-1.49%) ⬇️
sklearn/utils/optimize.py 85.29% <0%> (-1.48%) ⬇️
sklearn/datasets/species_distributions.py 29.57% <0%> (-0.98%) ⬇️
sklearn/utils/__init__.py 93.8% <0%> (-0.69%) ⬇️
sklearn/datasets/lfw.py 14.28% <0%> (-0.53%) ⬇️
sklearn/linear_model/sag.py 93.84% <0%> (-0.53%) ⬇️
... and 79 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8981b25...e084f5a. Read the comment docs.

codecov bot commented Feb 16, 2017

Codecov Report

Merging #8363 into master will decrease coverage by <.01%.
The diff coverage is 99.62%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #8363      +/-   ##
=========================================
- Coverage    95.5%   95.5%   -0.01%     
=========================================
  Files         342     342              
  Lines       61290   61301      +11     
=========================================
+ Hits        58538   58546       +8     
- Misses       2752    2755       +3
Impacted Files Coverage Δ
sklearn/preprocessing/tests/test_data.py 99.91% <100%> (+0.01%) ⬆️
sklearn/preprocessing/__init__.py 100% <100%> (ø) ⬆️
sklearn/preprocessing/data.py 99.24% <99.2%> (-0.02%) ⬇️
sklearn/linear_model/tests/test_bayes.py 83.01% <0%> (-2.47%) ⬇️
sklearn/datasets/california_housing.py 42.1% <0%> (-1.49%) ⬇️
sklearn/utils/optimize.py 85.29% <0%> (-1.48%) ⬇️
sklearn/datasets/species_distributions.py 29.57% <0%> (-0.98%) ⬇️
sklearn/utils/__init__.py 93.8% <0%> (-0.69%) ⬇️
sklearn/datasets/lfw.py 14.28% <0%> (-0.53%) ⬇️
sklearn/linear_model/sag.py 93.84% <0%> (-0.53%) ⬇️
... and 79 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8981b25...e084f5a. Read the comment docs.

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Feb 16, 2017

Contributor

@ogrisel Would it be possible to kill all appveyor pending checks for that PR despite the last one. I kinda overloaded the CI.

Contributor

glemaitre commented Feb 16, 2017

@ogrisel Would it be possible to kill all appveyor pending checks for that PR despite the last one. I kinda overloaded the CI.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Feb 17, 2017

Member
  1. QuantileNormalizer makes it sound like it's a sample-wise transformation as Normalizer is. I would rather QuantileScaler or QuantileTransformer or Quantiler or RankTransformer

  2. Should sparse normalisation produce the same results as the equivalent dense input? Currently it doesn't, but assumes that the implicit zeros have no effect on the data distribution.

>>> X = sparse.csr_matrix((np.array([0]), np.array([0]), np.array([0, 1])), shape=(1, 1)))
>>> QuantileNormalizer().fit_transform(X).A
array([[ 1.,  0.]])
>>> QuantileNormalizer().fit_transform(X.A)
array([[ 1.,  1.]])
>>> X[0, 0] = 1
>>> QuantileNormalizer().fit_transform(X).A
array([[ 1.,  0.]])

Is this the behaviour we want

Member

jnothman commented Feb 17, 2017

  1. QuantileNormalizer makes it sound like it's a sample-wise transformation as Normalizer is. I would rather QuantileScaler or QuantileTransformer or Quantiler or RankTransformer

  2. Should sparse normalisation produce the same results as the equivalent dense input? Currently it doesn't, but assumes that the implicit zeros have no effect on the data distribution.

>>> X = sparse.csr_matrix((np.array([0]), np.array([0]), np.array([0, 1])), shape=(1, 1)))
>>> QuantileNormalizer().fit_transform(X).A
array([[ 1.,  0.]])
>>> QuantileNormalizer().fit_transform(X.A)
array([[ 1.,  1.]])
>>> X[0, 0] = 1
>>> QuantileNormalizer().fit_transform(X).A
array([[ 1.,  0.]])

Is this the behaviour we want

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Feb 17, 2017

Contributor

Should sparse normalisation produce the same results as the equivalent dense input? Currently it doesn't, but assumes that the implicit zeros have no effect on the data distribution.

The issue is from the numpy.interp1d which does crazy stuff at the bounds. Considering the following matrix:

array([[0, 0],
       [0, 0],
       [1, 0]])

The sparse output will be:

array([[ 0.,  0.],
       [ 0.,  0.],
       [ 1.,  0.]])

For the sparse case, we do not compute the mapping with interp1d of the 0s since they will be 0s anyway.

For the dense case, the results is:

array([[ 0.24724725,  1.        ],
       [ 0.24724725,  1.        ],
       [ 1.        ,  1.        ]])

which can be reproduce with the following snippet:

reference = np.linspace(0, 1, num=1000)
q = np.percentile(X.A, reference, axis=0)
i0 = interp1d(q[:, 0], reference)
i1 = interp1d(q[:, 1], reference)
print(i0(0))
print(i1(0))

You get

0.24724725
1.

The hacking to get the same value between dense and sparse is to avoid using interp1d for the lower and upper bounds and map them to 0. and 1.

@jnothman @ogrisel @tguillemot any thoughts?

PS: Just by curiosity, what are we still supporting such old numpy like 1.6. The implementation could have been cleaner with 1.8 support :S

Contributor

glemaitre commented Feb 17, 2017

Should sparse normalisation produce the same results as the equivalent dense input? Currently it doesn't, but assumes that the implicit zeros have no effect on the data distribution.

The issue is from the numpy.interp1d which does crazy stuff at the bounds. Considering the following matrix:

array([[0, 0],
       [0, 0],
       [1, 0]])

The sparse output will be:

array([[ 0.,  0.],
       [ 0.,  0.],
       [ 1.,  0.]])

For the sparse case, we do not compute the mapping with interp1d of the 0s since they will be 0s anyway.

For the dense case, the results is:

array([[ 0.24724725,  1.        ],
       [ 0.24724725,  1.        ],
       [ 1.        ,  1.        ]])

which can be reproduce with the following snippet:

reference = np.linspace(0, 1, num=1000)
q = np.percentile(X.A, reference, axis=0)
i0 = interp1d(q[:, 0], reference)
i1 = interp1d(q[:, 1], reference)
print(i0(0))
print(i1(0))

You get

0.24724725
1.

The hacking to get the same value between dense and sparse is to avoid using interp1d for the lower and upper bounds and map them to 0. and 1.

@jnothman @ogrisel @tguillemot any thoughts?

PS: Just by curiosity, what are we still supporting such old numpy like 1.6. The implementation could have been cleaner with 1.8 support :S

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Feb 17, 2017

Member

PS: Just by curiosity, what are we still supporting such old numpy like 1.6. The implementation could have been cleaner with 1.8 support :S

You can wait until May 2017 where Ubuntu 12.04 will not be supported and we will probably upgrade the requirements. There is an issue about that: #7522.

Member

lesteve commented Feb 17, 2017

PS: Just by curiosity, what are we still supporting such old numpy like 1.6. The implementation could have been cleaner with 1.8 support :S

You can wait until May 2017 where Ubuntu 12.04 will not be supported and we will probably upgrade the requirements. There is an issue about that: #7522.

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Feb 17, 2017

Contributor

The bounds introduce wrong computation :

>>> X = np.array([[0, 0,  1],
                  [1, 0.5, 0]]).T
>>> X1 = np.array([[0, 0, 1],
                   [0.1, 0.5, 0.1]]).T
>>> qn = QuantileNormalizer()
>>> qn.fit(X)
>>> qn.transform(X)
array([[ 0. ,  1. ],
       [ 0. ,  0.5],
       [ 1. ,  0. ]])
>>>qn.transform(X1)
array([[ 0.,  0.],
       [ 0.,  1.],
       [ 1.,  0.]])

instead of :

>>>qn.transform(X1)
array([[ 0.,  0.1],
       [ 0.,  0.5],
       [ 1.,  0.1]])
Contributor

tguillemot commented Feb 17, 2017

The bounds introduce wrong computation :

>>> X = np.array([[0, 0,  1],
                  [1, 0.5, 0]]).T
>>> X1 = np.array([[0, 0, 1],
                   [0.1, 0.5, 0.1]]).T
>>> qn = QuantileNormalizer()
>>> qn.fit(X)
>>> qn.transform(X)
array([[ 0. ,  1. ],
       [ 0. ,  0.5],
       [ 1. ,  0. ]])
>>>qn.transform(X1)
array([[ 0.,  0.],
       [ 0.,  1.],
       [ 1.,  0.]])

instead of :

>>>qn.transform(X1)
array([[ 0.,  0.1],
       [ 0.,  0.5],
       [ 1.,  0.1]])
@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 17, 2017

Member

I think we can decide to proactively raise the dependency to the numpy / scipy versions of Ubuntu 14.04 if it's too complex to support the old numpy.

Member

ogrisel commented Feb 17, 2017

I think we can decide to proactively raise the dependency to the numpy / scipy versions of Ubuntu 14.04 if it's too complex to support the old numpy.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Feb 17, 2017

Member

Should sparse normalisation produce the same results as the equivalent dense input? Currently it doesn't, but assumes that the implicit zeros have no effect on the data distribution.

We probably need an option to switch between the 2 modes: consider non-materialized zeros as missing values to be ignored. I think the default would be to have the missing zeros considered as true zeros and therefore should behave the same as the corresponding model fit on X.toarray().

Member

ogrisel commented Feb 17, 2017

Should sparse normalisation produce the same results as the equivalent dense input? Currently it doesn't, but assumes that the implicit zeros have no effect on the data distribution.

We probably need an option to switch between the 2 modes: consider non-materialized zeros as missing values to be ignored. I think the default would be to have the missing zeros considered as true zeros and therefore should behave the same as the corresponding model fit on X.toarray().

@glemaitre

This comment has been minimized.

Show comment
Hide comment
@glemaitre

glemaitre Feb 17, 2017

Contributor

I think we can decide to proactively raise the dependency to the numpy / scipy versions of Ubuntu 14.04 if it's too complex to support the old numpy.

This is minor changes such as tolist() conversion or use permutation instead of choice. The code is just easier to read with those functions.

The issue with the bounds will still be there, whatever numpy version.

Contributor

glemaitre commented Feb 17, 2017

I think we can decide to proactively raise the dependency to the numpy / scipy versions of Ubuntu 14.04 if it's too complex to support the old numpy.

This is minor changes such as tolist() conversion or use permutation instead of choice. The code is just easier to read with those functions.

The issue with the bounds will still be there, whatever numpy version.

sklearn/preprocessing/data.py
+ return n.fit_transform(X)
+ else:
+ return n.fit_transform(X.T).T

This comment has been minimized.

@ogrisel

ogrisel Feb 17, 2017

Member

what if axis=2? We should raise and exception when axis is not 0 or 1.

@ogrisel

ogrisel Feb 17, 2017

Member

what if axis=2? We should raise and exception when axis is not 0 or 1.

This comment has been minimized.

@glemaitre

glemaitre Feb 17, 2017

Contributor

This is something redundant in the module. Shall we open another PR with a common test for all the functions?

@glemaitre

glemaitre Feb 17, 2017

Contributor

This is something redundant in the module. Shall we open another PR with a common test for all the functions?

@dengemann

This comment has been minimized.

Show comment
Hide comment
@dengemann

dengemann Jun 7, 2017

Contributor

@ogrisel @jnothman I think adding smoothing_noise=True is the right move. You just don't want to bother with this as a user.

Contributor

dengemann commented Jun 7, 2017

@ogrisel @jnothman I think adding smoothing_noise=True is the right move. You just don't want to bother with this as a user.

@dengemann

This comment has been minimized.

Show comment
Hide comment
@dengemann

dengemann Jun 7, 2017

Contributor

@glemaitre now 'm happy enough with the example. For me this is good to go!

Contributor

dengemann commented Jun 7, 2017

@glemaitre now 'm happy enough with the example. For me this is good to go!

doc/modules/preprocessing.rst
+ >>> X, y = iris.data, iris.target
+ >>> X_train, X_test, y_train, y_test = train_test_split(X, y)
+ >>> quantile_transformer = preprocessing.QuantileTransformer(
+ ... smoothing_noise=1e-12)

This comment has been minimized.

@ogrisel

ogrisel Jun 7, 2017

Member

This should just be:

quantile_transformer = preprocessing.QuantileTransformer()
@ogrisel

ogrisel Jun 7, 2017

Member

This should just be:

quantile_transformer = preprocessing.QuantileTransformer()
+# Note that this non-parametetric transformer introduces saturation artifacts
+# for extreme values.
+
+make_plot(6)

This comment has been minimized.

@amueller

amueller Jun 7, 2017

Member

could we do the examples as subplots? currently it opens 8 windows and the plots are hard to compare.

@amueller

amueller Jun 7, 2017

Member

could we do the examples as subplots? currently it opens 8 windows and the plots are hard to compare.

This comment has been minimized.

@ogrisel

ogrisel Jun 8, 2017

Member

we did that previously but that would generate a very large figure that is very slow to render and very tedious to layout correctly in a way that works correctly both for qt and png / sphinx.

@ogrisel

ogrisel Jun 8, 2017

Member

we did that previously but that would generate a very large figure that is very slow to render and very tedious to layout correctly in a way that works correctly both for qt and png / sphinx.

This comment has been minimized.

@ogrisel

ogrisel Jun 8, 2017

Member

Also and even more importantly we want to use the cell-base layout of sphinx-gallery to have the explanations that match each plot close to it when rendered in the example gallery.

@ogrisel

ogrisel Jun 8, 2017

Member

Also and even more importantly we want to use the cell-base layout of sphinx-gallery to have the explanations that match each plot close to it when rendered in the example gallery.

doc/modules/preprocessing.rst
+setting ``output_distribution='normal'``::
+
+ >>> quantile_transformer = preprocessing.QuantileTransformer(
+ ... smoothing_noise=True, output_distribution='normal')

This comment has been minimized.

@ogrisel

ogrisel Jun 7, 2017

Member

Let's not put smoothing_noise=True on doctest snippets that document another parameter. Here the following would be enough:

>>> quantile_transformer = preprocessing.QuantileTransformer(
...     output_distribution='normal')
@ogrisel

ogrisel Jun 7, 2017

Member

Let's not put smoothing_noise=True on doctest snippets that document another parameter. Here the following would be enough:

>>> quantile_transformer = preprocessing.QuantileTransformer(
...     output_distribution='normal')
doc/modules/preprocessing.rst
+corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively --- do not
+become infinite under the transformation.
+
+:class:`QuantileTransformer` provides a ``smoothing_noise`` parameter to

This comment has been minimized.

@ogrisel

ogrisel Jun 7, 2017

Member

provides a smoothing_noise parameter (set to True by default) to ....

@ogrisel

ogrisel Jun 7, 2017

Member

provides a smoothing_noise parameter (set to True by default) to ....

+
+
+def make_plot(item_idx):
+ title, X = distributions[item_idx]

This comment has been minimized.

@amueller

amueller Jun 7, 2017

Member

can we get titles for the two subplots? One is the full data, the other a zoom-in, right? That's not clear from the figure.

@amueller

amueller Jun 7, 2017

Member

can we get titles for the two subplots? One is the full data, the other a zoom-in, right? That's not clear from the figure.

sklearn/preprocessing/data.py
+
+ subsample : int, optional (default=1e5)
+ Maximum number of samples used to estimate the quantiles for
+ computational efficiency. Note that the subsamplong procedure may

This comment has been minimized.

@amueller

amueller Jun 7, 2017

Member

subsampling

@amueller

amueller Jun 7, 2017

Member

subsampling

doc/modules/preprocessing.rst
+ array([...])
+ >>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])
+ ... # doctest: +ELLIPSIS, +SKIP
+ array([...])

This comment has been minimized.

@ogrisel

ogrisel Jun 7, 2017

Member

I don't see the point of having docstests with "arrray([...])" as the output. I think we should display the actual percentiles up to 2 digits. We need to fix the random_state=0 value to ensure that the results are reproducible.

@ogrisel

ogrisel Jun 7, 2017

Member

I don't see the point of having docstests with "arrray([...])" as the output. I think we should display the actual percentiles up to 2 digits. We need to fix the random_state=0 value to ensure that the results are reproducible.

This comment has been minimized.

@glemaitre

glemaitre Jun 7, 2017

Contributor

It was due to some numpy < 1.8 support (it was skipped for the moment)

@glemaitre

glemaitre Jun 7, 2017

Contributor

It was due to some numpy < 1.8 support (it was skipped for the moment)

This comment has been minimized.

@ogrisel

ogrisel Jun 7, 2017

Member

I get non deterministic results even if I fix the random_state=0 in QuantileTransformer.

@ogrisel

ogrisel Jun 7, 2017

Member

I get non deterministic results even if I fix the random_state=0 in QuantileTransformer.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 7, 2017

Member

Currently with sparse matrices if you don't ignore them, you basically get a bunch of constant quantiles, right? and the rest is the same. When does that make sense? Basically you never want constant quantiles, do you?

Should they be cleaned up afterwards? Or can they be avoided?
https://gist.github.com/amueller/87d6eb4ca38d8f95bf823573923d8acb

Member

amueller commented Jun 7, 2017

Currently with sparse matrices if you don't ignore them, you basically get a bunch of constant quantiles, right? and the rest is the same. When does that make sense? Basically you never want constant quantiles, do you?

Should they be cleaned up afterwards? Or can they be avoided?
https://gist.github.com/amueller/87d6eb4ca38d8f95bf823573923d8acb

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 7, 2017

Member

I'm not sure this constitutes a FIX commit! Perhaps DOC?

Member

jnothman commented on be207c7 Jun 7, 2017

I'm not sure this constitutes a FIX commit! Perhaps DOC?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 8, 2017

Member

Ok, fair enough, I still would like headers for the two subplots left and right ;)

Member

amueller commented Jun 8, 2017

Ok, fair enough, I still would like headers for the two subplots left and right ;)

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 8, 2017

Member

The smoothing_noise is justified to map repeated values to some middle quantile value. Would there not be a strategy to do this mapping that does not rely on noise, but can assign in a deterministic way repeated value to the middle quantile? This would be cleaner, IMHO.

Member

GaelVaroquaux commented Jun 8, 2017

The smoothing_noise is justified to map repeated values to some middle quantile value. Would there not be a strategy to do this mapping that does not rely on noise, but can assign in a deterministic way repeated value to the middle quantile? This would be cleaner, IMHO.

sklearn/preprocessing/data.py
- output_pdf : scipy.stats.rv_continuous, optional (default=uniform)
- Probability density function of the normalized data. It should be a
- subclass of ``scipy.stats.rv_continuous``.
+ output_pdf : str, optional (default='norm')

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 8, 2017

Member

"norm" is not a great name, IMHO. "normal" or "gaussian" would be better.

@GaelVaroquaux

GaelVaroquaux Jun 8, 2017

Member

"norm" is not a great name, IMHO. "normal" or "gaussian" would be better.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 9, 2017

Member

I can get rid of the smoothing_noise by doing interpolation in one direction and interpolation in the other:

image

I've push a first implementation of this idea for quick reviews. I still need to remove the smoothing option, update docs and examples.

Member

GaelVaroquaux commented Jun 9, 2017

I can get rid of the smoothing_noise by doing interpolation in one direction and interpolation in the other:

image

I've push a first implementation of this idea for quick reviews. I still need to remove the smoothing option, update docs and examples.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 9, 2017

Member

I've removed the smoothing_noise

@jnothman : this should give us your 👍, no?

There is a failing test that I will address soon

Member

GaelVaroquaux commented Jun 9, 2017

I've removed the smoothing_noise

@jnothman : this should give us your 👍, no?

There is a failing test that I will address soon

ENH: 2-ways interpolation to avoid smoothing_noise
Simplifies also the code, examples, and documentation

@GaelVaroquaux GaelVaroquaux merged commit 26a1027 into scikit-learn:master Jun 9, 2017

2 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 9, 2017

Member

Merged. Whoot!

This is based on a 4-year old PR by Joseph Turian :)

Member

GaelVaroquaux commented Jun 9, 2017

Merged. Whoot!

This is based on a 4-year old PR by Joseph Turian :)

@dengemann

This comment has been minimized.

Show comment
Hide comment
@dengemann

dengemann Jun 9, 2017

Contributor
Contributor

dengemann commented Jun 9, 2017

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 10, 2017

Member

🍻

Member

agramfort commented Jun 10, 2017

🍻

@tguillemot

This comment has been minimized.

Show comment
Hide comment
@tguillemot

tguillemot Jun 10, 2017

Contributor

👍

Contributor

tguillemot commented Jun 10, 2017

👍

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jun 10, 2017

Member

Yohoo :D Thanks for the patience @glemaitre

Member

raghavrv commented Jun 10, 2017

Yohoo :D Thanks for the patience @glemaitre

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 10, 2017

Member

Nicely resolved, @GaelVaroquaux, and well done all!

Member

jnothman commented Jun 10, 2017

Nicely resolved, @GaelVaroquaux, and well done all!

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] QuantileTransformer (#8363)
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
+ output_distribution = self.output_distribution
+ output_distribution = getattr(stats, output_distribution)
+
+ # older version of scipy do not handle tuple as fill_value

This comment has been minimized.

@lesteve

lesteve Feb 27, 2018

Member

@glemaitre I bumped into this, while trying to get rid of code related to old numpy/scipy versions that we don't support any more. Do you remember what this is about? I could not figure it out by just looking at the code and searching the PR comments ...

@lesteve

lesteve Feb 27, 2018

Member

@glemaitre I bumped into this, while trying to get rid of code related to old numpy/scipy versions that we don't support any more. Do you remember what this is about? I could not figure it out by just looking at the code and searching the PR comments ...

This comment has been minimized.

@glemaitre

glemaitre Feb 27, 2018

Contributor

I think that it should have been removed. At first, I implemented the interpolation using scipy.interpolate.interp1d which get a fill_value parameters. In older version fill_values do not accept a tuple [min, max] which is what we need.

But right now we are using numpy.interp. We could move to the higher scipy interp function but we need to wait the fill_values is accepting a typle or array-like. Then I am also not sure if this is useful to spend time on it :)

@glemaitre

glemaitre Feb 27, 2018

Contributor

I think that it should have been removed. At first, I implemented the interpolation using scipy.interpolate.interp1d which get a fill_value parameters. In older version fill_values do not accept a tuple [min, max] which is what we need.

But right now we are using numpy.interp. We could move to the higher scipy interp function but we need to wait the fill_values is accepting a typle or array-like. Then I am also not sure if this is useful to spend time on it :)

This comment has been minimized.

@lesteve

lesteve Feb 28, 2018

Member

OK thanks

@lesteve

lesteve Feb 28, 2018

Member

OK thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment