Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] QuantileTransformer #8363

Merged
merged 107 commits into from Jun 9, 2017
Merged

Conversation

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Feb 15, 2017

Reference Issue

Cont'd of #2176

What does this implement/fix? Explain your changes.

Implementation of quantile normalizer

Any other comments?

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Feb 15, 2017

X_trans = normalizer.fit_transform(X)
# FIXME: one of those will drive to precision error
# in the interpolation
# assert_array_almost_equal(np.min(X_trans, axis=0), 0.)

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017
Contributor

I'm working on it.

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

I checked yesterday for while and there is nothing wrong with our code.
f(min(X)) of the interpolated function do not want to return 0.
The issue should come from numpy.interp

This is working on the toy :D
I will try to sort out the issue with the CI error coming from different numpy version I think.

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017
Contributor

It's a problem of precision with numpy.interp indeed.

self : object
Returns self
"""
X = self._validate_X(X)

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017
Contributor

Just a niptick, is it necessary to create a specific function ?
When there are few lines I prefer not create function ;).

normalizer = QuantileNormalizer()
normalizer.fit(X)
X_trans = normalizer.fit_transform(X)
assert_array_almost_equal(np.min(X_trans, axis=0), 0.)

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

You can use use assert_almost_equal when you compare scalar values.

X_trans = normalizer.fit_transform(X)
assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
assert_array_almost_equal(np.max(X_trans, axis=0), 1.)

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Can you please add a check that extreme values are mapped to 0 or 1, e.g.

X_test = np.array([
    [ -1,  1,  0],
    [101, 11, 10],
])
expected = np.array([
    [0, 0, 0],
    [1, 1, 1],
])
assert_array_almost_equal(normalizer.transform(X_test), expected)

for feat_idx, f in enumerate(func_transform):
Xt.data[Xt.indptr[feat_idx]:Xt.indptr[feat_idx + 1]] = f(
Xt.data[Xt.indptr[feat_idx]:Xt.indptr[feat_idx + 1]])

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Nice :)

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Actually you could factorize the slicing to make the code more readable:

column_slice = slice(Xt.indptr[feat_idx], Xt.indptr[feat_idx + 1])
Xt.data[column_slice] = f(Xt.data[column_sclice])
----------
X : sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. The sparse matrix
needs to be semi-positive.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

You should make it explicit that it only works for CSC sparse matrices (I know this is not public API but it makes it easier to understand how the code works).

# we only accept positive sparse matrix
if sparse.issparse(X) and X.min() < 0:
raise ValueError('QuantileNormalizer only accepts semi-positive'
' sparse matrices')

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Not "semi-positive sparse matrix" but "sparse matrices with all non-negative entries".

def test_quantile_normalizer_error_neg_sparse():
X = np.array([[0, 25, 50, 75, 100],
[-2, 4, 6, 8, 10],
[2.6, 4.1, 2.3, 9.5, 0.1]]).T

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

You should insert more zero values in this matrix to make it sparser.


X = np.array([[0, 25, 50, 75, 100],
[2, 4, 6, 8, 10],
[2.6, 4.1, 2.3, 9.5, 0.1]]).T

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

You should insert more zero values in this matrix to make it sparser.

qn_ser = pickle.dumps(qn, pickle.HIGHEST_PROTOCOL)
qn2 = pickle.loads(qn_ser)
assert_array_almost_equal(qn.transform(iris.data),
qn2.transform(iris.data))

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

You should also check that it can pickle correctly before fitting (evenn though it should trivially work).

The normalization is applied on each feature independently.
The cumulative density function of a feature is used to project the
original values.

This comment has been minimized.

@dengemann

dengemann Feb 16, 2017
Contributor

Add something like:

Features of new/unseen data that fall below or above the fitted range will be mapped to 0 and one, respectively.
Note that this transform and non-linear. It may remove correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

See also
--------
:class:`sklearn.preprocessing.StandardScaler` to perform standardization
that is faster, but less robust to outliers.

This comment has been minimized.

@dengemann

dengemann Feb 16, 2017
Contributor

Add maybe

:class:`sklearn.preprocessing.Ro bustScaler` to perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
     
bounds_error=False,
fill_value=(min(quantiles_feat),
max(quantiles_feat)))
for quantiles_feat in self.quantiles_.T]

This comment has been minimized.

@dengemann

dengemann Feb 16, 2017
Contributor

Is there any reason for these guys being lists, hence mutable?

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

Good point.

self.references_ = np.linspace(0, 1, self.n_quantiles,
endpoint=True)
# FIXME: it does not take into account the zero in the computation
self.quantiles_ = np.array([np.percentile(

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

@ogrisel @tguillemot Here I am not really sure what should be the right way.
Assuming that the sparse matrix as a lot of zeros fo a given feature, it will have a bad influence on the normalisation, didn't it?
It could also be the case in the dense in fact. That was the reason of including a quantile_range.

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017
Contributor

Can we modify the reference value to take into account of the number of 0 ?
Not sure it's what we want.

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

it is in np.percentiles that we can do that. We know the size of X_col and we can now the number of non-zeros. Therefore, we can add the zeros in the data to compute the percentiles.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

yes we need to find a way to shift the percentile distribution efficiently. It probably better to do the quantile computation ourselves: sort the subsampled column non-zero data, then consider the fraction of zeros that should be considered to be added at the beginning of that array (without actually materializing it) also taking the subsampling rate into account and do the quantile lookups manually.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

But then we also need to handle the linear interpolation...

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Actually, no need to do that, let's do:

column_nnz_data = X.data[X.indptr[feat]:X.indptr[feat + 1]]
column_subsample = subsample * len(column_nnz_data) // X.shape[0]
column_data = np.zeros(shape=subsample, dtype=X.dtype)
column_data[:column_subsample] = rng.choice(column_nnz_data, column_subsample,
                                            replace=False)

and then proceed to extract the quantiles from column_data as usual.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Because subsample is going to be smallish and independent of X.shape[0] this is good enough and easier to maintain.

@tguillemot tguillemot force-pushed the glemaitre:quantile_scaler branch from 509882d to 2b89139 Feb 16, 2017
# FIXME: it does not take into account the zero in the computation
self.quantiles_ = np.array([np.percentile(
X.data[X.indptr[feat]:X.indptr[feat + 1]], self.references_ * 100)
for feat in range(n_feat)]).T

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

Cosmetics: please use n_features and feature_idx.

# assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
# assert_array_almost_equal(np.max(X_trans, axis=0), 1.)
X_trans_inv = normalizer.inverse_transform(X_trans)
assert_array_almost_equal(X, X_trans_inv)

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

Not directly related with the line but with the transform.inverse_transform. It will not be equal if X have out of bounds value which will be clipped during transform and mapped to minimum of maximum of the references_ during inverse transform

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017
Contributor

There are no problem for that case (and it's a way to be sure the normalizer works in a correct way).
But what you say is true indeed for general cases.

if direction:
print(1)

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

@tguillemot That look like debugging flags

func_transform = self.f_transform_
else:
print(2)

This comment has been minimized.

@glemaitre

glemaitre Feb 16, 2017
Author Contributor

@tguillemot That look like debugging flags

This comment has been minimized.

@tguillemot

tguillemot Feb 16, 2017
Contributor

oups indeed. Sorry

@tguillemot tguillemot force-pushed the glemaitre:quantile_scaler branch from 211bd64 to a803339 Feb 16, 2017
@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Feb 16, 2017

@ogrisel I was checking the User guide for the preprocessing to see what to add.

I have a second thought on the naming of the class. From the description in the user guide, QuantileScaler would be more appropriate.

What is the reason to stick to normalizer?

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Feb 16, 2017

The problem is that (feature-wise) scaling stands for deviding each feature by a scalar value. This is the case for StandardScaler and RobustScaler but not in our case. I prefer QuantileNormalizer or QuantileTransformer.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Feb 16, 2017

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

The cumulative density function of a feature is used to project the
original values. Features values of new/unseen data that fall below
or above the fitted range will be mapped to 0 and 1, respectively.
Note that this transform is non-linear. It may remove correlations between

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

"remove correlations" => "distort linear correlations"

This Normalizer scales the features between 0 and 1, equalizing the
distribution of each feature to a uniform distribution. Therefore,
for a given feature, this normalization tends to spread out the most
frequent values.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

@glemaitre
Copy link
Contributor Author

@glemaitre glemaitre commented Feb 16, 2017

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

Fair enough. The narration of the User guide needs to be changed to be coherent.

f_inverse_transform_ : list of callable, shape (n_quantiles,)
The inverse of the cumulative density function used to project the
data.

This comment has been minimized.

@ogrisel

ogrisel Feb 16, 2017
Member

I think we should keep the f_transform_ and f_inverse_transform_ attribute private (with a leading underscore).

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 9, 2017

I've removed the smoothing_noise

@jnothman : this should give us your 👍, no?

There is a failing test that I will address soon

@GaelVaroquaux GaelVaroquaux force-pushed the glemaitre:quantile_scaler branch 9 times, most recently from 05290f5 to 45a1548 Jun 9, 2017
Simplifies also the code, examples, and documentation
@GaelVaroquaux GaelVaroquaux force-pushed the glemaitre:quantile_scaler branch from 45a1548 to 7046a6d Jun 9, 2017
@GaelVaroquaux GaelVaroquaux merged commit 26a1027 into scikit-learn:master Jun 9, 2017
2 of 3 checks passed
2 of 3 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 9, 2017

Merged. Whoot!

This is based on a 4-year old PR by Joseph Turian :)

@dengemann
Copy link
Contributor

@dengemann dengemann commented Jun 9, 2017

@agramfort
Copy link
Member

@agramfort agramfort commented Jun 10, 2017

🍻

@tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Jun 10, 2017

👍

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Jun 10, 2017

Yohoo :D Thanks for the patience @glemaitre

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 10, 2017

Nicely resolved, @GaelVaroquaux, and well done all!

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
output_distribution = self.output_distribution
output_distribution = getattr(stats, output_distribution)

# older version of scipy do not handle tuple as fill_value

This comment has been minimized.

@lesteve

lesteve Feb 27, 2018
Member

@glemaitre I bumped into this, while trying to get rid of code related to old numpy/scipy versions that we don't support any more. Do you remember what this is about? I could not figure it out by just looking at the code and searching the PR comments ...

This comment has been minimized.

@glemaitre

glemaitre Feb 27, 2018
Author Contributor

I think that it should have been removed. At first, I implemented the interpolation using scipy.interpolate.interp1d which get a fill_value parameters. In older version fill_values do not accept a tuple [min, max] which is what we need.

But right now we are using numpy.interp. We could move to the higher scipy interp function but we need to wait the fill_values is accepting a typle or array-like. Then I am also not sure if this is useful to spend time on it :)

This comment has been minimized.

@lesteve

lesteve Feb 28, 2018
Member

OK thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet