New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] QuantileTransformer #8363
Changes from 102 commits
14f3b15
cc8d264
b78b689
bb1829a
5c9bcbc
8d4b9cc
45e48f7
0a646c1
4dbdb6e
f723edb
5c8d496
bcbf79b
1be3f5b
86b4a22
a742a61
79927b6
cc680a7
1260a70
c043c07
0a7dc4d
36a8870
94e26ad
f552529
8a4592c
9070871
cbe4da9
1051fbb
790b0cb
9713089
a1052de
5b48b22
45172fa
ef3b403
adc1f37
22ea4f9
81a3721
055d8aa
777e353
63708c2
6e6eb52
1aba0fe
d1a94f5
f1282f2
11709a3
cf5fa8d
0150f62
81c08cc
adde8cf
fe009c9
6ec43a8
e94cd48
9c13d2a
5d544ef
d9b3e7a
23b3a91
04dc89a
9377cc2
c132211
9b66d71
fb88fa1
f46aea9
d55295a
38127d5
90fa3bd
ba8339d
9a1b79e
dabd403
29c24e0
3023a2f
17db1ff
1de03ba
79f6e97
12a3f47
9226f73
b47158f
bd928ed
c70aba0
9a9556c
6b105a9
dc39f9e
c3cf631
570c5d0
da5604d
7871513
52e4edf
28cc2af
6cdf964
58c64c2
1a181fa
cb04d53
eac7071
37afa44
a4719b4
07906cc
d4d6bb4
0b5be04
c740628
6c2d7cf
22708c9
4d2fe63
49c94b3
2c85eb3
0f1bc24
db08c55
be207c7
7b17f14
7046a6d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,13 @@ The ``sklearn.preprocessing`` package provides several common | |
utility functions and transformer classes to change raw feature vectors | ||
into a representation that is more suitable for the downstream estimators. | ||
|
||
In general, learning algorithms benefit from standardization of the data set. If | ||
some outliers are present in the set, robust scalers or transformers are more | ||
appropriate. The behaviors of the different scalers, transformers, and | ||
normalizers on a dataset containing marginal outliers is highlighted in | ||
:ref:`sphx_glr_auto_examples_preprocessing_plot_all_scaling.py`. | ||
|
||
|
||
.. _preprocessing_scaler: | ||
|
||
Standardization, or mean removal and variance scaling | ||
|
@@ -39,10 +46,10 @@ operation on a single array-like dataset:: | |
|
||
>>> from sklearn import preprocessing | ||
>>> import numpy as np | ||
>>> X = np.array([[ 1., -1., 2.], | ||
... [ 2., 0., 0.], | ||
... [ 0., 1., -1.]]) | ||
>>> X_scaled = preprocessing.scale(X) | ||
>>> X_train = np.array([[ 1., -1., 2.], | ||
... [ 2., 0., 0.], | ||
... [ 0., 1., -1.]]) | ||
>>> X_scaled = preprocessing.scale(X_train) | ||
|
||
>>> X_scaled # doctest: +ELLIPSIS | ||
array([[ 0. ..., -1.22..., 1.33...], | ||
|
@@ -71,7 +78,7 @@ able to later reapply the same transformation on the testing set. | |
This class is hence suitable for use in the early steps of a | ||
:class:`sklearn.pipeline.Pipeline`:: | ||
|
||
>>> scaler = preprocessing.StandardScaler().fit(X) | ||
>>> scaler = preprocessing.StandardScaler().fit(X_train) | ||
>>> scaler | ||
StandardScaler(copy=True, with_mean=True, with_std=True) | ||
|
||
|
@@ -81,7 +88,7 @@ This class is hence suitable for use in the early steps of a | |
>>> scaler.scale_ # doctest: +ELLIPSIS | ||
array([ 0.81..., 0.81..., 1.24...]) | ||
|
||
>>> scaler.transform(X) # doctest: +ELLIPSIS | ||
>>> scaler.transform(X_train) # doctest: +ELLIPSIS | ||
array([[ 0. ..., -1.22..., 1.33...], | ||
[ 1.22..., 0. ..., -0.26...], | ||
[-1.22..., 1.22..., -1.06...]]) | ||
|
@@ -90,7 +97,8 @@ This class is hence suitable for use in the early steps of a | |
The scaler instance can then be used on new data to transform it the | ||
same way it did on the training set:: | ||
|
||
>>> scaler.transform([[-1., 1., 0.]]) # doctest: +ELLIPSIS | ||
>>> X_test = [[-1., 1., 0.]] | ||
>>> scaler.transform(X_test) # doctest: +ELLIPSIS | ||
array([[-2.44..., 1.22..., -0.26...]]) | ||
|
||
It is possible to disable either centering or scaling by either | ||
|
@@ -248,6 +256,66 @@ a :class:`KernelCenterer` can transform the kernel matrix | |
so that it contains inner products in the feature space | ||
defined by :math:`phi` followed by removal of the mean in that space. | ||
|
||
.. _preprocessing_transformer: | ||
|
||
Non-linear transformation | ||
========================= | ||
|
||
Like scalers, :class:`QuantileTransformer` puts each feature into the same | ||
range or distribution. However, by performing a rank transformation, it smooths | ||
out unusual distributions and is less influenced by outliers than scaling | ||
methods. It does, however, distort correlations and distances within and across | ||
features. | ||
|
||
:class:`QuantileTransformer` and :func:`quantile_transform` provide a | ||
non-parametric transformation based on the quantile function to map the data to | ||
a uniform distribution with values between 0 and 1:: | ||
|
||
>>> from sklearn.datasets import load_iris | ||
>>> from sklearn.model_selection import train_test_split | ||
>>> iris = load_iris() | ||
>>> X, y = iris.data, iris.target | ||
>>> X_train, X_test, y_train, y_test = train_test_split(X, y) | ||
>>> quantile_transformer = preprocessing.QuantileTransformer( | ||
... smoothing_noise=1e-12) | ||
>>> X_train_trans = quantile_transformer.fit_transform(X_train) | ||
>>> X_test_trans = quantile_transformer.transform(X_test) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't show anything but usage. Can we put some kind of assertion regarding the transformed data here, e.g. report |
||
|
||
>>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) | ||
... # doctest: +ELLIPSIS, +SKIP | ||
array([...]) | ||
>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100]) | ||
... # doctest: +ELLIPSIS, +SKIP | ||
array([...]) | ||
>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100]) | ||
... # doctest: +ELLIPSIS, +SKIP | ||
array([...]) | ||
>>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100]) | ||
... # doctest: +ELLIPSIS, +SKIP | ||
array([...]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see the point of having docstests with "arrray([...])" as the output. I think we should display the actual percentiles up to 2 digits. We need to fix the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was due to some numpy < 1.8 support (it was skipped for the moment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get non deterministic results even if I fix the |
||
|
||
It is also possible to map the transformed data to a normal distribution by | ||
setting ``output_distribution='normal'``:: | ||
|
||
>>> quantile_transformer = preprocessing.QuantileTransformer( | ||
... smoothing_noise=True, output_distribution='normal') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's not put >>> quantile_transformer = preprocessing.QuantileTransformer(
... output_distribution='normal') |
||
>>> X_trans = quantile_transformer.fit_transform(X) | ||
>>> quantile_transformer.quantiles_ # doctest: +ELLIPSIS | ||
array([...]) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Say "Thus the median of the input becomes the mean of the output, centered at 0. The normal output is clipped so that the input's maximum and minimum do not become infinite under the transformation." |
||
Thus the median of the input becomes the mean of the output, centered at 0. The | ||
normal output is clipped so that the input's minimum and maximum --- | ||
corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively --- do not | ||
become infinite under the transformation. | ||
|
||
:class:`QuantileTransformer` provides a ``smoothing_noise`` parameter to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. provides a |
||
make the interpretation more intuitive when inspecting the | ||
transformation. This is particularly useful when feature values are | ||
replicated identically many times in the training set (e.g. prices, ordinal | ||
values such as user ratings, coarse-grained units of time, etc.). See | ||
:ref:`sphx_glr_auto_examples_preprocessing_plot_smoothing_noise_quantile_transform.py` | ||
for more details. | ||
|
||
.. _preprocessing_normalization: | ||
|
||
Normalization | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be: