Mean centring issues in PCA/PLS when used in a Pipeline #10605

Gscorreia89 · 2018-02-08T02:07:58Z

Description

The scikit-learn implementations of PCA ( sklearn.decomposition.PCA) and PLS algorithms (sklearn.cross_decomposition._PLS objects, exemplified here with PLSRegression) automatically perform mean centering as part of their fit method.

This can lead to unnecessary or even problematic duplicated centering when using these objects inside Pipelines, especially with RobustScaler or scalers that do not center using the arithmetic mean vector.

In summary, The center_/mean_ centering applied by the scaler object is further adjusted by the .mean_/.x_mean_ stored inside the objects during .fit.

I guess having a center option in the init or fit method for those cases (and kept as True by default) would fix this (always set to False when using a scaler in a pipeline), as well as allow more flexible use of these algorithms (ie, explore double-row/column centring).

Steps/Code to Reproduce

This can be reproduced by simply using a pipeline with PCA/PLS regression and any scaler before. I have quickly adapted an example with the wine dataset to showcase the double scaling with RobustScaler.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.datasets import load_wine
from sklearn.pipeline import make_pipeline

RANDOM_STATE = 42

features, target = load_wine(return_X_y=True)

# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.30, random_state=RANDOM_STATE)

# Fit to data - PCA.
unscaled_clf = make_pipeline(PCA(n_components=2))
unscaled_clf.fit(X_train)

# Fit to data - PLSRegression
unscaled_pls = make_pipeline(PLSRegression(2, copy=False, scale=False))
unscaled_pls.fit(X_train, y_train)

# Fit to data - StandardScaler and PCA pipeline.
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2))
std_clf.fit(X_train)

# Fit to data - StandardScaler and PLSRegression pipeline.
std_pls = make_pipeline(StandardScaler(), PLSRegression(2, copy=False, scale=False))
std_pls.fit(X_train, y_train)

# Fit to data - RobustScaler and PCA pipeline.
rob_clf = make_pipeline(RobustScaler(), PCA(n_components=2))
rob_clf.fit(X_train, y_train)

# Fit to data - RobustScaler and PLSRegression pipeline.
rob_pls = make_pipeline(RobustScaler(), PLSRegression(2, copy=False, scale=False))
rob_pls.fit(X_train, y_train)

# PCA 
# Mean center vector calculated when using .fit
print(unscaled_clf.named_steps['pca'].mean_)

# Pipeline StandardScaler + PCA
# Mean center vector from the scaler 
print(std_clf.named_steps['standardscaler'].mean_)
# Mean center vector inside PCA
print(std_clf.named_steps['pca'].mean_)

# Pipeline - RobustScaler + PCA
print(rob_clf.named_steps['robustscaler'].center_)
print(rob_clf.named_steps['pca'].mean_)

# PLSRegression
print(unscaled_pls.named_steps['plsregression'].x_mean_)

# Pipeline - StandardScaler + PLSRegression
print(std_pls.named_steps['standardscaler'].mean_)
print(std_pls.named_steps['plsregression'].x_mean_)

# Pipeline - RobustScaler + PLSRegression
print(rob_pls.named_steps['robustscaler'].center_)
print(rob_pls.named_steps['plsregression'].x_mean_)

Results

For the Pipelines with StandardScaler, the effect is quite "benign", as these scalers use the same
criteria. The differences are close to numerical tolerance. However, when using RobustScaler, the second "unexpected" scaling is no longer negligible...

Versions

Please run the following snippet and paste the output below.
Linux-4.13.0-32-generic-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1

jnothman · 2018-02-08T03:21:02Z

Yes, the automated centring on mean assumes that the data is not already centred, and if it has been centred in a robust way, this is particularly bad. I agree this seems to be an issue, and having a center option would seem to be the simplest solution.

kayush2O6 · 2018-02-08T05:28:43Z

Hi @jnothman , I am new to this community. Can I take this issue ?

jnothman · 2018-02-08T06:38:38Z

I think so, but it's a bit early and possible that other core devs will have a different opionion on whether this is a real problem, or the right solution

agramfort · 2018-02-08T07:59:43Z

PLS and PCA will be sensitive to outliers. If you exclude samples from mean you should exclude them from covariance ie use a robust covariance and you will end up using a kind of robust PCA/PLS. I don't think we should change the behavior of PCA/PLS but consider new robust versions.

jnothman · 2018-02-08T08:01:53Z

or encourage outlier removal by providing facilities for it?

Gscorreia89 · 2018-02-08T08:47:57Z

Outlier check and removal is something we have to consider in these models, and robust covariance is also something worth checking but what I find particular annoying here is that it breaks the behaviour of the Pipeline object.

jnothman · 2018-02-08T09:14:35Z

I get that. But I think we can only be more emphatic about the inappropriateness of using this when the data has outliers

jnothman · 2018-02-08T09:15:17Z

Feel free to offer a PR improving the documentation if you think that would help a little

ozancaglayan · 2018-10-18T14:41:55Z

I just discovered this by looking at the source code of PCA. It's not documented at all that PCA does this in terms of its fit method. I think at least, this should be documented.

If removed, a warning can be issued to make people aware of the difference in behavior.

jnothman · 2018-10-19T02:45:04Z

PR welcome to improve documentation where you find it is needed.

mhw-hermes · 2019-07-04T09:53:42Z

is there any progress on this coming? I would really appreciate a handle to turn the centering off - R's implementation in prcomp() has a center=TRUE as a default... witch can be switched off - I dont see how this could be a bad thing?

jnothman added Bug Easy Well-defined and straightforward way to resolve help wanted good first issue Easy with clear instructions to resolve labels Feb 8, 2018

jnothman removed Bug Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Feb 8, 2018

cmarmo added module:cross_decomposition module:decomposition labels Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean centring issues in PCA/PLS when used in a Pipeline #10605

Mean centring issues in PCA/PLS when used in a Pipeline #10605

Gscorreia89 commented Feb 8, 2018

jnothman commented Feb 8, 2018

kayush2O6 commented Feb 8, 2018

jnothman commented Feb 8, 2018 via email

agramfort commented Feb 8, 2018 via email

jnothman commented Feb 8, 2018 via email

Gscorreia89 commented Feb 8, 2018

jnothman commented Feb 8, 2018

jnothman commented Feb 8, 2018

ozancaglayan commented Oct 18, 2018

jnothman commented Oct 19, 2018 via email

mhw-hermes commented Jul 4, 2019

Mean centring issues in PCA/PLS when used in a Pipeline #10605

Mean centring issues in PCA/PLS when used in a Pipeline #10605

Comments

Gscorreia89 commented Feb 8, 2018

Description

Steps/Code to Reproduce

Results

Versions

jnothman commented Feb 8, 2018

kayush2O6 commented Feb 8, 2018

jnothman commented Feb 8, 2018 via email

agramfort commented Feb 8, 2018 via email

jnothman commented Feb 8, 2018 via email

Gscorreia89 commented Feb 8, 2018

jnothman commented Feb 8, 2018

jnothman commented Feb 8, 2018

ozancaglayan commented Oct 18, 2018

jnothman commented Oct 19, 2018 via email

mhw-hermes commented Jul 4, 2019