Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean centring issues in PCA/PLS when used in a Pipeline #10605

Open
Gscorreia89 opened this issue Feb 8, 2018 · 11 comments
Open

Mean centring issues in PCA/PLS when used in a Pipeline #10605

Gscorreia89 opened this issue Feb 8, 2018 · 11 comments

Comments

@Gscorreia89
Copy link

Description

The scikit-learn implementations of PCA ( sklearn.decomposition.PCA) and PLS algorithms (sklearn.cross_decomposition._PLS objects, exemplified here with PLSRegression) automatically perform mean centering as part of their fit method.

This can lead to unnecessary or even problematic duplicated centering when using these objects inside Pipelines, especially with RobustScaler or scalers that do not center using the arithmetic mean vector.

In summary, The center_/mean_ centering applied by the scaler object is further adjusted by the .mean_/.x_mean_ stored inside the objects during .fit.

I guess having a center option in the init or fit method for those cases (and kept as True by default) would fix this (always set to False when using a scaler in a pipeline), as well as allow more flexible use of these algorithms (ie, explore double-row/column centring).

Steps/Code to Reproduce

This can be reproduced by simply using a pipeline with PCA/PLS regression and any scaler before. I have quickly adapted an example with the wine dataset to showcase the double scaling with RobustScaler.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.datasets import load_wine
from sklearn.pipeline import make_pipeline

RANDOM_STATE = 42

features, target = load_wine(return_X_y=True)

# Make a train/test split using 30% test size
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.30, random_state=RANDOM_STATE)

# Fit to data - PCA.
unscaled_clf = make_pipeline(PCA(n_components=2))
unscaled_clf.fit(X_train)

# Fit to data - PLSRegression
unscaled_pls = make_pipeline(PLSRegression(2, copy=False, scale=False))
unscaled_pls.fit(X_train, y_train)

# Fit to data - StandardScaler and PCA pipeline.
std_clf = make_pipeline(StandardScaler(), PCA(n_components=2))
std_clf.fit(X_train)

# Fit to data - StandardScaler and PLSRegression pipeline.
std_pls = make_pipeline(StandardScaler(), PLSRegression(2, copy=False, scale=False))
std_pls.fit(X_train, y_train)

# Fit to data - RobustScaler and PCA pipeline.
rob_clf = make_pipeline(RobustScaler(), PCA(n_components=2))
rob_clf.fit(X_train, y_train)

# Fit to data - RobustScaler and PLSRegression pipeline.
rob_pls = make_pipeline(RobustScaler(), PLSRegression(2, copy=False, scale=False))
rob_pls.fit(X_train, y_train)

# PCA 
# Mean center vector calculated when using .fit
print(unscaled_clf.named_steps['pca'].mean_)

# Pipeline StandardScaler + PCA
# Mean center vector from the scaler 
print(std_clf.named_steps['standardscaler'].mean_)
# Mean center vector inside PCA
print(std_clf.named_steps['pca'].mean_)

# Pipeline - RobustScaler + PCA
print(rob_clf.named_steps['robustscaler'].center_)
print(rob_clf.named_steps['pca'].mean_)

# PLSRegression
print(unscaled_pls.named_steps['plsregression'].x_mean_)

# Pipeline - StandardScaler + PLSRegression
print(std_pls.named_steps['standardscaler'].mean_)
print(std_pls.named_steps['plsregression'].x_mean_)

# Pipeline - RobustScaler + PLSRegression
print(rob_pls.named_steps['robustscaler'].center_)
print(rob_pls.named_steps['plsregression'].x_mean_)

Results

For the Pipelines with StandardScaler, the effect is quite "benign", as these scalers use the same
criteria. The differences are close to numerical tolerance. However, when using RobustScaler, the second "unexpected" scaling is no longer negligible...

Versions

Please run the following snippet and paste the output below.
Linux-4.13.0-32-generic-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

Yes, the automated centring on mean assumes that the data is not already centred, and if it has been centred in a robust way, this is particularly bad. I agree this seems to be an issue, and having a center option would seem to be the simplest solution.

@jnothman jnothman added Bug Easy Well-defined and straightforward way to resolve help wanted good first issue Easy with clear instructions to resolve labels Feb 8, 2018
@kayush2O6
Copy link

Hi @jnothman , I am new to this community. Can I take this issue ?

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018 via email

@agramfort
Copy link
Member

agramfort commented Feb 8, 2018 via email

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018 via email

@jnothman jnothman removed Bug Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve help wanted labels Feb 8, 2018
@Gscorreia89
Copy link
Author

Outlier check and removal is something we have to consider in these models, and robust covariance is also something worth checking but what I find particular annoying here is that it breaks the behaviour of the Pipeline object.

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

I get that. But I think we can only be more emphatic about the inappropriateness of using this when the data has outliers

@jnothman
Copy link
Member

jnothman commented Feb 8, 2018

Feel free to offer a PR improving the documentation if you think that would help a little

@ozancaglayan
Copy link

I just discovered this by looking at the source code of PCA. It's not documented at all that PCA does this in terms of its fit method. I think at least, this should be documented.

If removed, a warning can be issued to make people aware of the difference in behavior.

@jnothman
Copy link
Member

jnothman commented Oct 19, 2018 via email

@mhw-hermes
Copy link

is there any progress on this coming? I would really appreciate a handle to turn the centering off - R's implementation in prcomp() has a center=TRUE as a default... witch can be switched off - I dont see how this could be a bad thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants