New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mean centring issues in PCA/PLS when used in a Pipeline #10605
Comments
Yes, the automated centring on mean assumes that the data is not already centred, and if it has been centred in a robust way, this is particularly bad. I agree this seems to be an issue, and having a center option would seem to be the simplest solution. |
Hi @jnothman , I am new to this community. Can I take this issue ? |
I think so, but it's a bit early and possible that other core devs will
have a different opionion on whether this is a real problem, or the right
solution
|
PLS and PCA will be sensitive to outliers. If you exclude samples from mean
you should exclude them from covariance ie use a robust covariance and you
will end up using a kind of robust PCA/PLS.
I don't think we should change the behavior of PCA/PLS but consider new
robust versions.
|
or encourage outlier removal by providing facilities for it?
|
Outlier check and removal is something we have to consider in these models, and robust covariance is also something worth checking but what I find particular annoying here is that it breaks the behaviour of the Pipeline object. |
I get that. But I think we can only be more emphatic about the inappropriateness of using this when the data has outliers |
Feel free to offer a PR improving the documentation if you think that would help a little |
I just discovered this by looking at the source code of PCA. It's not documented at all that PCA does this in terms of its If removed, a warning can be issued to make people aware of the difference in behavior. |
PR welcome to improve documentation where you find it is needed.
|
is there any progress on this coming? I would really appreciate a handle to turn the centering off - R's implementation in prcomp() has a center=TRUE as a default... witch can be switched off - I dont see how this could be a bad thing? |
Description
The scikit-learn implementations of PCA ( sklearn.decomposition.PCA) and PLS algorithms (sklearn.cross_decomposition._PLS objects, exemplified here with PLSRegression) automatically perform mean centering as part of their fit method.
This can lead to unnecessary or even problematic duplicated centering when using these objects inside Pipelines, especially with RobustScaler or scalers that do not center using the arithmetic mean vector.
In summary, The center_/mean_ centering applied by the scaler object is further adjusted by the .mean_/.x_mean_ stored inside the objects during .fit.
I guess having a center option in the init or fit method for those cases (and kept as True by default) would fix this (always set to False when using a scaler in a pipeline), as well as allow more flexible use of these algorithms (ie, explore double-row/column centring).
Steps/Code to Reproduce
This can be reproduced by simply using a pipeline with PCA/PLS regression and any scaler before. I have quickly adapted an example with the wine dataset to showcase the double scaling with RobustScaler.
Results
For the Pipelines with StandardScaler, the effect is quite "benign", as these scalers use the same
criteria. The differences are close to numerical tolerance. However, when using RobustScaler, the second "unexpected" scaling is no longer negligible...
Versions
Please run the following snippet and paste the output below.
Linux-4.13.0-32-generic-x86_64-with-debian-stretch-sid
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1
The text was updated successfully, but these errors were encountered: