Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overfitting in PLSRegression #9460

Open
mortonjt opened this issue Jul 28, 2017 · 1 comment
Open

Overfitting in PLSRegression #9460

mortonjt opened this issue Jul 28, 2017 · 1 comment

Comments

@mortonjt
Copy link

Description

It looks like PLSRegression is fitting noise really well

I've put together a null dataset simulating of the noise that I'd typically expect out of my dataset.
(See here for notebook)

Specifically, I generated the following dataset

image

And I get the following PLS fit

image

Note, that I have tested this on simulated dataset with known signal, and it was able to recover the real signal. However, it seems like PLSRegression is very prone to overfitting when tested on pure noise.

This is well-known in the literature - so its probably not a problem with the code. But it would be really nice if there could be some good cross-validation statistics, or other methods that could protect against overfitting. Is there any interest in including these sorts of methods / statistics? This paper lists a few of the statistics for measuring this sort of performance

Versions

Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 12:15:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.0
SciPy 0.19.0
Scikit-Learn 0.18.1

@Gscorreia89
Copy link

I believe this is a non-issue:
PLS regression is indeed very easy to overfit, and this is particularly bad for PLS-DA like models.

Scikit-learn's interface has a lot of options to help check this, that can be implemented easily. In fact, I can tell you (as I have done it) that with the model validation tools you have available at scikit-learn you can quickly implement more flexible and reliable validation procedures than those available at most PLS regression packages around.

So in the same manner that a Support Vector Machine object contains only the algorithm and its up to the user to select a CV workflow using other scikit-learn objects, it should be the same with PLSRegression. This is particularly helpful for PLS-DA, where with scikit-learn's toolkit you can easily use validation metrics other than Accuracy or AUC, which are sometimes more interesting in real life problems.

For general interest about what is "wrong" with PLS, I will leave these here : https://onlinelibrary.wiley.com/doi/full/10.1002/cem.2602
and https://onlinelibrary.wiley.com/doi/10.1002/cem.3002

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants