Overfitting in PLSRegression #9460

mortonjt · 2017-07-28T07:23:53Z

Description

It looks like PLSRegression is fitting noise really well

I've put together a null dataset simulating of the noise that I'd typically expect out of my dataset.
(See here for notebook)

Specifically, I generated the following dataset

And I get the following PLS fit

Note, that I have tested this on simulated dataset with known signal, and it was able to recover the real signal. However, it seems like PLSRegression is very prone to overfitting when tested on pure noise.

This is well-known in the literature - so its probably not a problem with the code. But it would be really nice if there could be some good cross-validation statistics, or other methods that could protect against overfitting. Is there any interest in including these sorts of methods / statistics? This paper lists a few of the statistics for measuring this sort of performance

Versions

Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 12:15:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.0
SciPy 0.19.0
Scikit-Learn 0.18.1

Gscorreia89 · 2019-07-04T10:31:27Z

I believe this is a non-issue:
PLS regression is indeed very easy to overfit, and this is particularly bad for PLS-DA like models.

Scikit-learn's interface has a lot of options to help check this, that can be implemented easily. In fact, I can tell you (as I have done it) that with the model validation tools you have available at scikit-learn you can quickly implement more flexible and reliable validation procedures than those available at most PLS regression packages around.

So in the same manner that a Support Vector Machine object contains only the algorithm and its up to the user to select a CV workflow using other scikit-learn objects, it should be the same with PLSRegression. This is particularly helpful for PLS-DA, where with scikit-learn's toolkit you can easily use validation metrics other than Accuracy or AUC, which are sometimes more interesting in real life problems.

For general interest about what is "wrong" with PLS, I will leave these here : https://onlinelibrary.wiley.com/doi/full/10.1002/cem.2602
and https://onlinelibrary.wiley.com/doi/10.1002/cem.3002

alexlenail mentioned this issue Jul 28, 2017

PLS* should not transform on y #9453

Open

cmarmo added the module:cross_decomposition label Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overfitting in PLSRegression #9460

Overfitting in PLSRegression #9460

mortonjt commented Jul 28, 2017

Gscorreia89 commented Jul 4, 2019

Overfitting in PLSRegression #9460

Overfitting in PLSRegression #9460

Comments

mortonjt commented Jul 28, 2017

Description

Versions

Gscorreia89 commented Jul 4, 2019