You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like PLSRegression is fitting noise really well
I've put together a null dataset simulating of the noise that I'd typically expect out of my dataset.
(See here for notebook)
Specifically, I generated the following dataset
And I get the following PLS fit
Note, that I have tested this on simulated dataset with known signal, and it was able to recover the real signal. However, it seems like PLSRegression is very prone to overfitting when tested on pure noise.
This is well-known in the literature - so its probably not a problem with the code. But it would be really nice if there could be some good cross-validation statistics, or other methods that could protect against overfitting. Is there any interest in including these sorts of methods / statistics? This paper lists a few of the statistics for measuring this sort of performance
I believe this is a non-issue:
PLS regression is indeed very easy to overfit, and this is particularly bad for PLS-DA like models.
Scikit-learn's interface has a lot of options to help check this, that can be implemented easily. In fact, I can tell you (as I have done it) that with the model validation tools you have available at scikit-learn you can quickly implement more flexible and reliable validation procedures than those available at most PLS regression packages around.
So in the same manner that a Support Vector Machine object contains only the algorithm and its up to the user to select a CV workflow using other scikit-learn objects, it should be the same with PLSRegression. This is particularly helpful for PLS-DA, where with scikit-learn's toolkit you can easily use validation metrics other than Accuracy or AUC, which are sometimes more interesting in real life problems.
Description
It looks like PLSRegression is fitting noise really well
I've put together a null dataset simulating of the noise that I'd typically expect out of my dataset.
(See here for notebook)
Specifically, I generated the following dataset
And I get the following PLS fit
Note, that I have tested this on simulated dataset with known signal, and it was able to recover the real signal. However, it seems like PLSRegression is very prone to overfitting when tested on pure noise.
This is well-known in the literature - so its probably not a problem with the code. But it would be really nice if there could be some good cross-validation statistics, or other methods that could protect against overfitting. Is there any interest in including these sorts of methods / statistics? This paper lists a few of the statistics for measuring this sort of performance
Versions
Darwin-15.6.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 12:15:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.0
SciPy 0.19.0
Scikit-Learn 0.18.1
The text was updated successfully, but these errors were encountered: