-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about preProcOptions in trainControl #335
Comments
No leakage! =] All pre-processing is applied on the resampled version of the data (e.g. 90% in 10-fold CV) and then those calculations are applied to the holdouts (the remaining 10%) with no re-calculation. For example, if you use 50 bootstrap samples, the pre-processing that you choose is applied 50 separate times (plus once more for the entire data set at the very end). Also, if any subsampling is done to deal with a class imbalance, that is also done within each resample. There is a programatic choice to do the pre-processing before or after the subsampling. |
Thanks for the explanation! What if I want to do tSNE for preprocessing? I didn't see that as an option. |
It's not an option because none of the R packages for tnsne can make predictions on new data (as far as I know). If you know how to do it, feel free to make a pull request to add it to preProccess. |
It looks like they're adding Barnes-Hut t-SNE to scikit learn, which can predict on new data. Someone should port this to R! |
You are right. Is there any way to port to R? |
It turns out there's already an R implementation in the Rtsne package. I opened a feature request for a predict method. We'll see what happens! |
"There is a programatic choice to do the pre-processing before or after the subsampling." |
I wonder how is the preprocess handled in the training process (or in the train function). If the method in trainControl is "cv", is preprocess applied to the whole training data or parts of data after splitting by cross-validation. I was doing preprocessing, like PCA, for the whole training data set, before I run the train function. I think it may lead to data leakage for the following cross-validation step.
The text was updated successfully, but these errors were encountered: