Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about preProcOptions in trainControl #335

Closed
ajing opened this issue Nov 30, 2015 · 7 comments
Closed

Question about preProcOptions in trainControl #335

ajing opened this issue Nov 30, 2015 · 7 comments

Comments

@ajing
Copy link

ajing commented Nov 30, 2015

I wonder how is the preprocess handled in the training process (or in the train function). If the method in trainControl is "cv", is preprocess applied to the whole training data or parts of data after splitting by cross-validation. I was doing preprocessing, like PCA, for the whole training data set, before I run the train function. I think it may lead to data leakage for the following cross-validation step.

@topepo
Copy link
Owner

topepo commented Nov 30, 2015

No leakage! =]

All pre-processing is applied on the resampled version of the data (e.g. 90% in 10-fold CV) and then those calculations are applied to the holdouts (the remaining 10%) with no re-calculation.

For example, if you use 50 bootstrap samples, the pre-processing that you choose is applied 50 separate times (plus once more for the entire data set at the very end).

Also, if any subsampling is done to deal with a class imbalance, that is also done within each resample. There is a programatic choice to do the pre-processing before or after the subsampling.

@ajing
Copy link
Author

ajing commented Dec 2, 2015

Thanks for the explanation! What if I want to do tSNE for preprocessing? I didn't see that as an option.

@zachmayer
Copy link
Collaborator

It's not an option because none of the R packages for tnsne can make predictions on new data (as far as I know). If you know how to do it, feel free to make a pull request to add it to preProccess.

@zachmayer
Copy link
Collaborator

It looks like they're adding Barnes-Hut t-SNE to scikit learn, which can predict on new data. Someone should port this to R!

@ajing
Copy link
Author

ajing commented Dec 2, 2015

You are right. Is there any way to port to R?

@zachmayer
Copy link
Collaborator

It turns out there's already an R implementation in the Rtsne package.

I opened a feature request for a predict method. We'll see what happens!

@topepo topepo closed this as completed Dec 18, 2015
@howewenann
Copy link

"There is a programatic choice to do the pre-processing before or after the subsampling."
How do you do this? Can't seem to find it in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants