Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting index in trainControl invalidates repeats/numbers arguments. #584

Closed
hadjipantelis opened this issue Jan 31, 2017 · 2 comments
Closed

Comments

@hadjipantelis
Copy link
Contributor

@hadjipantelis hadjipantelis commented Jan 31, 2017

Hello,

I am uncertain that this is "truly" a bug but it is somewhat unexpected behaviour nevertheless. If one sets the index argument to trainControl the additional arguments to repeats and or number are ignored. This can be misleading if you want to have a short simulation. For example in the following script train will attempt to do as many resamples as the length of the list of training indexes provided instead of repeats * number.

set.seed(123)
library(caret)
N = 1000;
myDF <- data.frame(predVal =as.factor(
                     sapply( 1:N, function(u) paste0(collapse = '', 'target', sample(10,1)))),
                   x1 = runif(N), x2 = runif(N), x3 = runif(N), x4 = runif(N), x5 = runif(N), 
                   x6 = runif(N), x7 = runif(N), x8 = runif(N), x9 = runif(N), x10 = runif(N), 
                   userID = as.factor(
                     sapply(1:N, function(u) paste0(collapse = '', 'user', sample(100,1)))))
                   
Nusers = length(unique(myDF$userID))
userIDs = lapply(unique(myDF$userID), function(u) which(myDF$userID == u))

kOfFolds = 5; propForTest = 0.3

Ntest = round(propForTest * Nusers)

testUserIDs = unlist(userIDs[1:Ntest])
trainUserIDs = unlist(userIDs[(1+Ntest):Nusers])

indexForCT = sapply(simplify = FALSE, 1:35, function(u){
  set.seed(1+u); 
  unlist(userIDs[sample((1+Ntest):Nusers, round((1-(1/kOfFolds))*(Nusers-Ntest)))]) })

ctrl2 = trainControl(method='repeatedcv', verboseIter = TRUE, repeats = 4, number= 5,
                     classProbs = TRUE, index = indexForCT)

set.seed(1);
mod2 = train(myDF$predVal, x= myDF[,2:11], method= 'knn', trControl = ctrl2)

The behaviour is similar when using simply cv. If we use adaptive_cv, train will do the adaptive search check but it will ignore the repeats / number arguments nevertheless.

ctrl2 = trainControl(method='adaptive_cv', verboseIter = TRUE, repeats = 4, number= 5,
                     adaptive = list(min=12, alpha=0.02, method='gls', complete=FALSE),
                     classProbs = TRUE, index = indexForCT)

This behaviour is the same on the CRAN binary as well as the github master-branch. Maybe a warning message should be issued that repeats / number args are overridden? (Or at least some mention of this in trainControl's documentation?) Thank you for checking this!

All best,
Pantelis

>sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-73    ggplot2_2.2.1   lattice_0.20-33

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9        compiler_3.3.1     nloptr_1.0.4       git2r_0.15.0       plyr_1.8.4        
[6] class_7.3-14       iterators_1.0.8    tools_3.3.1        digest_0.6.10      lme4_1.1-12       
[11] memoise_1.0.0      tibble_1.1         nlme_3.1-128       gtable_0.2.0       mgcv_1.8-12       
[16] Matrix_1.2-6       foreach_1.4.3      curl_1.2           parallel_3.3.1     SparseM_1.7       
[21] e1071_1.6-7        withr_1.0.2        httr_1.2.1         stringr_1.1.0      MatrixModels_0.4-1
[26] devtools_1.12.0    stats4_3.3.1       grid_3.3.1         nnet_7.3-12        R6_2.2.0          
[31] minqa_1.2.4        reshape2_1.4.2     car_2.1-4          magrittr_1.5       scales_0.4.1      
[36] codetools_0.2-14   ModelMetrics_1.1.0 MASS_7.3-45        splines_3.3.1      assertthat_0.1    
[41] pbkrtest_0.4-6     colorspace_1.2-6   quantreg_5.26      stringi_1.1.1      lazyeval_0.2.0    
[46] munsell_0.4.3     
@topepo
Copy link
Owner

@topepo topepo commented Feb 1, 2017

You are correct that there is a disconnect between the stated resampling type (based on the control arguments) and the resamples defined by the index.

The issue is that there is no way to derive what type of resampling is being used based only on index. For a while, I thought about making method override to something like "custom" but that was an issue for cases where people created 10-F CV or bootstrap samples and passed into index (the label would be just as wrong).

So right now, it is on the user to specify the type of resampling that they use when index is passed (even if it is just for labelling purposes).

@hadjipantelis
Copy link
Contributor Author

@hadjipantelis hadjipantelis commented Feb 1, 2017

Thank you for your response. This is a perfectly reasonable explanation. Maybe it should be mentioned in a sentence in the documentation of trainControl's index field though. Otherwise there is some confusion; just saying something like: "index specification takes priority over the resampling type specified by number/repeats" should be adequate and clear.

Feel free to close this issue.

topepo added a commit that referenced this issue Mar 15, 2017
@topepo topepo closed this Mar 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.