Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when using boot632 and multiClassSummary #382

Closed
ghost opened this issue Feb 26, 2016 · 2 comments
Closed

Issue when using boot632 and multiClassSummary #382

ghost opened this issue Feb 26, 2016 · 2 comments

Comments

@ghost
Copy link

@ghost ghost commented Feb 26, 2016

Hi,
I posted this on stackexchange a few days ago, but I'm not sure how active you are over there so thought I would cross-post to here.
There is a bug when using the 632 bootstrap resample with the multiClassSummary function

Example:

library(caret) # load caret version 6.0-65
data(iris)
model632 <- train(Species ~ ., 
              data = iris,
              metric = "logLoss",
              tuneLength = 3,
              trControl = trainControl(
                  method = "boot632",
                  number = 10,
                  classProbs = TRUE,
                  summaryFunction = multiClassSummary
              )
              )

Results in

Something is wrong; all the logLoss metric values are missing:
    logLoss       Mean_ROC      Accuracy       Kappa     Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
 Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA      Min.   : NA      Min.   : NA        
 1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA      1st Qu.: NA      1st Qu.: NA        
 Median : NA   Median : NA   Median : NA   Median : NA   Median : NA      Median : NA      Median : NA        
 Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN      Mean   :NaN      Mean   :NaN        
 3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA      3rd Qu.: NA      3rd Qu.: NA        
 Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA      Max.   : NA      Max.   : NA      
...
...
...
Error in train.default(x, y, weights = w, ...) : Stopping  

However, if you use normal bootstrapping (or any of the CV methods), there are no errors.

modelboot <- train(Species ~ ., 
                   data = iris,
                   metric = "logLoss",
                   tuneLength = 3,
                   trControl = trainControl(
                       method = "boot",
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = multiClassSummary
                   )
                   )

Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 
Resampling results across tuning parameters:

  mtry  logLoss    Mean_ROC   Accuracy   Kappa      Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
  2     0.1192646  0.9959070  0.9493702  0.9231340  0.9510912         0.9739756         0.9541812          
  3     0.1278133  0.9960823  0.9494363  0.9232620  0.9508519         0.9739756         0.9547698          
  4     0.1390321  0.9954000  0.9545795  0.9311739  0.9558890         0.9767709         0.9603660   

The problem seems to be only limited to the multiClassSummmary function as there are no issues when using the twoClassSummary or default (Accuracy) summary

model632Acc <- train(Species ~ ., 
                     data = iris,
                     metric = "Accuracy",
                     tuneLength = 3,
                     trControl = trainControl(
                         method = "boot632",
                         number = 10,
                         classProbs = TRUE,
                         )
                     ) 
Random Forest 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
  2     0.9654315  0.9477460  0.01556936   0.02352288
  3     0.9676522  0.9511172  0.01767405   0.02663498
  4     0.9676522  0.9511172  0.01767405   0.02663498

twoClass <- twoClassSim(n = 1000)
twoClass632 <- train(Class ~ ., 
                     data = twoClass,
                     metric = "ROC",
                     tuneLength = 3,
                     trControl = trainControl(
                       method = "boot632",
                       number = 10,
                       classProbs = TRUE,
                       summaryFunction = twoClassSummary
                       )
                     )

twoClass632
Random Forest 

1000 samples
  15 predictor
   2 classes: 'Class1', 'Class2' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec       ROC SD       Sens SD     Spec SD   
   2    0.9587114  0.8881064  0.9173559  0.007995325  0.03953783  0.03624957
   8    0.9470403  0.8883836  0.8937627  0.012619452  0.03054834  0.03026205
  15    0.9365844  0.8843317  0.8747735  0.017583286  0.03323702  0.02266483

ROC was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2. 

Session Info:

>sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)  #I'm running Windows 10, not sure why it says Windows 8

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomForest_4.6-12 caret_6.0-65        ggplot2_2.0.0       lattice_0.20-33    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3        magrittr_1.5       splines_3.2.2      MASS_7.3-43        munsell_0.4.3      colorspace_1.2-6   foreach_1.4.3      minqa_1.2.4       
 [9] stringr_1.0.0      car_2.1-1          plyr_1.8.3         tools_3.2.2        nnet_7.3-10        pbkrtest_0.4-4     parallel_3.2.2     grid_3.2.2        
[17] gtable_0.1.2       nlme_3.1-121       mgcv_1.8-7         quantreg_5.21      e1071_1.6-7        class_7.3-13       MatrixModels_0.4-1 iterators_1.0.8   
[25] lme4_1.1-10        Matrix_1.2-2       nloptr_1.0.4       reshape2_1.4.1     codetools_0.2-14   stringi_1.0-1      compiler_3.2.2     pROC_1.8          
[33] scales_0.3.0       stats4_3.2.2       SparseM_1.7
@topepo
Copy link
Owner

@topepo topepo commented Mar 4, 2016

I'll check it out.

Thanks,

Max

topepo added a commit that referenced this issue Mar 12, 2016
@topepo
Copy link
Owner

@topepo topepo commented Mar 12, 2016

Fixed now. I was not correctly identifying which columns were the performance measures.

@topepo topepo closed this Mar 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.