Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training univariate recipes with sampling #875

Closed
wkdavis opened this issue May 1, 2018 · 2 comments
Closed

Error when training univariate recipes with sampling #875

wkdavis opened this issue May 1, 2018 · 2 comments
Labels

Comments

@wkdavis
Copy link

@wkdavis wkdavis commented May 1, 2018

When using train on an object of class recipe that has only one predictor variable, along with a sampling method specified in trainControl(), an error is thrown:

Error in `[.data.frame`(training, , x$var_info$variable, drop = FALSE) : 
  undefined columns selected

Minimal dataset:

library(caret)
library(recipes)
set.seed(1)
dat <- twoClassSim(100)
a <- dat[,5]
y <- dat[["Class"]]
df <- data.frame(a,y)
rec <- recipe(y ~ .,data = df)

Minimal, runnable code:

rec <- recipe(y ~ .,data = df)

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     sampling = "smote",
                     verboseIter = TRUE)

fit <- train(rec,data=df,method = "glm",family = "binomial",trControl = ctrl)

>Error in `[.data.frame`(training, , x$var_info$variable, drop = FALSE) : 
  undefined columns selected

It looks like this is due to lines in the rec_model() function.

Error when sampling$first == TRUE

For cases where sub-sampling occurs before preprocessing, sampling$first == TRUE, the error occurs here:

other_dat <- dat[, other_cols]
tmp <- sampling$func(other_dat, y)
orig_dat <- dat
dat <- tmp$x

trained_rec <- prep(rec, training = dat, fresh = TRUE,
                        verbose = FALSE, stringsAsFactors = TRUE,
                        retain = TRUE)

>Error in `[.data.frame`(training, , x$var_info$variable, drop = FALSE) : 
  undefined columns selected

If there is only one value for other_cols (i.e. one non-outcome column), the result other_dat will be a vector rather than a data.frame or tbl. When this occurs, sampling$func(other_dat, y) assigns an arbitrary name to the column in tmp$x dataframe, also called x. When the frame dat is later passed to prep(), the predictor name in dat, now set to x, no longer matches the predictor name in rec, which is set to a in this example. This causes prep() to throw an error, because it cannot find the predictor column.

Error when sampling$first == FALSE

For cases where sub-sampling occurs after preprocessing, sampling$first == FALSE, the error occurs here:

if(!is.null(sampling) && !sampling$first) {
      tmp <- sampling$func(x, y)
      x <- tmp$x
      y <- tmp$y
      rm(tmp)
    }

>Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn,  : 
  length of 'dimnames' [2] not equal to array extent
>Called from: matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, 
    cn))

x is a tbl, and when it has only one column, sampling$func treats it like a vector, causing the function to throw an error (I have tested for up, rose, and smote).

Example without error

Because of the if statement at the top of rec_model that checks for a sampling method, this problem only occurs if a sampling method is specified.

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     classProbs = TRUE,
                     summaryFunction = twoClassSummary,
                     #sampling = "smote",
                     verboseIter = TRUE)

fit <- train(rec,data=df,method = "glm",family = "binomial",trControl = ctrl)

Session Info:

>sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DMwR_0.4.1      recipes_0.1.2   broom_0.4.4     dplyr_0.7.4     caret_6.0-79    ggplot2_2.2.1   lattice_0.20-35

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16       lubridate_1.7.4    tidyr_0.8.0        gtools_3.5.0       class_7.3-14       zoo_1.8-1         
 [7] assertthat_0.2.0   ipred_0.9-6        psych_1.8.3.3      foreach_1.4.4      R6_2.2.2           plyr_1.8.4        
[13] magic_1.5-8        stats4_3.5.0       pillar_1.2.1       gplots_3.0.1       rlang_0.2.0        curl_3.2          
[19] lazyeval_0.2.1     gdata_2.18.0       TTR_0.23-3         kernlab_0.9-25     rpart_4.1-13       Matrix_1.2-14     
[25] splines_3.5.0      CVST_0.2-1         ddalpha_1.3.2      gower_0.1.2        stringr_1.3.0      foreign_0.8-70    
[31] munsell_0.4.3      compiler_3.5.0     pkgconfig_2.0.1    mnormt_1.5-5       dimRed_0.1.0       nnet_7.3-12       
[37] tidyselect_0.2.4   tibble_1.4.2       prodlim_2018.04.18 DRR_0.0.3          codetools_0.2-15   RcppRoll_0.2.2    
[43] withr_2.1.2        bitops_1.0-6       MASS_7.3-49        ModelMetrics_1.1.0 nlme_3.1-137       gtable_0.2.0      
[49] magrittr_1.5       scales_0.5.0       KernSmooth_2.23-15 quantmod_0.4-13    stringi_1.1.7      ROCR_1.0-7        
[55] reshape2_1.4.3     bindrcpp_0.2.2     timeDate_3043.102  robustbase_0.93-0  geometry_0.3-6     xts_0.10-2        
[61] lava_1.6.1         iterators_1.0.9    tools_3.5.0        glue_1.2.0         DEoptimR_1.0-8     purrr_0.2.4       
[67] sfsmisc_1.1-2      abind_1.4-5        parallel_3.5.0     survival_2.41-3    yaml_2.1.18        colorspace_1.3-2  
[73] caTools_1.17.1     bindr_0.1.1    

Possible fix

I believe the way to fix it would be to change other_dat <- dat[, other_cols] to other_dat <- dat[, other_cols,drop =FALSE] when sampling$first == TRUE and to add x <- as.data.frame(x) prior to tmp <- sampling$func(x, y) when sampling$first == FALSE. I have tested these changes and they were successful. I can submit a pull request with these changes.

@topepo topepo added the reproducible label May 3, 2018
@topepo topepo closed this in 23b5651 May 3, 2018
@topepo
Copy link
Owner

@topepo topepo commented May 3, 2018

A bug. Please install the current github version to make sure that it wolves your issues.

@wkdavis
Copy link
Author

@wkdavis wkdavis commented May 7, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.