Error calling gbm #145

alepulver · 2015-04-22T17:42:28Z

I've asked this question in stackoverflow, but it also may be a bug.
http://stackoverflow.com/questions/29802216/caret-error-using-gbm-but-not-without-caret

I could run more tests or provide more information if it's helpful to fix this issue.

zachmayer · 2015-04-22T17:43:54Z

Please post a minimal, reproducible example here and we'll take a look at it.

alepulver · 2015-04-22T17:53:41Z

Sure, here is an rda file with the objects issueDataframe and issueResponse. It has only 20 rows, but all 14 variables.
https://www.sendspace.com/file/a636c8

To reproduce the error, run:
train(issueDataframe, issueResponse, method = "gbm", verbose=T)

BTW: with the latest github versions of caret and gbm, I had to define "gbm.fit = gbm". But it didn't change the error.

zachmayer · 2015-04-22T18:11:39Z

It's much more helpful to have an example fully encapsulated in a runnable r script. Try your script with the iris dataset (3 classes) and see if your script works.

alepulver · 2015-04-22T18:17:24Z

Ok, it throws the same error. The script is:

train(iris[,1:4], iris[,5], method = "gbm", verbose=T)

And the first warning message.

Warning messages:
1: In eval(expr, envir, enclos) :
  model fit failed for Resample01: shrinkage=0.1, interaction.depth=1, n.trees=150 Error in gbm.fit(x = structure(list(Sepal.Length = c(5.1, 5.1, 5.1, 4.9,  : 
  unused arguments (x = list(Sepal.Length = c(5.1, 5.1, 5.1, 4.9, 4.7, 4.6, 5, 5, 4.4, 4.9, 4.9, 4.8, 4.8, 4.3, 4.3, 5.8, 5.7, 5.7, 5.4, 5.1, 5.7, 5.1, 5.1, 4.6, 4.8, 4.8, 4.8, 5, 5, 5.2, 5.2, 5.4, 5.2, 5.5, 5, 4.4, 5.1, 5, 4.5, 4.4, 5, 5.1, 5.1, 5.1, 5.1, 7, 6.4, 6.9, 5.5, 6.5, 6.5, 6.5, 6.5, 5.7, 5.7, 5.7, 6.3, 4.9, 6.6, 6.6, 5.2, 5, 5, 6, 6.1, 6.1, 6.1, 6.1, 6.7, 5.6, 5.8, 5.6, 5.6, 5.9, 5.9, 5.9, 6.1, 6.1, 6.1, 6.3, 6.1, 6.4, 6.6, 6.7, 6, 5.7, 5.8, 5.8, 5.8, 5.4, 6, 6, 6, 6.7, 6.3, 5.6, 5.5, 6.1, 5.8, 5.6, 
5.7, 5.7, 6.2, 6.2, 5.1, 5.1, 6.3, 6.3, 5.8, 6.7, 6.5, 6.5, 6.4, 6.4, 6.4, 6.8, 6.8, 6.8, 5.7, 7.7, 6, 5.6, 6.3, 6.3, 6.7, 6.7, 6.7, 6.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 7.7, 7.7, 6.4, 6, 6, 6.9, 6.9, 5.8, 5.8, 6.7, 6.7, 6.3, 6.5, 5.9), Sepal.Width = c(3.5, 3.5, 3.5, 3, 3.2, 3.4, 3.4, 3.4, 2.9, 3.1, 3.1, 3.4, 3, 3,  [... truncated]

topepo · 2015-04-22T21:04:12Z

Can you send the results of sessionInfo()?

I'll look into this but I don't get an error with the iris data:

set.seed(1)
mod <- train(iris[,1:4], iris[,5], method="gbm", verbose = T)

Results:

> mod
Stochastic Gradient Boosting 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD  
  1                   50      0.9475846  0.9209174  0.02685690   0.04059622
  1                  100      0.9414756  0.9117078  0.02519604   0.03799105
  1                  150      0.9416747  0.9120051  0.02650409   0.03992756
  2                   50      0.9443351  0.9160351  0.02477932   0.03722424
  2                  100      0.9452470  0.9174003  0.02329412   0.03496742
  2                  150      0.9430611  0.9141175  0.02458664   0.03693194
  3                   50      0.9391190  0.9081672  0.02988798   0.04504981
  3                  100      0.9398852  0.9092714  0.02582751   0.03892999
  3                  150      0.9387984  0.9076697  0.02762577   0.04154757

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 50, interaction.depth =
 1 and shrinkage = 0.1.

alepulver · 2015-04-22T21:11:03Z

I've just cleared my environment and it started to work for the iris dataset, will check if it works as part of my RMarkdown document (where there are many imports, scripts and definitions).

alepulver · 2015-04-22T21:16:58Z

It still fails with the data file I mentioned before. It's just a portable small file and can be loaded with load(), with variables properly coded. Could you please try it?

zachmayer · 2015-04-22T21:24:47Z

I'm not going to load() your dataset and debug your entire R environment. You need to provide a minimal reproducible example.

Often the act of producing the minimal example will identify the source of the bug, but once you've created it, I'll be happy to help you solve the problem.

zachmayer · 2015-04-22T21:26:36Z

BTW, the key element of a MRR is:

testrun your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.

alepulver · 2015-04-22T21:32:33Z

The "entire environment" is a 20x14 dataframe and a 20 length vector. I could upload a CSV if it makes you happy, but claiming that the package works only with "anything that's accessible by the standard R installation" doesn't fix the issue either.

Also, I could use an R command to load that CSV from the internet so you can run it in the console, but that won't change the fact that the code fails for that. I repeat, there are NO other packages or variables loaded other than that small dataframe.

zachmayer · 2015-04-22T21:37:45Z

Then please use dput() on those 2 objects to create a script I can copy/paste into a fresh R console to re-create the bug. Thanks!

zachmayer · 2015-04-22T21:39:08Z

Furthermore, it might be interesting to investigate how many columns you can remove from the dataframe before the bug goes away.

alepulver · 2015-04-22T22:26:15Z

Here is the script, which loads the dataframe from a gist. I added more rows so that it doesn't fail because of not enough data. The problem seems to be related to libraries being imported before (if I remove them, it works).

library(dplyr)

library(devtools)
source_url('http://gist.githubusercontent.com/alepulver/496f2bac9fd9748c8298/raw/8a14503b786ad8435f7ee6216ae05f18e4549863/gbm_caret_issue.R')
library(caret)
fitRF = train(issueDataframe, issueResponse, method = "gbm")

Edit: removed unnecessary modules.

zachmayer · 2015-04-22T22:54:39Z

Interesting! How many libraries can your remove before it starts working?

—
Sent from Mailbox

On Wed, Apr 22, 2015 at 6:26 PM, Alejandro Pulver
notifications@github.com wrote:

Here is the script, which loads the dataframe from a gist. I added more rows so that it doesn't fail because of not enough data. The problem seems to be related to libraries being imported before (if I remove them, it works).
library(plyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(doParallel)
library(foreach)
library(FactoMineR)
library(Rmisc)
library(devtools)
source_url('http://gist.githubusercontent.com/alepulver/496f2bac9fd9748c8298/raw/8a14503b786ad8435f7ee6216ae05f18e4549863/gbm_caret_issue.R')
library(caret)
fitRF = train(issueDataframe, issueResponse, method = "gbm")
Reply to this email directly or view it on GitHub:
#145 (comment)

alepulver · 2015-04-22T22:58:36Z

dplyr is causing the issue, even if I load plyr before as the documentation suggests

zachmayer · 2015-04-22T23:15:00Z

Interesting! Can you send the output of a sessionInfo()?

There might be a bug between Caret and dplyr.

—
Sent from Mailbox

On Wed, Apr 22, 2015 at 6:58 PM, Alejandro Pulver
notifications@github.com wrote:

dplyr is causing the issue, even if I load plyr before as the documentation suggests

Reply to this email directly or view it on GitHub:
#145 (comment)

alepulver · 2015-04-22T23:18:18Z

R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.1      gbm_2.1.1       survival_2.38-1 caret_6.0-41    ggplot2_1.0.1   lattice_0.20-30 devtools_1.7.0  dplyr_0.4.1    

loaded via a namespace (and not attached):
 [1] assertthat_0.1      bitops_1.0-6        BradleyTerry2_1.0-6 brglm_0.5-9         car_2.0-25          class_7.3-12        codetools_0.2-11   
 [8] colorspace_1.2-6    compiler_3.1.1      DBI_0.3.1           digest_0.6.8        e1071_1.6-4         foreach_1.4.2       grid_3.1.1         
[15] gtable_0.1.2        gtools_3.4.1        httr_0.6.1          iterators_1.0.7     lme4_1.1-7          magrittr_1.5        MASS_7.3-40        
[22] Matrix_1.1-5        mgcv_1.8-4          minqa_1.2.4         munsell_0.4.2       nlme_3.1-120        nloptr_1.0.4        nnet_7.3-9         
[29] pbkrtest_0.4-2      proto_0.3-10        quantreg_5.11       Rcpp_0.11.5         RCurl_1.95-4.5      reshape2_1.4.1      scales_0.2.4       
[36] SparseM_1.6         stringr_0.6.2       tools_3.1.1

zachmayer · 2015-04-23T00:35:50Z

All right, so the problem is that the gbm functions in caret don't work with plyr's tbl_df:

> class(issueDataframe)
[1] "tbl_df"     "data.frame"

For now, coerce to a regular data.frame:

fitRF = train(as.data.frame(issueDataframe), issueResponse, method = "gbm")

We'll look into adding support for tbl_df, as dplyr is getting popular.

zachmayer · 2015-04-23T00:38:04Z

It looks like other caret models work with tbl_df, it's just something specific about the functions for fitting gbms.

zachmayer · 2015-04-23T00:39:20Z

In general it helps to do train(as.data.frame(x), y, ...). This helps make sure that whatever dataset you're working with (a list, a data.table, a tbl_df, a matrix, etc.) is coerced to the proper class for a caret::train model.

alepulver · 2015-04-23T00:53:02Z

Thanks a lot!

zachmayer · 2015-04-23T00:56:48Z

No problem!

liori · 2015-05-11T22:32:41Z

Just got a very similar problem with my own custom caret model based on glm. Recently upgraded by R installation to 3.2.0. In my code it seems as if glm stopped accepting matrices (!) as input, or that at least accepting matrices exposes some weird problems… No time to investigate more now, I just added a data.frame cast in several places and it seems to work. Damn, how I hate debugging R code!

topepo · 2015-05-11T22:35:56Z

I just added a data.frame cast

Check the class of the object. If you are using dplyr you might need to convert it to a plain old data frame prior to modeling.

abresler · 2016-02-04T19:28:43Z

Any plans to implement acceptance of data_frame & tbl_df objects? They are likely the future of R objects & are faster/cleaner to deal with.

salmamr · 2017-01-23T13:17:03Z

I faced the same problem when training with xgbTree too not just gbm.

topepo · 2017-01-23T19:01:54Z

It is a fairly complex issue. Most models end up using matrices no matter what you start with. xgboost and glmnet are two exceptions and both use different representations of sparse matrices.

I used to automatically do an as.data.frame conversion at the beginning of train but I removed it when people wanted to pass in sparse matrices and other objects. If you use the non-formula method, it should preserve the object type (but 99% of functions want a matrix or data frame in the end).

The only way that I can see around this is to do something like

train.tbl_df <- function(x, y, ...) train.default(as.data.frame(x), y, ...)

but that is really no different than you using

train.default(as.data.frame(x), y, ...)

yourself (as @zachmayer suggests).

I the near future, I strongly believe that there will be modeling workflows that can use other types of data types directly =]

AiDinho · 2017-09-19T21:48:24Z

I faced this issue today while fitting a glm model . however after I cleared the workspace I dint encounter this error . All my data were in data frame(not a tbl_df). This may help to narrow down the bug

1st run:
loaded caret, tried to fit a glm and got error

Error in requireNamespaceQuietStop("pROC") : package pROC is required
Installed pROC , again tried to fit the model and got this

" Fold01: parameter=none
model fit failed for Fold01: parameter=none Error"

at this moment I cleared the workspace but did not start a new session .
2nd run
ran the same script (without explicitly loading pROC) and it worked .

Something to do with pROC may be but I am not sure , let me know if you need anymore detail

topepo · 2017-09-19T22:04:42Z

Without a reproducible example and the results of sessionInfo (or preferably sessioninfo::session_info), there isn't much that I can do.

If you add these, please start another issue.

zachmayer closed this as completed Apr 23, 2015

This was referenced Mar 7, 2017

Does caret not play well with tibbles? #611

Closed

caret + tbl_df incompatibility amunategui/amunategui.github.io#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error calling gbm #145

Error calling gbm #145

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

topepo commented Apr 22, 2015

alepulver commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

dplyr is causing the issue, even if I load plyr before as the documentation suggests

alepulver commented Apr 22, 2015

zachmayer commented Apr 23, 2015

zachmayer commented Apr 23, 2015

zachmayer commented Apr 23, 2015

alepulver commented Apr 23, 2015

zachmayer commented Apr 23, 2015

liori commented May 11, 2015

topepo commented May 11, 2015

abresler commented Feb 4, 2016

salmamr commented Jan 23, 2017

topepo commented Jan 23, 2017

AiDinho commented Sep 19, 2017

topepo commented Sep 19, 2017

Error calling gbm #145

Error calling gbm #145

Comments

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

topepo commented Apr 22, 2015

alepulver commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

alepulver commented Apr 22, 2015

zachmayer commented Apr 22, 2015

dplyr is causing the issue, even if I load plyr before as the documentation suggests

alepulver commented Apr 22, 2015

zachmayer commented Apr 23, 2015

zachmayer commented Apr 23, 2015

zachmayer commented Apr 23, 2015

alepulver commented Apr 23, 2015

zachmayer commented Apr 23, 2015

liori commented May 11, 2015

topepo commented May 11, 2015

abresler commented Feb 4, 2016

salmamr commented Jan 23, 2017

topepo commented Jan 23, 2017

AiDinho commented Sep 19, 2017

topepo commented Sep 19, 2017