Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error calling gbm #145

Closed
alepulver opened this issue Apr 22, 2015 · 29 comments
Closed

Error calling gbm #145

alepulver opened this issue Apr 22, 2015 · 29 comments

Comments

@alepulver
Copy link

I've asked this question in stackoverflow, but it also may be a bug.
http://stackoverflow.com/questions/29802216/caret-error-using-gbm-but-not-without-caret

I could run more tests or provide more information if it's helpful to fix this issue.

@zachmayer
Copy link
Collaborator

Please post a minimal, reproducible example here and we'll take a look at it.

@alepulver
Copy link
Author

Sure, here is an rda file with the objects issueDataframe and issueResponse. It has only 20 rows, but all 14 variables.
https://www.sendspace.com/file/a636c8

To reproduce the error, run:
train(issueDataframe, issueResponse, method = "gbm", verbose=T)

BTW: with the latest github versions of caret and gbm, I had to define "gbm.fit = gbm". But it didn't change the error.

@zachmayer
Copy link
Collaborator

It's much more helpful to have an example fully encapsulated in a runnable r script. Try your script with the iris dataset (3 classes) and see if your script works.

@alepulver
Copy link
Author

Ok, it throws the same error. The script is:

train(iris[,1:4], iris[,5], method = "gbm", verbose=T)

And the first warning message.

Warning messages:
1: In eval(expr, envir, enclos) :
  model fit failed for Resample01: shrinkage=0.1, interaction.depth=1, n.trees=150 Error in gbm.fit(x = structure(list(Sepal.Length = c(5.1, 5.1, 5.1, 4.9,  : 
  unused arguments (x = list(Sepal.Length = c(5.1, 5.1, 5.1, 4.9, 4.7, 4.6, 5, 5, 4.4, 4.9, 4.9, 4.8, 4.8, 4.3, 4.3, 5.8, 5.7, 5.7, 5.4, 5.1, 5.7, 5.1, 5.1, 4.6, 4.8, 4.8, 4.8, 5, 5, 5.2, 5.2, 5.4, 5.2, 5.5, 5, 4.4, 5.1, 5, 4.5, 4.4, 5, 5.1, 5.1, 5.1, 5.1, 7, 6.4, 6.9, 5.5, 6.5, 6.5, 6.5, 6.5, 5.7, 5.7, 5.7, 6.3, 4.9, 6.6, 6.6, 5.2, 5, 5, 6, 6.1, 6.1, 6.1, 6.1, 6.7, 5.6, 5.8, 5.6, 5.6, 5.9, 5.9, 5.9, 6.1, 6.1, 6.1, 6.3, 6.1, 6.4, 6.6, 6.7, 6, 5.7, 5.8, 5.8, 5.8, 5.4, 6, 6, 6, 6.7, 6.3, 5.6, 5.5, 6.1, 5.8, 5.6, 
5.7, 5.7, 6.2, 6.2, 5.1, 5.1, 6.3, 6.3, 5.8, 6.7, 6.5, 6.5, 6.4, 6.4, 6.4, 6.8, 6.8, 6.8, 5.7, 7.7, 6, 5.6, 6.3, 6.3, 6.7, 6.7, 6.7, 6.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 7.7, 7.7, 6.4, 6, 6, 6.9, 6.9, 5.8, 5.8, 6.7, 6.7, 6.3, 6.5, 5.9), Sepal.Width = c(3.5, 3.5, 3.5, 3, 3.2, 3.4, 3.4, 3.4, 2.9, 3.1, 3.1, 3.4, 3, 3,  [... truncated]

@topepo
Copy link
Owner

topepo commented Apr 22, 2015

Can you send the results of sessionInfo()?

I'll look into this but I don't get an error with the iris data:

set.seed(1)
mod <- train(iris[,1:4], iris[,5], method="gbm", verbose = T)

Results:

> mod
Stochastic Gradient Boosting 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD  
  1                   50      0.9475846  0.9209174  0.02685690   0.04059622
  1                  100      0.9414756  0.9117078  0.02519604   0.03799105
  1                  150      0.9416747  0.9120051  0.02650409   0.03992756
  2                   50      0.9443351  0.9160351  0.02477932   0.03722424
  2                  100      0.9452470  0.9174003  0.02329412   0.03496742
  2                  150      0.9430611  0.9141175  0.02458664   0.03693194
  3                   50      0.9391190  0.9081672  0.02988798   0.04504981
  3                  100      0.9398852  0.9092714  0.02582751   0.03892999
  3                  150      0.9387984  0.9076697  0.02762577   0.04154757

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 50, interaction.depth =
 1 and shrinkage = 0.1. 

@alepulver
Copy link
Author

I've just cleared my environment and it started to work for the iris dataset, will check if it works as part of my RMarkdown document (where there are many imports, scripts and definitions).

@alepulver
Copy link
Author

It still fails with the data file I mentioned before. It's just a portable small file and can be loaded with load(), with variables properly coded. Could you please try it?

@zachmayer
Copy link
Collaborator

I'm not going to load() your dataset and debug your entire R environment. You need to provide a minimal reproducible example.

Often the act of producing the minimal example will identify the source of the bug, but once you've created it, I'll be happy to help you solve the problem.

@zachmayer
Copy link
Collaborator

BTW, the key element of a MRR is:

testrun your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.

@alepulver
Copy link
Author

The "entire environment" is a 20x14 dataframe and a 20 length vector. I could upload a CSV if it makes you happy, but claiming that the package works only with "anything that's accessible by the standard R installation" doesn't fix the issue either.

Also, I could use an R command to load that CSV from the internet so you can run it in the console, but that won't change the fact that the code fails for that. I repeat, there are NO other packages or variables loaded other than that small dataframe.

@zachmayer
Copy link
Collaborator

Then please use dput() on those 2 objects to create a script I can copy/paste into a fresh R console to re-create the bug. Thanks!

@zachmayer
Copy link
Collaborator

Furthermore, it might be interesting to investigate how many columns you can remove from the dataframe before the bug goes away.

@alepulver
Copy link
Author

Here is the script, which loads the dataframe from a gist. I added more rows so that it doesn't fail because of not enough data. The problem seems to be related to libraries being imported before (if I remove them, it works).

library(dplyr)

library(devtools)
source_url('http://gist.githubusercontent.com/alepulver/496f2bac9fd9748c8298/raw/8a14503b786ad8435f7ee6216ae05f18e4549863/gbm_caret_issue.R')
library(caret)
fitRF = train(issueDataframe, issueResponse, method = "gbm")

Edit: removed unnecessary modules.

@zachmayer
Copy link
Collaborator

Interesting!  How many libraries can your remove before it starts working?


Sent from Mailbox

On Wed, Apr 22, 2015 at 6:26 PM, Alejandro Pulver
notifications@github.com wrote:

Here is the script, which loads the dataframe from a gist. I added more rows so that it doesn't fail because of not enough data. The problem seems to be related to libraries being imported before (if I remove them, it works).

library(plyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(doParallel)
library(foreach)
library(FactoMineR)
library(Rmisc)
library(devtools)
source_url('http://gist.githubusercontent.com/alepulver/496f2bac9fd9748c8298/raw/8a14503b786ad8435f7ee6216ae05f18e4549863/gbm_caret_issue.R')
library(caret)
fitRF = train(issueDataframe, issueResponse, method = "gbm")

Reply to this email directly or view it on GitHub:
#145 (comment)

@alepulver
Copy link
Author

dplyr is causing the issue, even if I load plyr before as the documentation suggests

@zachmayer
Copy link
Collaborator

Interesting!  Can you send the output of a sessionInfo()?

There might be a bug between Caret and dplyr.


Sent from Mailbox

On Wed, Apr 22, 2015 at 6:58 PM, Alejandro Pulver
notifications@github.com wrote:

dplyr is causing the issue, even if I load plyr before as the documentation suggests

Reply to this email directly or view it on GitHub:
#145 (comment)

@alepulver
Copy link
Author

R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.1      gbm_2.1.1       survival_2.38-1 caret_6.0-41    ggplot2_1.0.1   lattice_0.20-30 devtools_1.7.0  dplyr_0.4.1    

loaded via a namespace (and not attached):
 [1] assertthat_0.1      bitops_1.0-6        BradleyTerry2_1.0-6 brglm_0.5-9         car_2.0-25          class_7.3-12        codetools_0.2-11   
 [8] colorspace_1.2-6    compiler_3.1.1      DBI_0.3.1           digest_0.6.8        e1071_1.6-4         foreach_1.4.2       grid_3.1.1         
[15] gtable_0.1.2        gtools_3.4.1        httr_0.6.1          iterators_1.0.7     lme4_1.1-7          magrittr_1.5        MASS_7.3-40        
[22] Matrix_1.1-5        mgcv_1.8-4          minqa_1.2.4         munsell_0.4.2       nlme_3.1-120        nloptr_1.0.4        nnet_7.3-9         
[29] pbkrtest_0.4-2      proto_0.3-10        quantreg_5.11       Rcpp_0.11.5         RCurl_1.95-4.5      reshape2_1.4.1      scales_0.2.4       
[36] SparseM_1.6         stringr_0.6.2       tools_3.1.1     

@zachmayer
Copy link
Collaborator

All right, so the problem is that the gbm functions in caret don't work with plyr's tbl_df:

> class(issueDataframe)
[1] "tbl_df"     "data.frame"

For now, coerce to a regular data.frame:

fitRF = train(as.data.frame(issueDataframe), issueResponse, method = "gbm")

We'll look into adding support for tbl_df, as dplyr is getting popular.

@zachmayer
Copy link
Collaborator

It looks like other caret models work with tbl_df, it's just something specific about the functions for fitting gbms.

@zachmayer
Copy link
Collaborator

In general it helps to do train(as.data.frame(x), y, ...). This helps make sure that whatever dataset you're working with (a list, a data.table, a tbl_df, a matrix, etc.) is coerced to the proper class for a caret::train model.

@alepulver
Copy link
Author

Thanks a lot!

@zachmayer
Copy link
Collaborator

No problem!

@liori
Copy link

liori commented May 11, 2015

Just got a very similar problem with my own custom caret model based on glm. Recently upgraded by R installation to 3.2.0. In my code it seems as if glm stopped accepting matrices (!) as input, or that at least accepting matrices exposes some weird problems… No time to investigate more now, I just added a data.frame cast in several places and it seems to work. Damn, how I hate debugging R code!

@topepo
Copy link
Owner

topepo commented May 11, 2015

I just added a data.frame cast

Check the class of the object. If you are using dplyr you might need to convert it to a plain old data frame prior to modeling.

@abresler
Copy link

abresler commented Feb 4, 2016

Any plans to implement acceptance of data_frame & tbl_df objects? They are likely the future of R objects & are faster/cleaner to deal with.

@salmamr
Copy link

salmamr commented Jan 23, 2017

I faced the same problem when training with xgbTree too not just gbm.

@topepo
Copy link
Owner

topepo commented Jan 23, 2017

It is a fairly complex issue. Most models end up using matrices no matter what you start with. xgboost and glmnet are two exceptions and both use different representations of sparse matrices.

I used to automatically do an as.data.frame conversion at the beginning of train but I removed it when people wanted to pass in sparse matrices and other objects. If you use the non-formula method, it should preserve the object type (but 99% of functions want a matrix or data frame in the end).

The only way that I can see around this is to do something like

train.tbl_df <- function(x, y, ...) train.default(as.data.frame(x), y, ...)

but that is really no different than you using

train.default(as.data.frame(x), y, ...)

yourself (as @zachmayer suggests).

I the near future, I strongly believe that there will be modeling workflows that can use other types of data types directly =]

@AiDinho
Copy link

AiDinho commented Sep 19, 2017

I faced this issue today while fitting a glm model . however after I cleared the workspace I dint encounter this error . All my data were in data frame(not a tbl_df). This may help to narrow down the bug

1st run:
loaded caret, tried to fit a glm and got error

Error in requireNamespaceQuietStop("pROC") : package pROC is required
Installed pROC , again tried to fit the model and got this

" Fold01: parameter=none
model fit failed for Fold01: parameter=none Error"

at this moment I cleared the workspace but did not start a new session .
2nd run
ran the same script (without explicitly loading pROC) and it worked .

Something to do with pROC may be but I am not sure , let me know if you need anymore detail

@topepo
Copy link
Owner

topepo commented Sep 19, 2017

Without a reproducible example and the results of sessionInfo (or preferably sessioninfo::session_info), there isn't much that I can do.

If you add these, please start another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants