Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in explain function with H2O GBM regression model - Error in if (r2 > max) { : missing value where TRUE/FALSE needed #46

Closed
andresrcs opened this issue Nov 7, 2017 · 4 comments

Comments

@andresrcs
Copy link

I'm trying to use the package with an H2O gbm regression model, but I get this error:

explainer <- lime(as.data.frame(train_h2o), gbm_model, bin_continuous = FALSE)
explanation <- explain(as.data.frame(test_h2o[1:10,]), explainer, n_feature = 5)
#> Error in if (r2 > max) { : missing value where TRUE/FALSE needed
#> In addition: Warning message:
#> In `[<-.factor`(`*tmp*`, iseq, value = c(1L, 1L, 1L, 1L, 1L, 1L,  :
#>  invalid factor level, NA generated

I have no NA values on my dataset and the structure looks like this:

str(head(train))
'data.frame':	6 obs. of  25 variables:
 $ monto    : num  958 363 340 299 382 ...
 $ feriado  : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1
 $ cumple   : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1
 $ index.num: num  1.48e+09 1.48e+09 1.48e+09 1.48e+09 1.48e+09 ...
 $ year     : num  2017 2017 2017 2017 2017 ...
 $ year.iso : num  2016 2017 2017 2017 2017 ...
 $ half     : num  1 1 1 1 1 1
 $ quarter  : num  1 1 1 1 1 1
 $ month    : num  1 1 1 1 1 1
 $ month.xts: num  0 0 0 0 0 0
 $ month.lbl: Factor w/ 8 levels "Abril","Agosto",..: 3 3 3 3 3 3
 $ day      : num  1 2 3 4 5 6
 $ wday     : num  1 2 3 4 5 6
 $ wday.xts : num  0 1 2 3 4 5
 $ wday.lbl : Factor w/ 7 levels "domingo","jueves",..: 1 3 4 5 2 7
 $ mday     : num  1 2 3 4 5 6
 $ qday     : num  1 2 3 4 5 6
 $ yday     : num  1 2 3 4 5 6
 $ mweek    : num  5 1 1 1 1 1
 $ week     : num  1 1 1 1 1 1
 $ week.iso : num  52 1 1 1 1 1
 $ week2    : num  1 1 1 1 1 1
 $ week3    : num  1 1 1 1 1 1
 $ week4    : num  1 1 1 1 1 1
 $ mday7    : num  1 1 1 1 1 1

I don't understand the error message, any clue on what is happening here?

@thomasp85
Copy link
Owner

I would be happy to look into it but it would require you to create a reproducible example. The code you provided is fine, but I would need to have access to the dataset or a similar one giving similar behaviour in order to be able to debug it...

@andresrcs
Copy link
Author

Here it is, is this reprex enough? thanks for looking into it

library(tidyverse)
#> -- Attaching packages ------------------------------------ tidyverse 1.2.0 --
#> v ggplot2 2.2.1     v purrr   0.2.4
#> v tibble  1.3.4     v dplyr   0.7.4
#> v tidyr   0.7.2     v stringr 1.2.0
#> v readr   1.1.1     v forcats 0.2.0
#> -- Conflicts --------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(h2o)
#> 
#> ----------------------------------------------------------------------
#> 
#> Your next step is to start H2O:
#>     > h2o.init()
#> 
#> For H2O package documentation, ask for help:
#>     > ??h2o
#> 
#> After starting H2O, you can use the Web UI at http://localhost:54321
#> For more information visit http://docs.h2o.ai
#> 
#> ----------------------------------------------------------------------
#> 
#> Attaching package: 'h2o'
#> The following objects are masked from 'package:stats':
#> 
#>     cor, sd, var
#> The following objects are masked from 'package:base':
#> 
#>     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
#>     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
#>     log10, log1p, log2, round, signif, trunc
library(lime)
#> 
#> Attaching package: 'lime'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain

dataset_url <- "https://www.dropbox.com/s/t3o1zvzq0t7emz4/sales.RDS?raw=1"
sales_aug <- readRDS(gzcon(url(dataset_url)))

train <- sales_aug %>% filter(month <= 8)
valid <- sales_aug %>% filter(month == 9)
test <- sales_aug %>% filter(month >= 10)

h2o.init()
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         8 minutes 21 seconds 
#>     H2O cluster version:        3.14.0.7 
#>     H2O cluster version age:    19 days  
#>     H2O cluster name:           H2O_started_from_R_andre_crw711 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   1.71 GB 
#>     H2O cluster total cores:    4 
#>     H2O cluster allowed cores:  4 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
#>     R Version:                  R version 3.4.2 (2017-09-28)
h2o.no_progress()
train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

y <- "amount"
x <- setdiff(names(train), y)

leaderboard <- h2o.automl(x, y, training_frame = train, validation_frame = valid, leaderboard_frame = test, max_runtime_secs = 30, stopping_metric = "MSE", seed = 12345)
gbm_model <- leaderboard@leader

explainer <- lime(as.data.frame(train), gbm_model, bin_continuous = FALSE)
explanation <- explain(as.data.frame(test[1:5,]), explainer, n_features = 5)
#> Warning in `[<-.factor`(`*tmp*`, iseq, value = structure(c(1L, 1L, 1L,
#> 1L, : invalid factor level, NA generated
#> Error in if (r2 > max) {: valor ausente donde TRUE/FALSE es necesario

@thomasp85
Copy link
Owner

The problem is that your test data includes factor levels that is not present in your training data. More specifically the month.lbl column. This means that the input cannot get properly permuted and will result in NA's which trips up the model. Either make sure that your training data covers the full feature space (this is good practice anyway) or don't use factors but regular strings instead to indicate that they might take any value...

@andresrcs
Copy link
Author

Thank you very much, I suppose I'll have to settle for the h2o.varimp() function until I have enough data for covering the feature space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants