Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in explain function with H2O GBM regression model - Error in if (r2 > max) { : missing value where TRUE/FALSE needed #47

andresrcs opened this issue Nov 10, 2017 · 4 comments


Copy link

andresrcs commented Nov 10, 2017

Hi, can you please check again into issue #46 ?
Just for curiosity I tried droping the month.lbl variable and now I dont get the warning message but stil have the same error message even though my training data covers the full feature space.


dataset_url <- ""
sales_aug <- readRDS(gzcon(url(dataset_url)))

sales_aug <- sales_aug %>% select(-month.lbl) # Dropping factor variable with non full feature range

train <- sales_aug %>% filter(month <= 8)
valid <- sales_aug %>% filter(month == 9)
test <- sales_aug %>% filter(month >= 10)

train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

y <- "amount"
x <- setdiff(names(train), y)

leaderboard <- h2o.automl(x, y, training_frame = train, validation_frame = valid, leaderboard_frame = test, max_runtime_secs = 30, stopping_metric = "MSE", seed = 12345)
gbm_model <- leaderboard@leader

explainer <- lime(, gbm_model, bin_continuous = FALSE)
explanation <- explain([1:5,]), explainer, n_features = 5)
#> Error in if (r2 > max) {: missing value where TRUE/FALSE needed
Copy link

Can i get you to try it with the latest version of lime from GitHub?

Copy link

The previous error message is gone but now there is a new one

explanation <- explain([1:5,]), explainer, n_features = 5)
#> Error in glmnet(x[, c(features, j), drop = FALSE], y, weights = weights,  : x should be a matrix with 2 or more columns

Copy link

Ok, so the reason for that error is quite specific to your dataset. Basically you have a single column (index.num) whose range is so extreme that, due to the fact that you do not bin continuous variables, completely dominates your dataset when it comes to calculating the similarity of the permutations. Basically all permutations gets weighted with 0 resulting in errors in the model fit.

Based on the name and the values I would throw that column out unless you have very good reasons to keep it. If you really need it, then either play with the kernel_size parameter or use bin_continuous = TRUE (the latter will give more interpretable explanations anyway)

thomasp85 added a commit that referenced this issue Nov 14, 2017
Copy link

I've added a meaningful error message for cases like yours were the similarity of the permutations to the original observation is zero and a local model cannot be created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants