Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in explain function with H2O GBM regression model - Error in if (r2 > max) { : missing value where TRUE/FALSE needed #47

Closed
andresrcs opened this issue Nov 10, 2017 · 4 comments

Comments

@andresrcs
Copy link

andresrcs commented Nov 10, 2017

Hi, can you please check again into issue #46 ?
Just for curiosity I tried droping the month.lbl variable and now I dont get the warning message but stil have the same error message even though my training data covers the full feature space.

library(tidyverse)
library(h2o)
library(lime)

dataset_url <- "https://www.dropbox.com/s/t3o1zvzq0t7emz4/sales.RDS?raw=1"
sales_aug <- readRDS(gzcon(url(dataset_url)))

sales_aug <- sales_aug %>% select(-month.lbl) # Dropping factor variable with non full feature range

train <- sales_aug %>% filter(month <= 8)
valid <- sales_aug %>% filter(month == 9)
test <- sales_aug %>% filter(month >= 10)

h2o.init()
h2o.no_progress()
train <- as.h2o(train)
valid <- as.h2o(valid)
test <- as.h2o(test)

y <- "amount"
x <- setdiff(names(train), y)

leaderboard <- h2o.automl(x, y, training_frame = train, validation_frame = valid, leaderboard_frame = test, max_runtime_secs = 30, stopping_metric = "MSE", seed = 12345)
gbm_model <- leaderboard@leader

explainer <- lime(as.data.frame(train), gbm_model, bin_continuous = FALSE)
explanation <- explain(as.data.frame(test[1:5,]), explainer, n_features = 5)
#> Error in if (r2 > max) {: missing value where TRUE/FALSE needed
@thomasp85
Copy link
Owner

Can i get you to try it with the latest version of lime from GitHub?

@andresrcs
Copy link
Author

The previous error message is gone but now there is a new one

explanation <- explain(as.data.frame(test[1:5,]), explainer, n_features = 5)
#> Error in glmnet(x[, c(features, j), drop = FALSE], y, weights = weights,  : x should be a matrix with 2 or more columns

@thomasp85
Copy link
Owner

Ok, so the reason for that error is quite specific to your dataset. Basically you have a single column (index.num) whose range is so extreme that, due to the fact that you do not bin continuous variables, completely dominates your dataset when it comes to calculating the similarity of the permutations. Basically all permutations gets weighted with 0 resulting in errors in the model fit.

Based on the name and the values I would throw that column out unless you have very good reasons to keep it. If you really need it, then either play with the kernel_size parameter or use bin_continuous = TRUE (the latter will give more interpretable explanations anyway)

thomasp85 added a commit that referenced this issue Nov 14, 2017
@thomasp85
Copy link
Owner

I've added a meaningful error message for cases like yours were the similarity of the permutations to the original observation is zero and a local model cannot be created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants