Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during explain with H2O GBM/XGB models - "NA/NaN/Inf in 'x'" #45

Closed
dkincaid opened this issue Nov 1, 2017 · 7 comments
Closed

Comments

@dkincaid
Copy link

dkincaid commented Nov 1, 2017

I'm trying to use the package with an H2O xgboost model (I've also tried it with GBM and get the same thing. The error is:

Error in glm.fit(x = x_fit, y = y, weights = weights, family = gaussian()) : 
  NA/NaN/Inf in 'x'

Here is the code I'm running:

explainer <- lime::lime(as.data.frame(wellnessTrain), mdl)

explanation <- lime::explain(as.data.frame(wellnessTest),
                       explainer, n_labels = 1, n_features = 2)

This is caused by having some NA values in the data frame, but I thought that this had already been fixed in issue #8. I verified this by removing the three columns that have NA values as a test. These NA values are meaningful and H2O's GBM and XGBoost handle them by creating a category for the missing value after binning the unmissing feature values. Is there any easy fix here?

@dkincaid
Copy link
Author

dkincaid commented Nov 2, 2017

I have found that I can use feature_select="tree", but none of the other possible types.

@thomasp85
Copy link
Owner

Where do you get wellness data from? and how is your model made? I would need it to reproduce your example...

@dkincaid
Copy link
Author

Unfortunately I can't share that data, but let me try to put together a reproducible example with some publicly available data. I was hoping that something I was doing would look wrong.

@thomasp85
Copy link
Owner

Not obviously so - I’ll look into it if you can make a reprex

@dkincaid
Copy link
Author

dkincaid commented Nov 14, 2017

Here is a reproducible example using the Iris data. I'm showing a successful run with the full iris data frame and then run the same thing against a data frame where I randomly set some values to NA. Hopefully this gives you something to work with. I appreciate you taking a look at it.

# Create a data frame from the Iris data and randomly set some values to NA
myIris <- purrr::map_df(iris[,-5], function(x) {x[sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(x), replace = TRUE)]})

myIris <- cbind(myIris, Species=iris$Species)

library(h2o)
h2o.init()

# First show that it's successful without any missing data
full_iris_frame <- as.h2o(iris)
full_mdl <- h2o.gbm(training_frame = full_iris_frame, y = "Species")

full_explainer <- lime::lime(dplyr::select(as.data.frame(full_iris_frame), -Species), full_mdl)

full_explanation <- lime::explain(dplyr::select(as.data.frame(full_iris_frame)[1:4,], -Species),
                             full_explainer, n_labels = 3 , n_features = 3)

# Now try to run it on the data that has some missing values
iris_frame <- as.h2o(myIris)
mdl <- h2o.gbm(training_frame = iris_frame,
               y = "Species")

explainer <- lime::lime(dplyr::select(as.data.frame(iris_frame), -Species), mdl)

explanation <- lime::explain(dplyr::select(as.data.frame(iris_frame)[1:4,], -Species),
                             explainer, n_labels = 3 , n_features = 3)

@thomasp85
Copy link
Owner

So, the support for NA implemented earlier were only considering NA values in the training data - not NA values in new observations to explain. I've just pushed an update that ignores NA columns in new observations so that you don't get the error.

@dkincaid
Copy link
Author

That is fantastic! Thanks for such a quick fix. I really love this package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants