Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling NAs #8

belariow opened this issue Jul 11, 2017 · 2 comments

Handling NAs #8

belariow opened this issue Jul 11, 2017 · 2 comments


Copy link

Hi there,

Lime currently does not seem to support NAs in data. Here's an example:



x =*10), ncol=10))
x$V1 = ifelse(x$V2 > 0, NA, x$V1) # introduce random NAs in V1
y = round(runif(100))
y = as.factor(y)
levels(y) = c("no", "yes")
data = cbind(x, target = y)

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 1,
                           allowParallel = TRUE,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

XGBModel = train(target ~ ., 
                 data = data,
                 trControl = fitControl, 
                 method = "xgbTree",
                 search = "random", 
                 metric = "ROC",
                 na.action = na.pass) # force XGB to take NAs into account

prediction = predict.train(XGBModel, data, na.action = na.pass, type = "prob") # works fine

explain = lime(data, XGBModel, bin_continuous = T, n_permutations = 1000) # error

# Error in quantile.default(x[[i]], seq(0, 1, length.out = n_bins + 1)) : 
#   missing values and NaN's not allowed if 'na.rm' is FALSE

explain = lime(na.omit(data), XGBModel, bin_continuous = T, n_permutations = 1000) # works fine

It would be great if NAs could be handled like XGB does.

Thanks guys for your work!

Copy link

It seems to be an easy fix. Just to be clear, you don't want to put any emphasis on the distribution of NA in a feature? What I mean by this is that when doing the permutations of the data it draws from the distribution of the training data - this distribution should just be calculated while ignoring NA-values, right?

Copy link

Fixed in 2d80643

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants