Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling NAs #8

Closed
belariow opened this issue Jul 11, 2017 · 2 comments
Closed

Handling NAs #8

belariow opened this issue Jul 11, 2017 · 2 comments

Comments

@belariow
Copy link

Hi there,

Lime currently does not seem to support NAs in data. Here's an example:

library(caret)
library(lime)

set.seed(123)

x = as.data.frame(matrix(rnorm(100*10), ncol=10))
x$V1 = ifelse(x$V2 > 0, NA, x$V1) # introduce random NAs in V1
y = round(runif(100))
y = as.factor(y)
levels(y) = c("no", "yes")
data = cbind(x, target = y)

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 1,
                           allowParallel = TRUE,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

XGBModel = train(target ~ ., 
                 data = data,
                 trControl = fitControl, 
                 method = "xgbTree",
                 search = "random", 
                 metric = "ROC",
                 na.action = na.pass) # force XGB to take NAs into account

prediction = predict.train(XGBModel, data, na.action = na.pass, type = "prob") # works fine

explain = lime(data, XGBModel, bin_continuous = T, n_permutations = 1000) # error

# Error in quantile.default(x[[i]], seq(0, 1, length.out = n_bins + 1)) : 
#   missing values and NaN's not allowed if 'na.rm' is FALSE

explain = lime(na.omit(data), XGBModel, bin_continuous = T, n_permutations = 1000) # works fine

It would be great if NAs could be handled like XGB does.

Thanks guys for your work!

@thomasp85
Copy link
Owner

It seems to be an easy fix. Just to be clear, you don't want to put any emphasis on the distribution of NA in a feature? What I mean by this is that when doing the permutations of the data it draws from the distribution of the training data - this distribution should just be calculated while ignoring NA-values, right?

@thomasp85
Copy link
Owner

Fixed in 2d80643

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants