Using lime() on xgboost object #1

martinju · 2017-04-25T15:02:59Z

Hi, and thank you for an excellent package!

I am trying to apply the lime package to a model fitted with xgboost (using the original xgboost package), but the lime function does not seem to accept the input format even if the predict function works fine.

Example using both xgbDmatrix and a regular matrix

x = matrix(rnorm(100*10),ncol=10)
y = rnorm(100)

xgbDMatrix.obj<- xgb.DMatrix(data=x,
label = y)

mod = xgb.train(data = xgbDMatrix.obj,nrounds = 100) # Variant 1, using xgbDMatrix format of data input
#predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

mod = xgboost(data = x,label = y,nrounds = 100) # Variant 2 using a regular matrix + vector as data input
#predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

In the readme you mention manually building a predict function.
If that is the solution here, could you please provide some guidelines on how to do that?

thomasp85 · 2017-04-27T13:26:12Z

Thanks for the report – I'll take a look. Keep in mind that lime is still in early development and any error could potentially be a bug on my part and not due to anything you have done. I'll be in touch

thomasp85 · 2017-04-27T13:45:19Z

Two things: currently lime only works with classifiers not regressors. Also, on my system I get an error when running predict, complaining about type being an unknown argument

martinju · 2017-04-27T14:14:57Z

Thanks for taking a look at this!

OK, let's do a classification example instead:
x = matrix(rnorm(100*10),ncol=10)
y = round(runif(100))

xgbDMatrix.obj<- xgb.DMatrix(data=x,label = y)
mod = xgb.train(data = xgbDMatrix.obj,nrounds = 100,objective = "binary:logistic") # Variant 1, using xgbDMatrix format of data input
predict(mod,xgbDMatrix.obj,type="prob") # works fine
explain <- lime(x=xgbDMatrix.obj,model=mod) # Throws error
predict(mod,x,type="prob") # works fine
explain <- lime(x=x.obj,model=mod) # Throws error

mod = xgboost(data = x,label = y,nrounds = 100,objective = "binary:logistic") # Variant 2 using a regular matrix + vector as data input
predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

I do not get an error when using predict with type="prob" (I am using xgboost 0.6-4). Anyhow, the predict function gives probabilities by default for classification problems, so doing leaving out type="prob" gives the same on my computer.

thomasp85 · 2017-06-13T09:13:31Z

Sorry for letting this hang - I was pulled away by higher priority stuff.

Several problems in your example:

lime takes data.frame input in order to ensure that variables are named. I might add a default coercion to data.frame which would make matrices work, but still unsure whether this is a good idea.

The predict function does not return a data.frame containing the probabilities of each class, which is the required predict output...

I would generally recommend using xgboost through the caret package as this is the target API lime is coded up against (soon to be joined be mlr)

martinju · 2017-06-22T13:20:57Z

Thank you for getting back to me.

So I tried running this from caret instead, and I got something up and running. However, when moving to real data I ran into some problems:

1: it seems that lime(x=x,...) cannot handle when x contains missing data. It throws the error:

Error in quantile.default(x[[i]], seq(0, 1, length.out = n_bins + 1)) :
missing values and NaN's not allowed if 'na.rm' is FALSE

A workaround is to remove the missing data before passing it lime, but as there is indeed information in the missing data which was used during model training, that is not an optimal solution.

2: even without missing data, I get an error when using bin_continuous=T:

Error in cut.default(x[[i]], unique(bin_cuts[[i]]), labels = FALSE, include.lowest = TRUE) :
invalid number of intervals

Everything works fine if if put bin_continuous=F with no missing data.

3: When re-runinng the explain-function which is the output from lime(), I get slightly different results every time. I guess this is due to the randomness in the permutations. Is it possible to somewhere control the number of permutations used? I guess that could resolve that issue with the expense of an increase in the running time.

4: Finally if you could point out how I might create my own prediction based on native xgboost, that would be excellent (the caret wrapper does not support all of xgboost's features). Do I need to overwrite the prediction function of the xgboost class?

Thank you so much for your support so far!

thomasp85 · 2017-06-23T08:47:34Z

I'm unsure how binning should work with missing values, so for now I'm ok with that failing (should prob have a better error message though)
You may have stumbled upon a bug - can you make a minimal example?
The permutations are random which is why you get different results. To "control" randomness you simply set the seed in R prior to running the explanation (e.g. set.seed(1))
The only requirements for a model to be lime compliant is that the input data is a data.frame and that the model has a predict method that outputs a data frame of class probabilities (one class per column) when called with type='prob'

martinju · 2017-06-23T09:19:00Z

OK. Maybe you could let the NA's become a separate category when binning? Or you may simply remove them when binning, although that throws away some information. That is anyway better than having to delete all rows including at least one NA-value. In an extreme situation that may throw away all your data even if you can easily fit a model to those data.
I just realized this error was caused by including an intercept the model. Here is an example:

library(lime)
library(caret)

set.seed(123)
x = as.data.frame(matrix(rnorm(100*10),ncol=10))
x = cbind(V0=1,x,V11=sample(c(0,1),size = 10,replace = T),V12=sample(c(0,1),size = 10,replace = T))
y = round(runif(100))
y.factor <- as.factor(y)
levels(y.factor) <- c("no","yes")

fitControl <- trainControl(method="none",
verboseIter = TRUE,
classProbs = TRUE,
allowParallel = TRUE)

caretFit.glm <- train(y=y.factor,
x=x,
method = "glm",
trControl = fitControl)

explain <- lime(x=x,model=caretFit.glm,bin_continuous = T) # Throws error

Yes, I understand this. I would like to be able to manually set how many permutations/samples are done as part of the algorithm (more samples means more stable results across different seeds).
Thank you, I will try to overwrite xgboost's prediction function then.

thomasp85 · 2017-06-23T09:22:57Z

That might be a possible approach - I'll think about it a bit
Thanks
You always have control over the number of permutations using the n_permtations argument in the explainer. And you can control randomness by setting the seed. Unsure what else you need?
Let me know if there are any problems

martinju · 2017-06-23T09:28:43Z

Thank you! Regarding 3 I wasn't aware of the n_permutations parameter. That is exactly what I was looking for :) What is the default value?

thomasp85 · 2017-06-23T09:30:25Z

5000

I'll try to improve the docs on this in the future. In general R does not have good support for documenting functions created by other functions...

martinju · 2017-06-23T10:53:30Z

Thanks!

negyn · 2017-06-23T15:18:51Z

Hi, thank you for LIME package, it is so interesting for me.
I have some questions, it seems to me that the current version of LIME package in R doesn't support the text and image data, is it right? so, except random forest what other kinds of classifier does it support?

pommedeterresautee · 2017-06-26T20:52:47Z

@negyn #6

belariow · 2017-07-10T16:58:16Z

@martinju Hi, I have the same 1. and 2. issues with a XGB model.
Have you finally came around problem 1.?
About problem 2., you found an intercept issue. How do you do with an XGB model though?
Thanks a lot!

pommedeterresautee · 2017-07-10T19:32:37Z

@belariow in the PR xgb is managed. read #7 for more info (xgb is managed in #6 but this is not a definitive solution)

What do you mean about intercept issue?

belariow · 2017-07-11T08:22:19Z

@pommedeterresautee The intercept issue is the one that throws the following error:

Error in cut.default(x[[i]], unique(bin_cuts[[i]]), labels = FALSE, include.lowest = TRUE) :
invalid number of intervals

I figured out just like @martinju that this is an intercept problem, meaning that it appears whenever one of the columns of you data is a constant.

I also have a NA issue, the lime() function does not work if your data has NAs, which is problematic for real data. I don't know how I'll come around that, probably create as many "VarXIsNa" variables as needed. Do you think lime will be able to handle NAs soon?

Thanks!

thomasp85 · 2017-07-11T08:59:50Z

@belariow Can you open a separate issue for the NA support as it is hidden within this xgboost thread

kwartler · 2017-07-11T21:38:38Z

What about something like this within the lime.data.frame function? At line ~66.

case_res <- predict(model, case_perm, type = 'prob')
 #Coerce to Data Frame since caret make preds a df, but other predict methods do not
    ###
    if(is.data.frame(case_res)==F){
      case_res<-as.data.frame(case_res) 
    }
    ###

For example:

#####
data("iris")
iris_test <- iris[1, 1:5]
iris_train <- iris[-1, 1:5]

fit <- randomForest::randomForest(Species ~ ., data = iris_train)

#check class of predictions to show it's a matrix
pred.rf<-predict(fit,iris_train, type='prob')
class(pred.rf)

#lime
explanation<-lime(iris_train,fit)
explain <- explanation(iris_test, n_labels = 2, n_features = 2)

Just my $0.02 since I a new to this repo and work. Thx for doing this though!

pommedeterresautee · 2017-07-11T22:50:52Z

For instance, on a multiclass XGBoost classification, it wouldn't work as you need first to coerce to a matrix (col: the cat), in this case there are some parameters to add when you call predict.
This is managed in the source code after merge of the #6

thomasp85 · 2017-09-20T11:04:17Z

Can I get you to confirm that the CRAN release works fine for you?

martinju · 2017-09-20T13:50:32Z

Thanks for the work with this. I really think the new predict_model and model_type-functions would be very useful. Reading the help function of these new functions (and the source code) it seems xgboost should work out of the box as predict_model.xgb.Booster is already defined. However, I don't understand how I should use it. Passing something else than a data.frame (or character vector) to the lime()-function just throws an error. Examples:

library(xgboost)
install.packages("lime")
library(lime)


x = matrix(rnorm(100*10),ncol=10)
y = round(runif(100))

#The preferred xgbDMatrix-way
xgbDMatrix.obj<- xgb.DMatrix(data=x,label = y)
mod1 = xgb.train(data = xgbDMatrix.obj,nrounds = 10,objective = "binary:logistic") # Variant 1, using xgbDMatrix format of data input
explain <- lime(x=xgbDMatrix.obj,model=mod1) # Throws error
predict(mod1,xgbDMatrix.obj) # Works
predict(mod1,x) # Works


# The matrix way
mod2 = xgboost(data = x,label=y,nrounds = 10,objective = "binary:logistic") # Variant 1, using xgbDMatrix format of data input
explain <- lime(x=x,model=mod2) # Throws error
predict(mod2,x) # Works
predict(mod2,xgbDMatrix.obj) # Works

Do you mind correcting my very simple example?

thomasp85 · 2017-09-20T14:02:55Z

You would convert the matrix to a data.frame and then (if needed) convert it back inside predict_model

thomasp85 · 2017-09-20T14:03:54Z

We might add direct support for matrices but most models expect data.frame input so I’m unsure whether we can do it gracefully

pommedeterresautee · 2017-09-20T14:15:09Z

XGBoost expects a matrix with its own format ( xgb.DMatrix).
Plus you may have your own transformations to apply to the permutations (ex.: center, scaled).
It s up to the user to provide the function to convert to the expected format.
predict_model.xgb.Booster is only for predictions.
Please look at the demo for more information (for text data, but work with any kind of data):
https://github.com/thomasp85/lime/blob/master/demo/text_classification_explanation.R

martinju · 2017-09-21T13:53:55Z

Ok, thank you for the pointers. I got the lime function to pass without error, but the explainer still does not work as xgboost requires a matrix (or xgb.DMatrix) to perform predictions. I added a pull request to fix this (in addition to handling NAs in the training data). See #37

thomasp85 · 2017-09-25T10:14:53Z

Fixed in a74a2ac

vikrant-sahu · 2018-09-28T11:06:38Z

Setting the seed value eliminates randomness but if I run it with different seed values than the result changes. So now the problem is - which result should I trust. If you say it's R-square than most of the times the R-square is nearly the same.

thomasp85 closed this as completed Jun 13, 2017

thomasp85 reopened this Jun 23, 2017

thomasp85 closed this as completed Sep 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using lime() on xgboost object #1

Using lime() on xgboost object #1

martinju commented Apr 25, 2017

thomasp85 commented Apr 27, 2017

thomasp85 commented Apr 27, 2017

martinju commented Apr 27, 2017

thomasp85 commented Jun 13, 2017

martinju commented Jun 22, 2017

thomasp85 commented Jun 23, 2017 •

edited

Loading

martinju commented Jun 23, 2017

thomasp85 commented Jun 23, 2017

martinju commented Jun 23, 2017

thomasp85 commented Jun 23, 2017

martinju commented Jun 23, 2017

negyn commented Jun 23, 2017

pommedeterresautee commented Jun 26, 2017

belariow commented Jul 10, 2017

pommedeterresautee commented Jul 10, 2017

belariow commented Jul 11, 2017

thomasp85 commented Jul 11, 2017

kwartler commented Jul 11, 2017

pommedeterresautee commented Jul 11, 2017

thomasp85 commented Sep 20, 2017

martinju commented Sep 20, 2017 •

edited

Loading

thomasp85 commented Sep 20, 2017

thomasp85 commented Sep 20, 2017

pommedeterresautee commented Sep 20, 2017

martinju commented Sep 21, 2017

thomasp85 commented Sep 25, 2017

vikrant-sahu commented Sep 28, 2018

Using lime() on xgboost object #1

Using lime() on xgboost object #1

Comments

martinju commented Apr 25, 2017

Example using both xgbDmatrix and a regular matrix

thomasp85 commented Apr 27, 2017

thomasp85 commented Apr 27, 2017

martinju commented Apr 27, 2017

thomasp85 commented Jun 13, 2017

martinju commented Jun 22, 2017

thomasp85 commented Jun 23, 2017 • edited Loading

martinju commented Jun 23, 2017

thomasp85 commented Jun 23, 2017

martinju commented Jun 23, 2017

thomasp85 commented Jun 23, 2017

martinju commented Jun 23, 2017

negyn commented Jun 23, 2017

pommedeterresautee commented Jun 26, 2017

belariow commented Jul 10, 2017

pommedeterresautee commented Jul 10, 2017

belariow commented Jul 11, 2017

thomasp85 commented Jul 11, 2017

kwartler commented Jul 11, 2017

pommedeterresautee commented Jul 11, 2017

thomasp85 commented Sep 20, 2017

martinju commented Sep 20, 2017 • edited Loading

thomasp85 commented Sep 20, 2017

thomasp85 commented Sep 20, 2017

pommedeterresautee commented Sep 20, 2017

martinju commented Sep 21, 2017

thomasp85 commented Sep 25, 2017

vikrant-sahu commented Sep 28, 2018

thomasp85 commented Jun 23, 2017 •

edited

Loading

martinju commented Sep 20, 2017 •

edited

Loading