Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using lime() on xgboost object #1

Closed
martinju opened this issue Apr 25, 2017 · 27 comments
Closed

Using lime() on xgboost object #1

martinju opened this issue Apr 25, 2017 · 27 comments

Comments

@martinju
Copy link
Contributor

Hi, and thank you for an excellent package!

I am trying to apply the lime package to a model fitted with xgboost (using the original xgboost package), but the lime function does not seem to accept the input format even if the predict function works fine.

Example using both xgbDmatrix and a regular matrix

x = matrix(rnorm(100*10),ncol=10)
y = rnorm(100)

xgbDMatrix.obj<- xgb.DMatrix(data=x,
label = y)

mod = xgb.train(data = xgbDMatrix.obj,nrounds = 100) # Variant 1, using xgbDMatrix format of data input
#predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

mod = xgboost(data = x,label = y,nrounds = 100) # Variant 2 using a regular matrix + vector as data input
#predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

In the readme you mention manually building a predict function.
If that is the solution here, could you please provide some guidelines on how to do that?

@thomasp85
Copy link
Owner

Thanks for the report – I'll take a look. Keep in mind that lime is still in early development and any error could potentially be a bug on my part and not due to anything you have done. I'll be in touch

@thomasp85
Copy link
Owner

Two things: currently lime only works with classifiers not regressors. Also, on my system I get an error when running predict, complaining about type being an unknown argument

@martinju
Copy link
Contributor Author

Thanks for taking a look at this!

  1. OK, let's do a classification example instead:
    x = matrix(rnorm(100*10),ncol=10)
    y = round(runif(100))

xgbDMatrix.obj<- xgb.DMatrix(data=x,label = y)
mod = xgb.train(data = xgbDMatrix.obj,nrounds = 100,objective = "binary:logistic") # Variant 1, using xgbDMatrix format of data input
predict(mod,xgbDMatrix.obj,type="prob") # works fine
explain <- lime(x=xgbDMatrix.obj,model=mod) # Throws error
predict(mod,x,type="prob") # works fine
explain <- lime(x=x.obj,model=mod) # Throws error

mod = xgboost(data = x,label = y,nrounds = 100,objective = "binary:logistic") # Variant 2 using a regular matrix + vector as data input
predict(mod,x,type="prob") # works fine
explain <- lime(x=x,model=mod) # Throws error

  1. I do not get an error when using predict with type="prob" (I am using xgboost 0.6-4). Anyhow, the predict function gives probabilities by default for classification problems, so doing leaving out type="prob" gives the same on my computer.

@thomasp85
Copy link
Owner

Sorry for letting this hang - I was pulled away by higher priority stuff.

Several problems in your example:

lime takes data.frame input in order to ensure that variables are named. I might add a default coercion to data.frame which would make matrices work, but still unsure whether this is a good idea.

The predict function does not return a data.frame containing the probabilities of each class, which is the required predict output...

I would generally recommend using xgboost through the caret package as this is the target API lime is coded up against (soon to be joined be mlr)

@martinju
Copy link
Contributor Author

Thank you for getting back to me.

So I tried running this from caret instead, and I got something up and running. However, when moving to real data I ran into some problems:

1: it seems that lime(x=x,...) cannot handle when x contains missing data. It throws the error:

Error in quantile.default(x[[i]], seq(0, 1, length.out = n_bins + 1)) :
missing values and NaN's not allowed if 'na.rm' is FALSE

A workaround is to remove the missing data before passing it lime, but as there is indeed information in the missing data which was used during model training, that is not an optimal solution.

2: even without missing data, I get an error when using bin_continuous=T:

Error in cut.default(x[[i]], unique(bin_cuts[[i]]), labels = FALSE, include.lowest = TRUE) :
invalid number of intervals

Everything works fine if if put bin_continuous=F with no missing data.

3: When re-runinng the explain-function which is the output from lime(), I get slightly different results every time. I guess this is due to the randomness in the permutations. Is it possible to somewhere control the number of permutations used? I guess that could resolve that issue with the expense of an increase in the running time.

4: Finally if you could point out how I might create my own prediction based on native xgboost, that would be excellent (the caret wrapper does not support all of xgboost's features). Do I need to overwrite the prediction function of the xgboost class?

Thank you so much for your support so far!

@thomasp85 thomasp85 reopened this Jun 23, 2017
@thomasp85
Copy link
Owner

thomasp85 commented Jun 23, 2017

  1. I'm unsure how binning should work with missing values, so for now I'm ok with that failing (should prob have a better error message though)
  2. You may have stumbled upon a bug - can you make a minimal example?
  3. The permutations are random which is why you get different results. To "control" randomness you simply set the seed in R prior to running the explanation (e.g. set.seed(1))
  4. The only requirements for a model to be lime compliant is that the input data is a data.frame and that the model has a predict method that outputs a data frame of class probabilities (one class per column) when called with type='prob'

@martinju
Copy link
Contributor Author

  1. OK. Maybe you could let the NA's become a separate category when binning? Or you may simply remove them when binning, although that throws away some information. That is anyway better than having to delete all rows including at least one NA-value. In an extreme situation that may throw away all your data even if you can easily fit a model to those data.
  2. I just realized this error was caused by including an intercept the model. Here is an example:

library(lime)
library(caret)

set.seed(123)
x = as.data.frame(matrix(rnorm(100*10),ncol=10))
x = cbind(V0=1,x,V11=sample(c(0,1),size = 10,replace = T),V12=sample(c(0,1),size = 10,replace = T))
y = round(runif(100))
y.factor <- as.factor(y)
levels(y.factor) <- c("no","yes")

fitControl <- trainControl(method="none",
verboseIter = TRUE,
classProbs = TRUE,
allowParallel = TRUE)

caretFit.glm <- train(y=y.factor,
x=x,
method = "glm",
trControl = fitControl)

explain <- lime(x=x,model=caretFit.glm,bin_continuous = T) # Throws error

  1. Yes, I understand this. I would like to be able to manually set how many permutations/samples are done as part of the algorithm (more samples means more stable results across different seeds).
  2. Thank you, I will try to overwrite xgboost's prediction function then.

@thomasp85
Copy link
Owner

  1. That might be a possible approach - I'll think about it a bit
  2. Thanks
  3. You always have control over the number of permutations using the n_permtations argument in the explainer. And you can control randomness by setting the seed. Unsure what else you need?
  4. Let me know if there are any problems

@martinju
Copy link
Contributor Author

Thank you! Regarding 3 I wasn't aware of the n_permutations parameter. That is exactly what I was looking for :) What is the default value?

@thomasp85
Copy link
Owner

5000

I'll try to improve the docs on this in the future. In general R does not have good support for documenting functions created by other functions...

@martinju
Copy link
Contributor Author

Thanks!

@negyn
Copy link

negyn commented Jun 23, 2017

Hi, thank you for LIME package, it is so interesting for me.
I have some questions, it seems to me that the current version of LIME package in R doesn't support the text and image data, is it right? so, except random forest what other kinds of classifier does it support?

@pommedeterresautee
Copy link
Contributor

@negyn #6

@belariow
Copy link

@martinju Hi, I have the same 1. and 2. issues with a XGB model.
Have you finally came around problem 1.?
About problem 2., you found an intercept issue. How do you do with an XGB model though?
Thanks a lot!

@pommedeterresautee
Copy link
Contributor

@belariow in the PR xgb is managed. read #7 for more info (xgb is managed in #6 but this is not a definitive solution)

What do you mean about intercept issue?

@belariow
Copy link

@pommedeterresautee The intercept issue is the one that throws the following error:

Error in cut.default(x[[i]], unique(bin_cuts[[i]]), labels = FALSE, include.lowest = TRUE) :
invalid number of intervals

I figured out just like @martinju that this is an intercept problem, meaning that it appears whenever one of the columns of you data is a constant.

I also have a NA issue, the lime() function does not work if your data has NAs, which is problematic for real data. I don't know how I'll come around that, probably create as many "VarXIsNa" variables as needed. Do you think lime will be able to handle NAs soon?

Thanks!

@thomasp85
Copy link
Owner

@belariow Can you open a separate issue for the NA support as it is hidden within this xgboost thread

@kwartler
Copy link

What about something like this within the lime.data.frame function? At line ~66.

case_res <- predict(model, case_perm, type = 'prob')
 #Coerce to Data Frame since caret make preds a df, but other predict methods do not
    ###
    if(is.data.frame(case_res)==F){
      case_res<-as.data.frame(case_res) 
    }
    ###

For example:

#####
data("iris")
iris_test <- iris[1, 1:5]
iris_train <- iris[-1, 1:5]

fit <- randomForest::randomForest(Species ~ ., data = iris_train)

#check class of predictions to show it's a matrix
pred.rf<-predict(fit,iris_train, type='prob')
class(pred.rf)

#lime
explanation<-lime(iris_train,fit)
explain <- explanation(iris_test, n_labels = 2, n_features = 2)

Just my $0.02 since I a new to this repo and work. Thx for doing this though!

@pommedeterresautee
Copy link
Contributor

For instance, on a multiclass XGBoost classification, it wouldn't work as you need first to coerce to a matrix (col: the cat), in this case there are some parameters to add when you call predict.
This is managed in the source code after merge of the #6

@thomasp85
Copy link
Owner

Can I get you to confirm that the CRAN release works fine for you?

@martinju
Copy link
Contributor Author

martinju commented Sep 20, 2017

Thanks for the work with this. I really think the new predict_model and model_type-functions would be very useful. Reading the help function of these new functions (and the source code) it seems xgboost should work out of the box as predict_model.xgb.Booster is already defined. However, I don't understand how I should use it. Passing something else than a data.frame (or character vector) to the lime()-function just throws an error. Examples:

library(xgboost)
install.packages("lime")
library(lime)


x = matrix(rnorm(100*10),ncol=10)
y = round(runif(100))

#The preferred xgbDMatrix-way
xgbDMatrix.obj<- xgb.DMatrix(data=x,label = y)
mod1 = xgb.train(data = xgbDMatrix.obj,nrounds = 10,objective = "binary:logistic") # Variant 1, using xgbDMatrix format of data input
explain <- lime(x=xgbDMatrix.obj,model=mod1) # Throws error
predict(mod1,xgbDMatrix.obj) # Works
predict(mod1,x) # Works


# The matrix way
mod2 = xgboost(data = x,label=y,nrounds = 10,objective = "binary:logistic") # Variant 1, using xgbDMatrix format of data input
explain <- lime(x=x,model=mod2) # Throws error
predict(mod2,x) # Works
predict(mod2,xgbDMatrix.obj) # Works

Do you mind correcting my very simple example?

@thomasp85
Copy link
Owner

You would convert the matrix to a data.frame and then (if needed) convert it back inside predict_model

@thomasp85
Copy link
Owner

We might add direct support for matrices but most models expect data.frame input so I’m unsure whether we can do it gracefully

@pommedeterresautee
Copy link
Contributor

XGBoost expects a matrix with its own format ( xgb.DMatrix).
Plus you may have your own transformations to apply to the permutations (ex.: center, scaled).
It s up to the user to provide the function to convert to the expected format.
predict_model.xgb.Booster is only for predictions.
Please look at the demo for more information (for text data, but work with any kind of data):
https://github.com/thomasp85/lime/blob/master/demo/text_classification_explanation.R

@martinju
Copy link
Contributor Author

Ok, thank you for the pointers. I got the lime function to pass without error, but the explainer still does not work as xgboost requires a matrix (or xgb.DMatrix) to perform predictions. I added a pull request to fix this (in addition to handling NAs in the training data). See #37

@thomasp85
Copy link
Owner

Fixed in a74a2ac

@vikrant-sahu
Copy link

Setting the seed value eliminates randomness but if I run it with different seed values than the result changes. So now the problem is - which result should I trust. If you say it's R-square than most of the times the R-square is nearly the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants