Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: All permutations have no similarity to the original observation. Try setting bin_continuous to TRUE and/or increase kernel_size #56

Closed
YonasK95 opened this issue Nov 30, 2017 · 11 comments

Comments

@YonasK95
Copy link

YonasK95 commented Nov 30, 2017

Hi Thomas, there is an error at the final stage of this analysis. When running explain() function on an h2o model, I get the following error:
Error: All permutations have no similarity to the original observation. Try setting bin_continuous to TRUE and/or increase kernel_size
I have tried both the suggestions in the error. If I change the bin_continous to TRUE, the lime() does not work and other kernel sizes do not work either. Any thought on how to solve this and therefore be able to get the results with the plot_features() function?
Thanks in advance!

library(dplyr)
library(readxl)
library(httr)
library(h2o)
library(lime)

GET("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.xlsx",
write_disk(tf <- tempfile(fileext = ".xls")))
hr_data_raw <- read_xlsx(tf)

hr_data <- hr_data_raw %>%
mutate_if(is.character, as.factor) %>%
select(Attrition, everything())

h2o.init()
h2o.no_progress()

hr_data_h2o <- as.h2o(hr_data)
split_h2o <- h2o.splitFrame(hr_data_h2o, c(0.7, 0.15), seed = 1234 )
train_h2o <- h2o.assign(split_h2o[[1]], "train" ) # 70%
valid_h2o <- h2o.assign(split_h2o[[2]], "valid" ) # 15%
test_h2o <- h2o.assign(split_h2o[[3]], "test" ) # 15%

y <- "Attrition"
x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 30)

automl_leader <- automl_models_h2o@leader

explainer <- lime::lime(
as.data.frame(train_h2o[,-1]),
model = automl_leader,
bin_continuous = F)

explanation <- lime::explain(
as.data.frame(test_h2o[1:10, -1]),
explainer = explainer,
n_labels = 1,
n_features = 4)

Error: All permutations have no similarity to the original observation.
Try setting bin_continuous to TRUE and/or increase kernel_size

#Cannot Continue
plot_features(explanation)

@mdancho84
Copy link
Contributor

I also saw this on StackOverflow here.
https://stackoverflow.com/questions/47568866/combing-lime-and-h2o-in-r-error-when-running-explain

I successfully recreated the issue. Not 100% sure why it's occurring. I believe it has something to do with the model_permutations function returning an error due to all(weights[-1] == 0). Need to investigate further.

@thomasp85
Copy link
Owner

Basically the error states that all the permutations generated gets a similarity (and thus a weight) of zero... If I did not make the check the ridge regression would fail instead, but with a much more obscure error message... The check is thus fine, but I'll have to look into why your permutations are so strange...

@mdancho84
Copy link
Contributor

That’s very odd because when it was working the R squared was pretty high which would lead me to believe the permutations had above zero similarity. I’ll look into it more as well.

@thomasp85
Copy link
Owner

The problem in essence is that if you do not use bin_continuous and have numeric features that are very large, these features will dominate the distance measurement and thus result in very low (or zero) similarity. It's pretty difficult to choose a sensible kernel size in these circumstances so it is better to bin them (or recode them)

The problem with this particular example is that it contains numeric features with zero variance which makes the binning fail - I'm working on a fix for that...

@mdancho84
Copy link
Contributor

OK, so the H2O algorithm actually performed better on unscaled data, which is why I left alone. However, in the tutorials I'm working on now I will scale / center the data and remove the zero-variance features. Part of what I was trying to show is the minimal amount of effort that is needed, but I see that for the case of LIME this is not a good strategy. You live and you learn. Thanks for your help.

@thomasp85
Copy link
Owner

The latest commit let lime work with zero variance columns but throws a warning

@mdancho84
Copy link
Contributor

This issue was the result of me not realizing that LIME depended on similarity, and therefore requires scaling/centering numeric predictors. LIME works perfectly when, prior to modeling with h2o.automl(), I (1) remove zero variance predictors and (2) scale/center numeric predictors. I think you can close this issue out.

@laresbernardo
Copy link

I am having the same issue.
"Error: All permutations have no similarity to the original observation. Try setting bin_continuous to TRUE and/or increase kernel_size"
Tried increasing the kernel_width and setting the bin_continuous to TRUE (which gives me error too) and nothing. Tried with a sample of 10, 100, and nothing. Quite strange.
I am using an h2o model as well.
BTW: what happened with the h2o.klime function in the most recent h2o versions?

@mdancho84
Copy link
Contributor

You need to do these preprocessing steps prior to running h2o.automl():

  1. Remove EmployeeNumber and zero variance predictors
  2. Scale and center the numeric features.
    Once you do this, lime works fine. I have not had a chance to update the post.

Using recipes you can try this code:

# Before h2o.init()
library(recipes)

recipe_obj <- hr_data %>%
    recipe(formula = ~ .) %>%
    step_rm(EmployeeNumber) %>%
    step_zv(all_predictors()) %>%
    step_center(all_numeric()) %>%
    step_scale(all_numeric()) %>%
    prep(data = hr_data)
recipe_obj

hr_data <- bake(recipe_obj, newdata = hr_data)

# Then do the h2o.init()

@laresbernardo
Copy link

Thanks for the super fast reply Matt.
Actually I wasn't using the churn's database but a personal one (which contains about 9924 rows and 32 columns). With that data I trained an h2o model and I was trying to get the "lime explanation" into some of the predictions to understand the results better. But I'm obtaining that error when running the explain function.
I do not have any zero variance predictors (in my training and testing sets) or ID's column.
Do you mean by scale and center converting all numerical columns into a 0-1 continuous (logarithmical if needed) numbers?

@mdancho84
Copy link
Contributor

Yes any numeric features with ranges that are large values will cause issues with similarity. They need to be standardized to prevent the error and to have the lime analysis work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants