Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text support #6

Merged
merged 43 commits into from
Jul 11, 2017
Merged

Add text support #6

merged 43 commits into from
Jul 11, 2017

Conversation

pommedeterresautee
Copy link
Contributor

Add text support
Text data
Demo for text explanation

This was referenced Jun 26, 2017
refactoring
Copy link
Owner

@thomasp85 thomasp85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch. Overall it looks good!

Can I get you to address the comments as well as making sure it passes travis?

DESCRIPTION Outdated
hrbrthemes,
magrittr,
purrr,
e1071
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see where e1071 is used? Why move it from suggest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was in relation with some code I have not pushed. But I am not ready at all, I will move it back to suggest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

R/character.R Outdated
labels = labels, n_labels = n_labels, n_features = number_features_explain,
feature_method = feature_selection_method)
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lime method should return a new function rather than do the explanation directly. It does not make much sense for text data but do for e.g. tabular data, and we need to keep the interface consistent. The arguments of the returned function should be the same across methods...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the consistency requirement.
Can you tell me why you do that for data.frame? Is it to memorize some pre treatments?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - It needs to be trained on the training data in order to make sensible permutations. Also it makes it possible to do some one-off computations once and not every time an explanation is required...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super comfortable with the factory pattern on R, much more a Java stuff :-)
Would it be a possible solution to use something like https://cran.r-project.org/web/packages/memoise/ ? Like you do the costly pre treatment, cache the result of the pre treatment, and return the final result. The user recall the function which requires the same pre treatment, this time, there is no costly computation, just retrieve stuff from cache? It is just a suggestion, I have not worked on data.frame part at all.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work fine for the time consuming steps, but would not solve the need to pre-train the explainer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super exited about that approach as it also adds changes to the API (instead of different calling conventions it is different calling progressions)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively would be to have an explicit create_explainer method that returns a new lime object that wraps both the model and any preprocessing. This object is then passed on to the lime method..? I could definitely see the benefits of this...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best approach right now is to make the text version compliant with the current approach and then I'll try to come up with an alternative approach in a new branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I moved all the parameters in the main function, and as you can see in the demo, I use currying to avoid double call ! Will need to put stuff in documentation, but I think it's quite ok that way

R/character.R Outdated
predict(model, data, type = "prob")
)
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the infrastructure around the calling predict a bit, which makes this redundant, but lets merge it in and we can update the the text implementation to match tabular after that...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it an external function?
it s a way to make people able to use any algo and not just carret. (even if there is no predict function)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep (will push later today). Basically I export a predict_model generic that people can define for their model class (making it possible to support models with non-conforming predict methods. The default simply calls predict with the right type argument, but I've also included a method for mlr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTHING TO DO (FOR NOW)

set_names(d, nm = .)})

dict_size <- length(tokens)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference is to avoid %<>% inside packages as it is a bit less clear whats going on (it is not a clear assignment and R itself doesn't understand it as such). Can we keep it to using the %>% pipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are right, there is no big gain in doing that way. Will refactor that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you walk me through the new implementation of the permutation? Is it just a rewrite doing the same thing or is there any change in the fundamentals of the approach?

Copy link
Contributor Author

@pommedeterresautee pommedeterresautee Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it s a rewrite in a way I can easily manage for a future refactoring. When I try crazy number of permultations (500K) this part is the slowest by far (40s on my i7), I want to remake it using Rcpp, that's why I divided the code that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTHING TO DO (RIGHT NOW)

@pommedeterresautee
Copy link
Contributor Author

pommedeterresautee commented Jun 27, 2017

Regarding Travis, I will refactor until there is no more Notes / Warning. Nothing crazy anyway

Edit: DONE

@pommedeterresautee
Copy link
Contributor Author

FYI, I use lintr::lint_package() to clean the code style.

Copy link
Owner

@thomasp85 thomasp85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me a heads up when you feel like you're done with the PR, then I'll give it a proper review

R/character.R Outdated
expect_true(feature_selection_method %in% feature_selection_method())
expect_gte(number_features_explain, 1)
expect_gte(n_permutations, 1)
expect_gte(kernel_width, 1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assertthat for defensive programming. testthat is for unit tests only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the last commit

R/lime.R Outdated
@@ -46,7 +46,7 @@
#'
#' @export
lime <- function(x, model, ...) {
UseMethod('lime')
UseMethod("lime")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change my coding style :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are right, I followed the advises of linter...
Do you want me to put single quotes everywhere?

Replace testthat by assertthat
@pommedeterresautee
Copy link
Contributor Author

pommedeterresautee commented Jul 5, 2017

@thomasp85 I think it s ready for a review.

In details :
The 2 Rcpp functions brought huge perf improvement when working with >1000 tokens documents.
In particular, I have been surprised by the perf of stringdist which are not very good for large matrix (>1e3, >1e6).

Text explanations returned are less rich than those for data.frame. I will add info when perf optimizations will be finished. Even if light, they are usable IRL.

Finally, I realized that trees bring something interesting : a limit on the number of features to use without requiring multiple steps or choosing just the biggest weights => better perf for long text with many permutations! I have some code at the end of the demo file. Just waiting for the new API you told me you are going to introduce in the package.

@thomasp85
Copy link
Owner

If it is ok with you I'll merge it in and any improvements and changes will happen in a new branch - ok?

@pommedeterresautee
Copy link
Contributor Author

Yes I think it's good !

@thomasp85 thomasp85 merged commit b19b337 into thomasp85:master Jul 11, 2017
@thomasp85
Copy link
Owner

Thanks. Will begin to do some small refactoring to make it fit into my local branch and then push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants