Add text support #6

pommedeterresautee · 2017-06-26T20:47:25Z

Add text support
Text data
Demo for text explanation

change data name

documentation of data

refactoring

thomasp85

Thanks a bunch. Overall it looks good!

Can I get you to address the comments as well as making sure it passes travis?

thomasp85 · 2017-06-27T09:30:30Z

DESCRIPTION

+    hrbrthemes,
+    magrittr,
+    purrr,
+    e1071


I can't see where e1071 is used? Why move it from suggest?

Was in relation with some code I have not pushed. But I am not ready at all, I will move it back to suggest.

thomasp85 · 2017-06-27T09:32:46Z

R/character.R

+                            labels = labels, n_labels = n_labels, n_features = number_features_explain,
+                            feature_method = feature_selection_method)
+}
+


The lime method should return a new function rather than do the explanation directly. It does not make much sense for text data but do for e.g. tabular data, and we need to keep the interface consistent. The arguments of the returned function should be the same across methods...

I understand the consistency requirement.
Can you tell me why you do that for data.frame? Is it to memorize some pre treatments?

Yep - It needs to be trained on the training data in order to make sensible permutations. Also it makes it possible to do some one-off computations once and not every time an explanation is required...

I am not super comfortable with the factory pattern on R, much more a Java stuff :-)
Would it be a possible solution to use something like https://cran.r-project.org/web/packages/memoise/ ? Like you do the costly pre treatment, cache the result of the pre treatment, and return the final result. The user recall the function which requires the same pre treatment, this time, there is no costly computation, just retrieve stuff from cache? It is just a suggestion, I have not worked on data.frame part at all.

That would work fine for the time consuming steps, but would not solve the need to pre-train the explainer.

I'm not super exited about that approach as it also adds changes to the API (instead of different calling conventions it is different calling progressions)

Alternatively would be to have an explicit create_explainer method that returns a new lime object that wraps both the model and any preprocessing. This object is then passed on to the lime method..? I could definitely see the benefits of this...

I think the best approach right now is to make the text version compliant with the current approach and then I'll try to come up with an alternative approach in a new branch

Finally, I moved all the parameters in the main function, and as you can see in the demo, I use currying to avoid double call ! Will need to put stuff in documentation, but I think it's quite ok that way

thomasp85 · 2017-06-27T09:34:23Z

R/character.R

+         predict(model, data, type = "prob")
+  )
+}
+


I've changed the infrastructure around the calling predict a bit, which makes this redundant, but lets merge it in and we can update the the text implementation to match tabular after that...

is it an external function?
it s a way to make people able to use any algo and not just carret. (even if there is no predict function)

Yep (will push later today). Basically I export a predict_model generic that people can define for their model class (making it possible to support models with non-conforming predict methods. The default simply calls predict with the right type argument, but I've also included a method for mlr

NOTHING TO DO (FOR NOW)

thomasp85 · 2017-06-27T09:36:25Z

R/permute_cases.R

+      set_names(d, nm = .)})
+
+  dict_size <- length(tokens)
+


Personal preference is to avoid %<>% inside packages as it is a bit less clear whats going on (it is not a clear assignment and R itself doesn't understand it as such). Can we keep it to using the %>% pipe?

Yep you are right, there is no big gain in doing that way. Will refactor that

thomasp85 · 2017-06-27T09:37:33Z

R/permute_cases.R

}
+

Can you walk me through the new implementation of the permutation? Is it just a rewrite doing the same thing or is there any change in the fundamentals of the approach?

So far it s a rewrite in a way I can easily manage for a future refactoring. When I try crazy number of permultations (500K) this part is the slowest by far (40s on my i7), I want to remake it using Rcpp, that's why I divided the code that way.

NOTHING TO DO (RIGHT NOW)

pommedeterresautee · 2017-06-27T09:51:37Z

Regarding Travis, I will refactor until there is no more Notes / Warning. Nothing crazy anyway

Edit: DONE

fix warning + notes

pommedeterresautee · 2017-06-27T15:01:52Z

FYI, I use lintr::lint_package() to clean the code style.

Introduce RCPP

Add tests

improve BoW

Replace stringdist function by custom Rcpp one. Better horizontal (matrix col) scalability

update gitignore

thomasp85

Can you give me a heads up when you feel like you're done with the PR, then I'll give it a proper review

thomasp85 · 2017-07-05T08:40:39Z

R/character.R

+  expect_true(feature_selection_method %in% feature_selection_method())
+  expect_gte(number_features_explain, 1)
+  expect_gte(n_permutations, 1)
+  expect_gte(kernel_width, 1)


Please use assertthat for defensive programming. testthat is for unit tests only

Done in the last commit

thomasp85 · 2017-07-05T08:45:23Z

R/lime.R

@@ -46,7 +46,7 @@
 #'
 #' @export
 lime <- function(x, model, ...) {
-  UseMethod('lime')
+  UseMethod("lime")


Please don't change my coding style :-)

Sorry, you are right, I followed the advises of linter...
Do you want me to put single quotes everywhere?

Replace testthat by assertthat

pommedeterresautee · 2017-07-05T16:31:55Z

@thomasp85 I think it s ready for a review.

In details :
The 2 Rcpp functions brought huge perf improvement when working with >1000 tokens documents.
In particular, I have been surprised by the perf of stringdist which are not very good for large matrix (>1e3, >1e6).

Text explanations returned are less rich than those for data.frame. I will add info when perf optimizations will be finished. Even if light, they are usable IRL.

Finally, I realized that trees bring something interesting : a limit on the number of features to use without requiring multiple steps or choosing just the biggest weights => better perf for long text with many permutations! I have some code at the end of the demo file. Just waiting for the new API you told me you are going to introduce in the package.

thomasp85 · 2017-07-11T08:56:12Z

If it is ok with you I'll merge it in and any improvements and changes will happen in a new branch - ok?

pommedeterresautee · 2017-07-11T12:44:17Z

Yes I think it's good !

thomasp85 · 2017-07-11T12:46:13Z

Thanks. Will begin to do some small refactoring to make it fit into my local branch and then push

pommedeterresautee added 13 commits June 26, 2017 20:29

add text support

e066761

update description

c40fe0c

cleaning

bad6b65

add data

967e983

delete

c44d08e

add demo

6bf603f

change data name

change name

cc0e3b2

update data

27bad6c

refactoring

32e965b

documentation of data

fix demo

18d894c

Merge branch 'master' of github.com:pommedeterresautee/lime

c20b7fe

Add more documentation

7e0bced

refactoring

ed6a84b

This was referenced Jun 26, 2017

scope of lime package in R #5

Closed

Using lime() on xgboost object #1

Closed

clean data

2a9681c

refactoring

thomasp85 reviewed Jun 27, 2017

View reviewed changes

pommedeterresautee added 6 commits June 27, 2017 15:19

remove warnings

dadad02

Refactoring

2253f27

fix warning + notes

remove %<>%

8a84b25

fix

e8898d2

documentation

3a47d74

documentation

faa486f

pommedeterresautee and others added 5 commits June 27, 2017 23:00

check parameters

fefb9b9

documentaton

22dcb6f

test parameters

985be44

improve keep unique words parameter

affd068

small refactoring

49a8f5b

pommedeterresautee added 15 commits June 30, 2017 17:28

Improve permutation perf by X30 !

bd218dc

Introduce RCPP

rename files

f80eb52

Change exposition

ed68317

Improve permutations

5bd25ad

Add tests

fix permutation order

3972941

fix dependencies

19390a4

improve BoW

fix

98101d4

refactoring

ca62e71

Distance perf X10

551905f

Replace stringdist function by custom Rcpp one. Better horizontal (matrix col) scalability

refactoring

9121227

Fix a note on Windows compilation

04dfe1b

dynamic loading

fc68679

Add description

4a99555

update gitignore

remove not needed parameter (fix distance to cosine)

210321a

refactoring

ce2d5c9

thomasp85 reviewed Jul 5, 2017

View reviewed changes

pommedeterresautee added 2 commits July 5, 2017 18:15

Fix test

13cf2b9

Replace testthat by assertthat

fix

a7a2286

testthat is just suggested

b615589

pommedeterresautee mentioned this pull request Jul 9, 2017

Phrase detection dselivanov/text2vec#99

Closed

thomasp85 merged commit b19b337 into thomasp85:master Jul 11, 2017

Add text support #6

Add text support #6

Conversation

pommedeterresautee commented Jun 26, 2017

thomasp85 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pommedeterresautee Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pommedeterresautee commented Jun 27, 2017 • edited Loading

pommedeterresautee commented Jun 27, 2017

thomasp85 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pommedeterresautee commented Jul 5, 2017 • edited Loading

thomasp85 commented Jul 11, 2017

pommedeterresautee commented Jul 11, 2017

thomasp85 commented Jul 11, 2017

pommedeterresautee Jun 27, 2017 •

edited

Loading

pommedeterresautee commented Jun 27, 2017 •

edited

Loading

pommedeterresautee commented Jul 5, 2017 •

edited

Loading