Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upTokenizers package #33
Comments
|
Thanks for the submission! Looking for reviewer |
|
Reviewers: @kevinushey |
General CommentsOverall, everything is neat, tidy, and well put together. Nice! Most of my comments are stylistic, especially re: the C++ code included within. The near-100% test coverage is great to see indeed! I can't comment as well as to the utility / scope of the package since I'm not as familiar with NLP, but overall the package feels quite well focused. TestsI ran CommentsPlease add a vignette, for potential users like me who are less familiar with NLP in general. :)
|
|
@kevinushey: Thanks very much for this detailed code review. As you doubtless surmised, I'm at the very beginning of learning C++ and Rcpp, so I appreciate the clear guidance about how to improve that code. I didn't know that testthat automatically sourced I agree with all the changes listed here, and I'll make these changes soon. (Unfortunately, probably not till the first week of May.) Then I'll ping this issue again. |
|
@kevinushey thank you for review. Probably I can say that I'm an author of One thing related to |
|
@lmullen Anything we can help with? |
|
Thanks for the ping, @sckott. I just need to get around to fixing @kevinushey's very straightforward list of changes (thanks again!). I'm a bit out of my depth with some of the C++. But let me take a stab at most of them, and I'll check in with @dselivanov on the C++ parts that he wrote. |
|
thanks, sounds good |
|
I've made fixes for most of the issues that @kevinushey pointed out. Here is a line by line accounting.
Added a vignette: ropensci/tokenizers@4a2ce76
Fixed: ropensci/tokenizers@023f801
Fixed: ropensci/tokenizers@b16b146
I don't think we would ever overflow an int. In skip grams n and k (which becomes i) are usually in the single digits.
Fixed: ropensci/tokenizers@b7be1d4
I didn't know testthat automatically sourced
Fixed: ropensci/tokenizers@efa0605
More descriptive title: ropensci/tokenizers@4c49a62 This issue @dselivanov already responded to.
That leaves these two issues. Since they deal with code that @dselivanov wrote, can you take a look at them please, Dmitriy?
Assuming that the fixes above and Dmitriy's response to the unordered sets and maps questions is adequate, that leaves one stylistic question and one substantive question about long vectors. Is the package ready, or do you want to wait till Dmitry looks at those two? I'm happy either way. |
|
Thanks @lmullen - For those 2 outstanding issues, can you open issues for those in your repo, and we can consider these approved. I think you know what to do from here :) |
|
@sckott: I've added them as issues. |
|
thanks! |
|
I transfered ownership to rOpenSci. Hope that's okay. |
|
@lmullen I will have a look on Sunday-Monday. |
|
sounds good, thanks, closing now, thanks again for submitting! would you be interested in doing a blog post on our blog about this? |
|
Sure, sounds like a good idea. I will likely be pretty short. How about I send it to you by the end of next week, to give me time to get a new version on CRAN? |
|
Sounds good. And feel free to say no |
|
It would be nice if more packages relied on tokenizers, so a little On Fri, Aug 12, 2016 at 1:02 PM, Scott Chamberlain <notifications@github.com
Lincoln Mullen |
|
true, would get more eyes on it, and hopefully uses in pkgs |
Provides tokenizers for natural language.
https://github.com/lmullen/tokenizers
Natural language text
Users of R packages for NLP
Virtually every R package for NLP implements a couple of tokenizers. The point of this package is to collect all the tokenizers that one could conceivably want to use in a single package, and make sure that all the packages have a consistent interface. The package also aims to have fast and correct tokenizers implemented on top of stringi and Rcpp.
devtoolsinstall instructionsdevtools::check()produce any errors or warnings? If so paste them below.The package does not contain a vignette, but it does contain an extensive README. Since all the tokenizers work in the same basic way, a vignette seems unnecessary. But if rOpenSci wants one, I can easily adapt the README into a standalone vignette.
This package is already on CRAN.