Tokenizers package #33
Provides tokenizers for natural language.
Natural language text
Users of R packages for NLP
Virtually every R package for NLP implements a couple of tokenizers. The point of this package is to collect all the tokenizers that one could conceivably want to use in a single package, and make sure that all the packages have a consistent interface. The package also aims to have fast and correct tokenizers implemented on top of stringi and Rcpp.
The package does not contain a vignette, but it does contain an extensive README. Since all the tokenizers work in the same basic way, a vignette seems unnecessary. But if rOpenSci wants one, I can easily adapt the README into a standalone vignette.
This package is already on CRAN.
The text was updated successfully, but these errors were encountered:
Overall, everything is neat, tidy, and well put together. Nice! Most of my comments are stylistic, especially re: the C++ code included within. The near-100% test coverage is great to see indeed!
I can't comment as well as to the utility / scope of the package since I'm not as familiar with NLP, but overall the package feels quite well focused.
Please add a vignette, for potential users like me who are less familiar with NLP in general. :)
@kevinushey: Thanks very much for this detailed code review. As you doubtless surmised, I'm at the very beginning of learning C++ and Rcpp, so I appreciate the clear guidance about how to improve that code. I didn't know that testthat automatically sourced
I agree with all the changes listed here, and I'll make these changes soon. (Unfortunately, probably not till the first week of May.) Then I'll ping this issue again.
@kevinushey thank you for review. Probably I can say that I'm an author of
One thing related to
I've made fixes for most of the issues that @kevinushey pointed out. Here is a line by line accounting.
Added a vignette: ropensci/tokenizers@4a2ce76
I don't think we would ever overflow an int. In skip grams n and k (which becomes i) are usually in the single digits.
I didn't know testthat automatically sourced
More descriptive title: ropensci/tokenizers@4c49a62
This issue @dselivanov already responded to.
That leaves these two issues. Since they deal with code that @dselivanov wrote, can you take a look at them please, Dmitriy?
Assuming that the fixes above and Dmitriy's response to the unordered sets and maps questions is adequate, that leaves one stylistic question and one substantive question about long vectors.
Is the package ready, or do you want to wait till Dmitry looks at those two? I'm happy either way.
It would be nice if more packages relied on tokenizers, so a little
On Fri, Aug 12, 2016 at 1:02 PM, Scott Chamberlain <email@example.com