-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gutenbergr -- download and processing public domain works from Project Gutenberg #41
Comments
Thanks for your submission @dgrtwo - seeking reviewers now |
Reviewers: @masalmon |
General commentsThis package offers access to a playground of data for text mining which is awesome, and the output are data.frames that one can use with The package works as expected on my PC with Windows and provides tools for filtering works and then downloading them. The examples and vignette help getting a good overview of the functionalities. The package is already very good so I feel very picky writing comments! R CHECKWhen running R Check I get a note, "found 13591 marked UTF-8 strings" but it is written in the cran-comments.md. spell checkUsing Continuous integrationWhy not add AppVeyor CI too? Unit tests
gutenbergr Coverage: 89.47%
R/gutenberg_download.R: 86.05%
R/gutenberg_works.R: 100.00%
R/utils.R: 100.00% so why not add a badge to show this?
data-raw folder, data preparationIn the readme of this folder you mention "This set of scripts produces the three .rda files in the data directory. (...) There are no guarantees about how often the metadata will be updated in the package."
Functionality
Documentation
Vignette
Links to other packagesAt the end of the vignette/readme you mention
|
I sincerely thank Maëlle both for the helpful comments and the kind words about the gutenbergr package- I've learned a lot from this advice! Also grateful it was received so quickly! 👍 👍 It's worth noting I made one major change not in response to feedback, in switching the package license from MIT to GPL-2. (I realized the Project Gutenberg catalog is released under GPL so I should distribute it only under a GPL-compatible license). All this is in the NEWS.md file. Continuous Integration
Unit tests
data-raw folder, data preparation
I don't know how often the metadata of old books is changed (I would guess quite rarely, and probably very rarely for classic books). I do know that every day there are books added to the collection, along with metadata. For example, as of this writing there were about 23 books published yesterday, May 3. While there is a feed for books added yesterday, there is not such a feed for additions over a longer period or for edits, nor does it contain the full metadata I would need for an update.
The problem is that the scripts take about 30 minutes to run (on my machine) and require downloading a ~60 MB file and unpacking it to ~750MB. This isn't a small ask of resources so I can't think of a system I could use to automate this. The main advantage of regular updates would be including new e-books released each day. The reason I don't see this as a priority is that I think the vast majority of analyses will be on a small fraction of very popular books, most of which were added long ago. I think just as any dataset will be incomplete to some degree, I'm OK treating the dataset as "frozen" every few months. Of course I'm open to suggestions for easy automation!
Great point about making the dataset explicit: I've added an attribute to each dataset like
This fact is also noted in I'm not sure about mentioning it near the installation instructions, however, simply because I don't know how much more often the dataset will be updated on GitHub than on CRAN. I'm sure I'll update it before each CRAN release. Functionality
This was a hard UI decision for me as well. The main issue is that I noticed I was doing
on almost every call, such that I allowed filtering conditions to be included. I considered having the behavior To help with this ambiguity, I have made two adjustments:
Great point. I have made two changes:
and request works in both English and French with:
The examples and tests give the full details of the behavior. It's possible that there are corner cases (e.g. either English or French, but not both) that will require additional, minimal processing on the part of the user.
This is true but I'm trying to follow the "Good practice" section of
Documentation
I've removed all periods at the end of these lines. I generally shy away from adding the class of all arguments, since I think it clutters documentation and that anyone would look at the class before they did anything interesting.
Fixed
I have added an example. The main reason I export it is to be especially transparent about how text-stripping is done. This also helps users if they want to do strip = FALSE, check or preprocess some of the texts, and then use
I've now clarified this: "Although the Project Gutenberg raw data also includes metadata on contributors, editors, illustrators, etc., this dataset contains only people who have been the single author of at least one work."
Agreed, each time I mention tbl_df I have now changed it to
Done: added info and link to ISO-639.
I've clarified this to:
Done, added to Details in the gutenberg_subjects Rd.
Comment has been removed Vignette
Not in the examples I've seen; e.g. none of the dplyr vignettes, or the vignettes of rOpenSci packages I've looked at.
Absolutely, fixed
Done- added includes section and switched order, I agree that it makes sense. The exception is that I left the
I considered this, but some of the tables end up being so wide that they don't look good in HTML output anyway. Furthermore, almost every line would require knitr::kable, and if users aren't familiar with the kable function they may think it's necessary that they use it as well, or at the very least get distracted from the package functions. Links to other packages
I've added links to the CRAN view and wikitrends/WikipediR packages to the README and the vignette. One issue with humanparser is that the columns are in Lastname, Firstname order, which in my understanding it can't necessarily handle. (I didn't include gender for a similiar reason).
I think this is a general text analysis question that falls outside the scope of an API package like this. What encoding issues they'd run into would depend on what they were doing :-) |
Well I've learnt a lot too! 👌 This all looks good to me. 👍 I have been watching your repo and find all changes well done. The warning for = instead of == seems useful! Thanks for all your answers, I now agree with everything. ☺ As regards names, it's clearly not important but I saw today that this parser can reverse names https://github.com/Ironholds/humaniformat/blob/master/vignettes/Introduction.Rmd For covr I saw that Jim will have a look at it so I'm confident that your badge will soon be green! |
Having a quick look over this now... |
@dgrtwo Looks great. Just a minor thing:
Warning message:
In node_find_one(x$node, x$doc, xpath = xpath, nsMap = ns) :
101 matches for .//a: using first Wonder if it's worth a |
Great!
|
(Just needs Travis + AppVeyor activated for its new repo I suppose) Once it's settled I'm planning to submit 0.1.1 to CRAN (it's currently on CRAN as 0.1) including the new URLs. |
Nice, builds on at travis https://travis-ci.org/ropenscilabs/gutenbergr/ - You can keep appveyor builds under your acct, or I can start on mine, let me know |
@sckott Happy to transfer to you/ropensci! That's where the badge is currently pointing |
updated badge link, started an appveyor account with |
https://github.com/dgrtwo/gutenbergr
Project Gutenberg
People interested in text mining and analysis. Especially interesting to those analyzing historical and literary works, but this can also serve as example datasets for text mining problems.
No.
devtools
install instructionsdevtools::check()
produce any errors or warnings? If so paste them below.The text was updated successfully, but these errors were encountered: