gutenbergr -- download and processing public domain works from Project Gutenberg #41
People interested in text mining and analysis. Especially interesting to those analyzing historical and literary works, but this can also serve as example datasets for text mining problems.
The text was updated successfully, but these errors were encountered:
This package offers access to a playground of data for text mining which is awesome, and the output are data.frames that one can use with
The package works as expected on my PC with Windows and provides tools for filtering works and then downloading them. The examples and vignette help getting a good overview of the functionalities. The package is already very good so I feel very picky writing comments!
When running R Check I get a note, "found 13591 marked UTF-8 strings" but it is written in the cran-comments.md.
Why not add AppVeyor CI too?
gutenbergr Coverage: 89.47% R/gutenberg_download.R: 86.05% R/gutenberg_works.R: 100.00% R/utils.R: 100.00%
so why not add a badge to show this?
data-raw folder, data preparation
In the readme of this folder you mention "This set of scripts produces the three .rda files in the data directory. (...) There are no guarantees about how often the metadata will be updated in the package."
Links to other packages
At the end of the vignette/readme you mention
I sincerely thank Maëlle both for the helpful comments and the kind words about the gutenbergr package- I've learned a lot from this advice! Also grateful it was received so quickly!
It's worth noting I made one major change not in response to feedback, in switching the package license from MIT to GPL-2. (I realized the Project Gutenberg catalog is released under GPL so I should distribute it only under a GPL-compatible license). All this is in the NEWS.md file.
data-raw folder, data preparation
I don't know how often the metadata of old books is changed (I would guess quite rarely, and probably very rarely for classic books). I do know that every day there are books added to the collection, along with metadata. For example, as of this writing there were about 23 books published yesterday, May 3. While there is a feed for books added yesterday, there is not such a feed for additions over a longer period or for edits, nor does it contain the full metadata I would need for an update.
The problem is that the scripts take about 30 minutes to run (on my machine) and require downloading a ~60 MB file and unpacking it to ~750MB. This isn't a small ask of resources so I can't think of a system I could use to automate this.
The main advantage of regular updates would be including new e-books released each day. The reason I don't see this as a priority is that I think the vast majority of analyses will be on a small fraction of very popular books, most of which were added long ago. I think just as any dataset will be incomplete to some degree, I'm OK treating the dataset as "frozen" every few months. Of course I'm open to suggestions for easy automation!
Great point about making the dataset explicit: I've added an attribute to each dataset like
This fact is also noted in
I'm not sure about mentioning it near the installation instructions, however, simply because I don't know how much more often the dataset will be updated on GitHub than on CRAN. I'm sure I'll update it before each CRAN release.
This was a hard UI decision for me as well. The main issue is that I noticed I was doing
on almost every call, such that I allowed filtering conditions to be included.
I considered having the behavior
To help with this ambiguity, I have made two adjustments:
Great point. I have made two changes:
and request works in both English and French with:
The examples and tests give the full details of the behavior. It's possible that there are corner cases (e.g. either English or French, but not both) that will require additional, minimal processing on the part of the user.
This is true but I'm trying to follow the "Good practice" section of
I've removed all periods at the end of these lines. I generally shy away from adding the class of all arguments, since I think it clutters documentation and that anyone would look at the class before they did anything interesting.
I have added an example. The main reason I export it is to be especially transparent about how text-stripping is done. This also helps users if they want to do strip = FALSE, check or preprocess some of the texts, and then use
I've now clarified this: "Although the Project Gutenberg raw data also includes metadata on contributors, editors, illustrators, etc., this dataset contains only people who have been the single author of at least one work."
Agreed, each time I mention tbl_df I have now changed it to
Done: added info and link to ISO-639.
I've clarified this to:
Done, added to Details in the gutenberg_subjects Rd.
Comment has been removed
Not in the examples I've seen; e.g. none of the dplyr vignettes, or the vignettes of rOpenSci packages I've looked at.
Done- added includes section and switched order, I agree that it makes sense. The exception is that I left the
I considered this, but some of the tables end up being so wide that they don't look good in HTML output anyway. Furthermore, almost every line would require knitr::kable, and if users aren't familiar with the kable function they may think it's necessary that they use it as well, or at the very least get distracted from the package functions.
Links to other packages
I've added links to the CRAN view and wikitrends/WikipediR packages to the README and the vignette. One issue with humanparser is that the columns are in Lastname, Firstname order, which in my understanding it can't necessarily handle. (I didn't include gender for a similiar reason).
I think this is a general text analysis question that falls outside the scope of an API package like this. What encoding issues they'd run into would depend on what they were doing :-)
Well I've learnt a lot too!
This all looks good to me.
As regards names, it's clearly not important but I saw today that this parser can reverse names https://github.com/Ironholds/humaniformat/blob/master/vignettes/Introduction.Rmd
For covr I saw that Jim will have a look at it so I'm confident that your badge will soon be green!
@dgrtwo Looks great. Just a minor thing:
Warning message: In node_find_one(x$node, x$doc, xpath = xpath, nsMap = ns) : 101 matches for .//a: using first
Wonder if it's worth a
updated badge link, started an appveyor account with