Submission: epubr #222
Extract, read and parse EPUB format e-book file archive metadata and book text into tidy data frames to prepare for subsequent text analysis.
Data extraction. This package focuses on importing EPUB file metadata and data into R in a useful format. It strips xml tags and formatting to focus on the readable, meaningful text. While future package versions will expand functionality around the edges, the core purpose will remain the data extraction described.
Data analysts or researchers performing text mining and other language analysis involving individual books or book collections in a typical, unrestricted EPUB file format.
I have not found other R packages that do this.
This package is already on CRAN (v0.4.0).
Confirm each of the following by checking the box. This package:
To the best of my knowledge, though it's quite possible I missed something small or have something not exactly as required.
The text was updated successfully, but these errors were encountered:
I am technically not a "reviewer" but one of the best "connectors" in the R univeRse connected your pkg with a quick hack of mine from earlier this year (
First, congrad on the CRAN acceptance! You had to do the "roll up the sleeves" work to deal with file extraction to do that and I lazily relied on the
The only thing I rly cld suggest (besides a personal preference to remove some tidyverse dependencies) is to take a look at https://github.com/hrbrmstr/pubcrawl/blob/master/inst/xslt/justthetext.xslt as that contains set of "cleaner" XSL template matchers garnered from spending way too much time dissecting various "readability" idioms. Yours has a good catch-all, and a "fun" (I have weird ideas of fun) exercise might be to extract tags from a large or at least a decent representative random corpus of epubs to see which ones tend to be there and have a secondary template that explicitly targets around them. This may not be necessary at all given that epubs aren't as "evil" as general web pages and a catchall might be fine.
Again, great job! And, if there's anything (which is unlikely) you need from my pkg carve away (and no need for attribution since I just made mine due to a random request on Twitter). Also, if you need resources for the suggested corpus/xslt testing, def lemme know.
I will let the amazing rOpenSci folks go back to their regularly scheduled reviewing activities now :-)
Thank you both! Much appreciated.
@hrbrmstr I overlooked
About the justthetext.xslt
I sent you a couple emails about it on May 27/28 that might have fallen through the cracks but also the second email basically said to disregard the first because I'd made a mistake. Essentially it turned out that almost every line in the cleaner template had zero impact on my (non-representative) sample of epubs. At first I was surprised that almost the whole cleaner template wasn't doing anything, but then it made sense when I considered what many of the tags were named. I guess they don't show up in epubs really.
But it was not quite so simple as removing lines that had no impact. There were a small number of lines in the cleaner template that actually appeared to have a detrimental effect on what type of text was returned for some sections of an epub. I did the best I could from the personal sample I had (plus some project Gutenberg, but again, that is only representative of project Gutenberg as far as I can tell) to make a judicious selection for the filter that seemed to cause the least problems and returned text most ideally.
What I ended up with is what you see in my version in
Normally I'm partial to removing as many dependencies as possible but in this case I favor retaining minimal tidyverse package imports. I did make
I was driving relatively blind with the cleaner template, but considering the best results I got were when using a virtually blank template, it'd be great to know if there is some way to remove the
I like the idea of having a more representative sample of epubs from various publishers. I don't have any ideas on where to get this. There's also a sampling issue. Many epubs users may elect to read in may not have DRM but are still licensed books that can't be shared as part of a corpus. I don't have stats on this but I wouldn't be surprised if an available corpus of books from a variety of sources still won't represent the bulk of books most people encounter since it would be missing all the books most people purchase these days.
Even with a better non-representative sample, it's a painstaking process. You still have to kind of "check" (minimally) each book by eye to see if the filter did something unusual to a given book. Eventually I had to accept that as much as I'd prefer to have figured out all the gotchas first so that I could minimize susbequent GitHub issues being reported, I was going to need to release the package and wait to see if/what people report as strange.
Thanks so much again Bob and let me know if you have any more thoughts re: mine! :)
Email & I have been bitter, bitter enemies of late. So it's more of a "wall" between me & it vs bits slipping through cracks :-)
I had an inkling that the style sheet transformer might be super aggressive for an epub (and epubs have, er, "diverse" XML formats, too) that your straightforward approach is likely much, much safer.
I was going to suggest using
If it's OK with y'all I'll likely set the
Again, rly nice work!
Thanks! :) I'll have to go back and look at the details when I have a chance, but I can probably remove
The diversity of formats is also why I could not be as aggressive as I would have liked with things like the optional attempts at chapter identification. To some degree that will have to be left to the user so that they can adjust things on a case by case basis without
In a future version I would like to include a function that loads a book in the browser as a simple e-reader using
Thanks again for your help!
Thanks again for your submission @leonawicz!
The README states "Subsequent versions will include a more robust set of features.". Before I start looking for reviewers, could you
We can put this submission on hold so that the reviewers look at a more finished product, which in my opinion would help you make the most of our onboarding process.
In any case, writing down what your ideas are will help lessen duplication of efforts, since reviewers often suggest enhancements.
@maelle Thanks so much!
I have removed that line from the readme. I do not have a clear concept at this time of specific enhancements I plan to include. I had tentatively thought about including a demo Shiny app that acts as a browser-based epub reader, but it may be quite a bit more complicated than I thought. It is not something I have considered in much detail yet.
I do not plan to add any specific functionality at this time. Pending any issues and feature requests users may submit later on via GitHub, that would probably give me a better sense of what kinds of additional functions might be helpful to other users. So far I have not received any user feedback so I do not plan to make any changes.
@hrbrmstr thanks a lot for volunteering! You've however reviewed another package recently so I asked other reviewers before seeing your message, sorry about that and thanks again for offering your expertise!
@leonawicz you can now add a peer-review badge to the README of your package
@calebasaraba @suzanbaert thanks a lot for accepting to review this package! Your reviews are due on 2018-07-02. As a reminder here are links to the recently updated reviewing and packaging guides and to the review template.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
Final approval (post-review)
Estimated hours spent reviewing: 4
First of all, I love the idea of the package and the ability to add a vector of epubs. I also was quite amazed at the speed, I would have thought it was going to take longer to turn a few quite sizeable books into an R object. Functions also work as part of a pipe, and plays very nice with dplyr.
A few suggestions I wrote down along the way:
It took only a mysterious error and a bit more focus to realize my silly mistake, but I did wonder whether the README could just be a bit clearer. Perhaps something along these lines? Tiny change, but maybe a few frowns less from people first using the package?
Output of devtools checks:
The package includes all the following forms of documentation:
Final approval (post-review)
Estimated hours spent reviewing:
I think this is a solid idea for a package and it accomplishes its main goal well: to provide users with functions to transform
I found that the package was easy to use and performed well at these actions using R 3.5 on windows and mac systems. I wrote down some of my thoughts below:
Chapter separation functionality
I downloaded several epub files from project gutenberg and played around with loading them in with the
Documentation / Vignettes
On the topic of chapter separation / funky formatting, you include the following paragraph that refers to some sort-of hidden functionality:
I wonder if it is a necessary addition to the vignette, since without documentation or examples of test cases, it only serves to tantalize the reader! I noticed earlier in the submission issue you mentioned that you didn't have any concrete next steps for development, so perhaps some examples or explanation of these arguments could be one; alternatively, you could consider removing this reference from your documentation.
Thanks a lot for your review @calebasaraba!
@leonawicz now both reviews are in! As per our author's guide we ask that you respond to reviewers’ comments within 2 weeks of the last-submitted review, but you may make updates to your package or respond at any time.
Hi all! Thank you all so much for your feedback and hard work! It is much appreciated. :) Some changes have been implemented already. Those and other replies below.
Re: cryptic error message.
Agreed, thank you for catching that. I have made more informative error messages for missing files and non-epub files, for epub, epub_meta, and epub_unzip.
Re: required title field.
This was a big issue. Thank you again. I have made
Context: title is important because it acts as the most likely/dependable field to uniquely identify books when
Of course, multiple books could have the same title, such as multiple editions of one book. Title is not guaranteed to be unique. However, it is the best bet. As such, I have made
I have not encountered any epubs yet lacking a title field, but I did my best to test this out by passing the additional
Re: function naming.
I am open to renaming
I'm not inclined to lengthen
The example using
I agree I probably did more harm than good by indicating that there is developmental functionality potentially accessible via
Re: dependencies/import note
I'm confused regarding this comment, sorry. I used
Re: chapter separation
I'm unclear about "did not find the default item separation useful / meaningful". Because this is generally unpredictable, the default is to attempt no chapter identification.
The default is to simply dump each book section into its own row of the
Related suggestion and some context:
It is very tempting to us, but Project Gutenberg maybe is getting too much attention :)
I think we should not pay much attention to Project Gutenberg (PG) texts honestly. It is a convenient place to grab a public domain epub file for package testing and ability to share the content/examples.
I have looked at books from a variety of publishers. I don't consider this to be representative of "all epubs" of course, but I did notice that none had formatting like PG books I've looked at (which I admit is far fewer). I have looked at a lot of licensed/copyrighted books, the type you have to purchase, from a range of publishers and years. I consider these much more representative of the types of e-books users will load with
Much of the public domain stuff can be obtained in multiple formats and pulled directly into R from the web.
With the goal of loading a novel's text into R to be used in a subsequent text analysis, when it comes to PG there is no reason at all to use
So overall, I think PG is not only not a good/gold standard for evaluation, but actually an exceptional edge case/not a representative use case for
Re: Devtools checks
test-lintr.R:4: warning: Package Style
I think this can be ignored. It is possible that if you pull the repo down to some systems, the symlink stops being seen as a symlink (Windows?). On my local Windows it is fine (no warnings). Recreating the symlink can fix the problem on a local system. But this passes on Windows, Mac and Linux (Travis-CI, I think with warnings as errors). This file just links to the real lintr file in the
test-lintr.R:4: error: Package Style
I'm not sure if this is more of the same. I am unable to reproduce these errors locally or remotely.
"warning regarding qpdf."
It's been a long time since I installed qpdf, so I don't remember the exact complaint, but if I didn't have it installed then local package builds threw some kind of message or warning about it not being available.
Status: 1 WARNING, 1 NOTE
I think this is normal, just need to have them on the system if you are building a package that uses them. I'm saying "I think" a lot, which is just to say, if anything I'm saying is not quite right, please feel free to jump in and set things on the right track.
Yeah, I think this is the warning I get rid of by making sure qpdf is installed.
Re: function commenting/code
This is one area where it's possible it may have been premature for me to submit my package for review? I can't speak to what other downloaders of
I published to CRAN because what was official/documented, was ready to go. The parts that are in the works though, it's just too early to say exactly what that will look like and trying to document it or give examples would be getting into the weeds I think. Please let me know what you think. I can always resubmit in the future if there is too much problematic about the state of some of the internals.
Thanks again so much! :) Please forgive me if I forgot to address some comments. Or if I shouldn't have addressed both reviews in one reply, but I saw some overlap. It took a while to compile this (probably not nearly as long as it took you to review the package code and I am very grateful). Please let me know if I missed anything or confused anything.
It was a fun time taking a look at the
Re: dependencies/import note
Re: function naming
Re: chapter separation / additional context
I did notice that the default settings for
Re: function commenting/code
I tried to look for guidance regarding this in the ROpenSci Packaging Guide but couldn't find anything explicit. I think it's best practice to have pretty streamlined functions in the master branch and then a development branch with all the in-progress code that's still evolving -- something to think about in the future, maybe. Ultimately, I don't think it's too big an issue, but it made looking through your code and trying to evaluate whether everything was working as intended a bit harder.
Thanks for your feedback :) I definitely should have used a separate branch for the relatively dev features. It was difficult to anticipate though, given the wide range of epub formatting, what would turn out to be common practices. It became more clear in retrospect. And once I had everything intermingled, the best option at the time seemed to at least move some features into being accessible only by
I know it's not ideal practice, but I have seen a lot of packages that make room for a
If you all think this approach will not suffice please let me know. From my view it's the best option to give me time to make improvements when I can, and without removing or substantially altering code in the next master branch CRAN release from the current CRAN version in such a way that doesn't actually make the next CRAN release any better.
Thanks. I have updated the documentation to indicate that, with the exception of the new optional
I have also renamed
I've downloaded the latest version from GitHub, and tried with latest changes.
Re: Error messages.
Re: Title automatically added.
Re: types of epub
The devtools notes and warnings did not seem to have any impact on the package properly working, so I only passed those on in case it was useful to you.
Yeah, I also tried a number of Star Trek books (related to my
Also just for more insights, wanted to share, I did an interesting test.
Where one book had not only chapters, marked by
I read the book with
It's difficult to show these nice examples when books are licensed works that can't be distributed.
Also, one thing I shared with @maelle on Slack that I think might not have been mentioned elsewhere is that when you do perform chapter identification on books using a regex pattern,
Other text parsing packages, like the more general
@calebasaraba I see you've now checked the box "The author has responded to my review and made changes to my satisfaction. I recommend approving this package." in your review, perfect, thanks. @suzanbaert can you do that as well if you're happy with all changes?
I'll do last editor's checks next week at the latest. Thanks everyone for a very constructive process!
Should you want to awknowledge your reviewers in your package DESCRIPTION, you can do so by making them
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.
Thank you! @maelle I've added the footer and the related packages section. I used the
I added the
I'll do the transfer soon.
I'm not sure what the pull request to
@leonawicz this link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length: https://ropensci.org/tags/review/. Technotes are here: https://ropensci.org/technotes/.
Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post.
Happy to answer any questions.