-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submission: rtika #191
Comments
Following up on this, is there anything I can do to help, like suggest additional reviewers? |
Editor checks:
Editor commentsThanks for your submission @predict-r! And sorry for the slow reply, this fell through the cracks for a week. This is a good fit and I'll go ahead and find reviewers. Here are the results of our automated tests via I suspect you'll have trouble getting a >50MB
Reviewers: @juliasilge @davidgohel |
Thank you @juliasilge and @davidgohel for agreeing to review! Reviews will be due 2018-03-05. |
@predict-r You can add the review badge to your README now if you wish. It will auto-update with review status:
|
Thanks! I'll go ahead and add the badge then. I'll patch some of the goodpractice issues listed above. Thanks for pointing out the CRAN situation. I can create another repository with the Jar, and call it 'tikajar' . That package won't update often, so maybe CRAN will accept it. If not, the package can be left on github and downloaded with devtools. |
Okay, I removed the .jar file. It is in a new package on github called 'tikajar' (see: https://github.com/predict-r/tikajar ). Since its a key part of 'rtika', I added the review badge to tikajar as well. Let me know if that's an issue. |
Hi, here is my review: Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
davidgohel: I tested with 1204 documents (my admin folder). I got 20.354 sec. with 8 threads I was expecting a faster execution with 8 threads (not 8 times faster but at
davidgohel:
As the unit tests are there and cover almost
The fun fact is that these are sucessful when launched via RStudio test button... Maybe https://github.com/davidgohel/gdtools/blob/master/.travis.yml#L5 Final approval (post-review)
Estimated hours spent reviewing: 4 Review CommentsCongratulations for that package, it brings someting new It works well and is simple to use. A minor modification might be necessary: starting JVM with option More a wish than a comment... I would have used
About the size of the jar file, I did not read anything that says it's an Hope you will find that review useful :) David |
Thank you David for the very helpful comments. I've attempted to address your concerns in the 0.1.5 version, and hope they meet your approval. I'm happy that the batch speed was good, and have sped up the package a bit more this time.
Yes, I wish there was a better improvement in speed with that many threads. It seems like the speed per thread shows sharply diminishing returns, unfortunately (perhaps this is because of file system read speed maxing out, but that is just a guess). Your experience had led me to increase the default number of threads to two. Tika's default is the number of processors minus one, according to the batch processor config file. If you think I should use that instead, I will.
This was very helpful. Your comment led me to read the Tika source code more carefully. I am happy to report that Tika outputs UTF-8 now. It was not immediately apparent, because the original Tika used UTF-16. However, the newer batch processor's FSOutputStreamFactory defaults to UTF-8. There is no need to create a special config file or anything, so I removed the offending enc2utf8(). This will speed up the package quite a bit! Thanks.
Thank you for reminding me. I've added one.
I've updated the authorship to use 'person' functions, as you suggested.
The utils and stats packages are now declared in Imports.
I've updated the Java declaration to use the standard CRAN version style you mentioned. I've also included an .onLoad() function to verify that java is available and is the appropriate version.
These have been stopped using \donttest{} statements. I also made a change to the tests to speed them up, that you may or may not approve of. Instead of downloading files for the test each time, which took fifteen seconds or so, I moved a few different files into the test directory and now test on those.
After several hours, this has been fixed on both linux and OS X. On my OS X laptop, I was able to replicate the error by using the system2 command to invoke java. The problem was that system2 stops Tika from finishing in some cases, terminating the process in the middle. The problem was intermittent. It did affect the same tests as you reported. By making the 'sys' package a requirement instead of suggestion, and using sys::exec_wait, the issue is fixed on my end. The .travis.yml is updated to test on both osx and linux now, and is passing. As a side effect, sys::exec_wait speeds up rtika by up to five seconds on my laptop.
Thank you for this, and it has been implemented.
Thank you for letting me know. I agree that the rJava package could help with certain security issues. It would definitely bring speed benefits over repeated command line invocations. However, it would take time to to dive into the Tika codebase, implement it and test it. Perhaps my main issue is that I've had a some recurring frustration over the years installing rJava. While I'm not opposed to rJava, I would prefer not to add the extra complexity in this first version. If the rtika package is popular at all, I will add rJava.
Thank you, I will look into the .jar size issue next! I hope I've addressed your questions and comments, to your satisfaction. |
Awesome, thank you. There are 2 NOTE(s) left that could be suppressed: https://travis-ci.org/predict-r/rtika/jobs/344591725#L349
|
Thanks for your thorough review, @davidgohel! |
David, I fixed the issue with the two NOTES. |
Thank you Sasha |
Hello! Here is my review. 🙂 Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 3 Review CommentsI have had HORRIBLE EXPERIENCES with installation and usage for Java-dependent R packages 😱😱😱 so I was full of dread I was asked to review this package. However, I am very happy to report that I was able to install and use this package without any trouble at all. Great job on this package, its test coverage, and usability! library(rtika)
library(tidyverse)
library(tidytext)
my_path <- "~/Documents/Very Ancient Stuff/Teaching & Astronomy/"
batch <- paste0(my_path, list.files(path = my_path))
text <-
batch %>%
tika_text()
tidy_df <-
data_frame(file = batch,
text = text) %>%
unnest_tokens(word, text) %>%
count(file, word, sort = TRUE) %>%
anti_join(get_stopwords())
tidy_df
metadata <-
data_frame(raw = batch %>%
tika_json()) %>%
filter(!is.na(raw)) %>%
mutate(JSON = map(raw, jsonlite::fromJSON, flatten = TRUE)) %>%
pull(JSON) %>%
map_df(function(x) { x[1:4] %>% head(1) })
metadata
Issue with R CMD checkWhen I run
I have not built a package that uses a Code styleI have some suggestions around code style to place this package more squarely within R community norms as expressed, for example, in Google's R Style Guide or the tidyverse style guide. I don't know how annoying you will find this 😳 but the value here is that other contributors will be able to read and understand your code more easily. Here are some big picture thoughts:
This is especially important in the Thoughts on the vignetteThe vignette currently focuses on text mining of the text that we can get out of this package. Almost the entire vignette focuses on text mining (using base R and some data.table) exploratory data analysis, which is not what this package is about; only the very first code chunk uses functions from rtika at all. The text mining community in R is active and healthy (I am part of it!) and there are lots of opportunities to learn about how to do text mining once you have text; I would suggest that a vignette for a package like rtika is not really the right spot for this kind of analysis. When I am learning how to use a new package, what I want from the vignette is how to use the functions in that package. I would love to see, instead, a more detailed exercise of how to use the actual rtika functions and what they can do. There is nothing in the vignette about handling metadata from multiple sources, for example. The current material that is in the vignette would be more appropriate for a blog post, or perhaps a secondary, supportive vignette. I'd also like to address the tone of the vignette a little. I know it is so hard to write for audiences that you aren't quite sure where they are! 😃 I am 100% in favor of writing accessibly without jargon so that people can get started, but I want to make sure this vignette doesn't come off as condescending. For example, if a user has managed to install the correct version of Java and is attempting to walk through a vignette, they probably know what regular expressions are. |
Thanks for your thoughtful review, @juliasilge! I especially concur with your comments on the vignette. Regarding the |
That's because Adding |
@juliasilge Thank you for your helpful comments about making the package feel more welcoming. Based on the feedback here, we need a new vignette. It should demonstrate how to use the actual rtika functions and to show what they can do. It is also an opportunity to show the breadth of uses. I'll merge your pull request, and go through each point in a coming response. This morning, in the meantime, I'm brainstorming the vignette. If people know a good vignette to study, let me know please! I'm thinking of a survey of uses and data. It will have more than five examples, each brief, each focused on a distinct case, each using a cool data set, and each following the 'tidy tools manifesto' principles of using pipes and common formats. The more diverse the examples, the better. I'm thinking :
As for data, my data sources feel too similar to one another and am open to suggestions, especially anything from the Internet Archive (https://archive.org/ ). My initial ideas:
More diverse data examples feels more welcoming, especially documents appealing to the humanities and arts. Thank you. |
Around five example data sets would be plenty for a vignette, I think! Even three small case studies would be good. I would be thoughtful in what tooling you choose to use in the vignettes. For example, if you wanted to use the Stanford CoreNLP in the vignette, then that package would need to be in Suggests, and that package is a BEAR 👿 to get to install on Travis. I know this from experience. I'd try to stay quite lean and simple in what you use in example analysis in the vignettes and keep complex dependencies for demonstration blog posts. Two examples packages that I think are good to look to for how to handle vignettes in this case are gutenbergr (downloads public domain texts from Project Gutenberg) and fulltext (searches and downloads full text for journal articles). |
@juliasilge, I've implemented changes to address your helpful comments.
To avoid this issue, and make things simpler until CRAN complains, I've put the .jar file back into the 'rtika' package for now. So, the issues with importing 'tikajar' is gone on my end. @davidgohel said: "About the size of the jar file, I did not read anything that says it's an The size might not be an issue unless the package is updated frequently, and I hope to update it infrequently.
The 'styler' package came in handy. I've tried to make the README and vignette use a standard style throughout.
I've written an introductory vignette, and removed the last one. This one focuses on how to use the package functions and process the metadata. It is on the Github Pages site: I tried to keep it light and easy to read, and hope it meets your approval for a package vignette. The site was built with pkgdown::build_site(). By the way, I noticed that the review badge was not loading here, so had to remove it. |
Thanks for your response, @predict-r! @juliasilge, let us know if these changes have addressed your comments and check that box off if so, let us know what else remains if not. I see @davidgohel already has checked off his box. FYI, we had a temporary snafu on the service serving badges. It should be back now. |
Thank you. |
Ah, I love the new vignette!!! 💕 It is full of so much useful, detailed information. This is going to be a great resource for users. I also am happy to see the other issues in the review addressed; I reviewed the package once more, did R CMD check and The only thing significant remaining in the checks is the NOTE about the size of the package. |
I do want this package to be a useful resource, and am so glad you liked the vignette!
I will keep trying, but CRAN never got back to me at all. Has anybody seen a package that downloads the required resources after package installation? There are Apache mirrors for Tika, such as http://mirrors.ocf.berkeley.edu/apache/tika/tika-app-1.17.jar . |
Thanks all for great reviews and follow-up, @juliasilge @davidgohel and @predict-r. We can accept the package as-is from our perspective, the size/CRAN issue isn't a precondition on our front, but it makes sense to work out the approach first. I'll be happy to do a full editors' check if you make major changes on this front. Courtesy of great friend of rOpenSci Bob Rudis, here's the distribution of If not, I believe having a package automatically install resources on installation is a violation of CRAN policies in most cases. The few exceptions I know of relate to license issues or platform-specific installs of compiled binaries. But plenty of packages have functions to install system dependencies that users run after install, such as |
I've slept on it, and am ready to give up on CRAN for now. The response from Uwe Ligges at CRAN is:
Since the Tika .jar is over 55.7MB (49.5MB compressed), there is no way to distribute the .jar on CRAN without violating that principle. So, once accepted this package would be on both github and ropensci! I'm happy with that. @noamross Thank you for showing a good way, and offering to do a full editor's check. Uwe from CRAN had a similar suggestion:
The way you suggested, using Perhaps we can return to this issue and I could requests an editorial review in the future? Maybe there could be two versions of the package--one without the .jar? |
I advise against two packages. It will be confusing for users and make more effort for maintenance. I think an I'm OK with marking this as "accepted" now, but I would prefer that if the plan is to change the user experience in this way soon, we do so before we move forward with blog posts and such. |
Alright, I’ll make the install script. :) |
I've implemented However, along the way, I ran into a big Apache Tika bug! It was in the batch processor. The good news is that I reported the bug, and it was immediately patched! See [1]. The bug, combined with A potential issue is the next major release of Tika with the patch, Tika version 1.18, will be released soon, but I'm not sure exactly when...I've asked. In the meantime, what about having If people have thoughts on installing a development version of Tika, I'd like to hear them. [1] https://issues.apache.org/jira/browse/TIKA-2604?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16393363#comment-16393363 P.S. I'm kinda proud of finding the bug! |
👏 Good work finding that @predict-r! I had problems with another program crashing recently because of spaces in file paths, so I'm glad this bug is squashed. We'll see what the response is from the developer, but based on this email thread it seems that 1.18 is only a week or two off. If it were months away I would suggest an jumping ahead. I would say it's fine to the patched version on GitHub now, (esp. given that I will take a careful look at the remaining changes to wrap this up later this weekend. |
Thanks ! The install_tika() is temporarily set to download the development version. The package is passing Travis. |
Approved! Great work, and thanks a lot @juliasilge and @davidgohel for your reviews! The To-dos:
Welcome aboard! I for one am quite excited about rtika. We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are interested in writing one, let us know and, @stefaniebutland will be in touch about content and timing. |
Writing a long form post sounds good. It's an opportunity for me to introduce the package and reflect some on the process. Please express my interest and thanks for the opportunity to @stefaniebutland . Okay, regarding the logistics of the transfer, it's now at: https://github.com/ropensci/rtika That was super easy. The README links are also changed, yet I don't know how to change the github repository title blurb at the very top to: R Interface to Apache Tika https://ropensci.github.io/rtika/ I guess I no longer control repository level settings, which is okay, but that link should be changed. |
I made that change. You should have repo settings control now. All set! I'm going to close this issue, but @stefaniebutland will respond here about blog timing. Here are our editorial and technical guidelines for posts: https://github.com/ropensci/roweb2#contributing-a-blog-posts, and some other posts in the onboarding series https://ropensci.org/tags/review/ |
👍 |
Hi @predict-r. Great to hear that you are interested in contributing a blog post and thanks to Noam for providing that link to recommendations. Would you be able to submit a draft via pull request by March 27? That will give us an opportunity to review it, give you feedback and set a date for publication. |
Greetings @stefaniebutland. That is reasonable. I'll submit the draft by the 27th. Based on https://github.com/ropensci/roweb2#contributing-a-blog-posts, I know what to try. If there are any examples of the blog you particularly liked or want to recommend, let me know and I'll take a closer look. Thanks. |
@predict-r This will give you a range of examples https://ropensci.org/tags/review/. Browsing them will give you an idea of the length you prefer, and different angles on the narrative. I'm happy to give feedback on an early version or outline if that helps you. |
@stefaniebutland, I posted a draft here: It's a pull request on a fork...hope I did that correctly. |
Thank you for doing this @predict-r.
From https://github.com/predict-r/roweb2 you should create a pull request with branch I have a few other blog posts to review first so I will get you some feedback likely early next week. |
Summary
Extract text or metadata from over a thousand file types. This is an R interface to Apache Tika, a content and metadata extraction tool. Tika is written in Java and is maintained by the Apache Software Foundation since 2007.
https://github.com/predict-r/rtika
It fits into data extraction, especially text extraction. This package helps extract either a plain text or XHTML rendition of most common file types, without leaving R. It includes parsers for Microsoft Office (both the old and new formats), OpenOffice, iWorks, WordPerfect, PDF, ePub, rtf, HTML, XML, and Mbox email.
The hours of labor spent on these parsers is unbelievable. I recall Tika was intended to prevent duplicate work at Apache, and has since became a foundational library for Apache search engine projects.
Tika also detects metadata from over a thousand file formats. These include basic features like Content-Encoding and mime type sniffing. Beyond that, there are specialized extractors for scientific formats (ISA-Tab, Matlab), geospatial formats (GDAL, ISO-19139), image Exif data from tiff or jpeg, and many more.
The long list of formats is here: https://tika.apache.org/1.17/formats.html
Computational social scientists, digital humanities scholars, historians, or any person that works with large digital archives filled with possibly unpredictable file formats.
For example, consider a social scientist studying the California state legislature, as I am. The legislature publishes a database that includes, among other things, Microsoft Word documents created over the past decades. These are at http://leginfo.legislature.ca.gov/.
I've used Tika to automatically extract text from decades of Word documents, even if I don't know the particular version of Word beforehand. Further, with the XHTML output option, I can preserve tables inside each Word document.
yours differ or meet our criteria for best-in-category?
There are specialized parsers for several of the file types Tika processes, such as antiword for older Microsoft Word documents. Overlap is going to be inevitable. I have used antiword and it worked well, except that some of my files did not have a file type affix (e.g. .docx or .doc), and could be in either format, and so it took some trial and error to get it working. It turned out antiword does not process the modern XML based Word formats. It also does not render the Word document into XHTML, which could preserve tables from the Word document.
Listing all the document parsers that Tika overlaps with would take an enormous amount of space, since it processes over a thousand file types. In general, however, Tika's aims to streamline text extraction within large archives. Archives often contain heterogeneous and unpredictable file formats. Tika handles most of what I throw at it.
The other package I found that interfaces with Apache Tika is tikaR, released in September of 2017 (See: https://github.com/kyusque/tikaR). It provides only XHMTL renditions of documents, and uses the rJava package to interface with Tika.
I was a Java developer and implemented a rJava interface with Tika in 2012, when I was considering my first package for R (the project is empty, but I set up a repository on r-forge back then: https://r-forge.r-project.org/projects/r-tika/). My rJava code worked, but I concluded at the time that the Apache Tika command line interface (CLI) was sufficient for myself and most people. I ended up just crafting some specialized system calls to the command line for my own work. The r-forge repository remains empty because I didn't want to maintain a complex Java codebase.
Recently, the Tika project came out with a very nice batch processor in their command line that I found to be the most efficient method to handle tens of thousands of documents. It was too good to keep to myself.
Unlike the rJava functions used by tikaR, I essentially send Tika a file list to process, and then craft a specialized call to the command line to invoke the batch processor. Retrieving the converted results is pretty straightforward. The code is simple and easy to maintain.
Unlike tikaR, I've included a variety of tests and keep track of test coverage, and anticipate that maintaining this package will be fairly simple, even without Java knowledge. There are also more features implemented in my package than tikaR, such as the ability to use multiple threads.
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
Jeroen Ooms has worked on text file parsers and I respect his work immensely.
The text was updated successfully, but these errors were encountered: