Skip to content

Submission: rtika #191

Closed
Closed
@goodmansasha

Description

@goodmansasha

Summary

  • What does this package do? (explain in 50 words or less):

Extract text or metadata from over a thousand file types. This is an R interface to Apache Tika, a content and metadata extraction tool. Tika is written in Java and is maintained by the Apache Software Foundation since 2007.

  • Paste the full DESCRIPTION file inside a code block below:
Package: rtika
Type: Package
Title: R Interface to 'Apache Tika' 
Version: 0.1.2
Author: Sasha Goodman <goodmansasha@gmail.com>[aut], The Apache Software Foundation [aut,cph]
Maintainer: Sasha Goodman <goodmansasha@gmail.com>
Suggests: 
    sys,
    jsonlite,
    xml2,
    data.table,
    curl,
    testthat,
    knitr,
    rmarkdown,
    covr,
    Matrix
License: Apache License 2.0 | file LICENSE
SystemRequirements: Java (>=7) | openjdk-7-jre (via apt) | java-1.7.0-openjdk (via yum) | openjdk-8-jre (via apt) | java-1.8.0-openjdk (via yum) 
Description: Extract text or metadata from over a thousand file types. Get either plain text or structured XHTML. This R interface includes the Tika software. Its source is available at https://github.com/apache/tika. 
Depends: R (>= 3.1.0)
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
URL: http://github.com/predict-r/rtika
BugReports: http://github.com/predict-r/rtika/issues
VignetteBuilder: knitr

  • URL for the package (the development repository, not a stylized html page):

https://github.com/predict-r/rtika

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

It fits into data extraction, especially text extraction. This package helps extract either a plain text or XHTML rendition of most common file types, without leaving R. It includes parsers for Microsoft Office (both the old and new formats), OpenOffice, iWorks, WordPerfect, PDF, ePub, rtf, HTML, XML, and Mbox email.

The hours of labor spent on these parsers is unbelievable. I recall Tika was intended to prevent duplicate work at Apache, and has since became a foundational library for Apache search engine projects.

Tika also detects metadata from over a thousand file formats. These include basic features like Content-Encoding and mime type sniffing. Beyond that, there are specialized extractors for scientific formats (ISA-Tab, Matlab), geospatial formats (GDAL, ISO-19139), image Exif data from tiff or jpeg, and many more.

The long list of formats is here: https://tika.apache.org/1.17/formats.html

  • Who is the target audience and what are scientific applications of this package?

Computational social scientists, digital humanities scholars, historians, or any person that works with large digital archives filled with possibly unpredictable file formats.

For example, consider a social scientist studying the California state legislature, as I am. The legislature publishes a database that includes, among other things, Microsoft Word documents created over the past decades. These are at http://leginfo.legislature.ca.gov/.

I've used Tika to automatically extract text from decades of Word documents, even if I don't know the particular version of Word beforehand. Further, with the XHTML output option, I can preserve tables inside each Word document.

There are specialized parsers for several of the file types Tika processes, such as antiword for older Microsoft Word documents. Overlap is going to be inevitable. I have used antiword and it worked well, except that some of my files did not have a file type affix (e.g. .docx or .doc), and could be in either format, and so it took some trial and error to get it working. It turned out antiword does not process the modern XML based Word formats. It also does not render the Word document into XHTML, which could preserve tables from the Word document.

Listing all the document parsers that Tika overlaps with would take an enormous amount of space, since it processes over a thousand file types. In general, however, Tika's aims to streamline text extraction within large archives. Archives often contain heterogeneous and unpredictable file formats. Tika handles most of what I throw at it.

The other package I found that interfaces with Apache Tika is tikaR, released in September of 2017 (See: https://github.com/kyusque/tikaR). It provides only XHMTL renditions of documents, and uses the rJava package to interface with Tika.

I was a Java developer and implemented a rJava interface with Tika in 2012, when I was considering my first package for R (the project is empty, but I set up a repository on r-forge back then: https://r-forge.r-project.org/projects/r-tika/). My rJava code worked, but I concluded at the time that the Apache Tika command line interface (CLI) was sufficient for myself and most people. I ended up just crafting some specialized system calls to the command line for my own work. The r-forge repository remains empty because I didn't want to maintain a complex Java codebase.

Recently, the Tika project came out with a very nice batch processor in their command line that I found to be the most efficient method to handle tens of thousands of documents. It was too good to keep to myself.

Unlike the rJava functions used by tikaR, I essentially send Tika a file list to process, and then craft a specialized call to the command line to invoke the batch processor. Retrieving the converted results is pretty straightforward. The code is simple and easy to maintain.

Unlike tikaR, I've included a variety of tests and keep track of test coverage, and anticipate that maintaining this package will be fairly simple, even without Java knowledge. There are also more features implemented in my package than tikaR, such as the ability to use multiple threads.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
    • The package is novel and will be of interest to the broad readership of the journal.
    • The manuscript describing the package is no longer than 3000 words.
    • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal.
    • (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Jeroen Ooms has worked on text file parsers and I respect his work immensely.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions