rtika blog #166
rtika blog #166
Conversation
@stefaniebutland Thanks, I added the yaml. Hmm, the "deploy/netlify' banner appeared for a second then disappeared along with its link to a preview. Am I still doing something wrong? |
Tentative date for your post is Tues Apr 17. I'll review your draft in detail next week. In the meantime,
|
Thank you so much for explaining it . I’ll try and fix the formatting after sleeping |
slightly reducing the amount of code in the blog post
@stefaniebutland Okay, the fish formatting is fixed! The references are in the order presented in the paper. If you need the reference text formatted in a certain format, I can change them once I know the style name. After seeing the blog post in context, I reduced the amount of R code since that was taking up a lot of visual space. Since you have other posts to edit before this, I am tempted to make a few changes in the next few days if that is not an issue. Just a thought. |
@predict-r since tentative date for your post is Tues Apr 17, you can make any changes up to Tues Apr 10th. Let me know when you're finished/happy with it and I'll review then. Cheers! |
@predict-r Your post is scheduled to publish in one week on Tues Apr 24. Please update the date in YAML, make any final edits so I can do a final review. Happy to answer any questions. |
reduced size of headings
updated date in yaml and made final edits. It is ready for final review.
It has smaller headings now and the date is updated. I'll be traveling until next Wed and will check emails for updates...off to the NYC text conference!! |
|
||
Fortunately, I remembered Apache Tika. Five years earlier, Tika helped parse the Internet Archive, and handled whatever format I threw at it. Back then, I put together a R package for myself and a few colleagues. It was outdated. | ||
|
||
I downloaded Tika and made a R script. Tika did its magic. It scanned the headers for the "Magic Bytes" [^4] and parsed the files appropriately: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an R script ?
|
||
#### Lessons Learned | ||
|
||
I never distributed a package before on repositories such as CRAN or Github, and the rOpenSci group was the right place to learn how. The reviewers used a transparent on-boarding process and taught about good documentation and coding style. They were helping create a maintainable package by following certain standards. If I stopped maintaining `rtika`, others could use their knowledge of the same standards to take over. The vast majority of time was spent on documenting the code, the introductory vignette, and continuous testing to integrate new code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onboarding, no hyphen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider linking to http://onboarding.ropensci.org/ or https://github.com/ropensci/onboarding so others can see what it's about (I've asked in #onboarding channel which is link is preferable from blog post)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In "The reviewers used a transparent on-boarding process and taught about good documentation and coding style." you might link to the open review thread: ropensci/software-review#191
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed: please link to https://github.com/ropensci/onboarding
|
||
This worked. R sends Tika a signal to execute code using an old-fashioned command line call, telling Tika to parse a particular batch of files. R waits. Eventually, Tika sends the signal of its completion, and R can then return with results as a character vector. Surprisingly, this process may be a good option for containerized applications running Docker. In the example later in this blog post, a similar technique is used to connect to a Docker container in a few lines of code. | ||
|
||
Communication with Tika went smoothly, but after one issue with `base::system2()` was identified. That was terminating Tika's long running process. Switching to `sys::exec_wait()` or `processx::run()` solved the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First two sentences here not clear - something missing?
|
||
##### The R User Interface | ||
|
||
Many in the R community make use of `magrittr` style pipelines, so those needed to work well. The Tidy Tools Manifesto makes piping a central tenant [^6], which makes code easier to read and maintain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"central tenant" should be "central tenet"
|
||
##### Responding to Reviewers | ||
|
||
During the review process, I appreciated David Gohel's [^7] attention to technical details, and that Julia Silge [^8] and Noam Ross [^9] pushed me to make the documentation more focused. I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome "I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better."
|
||
#### Tika in Context: Parsing the Internet Archive | ||
|
||
The first archive I parsed with Tika was a website retrieved from the Wayback Machine [^10], a treasure trove of historical files. Maintained by the Internet Archive, their crawler downloads sites over decades. When a site is crawled consistently, longitudinal analyses are possible. For example, federal agency websites often change when an administrations change, so the Internet Archive group and academic partners have increased the consistency of crawling there. In 2016, they archived over 200 terabytes of documents to include, among other things, over 40 million `pdf` files [^11]. I consider these government documents to be in the public domain, even if an administration hides or removes them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"when an administrations change" needs a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@predict-r What a great post to read. Was clear and compelling from my (somewhat) less technical perspective.
I've made minor suggestions.
For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?
Thanks so much for doing this!
@predict-r Let me know when you've made the edits so I can publish |
Tomorrow I’ll be back from traveling and will review the changes on my laptop as opposed to this iPad. Thank you ! |
Sounds good. If it will be ready after ~noon Pacific time on 24th then please change the date to 2018-04-25 for filename, YAML date and any images |
@predict-r Hope you're back safe and sound. Please make your final edits including date changes and tag me here so I can post asap today. We're publishing another post Thursday and I don't want them to detract attention from each other. After edits are done:
|
@stefaniebutland I'm back and trying to merge your changes and suggestions now. Should be done in an hour. |
merged in edits from @stefaniebutland, and changed date to today.
updated blog post date embedded in file name
removing this file with outdated file name, to be replaced with updated file name
fixed one link and added link to package at the very end of the article.
@stefaniebutland Your edits were merged into the file and committed. Its ready to go, as far as I'm concerned! Today's date was added to the file name and yaml. The first paragraph of 'Connecting to Tika' was cleaned up, and hopefully it is clearer now and without added typos. I added a link to the rtika github page at the very end of the article. Let me know if that should be put elsewhere. |
How about this: "It currently handles text or metadata extraction from over one thousand digital formats" |
@predict-r In the author spot, do you want to link to something other than the package? e.g. https://twitter.com/goodmansasha or a website? |
Yes! Please link to the twitter page: https://twitter.com/goodmansasha |
Thanks!! I see it is up on the site now. |
try 2