rtika blog #166

goodmansasha · 2018-03-28T06:10:11Z

try 2

creating blank file.

…ion.md

goodmansasha · 2018-03-28T06:12:52Z

@stefaniebutland Thanks, I added the yaml.

Hmm, the "deploy/netlify' banner appeared for a second then disappeared along with its link to a preview.

Am I still doing something wrong?

stefaniebutland · 2018-03-28T06:21:08Z

See Show all checks in this screenshot image on the right.

Click that to see what looks like the image below with deploy/netlify and Details on the right

Click Details to get to the preview of our site. Click Blog and you should see this, the preview with your post

stefaniebutland · 2018-03-28T06:33:22Z

Tentative date for your post is Tues Apr 17. I'll review your draft in detail next week.

In the meantime,

I notice the babel fish didn't come out right. I wonder if it's because line breaks don't render properly.
you can format references as in this post: https://ropensci.org/blog/2017/12/05/rperseus/ ; https://raw.githubusercontent.com/ropensci/roweb2/master/content/blog/2017-12-05-rperseus.md

goodmansasha · 2018-03-28T07:27:42Z

Thank you so much for explaining it . I’ll try and fix the formatting after sleeping

slightly reducing the amount of code in the blog post

goodmansasha · 2018-03-28T17:43:14Z

@stefaniebutland Okay, the fish formatting is fixed!

The references are in the order presented in the paper. If you need the reference text formatted in a certain format, I can change them once I know the style name.

After seeing the blog post in context, I reduced the amount of R code since that was taking up a lot of visual space. Since you have other posts to edit before this, I am tempted to make a few changes in the next few days if that is not an issue. Just a thought.

stefaniebutland · 2018-03-28T20:54:03Z

@predict-r since tentative date for your post is Tues Apr 17, you can make any changes up to Tues Apr 10th. Let me know when you're finished/happy with it and I'll review then.

Cheers!

stefaniebutland · 2018-04-17T17:48:38Z

@predict-r Your post is scheduled to publish in one week on Tues Apr 24.
Here is the preview: https://deploy-preview-166--ropensci.netlify.com/blog/2018/04/17/rtika-introduction/

Please update the date in YAML, make any final edits so I can do a final review.
Note that the headings still look pretty big so best to reduce them.

Happy to answer any questions.

reduced size of headings

updated date in yaml and made final edits. It is ready for final review.

goodmansasha · 2018-04-19T04:00:02Z

It has smaller headings now and the date is updated. I'll be traveling until next Wed and will check emails for updates...off to the NYC text conference!!

stefaniebutland · 2018-04-20T18:39:05Z

content/blog/2018-04-03-rtika-introduction.md

+
+Fortunately, I remembered Apache Tika. Five years earlier, Tika helped parse the Internet Archive, and handled whatever format I threw at it. Back then, I put together a R package for myself and a few colleagues. It was outdated.
+
+I downloaded Tika and made a R script. Tika did its magic. It scanned the headers for the "Magic Bytes" [^4] and parsed the files appropriately:


an R script ?

stefaniebutland · 2018-04-20T18:40:46Z

content/blog/2018-04-03-rtika-introduction.md

+
+#### Lessons Learned
+
+I never distributed a package before on repositories such as CRAN or Github, and the rOpenSci group was the right place to learn how. The reviewers used a transparent on-boarding process and taught about good documentation and coding style. They were helping create a maintainable package by following certain standards. If I stopped maintaining `rtika`, others could use their knowledge of the same standards to take over. The vast majority of time was spent on documenting the code, the introductory vignette, and continuous testing to integrate new code.


onboarding, no hyphen

consider linking to http://onboarding.ropensci.org/ or https://github.com/ropensci/onboarding so others can see what it's about (I've asked in #onboarding channel which is link is preferable from blog post)

In "The reviewers used a transparent on-boarding process and taught about good documentation and coding style." you might link to the open review thread: ropensci/software-review#191

confirmed: please link to https://github.com/ropensci/onboarding

stefaniebutland · 2018-04-20T19:02:20Z

content/blog/2018-04-03-rtika-introduction.md

+
+This worked. R sends Tika a signal to execute code using an old-fashioned command line call, telling Tika to parse a particular batch of files. R waits. Eventually, Tika sends the signal of its completion, and R can then return with results as a character vector. Surprisingly, this process may be a good option for containerized applications running Docker. In the example later in this blog post, a similar technique is used to connect to a Docker container in a few lines of code.
+
+Communication with Tika went smoothly, but after one issue with `base::system2()` was identified. That was terminating Tika's long running process. Switching to `sys::exec_wait()` or `processx::run()` solved the issue.


First two sentences here not clear - something missing?

stefaniebutland · 2018-04-20T19:03:40Z

content/blog/2018-04-03-rtika-introduction.md

+
+##### The R User Interface
+
+Many in the R community make use of `magrittr` style pipelines, so those needed to work well. The Tidy Tools Manifesto makes piping a central tenant [^6], which makes code easier to read and maintain.


"central tenant" should be "central tenet"

stefaniebutland · 2018-04-20T19:05:10Z

content/blog/2018-04-03-rtika-introduction.md

+
+##### Responding to Reviewers
+
+During the review process, I appreciated David Gohel's [^7] attention to technical details, and that Julia Silge [^8] and Noam Ross [^9] pushed me to make the documentation more focused. I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better.


This is awesome "I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better."

stefaniebutland · 2018-04-20T19:06:34Z

content/blog/2018-04-03-rtika-introduction.md

+
+#### Tika in Context: Parsing the Internet Archive
+
+The first archive I parsed with Tika was a website retrieved from the Wayback Machine [^10], a treasure trove of historical files. Maintained by the Internet Archive, their crawler downloads sites over decades. When a site is crawled consistently, longitudinal analyses are possible. For example, federal agency websites often change when an administrations change, so the Internet Archive group and academic partners have increased the consistency of crawling there. In 2016, they archived over 200 terabytes of documents to include, among other things, over 40 million `pdf` files [^11]. I consider these government documents to be in the public domain, even if an administration hides or removes them.


"when an administrations change" needs a fix

stefaniebutland

@predict-r What a great post to read. Was clear and compelling from my (somewhat) less technical perspective.

I've made minor suggestions.

For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?

Thanks so much for doing this!

stefaniebutland · 2018-04-23T15:26:03Z

@predict-r Let me know when you've made the edits so I can publish

goodmansasha · 2018-04-24T12:35:43Z

Tomorrow I’ll be back from traveling and will review the changes on my laptop as opposed to this iPad. Thank you !

stefaniebutland · 2018-04-24T15:50:17Z

Sounds good. If it will be ready after ~noon Pacific time on 24th then please change the date to 2018-04-25 for filename, YAML date and any images

stefaniebutland · 2018-04-25T15:32:31Z

@predict-r Hope you're back safe and sound.

Please make your final edits including date changes and tag me here so I can post asap today. We're publishing another post Thursday and I don't want them to detract attention from each other.

After edits are done:

For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?

goodmansasha · 2018-04-25T16:53:40Z

@stefaniebutland I'm back and trying to merge your changes and suggestions now. Should be done in an hour.

@stefaniebutland

merged in edits from @stefaniebutland, and changed date to today.

updated blog post date embedded in file name

removing this file with outdated file name, to be replaced with updated file name

fixed one link and added link to package at the very end of the article.

goodmansasha · 2018-04-25T17:31:27Z

@stefaniebutland Your edits were merged into the file and committed. Its ready to go, as far as I'm concerned!

Today's date was added to the file name and yaml.

The first paragraph of 'Connecting to Tika' was cleaned up, and hopefully it is clearer now and without added typos.

I added a link to the rtika github page at the very end of the article. Let me know if that should be put elsewhere.

goodmansasha · 2018-04-25T17:39:20Z

For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?

How about this:

"It currently handles text or metadata extraction from over one thousand digital formats"

stefaniebutland · 2018-04-25T17:52:06Z

@predict-r In the author spot, do you want to link to something other than the package? e.g. https://twitter.com/goodmansasha or a website?

goodmansasha · 2018-04-25T18:03:13Z

Yes! Please link to the twitter page: https://twitter.com/goodmansasha

goodmansasha · 2018-04-25T18:24:57Z

Thanks!! I see it is up on the site now.

goodmansasha added 4 commits March 14, 2018 16:34

Create 2018-03-27-rtika-introduction.md

92affcb

creating blank file.

Rename 2018-03-27-rtika-introduction.md to 2018-04-03-rtika-introduct…

6fcb65a

…ion.md

Update 2018-04-03-rtika-introduction.md

8b6be25

Update 2018-04-03-rtika-introduction.md

29bf478

goodmansasha added 6 commits March 28, 2018 10:12

Update 2018-04-03-rtika-introduction.md

cb0bc12

Update 2018-04-03-rtika-introduction.md

9752b51

Update 2018-04-03-rtika-introduction.md

30f059d

Update 2018-04-03-rtika-introduction.md

e9d6922

Update 2018-04-03-rtika-introduction.md

959819f

Update 2018-04-03-rtika-introduction.md

079f8f2

slightly reducing the amount of code in the blog post

goodmansasha added 6 commits March 28, 2018 11:31

Update 2018-04-03-rtika-introduction.md

87fcf8d

Update 2018-04-03-rtika-introduction.md

7a983da

Update 2018-04-03-rtika-introduction.md

e2d07ee

Update 2018-04-03-rtika-introduction.md

3aa112c

Update 2018-04-03-rtika-introduction.md

f5d8c00

Update 2018-04-03-rtika-introduction.md

4a2c35f

goodmansasha added 3 commits April 18, 2018 20:14

Update 2018-04-03-rtika-introduction.md

1db7a74

reduced size of headings

Update 2018-04-03-rtika-introduction.md

0bb4ae7

updated date in yaml and made final edits. It is ready for final review.

Update 2018-04-03-rtika-introduction.md

b95fd57

stefaniebutland reviewed Apr 20, 2018

View reviewed changes

goodmansasha added 4 commits April 25, 2018 10:14

Update 2018-04-03-rtika-introduction.md

48ca6f4

merged in edits from @stefaniebutland, and changed date to today.

Create 2018-04-25-rtika-introduction.md

8c5df95

updated blog post date embedded in file name

Delete 2018-04-03-rtika-introduction.md

09e712b

removing this file with outdated file name, to be replaced with updated file name

Update 2018-04-25-rtika-introduction.md

ebb7167

fixed one link and added link to package at the very end of the article.

stefaniebutland merged commit ceea566 into ropensci-archive:master Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rtika blog #166

rtika blog #166

goodmansasha commented Mar 28, 2018

goodmansasha commented Mar 28, 2018

stefaniebutland commented Mar 28, 2018

stefaniebutland commented Mar 28, 2018

goodmansasha commented Mar 28, 2018

goodmansasha commented Mar 28, 2018

stefaniebutland commented Mar 28, 2018

stefaniebutland commented Apr 17, 2018

goodmansasha commented Apr 19, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland Apr 20, 2018

stefaniebutland left a comment

stefaniebutland commented Apr 23, 2018

goodmansasha commented Apr 24, 2018

stefaniebutland commented Apr 24, 2018 •

edited

stefaniebutland commented Apr 25, 2018

goodmansasha commented Apr 25, 2018 •

edited

goodmansasha commented Apr 25, 2018

goodmansasha commented Apr 25, 2018

stefaniebutland commented Apr 25, 2018

goodmansasha commented Apr 25, 2018

goodmansasha commented Apr 25, 2018


		Fortunately, I remembered Apache Tika. Five years earlier, Tika helped parse the Internet Archive, and handled whatever format I threw at it. Back then, I put together a R package for myself and a few colleagues. It was outdated.

		I downloaded Tika and made a R script. Tika did its magic. It scanned the headers for the "Magic Bytes" [^4] and parsed the files appropriately:


		#### Lessons Learned

		I never distributed a package before on repositories such as CRAN or Github, and the rOpenSci group was the right place to learn how. The reviewers used a transparent on-boarding process and taught about good documentation and coding style. They were helping create a maintainable package by following certain standards. If I stopped maintaining `rtika`, others could use their knowledge of the same standards to take over. The vast majority of time was spent on documenting the code, the introductory vignette, and continuous testing to integrate new code.


		This worked. R sends Tika a signal to execute code using an old-fashioned command line call, telling Tika to parse a particular batch of files. R waits. Eventually, Tika sends the signal of its completion, and R can then return with results as a character vector. Surprisingly, this process may be a good option for containerized applications running Docker. In the example later in this blog post, a similar technique is used to connect to a Docker container in a few lines of code.

		Communication with Tika went smoothly, but after one issue with `base::system2()` was identified. That was terminating Tika's long running process. Switching to `sys::exec_wait()` or `processx::run()` solved the issue.


		##### The R User Interface

		Many in the R community make use of `magrittr` style pipelines, so those needed to work well. The Tidy Tools Manifesto makes piping a central tenant [^6], which makes code easier to read and maintain.


		##### Responding to Reviewers

		During the review process, I appreciated David Gohel's [^7] attention to technical details, and that Julia Silge [^8] and Noam Ross [^9] pushed me to make the documentation more focused. I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better.


		#### Tika in Context: Parsing the Internet Archive

		The first archive I parsed with Tika was a website retrieved from the Wayback Machine [^10], a treasure trove of historical files. Maintained by the Internet Archive, their crawler downloads sites over decades. When a site is crawled consistently, longitudinal analyses are possible. For example, federal agency websites often change when an administrations change, so the Internet Archive group and academic partners have increased the consistency of crawling there. In 2016, they archived over 200 terabytes of documents to include, among other things, over 40 million `pdf` files [^11]. I consider these government documents to be in the public domain, even if an administration hides or removes them.

rtika blog #166

rtika blog #166

Conversation

goodmansasha commented Mar 28, 2018

goodmansasha commented Mar 28, 2018

stefaniebutland commented Mar 28, 2018

stefaniebutland commented Mar 28, 2018

goodmansasha commented Mar 28, 2018

goodmansasha commented Mar 28, 2018

stefaniebutland commented Mar 28, 2018

stefaniebutland commented Apr 17, 2018

goodmansasha commented Apr 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefaniebutland left a comment

Choose a reason for hiding this comment

stefaniebutland commented Apr 23, 2018

goodmansasha commented Apr 24, 2018

stefaniebutland commented Apr 24, 2018 • edited

stefaniebutland commented Apr 25, 2018

goodmansasha commented Apr 25, 2018 • edited

goodmansasha commented Apr 25, 2018

goodmansasha commented Apr 25, 2018

stefaniebutland commented Apr 25, 2018

goodmansasha commented Apr 25, 2018

goodmansasha commented Apr 25, 2018

stefaniebutland commented Apr 24, 2018 •

edited

goodmansasha commented Apr 25, 2018 •

edited