Skip to content
This repository has been archived by the owner on May 10, 2022. It is now read-only.

rtika blog #166

Merged
merged 23 commits into from Apr 25, 2018
Merged

rtika blog #166

merged 23 commits into from Apr 25, 2018

Conversation

goodmansasha
Copy link
Contributor

try 2

@goodmansasha
Copy link
Contributor Author

@stefaniebutland Thanks, I added the yaml.

Hmm, the "deploy/netlify' banner appeared for a second then disappeared along with its link to a preview.

Am I still doing something wrong?

@stefaniebutland
Copy link
Collaborator

See Show all checks in this screenshot image on the right.

screen shot 2018-03-27 at 11 13 33 pm

Click that to see what looks like the image below with deploy/netlify and Details on the right

screen shot 2018-03-27 at 11 14 36 pm

Click Details to get to the preview of our site. Click Blog and you should see this, the preview with your post

screen shot 2018-03-27 at 11 17 46 pm

@stefaniebutland
Copy link
Collaborator

Tentative date for your post is Tues Apr 17. I'll review your draft in detail next week.

In the meantime,

@goodmansasha
Copy link
Contributor Author

Thank you so much for explaining it . I’ll try and fix the formatting after sleeping

@goodmansasha
Copy link
Contributor Author

@stefaniebutland Okay, the fish formatting is fixed!

The references are in the order presented in the paper. If you need the reference text formatted in a certain format, I can change them once I know the style name.

After seeing the blog post in context, I reduced the amount of R code since that was taking up a lot of visual space. Since you have other posts to edit before this, I am tempted to make a few changes in the next few days if that is not an issue. Just a thought.

@stefaniebutland
Copy link
Collaborator

@predict-r since tentative date for your post is Tues Apr 17, you can make any changes up to Tues Apr 10th. Let me know when you're finished/happy with it and I'll review then.

Cheers!

@stefaniebutland
Copy link
Collaborator

@predict-r Your post is scheduled to publish in one week on Tues Apr 24.
Here is the preview: https://deploy-preview-166--ropensci.netlify.com/blog/2018/04/17/rtika-introduction/

Please update the date in YAML, make any final edits so I can do a final review.
Note that the headings still look pretty big so best to reduce them.

Happy to answer any questions.

updated date in yaml and made final edits. It is ready for final review.
@goodmansasha
Copy link
Contributor Author

It has smaller headings now and the date is updated. I'll be traveling until next Wed and will check emails for updates...off to the NYC text conference!!


Fortunately, I remembered Apache Tika. Five years earlier, Tika helped parse the Internet Archive, and handled whatever format I threw at it. Back then, I put together a R package for myself and a few colleagues. It was outdated.

I downloaded Tika and made a R script. Tika did its magic. It scanned the headers for the "Magic Bytes" [^4] and parsed the files appropriately:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an R script ?


#### Lessons Learned

I never distributed a package before on repositories such as CRAN or Github, and the rOpenSci group was the right place to learn how. The reviewers used a transparent on-boarding process and taught about good documentation and coding style. They were helping create a maintainable package by following certain standards. If I stopped maintaining `rtika`, others could use their knowledge of the same standards to take over. The vast majority of time was spent on documenting the code, the introductory vignette, and continuous testing to integrate new code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onboarding, no hyphen

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider linking to http://onboarding.ropensci.org/ or https://github.com/ropensci/onboarding so others can see what it's about (I've asked in #onboarding channel which is link is preferable from blog post)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In "The reviewers used a transparent on-boarding process and taught about good documentation and coding style." you might link to the open review thread: ropensci/software-review#191

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed: please link to https://github.com/ropensci/onboarding


This worked. R sends Tika a signal to execute code using an old-fashioned command line call, telling Tika to parse a particular batch of files. R waits. Eventually, Tika sends the signal of its completion, and R can then return with results as a character vector. Surprisingly, this process may be a good option for containerized applications running Docker. In the example later in this blog post, a similar technique is used to connect to a Docker container in a few lines of code.

Communication with Tika went smoothly, but after one issue with `base::system2()` was identified. That was terminating Tika's long running process. Switching to `sys::exec_wait()` or `processx::run()` solved the issue.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First two sentences here not clear - something missing?


##### The R User Interface

Many in the R community make use of `magrittr` style pipelines, so those needed to work well. The Tidy Tools Manifesto makes piping a central tenant [^6], which makes code easier to read and maintain.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"central tenant" should be "central tenet"


##### Responding to Reviewers

During the review process, I appreciated David Gohel's [^7] attention to technical details, and that Julia Silge [^8] and Noam Ross [^9] pushed me to make the documentation more focused. I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome "I ended up writing about each of the major functions in a vignette, one by one, in a methodical manner. While writing, I learned to understand Tika better."


#### Tika in Context: Parsing the Internet Archive

The first archive I parsed with Tika was a website retrieved from the Wayback Machine [^10], a treasure trove of historical files. Maintained by the Internet Archive, their crawler downloads sites over decades. When a site is crawled consistently, longitudinal analyses are possible. For example, federal agency websites often change when an administrations change, so the Internet Archive group and academic partners have increased the consistency of crawling there. In 2016, they archived over 200 terabytes of documents to include, among other things, over 40 million `pdf` files [^11]. I consider these government documents to be in the public domain, even if an administration hides or removes them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"when an administrations change" needs a fix

Copy link
Collaborator

@stefaniebutland stefaniebutland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@predict-r What a great post to read. Was clear and compelling from my (somewhat) less technical perspective.

I've made minor suggestions.

For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?

Thanks so much for doing this!

@stefaniebutland
Copy link
Collaborator

@predict-r Let me know when you've made the edits so I can publish

@goodmansasha
Copy link
Contributor Author

Tomorrow I’ll be back from traveling and will review the changes on my laptop as opposed to this iPad. Thank you !

@stefaniebutland
Copy link
Collaborator

stefaniebutland commented Apr 24, 2018

Sounds good. If it will be ready after ~noon Pacific time on 24th then please change the date to 2018-04-25 for filename, YAML date and any images

@stefaniebutland
Copy link
Collaborator

@predict-r Hope you're back safe and sound.

Please make your final edits including date changes and tag me here so I can post asap today. We're publishing another post Thursday and I don't want them to detract attention from each other.

After edits are done:

For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?

@goodmansasha
Copy link
Contributor Author

goodmansasha commented Apr 25, 2018

@stefaniebutland I'm back and trying to merge your changes and suggestions now. Should be done in an hour.

merged in edits from @stefaniebutland, and changed date to today.
updated blog post date embedded in file name
removing this file with outdated file name, to be replaced with updated file name
fixed one link and added link to package at the very end of the article.
@goodmansasha
Copy link
Contributor Author

@stefaniebutland Your edits were merged into the file and committed. Its ready to go, as far as I'm concerned!

Today's date was added to the file name and yaml.

The first paragraph of 'Connecting to Tika' was cleaned up, and hopefully it is clearer now and without added typos.

I added a link to the rtika github page at the very end of the article. Let me know if that should be put elsewhere.

@goodmansasha
Copy link
Contributor Author

For tweets from rOpenSci about this post, do you have any lines from the post or specific things you would like to convey?

How about this:

"It currently handles text or metadata extraction from over one thousand digital formats"

@stefaniebutland
Copy link
Collaborator

@predict-r In the author spot, do you want to link to something other than the package? e.g. https://twitter.com/goodmansasha or a website?

@stefaniebutland stefaniebutland merged commit ceea566 into ropensci-archive:master Apr 25, 2018
@goodmansasha
Copy link
Contributor Author

Yes! Please link to the twitter page: https://twitter.com/goodmansasha

@goodmansasha
Copy link
Contributor Author

Thanks!! I see it is up on the site now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants