Submission: jstor #189

tklebel · 2018-01-26T19:17:29Z

Summary

What does this package do? (explain in 50 words or less):

The tool Data for Research (DfR) by JSTOR is a
valuable source for citation analysis and text mining. jstor
provides functions and suggests workflows for importing
datasets from DfR.

Paste the full DESCRIPTION file inside a code block below:

Package: jstor
Title: Read Data from JSTOR/DfR
Version: 0.2.6
Authors@R: person("Thomas", "Klebel", email = "thomas.klebel@uni-graz.at", 
                  role = c("aut", "cre"))
Description: Functions and helpers to import metadata and full-texts delivered
    by Data for Research (DfR) by JSTOR. 
Depends: R (>= 3.1)
License: GPL-3 | file LICENSE
Encoding: UTF-8
LazyData: true
Imports: 
    dplyr,
    purrr,
    xml2,
    magrittr,
    stringr,
    readr,
    tibble,
    rlang,
    foreach,
    doSNOW,
    snow
Suggests: testthat,
    covr,
    knitr,
    rmarkdown,
    tidyr
BugReports: https://github.com/tklebel/jstor/issues
URL: https://github.com/tklebel/jstor, https://tklebel.github.io/jstor/
RoxygenNote: 6.0.1
Roxygen: list(markdown = TRUE)
VignetteBuilder: knitr

URL for the package (the development repository, not a stylized html page): https://github.com/tklebel/jstor
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

The package should fit very well into the data extraction-category, because
it extracts data from *.xml-files for further use.

Who is the target audience and what are scientific applications of this package?

The target audience would be scientists leveraging JSTOR/DfR for textual research.
Currently there is is no implementation in R to parse the meta data they deliver.
Although the data quality on references is not very consistent, one could try
to conduct citation analysis with a lot of control over the sample and the data
compared to GoogleScholar or Web of Science.

Are there other R packages that accomplish the same thing? If so, how does
yours differ or meet our criteria for best-in-category?

No.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Pre-submission was approved by @karthik: #186

Requirements

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

Do you intend for this package to go on CRAN?
Do you wish to automatically submit to the Journal of Open Source Software? If so:
- The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI: 10.5281/zenodo.1169862
- (Do not submit your package separately to JOSS)
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal.
- (Please do not submit your package separately to Methods in Ecology and Evolution)

I would very much like to submit a paper to JOSS, however at the moment I don't
have a proper case-study yet. If possible, I would like to submit a
paper at a later stage.

Detail

Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

The only "problems" from goodpractice::gp() are a test coverage of 98% and
lengthy lines in the test-files.

If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

If he would find the time to do it, a technical review by @jimhester could
probably help in resolving some performance issues, although he might urge me
to re-write the whole thing (again).

The text was updated successfully, but these errors were encountered:

karthik · 2018-02-02T04:32:46Z

👋 @tklebel

Thank you for this submission. During your pre-submission review I completely missed that we had a similar submission a year ago #86 by @benmarwick

Since that submission was stalled, we are good to proceed with yours. I checked in with Ben on this, and his suggestion was for you (and reviewers) to look through his work to see if there are ideas that might help your effort. So please take a look at https://github.com/benmarwick/JSTORr

Editor checks:

Fit: The package meets criteria for fit and overlap
Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
License: The package has a CRAN or OSI accepted license
Repository: The repository link resolves correctly
Archive (JOSS only, may be post-review): The repository DOI resolves correctly
Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

You are already aware of the goodpractice results, but here they are anyway.

── GP jstor ────────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. 97% of code lines are covered by test cases.

    R/batch_import_fun.R:21:NA
    R/batch_import_fun.R:22:NA
    R/batch_import_fun.R:23:NA
    R/example.R:15:NA
    R/example.R:16:NA
    ... and 3 more lines

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    tests/testthat/test-article.R:75:1
    tests/testthat/test-author-import.R:35:1
    tests/testthat/test-author-import.R:36:1
    tests/testthat/test-author-import.R:40:1
    tests/testthat/test-author-import.R:41:1
    ... and 19 more lines

  ✖ fix this R CMD check WARNING: LaTeX errors when creating
    PDF version. This typically indicates Rd problems.
  ✖ fix this R CMD check ERROR: Re-running with no
    redirection of stdout/stderr. Hmm ... looks like a package You may
    want to clean up by 'rm -rf /tmp/Rtmpq9Jt37/Rd2pdf1b207a3eca4b'
────────────────────────────────────────────────────────────────────────────────

I would very much like to submit a paper to JOSS, however at the moment I don't
have a proper case-study yet. If possible, I would like to submit a
paper at a later stage.

Not a problem. There is no requirement that both be timed together. JOSS submission can be made independently at anytime.

Reviewer 1: @jsonbecker
Due date: Feb 23, 2018

Reviewer 2: @rileymsmith19
Due date: March 1

tklebel · 2018-02-06T18:11:39Z

Thanks @karthik!

I cannot reproduce your errors and warnings from R CMD check, neither on travis, nor on my machine, nor Appveyor or the win-builder. Could you help me in debugging, or should we ignore them?

You put my submission into topic:data-publication, whereas I initially thought that topic:data-extraction would be more suitable. Did you change your opinion?

Before my pre-submission, I look into @benmarwick s package, and read the review by @kbenoit thoroughly. I am reluctant to add similar features like which are currently implemented in JSTORr, because the same critique would apply.
In general, I think it would be good to keep jstor rather simple and in a way agnostic to what the user would do with data from JSTOR (analyze n-grams, conduct topic-modeling, etc.). Nevertheless, it would probably be beneficial to show a bit more clearly what can be done with jstor. I am therefore working on a simple case study to illustrate a typical use-case, which includes filtering of meta-data in combination with the analysis of n-grams.

karthik · 2018-02-06T18:57:08Z

Thanks for catching my mistake with the wrong tag. I have fixed it.

Re. the issues from R CMD CHECK, I'll look again when I have a bit of time, but let's proceed since the reviewers will also be able to see if it's a replicable issue.

For long lines, you can add # nolint at the end of those lines https://github.com/jimhester/lintr#project-configuration this way lintr will not see them.

And lastly, I appreciate that you have looked into Ben's work and the relevant feedback on the package. I agree that it would be very helpful to include some use cases in your package.

tklebel · 2018-02-08T17:21:00Z

I added # nolint at the relevant lines, so lintr ignores them now properly.

In the meantime I wrote an initial version for a case-study which is live here: https://tklebel.github.io/jstor/articles/analysing-n-grams.html In order to host this case-study which cannot be included as a vignette, I went ahead and added pkgdown.

Finally, I somehow mistook the JOSS for the JSS. Since JOSS only requires a very short introductory paper, I included one in jstor and would like to submit to JOSS via the fast-track option, in case the review goes well.

karthik · 2018-02-08T17:42:05Z

Both reviewers are now assigned. Thanks @rileymsmith19 and @jsonbecker 🙏!

tklebel · 2018-02-09T12:39:49Z

Since I made a few improvements over the last few days, I created a new version (0.2.6) and made a snapshot at Zenodo. I updated my initial post accordingly and will try to avoid any further changes until the reviews are in.

maelle · 2018-02-11T08:07:53Z

I think the correct Github username for @rileymsmith19 is @EccRiley 😺

EccRiley · 2018-02-15T16:25:04Z

You are correct, @maelle, Thanks! Sorry for any confusion, @karthik !

jsonbecker · 2018-02-21T01:58:24Z

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally

Personal preference here-- in the ToC the Introduction vignette, the use of backticks to style the section headers resulted in links that don't look like links. I think that my preference would be to not have the fixed-width fonts for the functions in the headers such that the ToC is more clearly displaced as links.

When reading the description of find_article, find_authors, etc I don't have confidence about which should return more than one response. I think based on the pluralization, the expectation is that the XML always contains one article, which may have multiple authors, multiple references, and multiple footnotes. This is clearer in reading all of the examples, but if so, I think it'd be worth specifying whether the base XML will always have one article or more than one article up front. Additionally, I find the kable output to be appealing in the vignette and the swap from find_article and find_authors which use that output to just printing tables in the output is a little jarring for me. I'd probaby just use kable in all the examples.

Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
[O] Functionality: Any functional claims of the software been confirmed.

The functionality works with the supplied examples. Because the data set is not freely available though, it's hard to confirm with an authentic case. I have relatively high confidence this will not be a challenge based on the structured nature of the data set and clear code here.
[O] Performance: Any performance claims of the software been confirmed.

The functions work well and as described with the minimal data examples. There was some note about a desire for a technical review for speeding things up. I might suggest that if such a task was undertaken, it may make more sense to explore whether loading data into a sqlite database might be preferable to working with many XML files. It wouold be a pretty different package, but I can imagine a scenario where the primary goal of this package is loading the data into sqlite and then some convenience functions to access that data rather than from XML straight to tibbles. In my opinion, the fractured nature of the many XML files and atomiticity of the contents of those files suggest that I would want this data in a database for repeated access and not have to crawl the file system. I wonder if the data is too large for sqlite though under realistic loads. In that case, it would be more difficult to justify a package around the database load process, although I'd probably still pursue that if I were working with this package with my own database.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.

The test coverage is solid and the tests are clear. The author may want to consider doing more tests that look at whether the resulting tibble is as expected as opposed to the more extensive element-level checks in some tests, e.g. more like test-author-import.R and less like test-article.R.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

2 -2.5hrs

Review Comments

I am not sure if I would name the first column response in find_* functions basename. Since this is intendended to be a file path or file name, I think it should be named file_name or similar to be more desciptive.
I am not sure that I would be using the basename_id as the primary matching agent. I guess since articles do not consistently have a doi, pub_id, and jcode, you are trying to solve the problem by relating back to the file name. In my opinion, if the article_doi, article_pub_id, and article_jcode are a set where almost always only one value exists, then I would probably structure my data to have an article_unique_id and article_id_type or similar, based on gathering those three values. I'd far rather match on an externally valid identifier across files and other data than rely on my file path to the data. In that scenario, I'd think the file path is then an optional field really.
I find it a little surprising that books don't contain some of the data available for articles, like footnotes and references. I somewhat expected to see equivalent functions across books and articles. I wonder if it would be valuable to have find_authors work for both books and articles. You already have functions to check if something is an article or a book, so why not have find_authors with a parameter for book or article. You can check for its presence, and if it's not user supplied, use your validate_* helpers to determine the type, then for book you can call find_chapters(authors = TRUE) internally and for article continue on the existing code path? Combined with my feedback above about the article identifiers, you could then return the same tibble structure for books or articles when calling find_authors, which seems convenient to me. This would also potentially avoid the unnest call by adding the author_number column to books like with articles. Finally, for things like references and footnotes, it'd be good for the validate_article failure to say that footnotes and references are not available for books, rather than the more generic "you're using an article function for a book" style message.
I see now by reading through the analyzing n-gram the motivation for having the file path in use for grabbing the redduced ngram2 text from this data set. While I think this motivates making file path an available piece of information, I still think my earlier comment about not using it as the core identifier stands to some degree. I did find this vignette to be helpful as someone not familiar with this data set.
Given the file directory structure that comes from DfR requests, does it make sense to mimic that for test data as well? Are there support functions or processes used with the ngram2 files that you would find helpful to add as additional functions with this package?
What parts of this data may make sense for loading into a sqlite database? The choice to go from XML to CSV is a great lowest denominator, but I wonder if there isn't greater advantage to formalizing the data in sqlite. Maintaining a large number of files in a specific directory structure, even after transitioning from XML to CSV poses some concer to me, especially with the file name being a major identifier for joins.
If CSV is the appropriate target for some of this work, I wonder if it may be worthwhile to specify classes prior to write_csv to have the best likelihood of reloading that data and not losing type info in the serialization/deserialization process. Similarly, should their be import functions that can pull the CSV output of jstor_import into R with some assertions/checks?

tklebel · 2018-02-21T08:53:33Z

Thanks @jsonbecker for the thorough review! I'm eager to respond to the more substantial questions (csv vs sqlite and file_name vs other_identifier), but I think I will wait for the review by @EccRiley, so I can address all comments in one go.

jsonbecker · 2018-02-21T12:08:29Z

Absolutely. I plan to add a few more overall thoughts but with deadline looming I really wanted to get it in! I realize my review heavily focused on a suggestion that may have limited value and is a bit of my own preferences, but that’s largely because everything is otherwise in great shape! http://www.jsonbecker.com/contact/

…

On Feb 21, 2018, at 3:53 AM, Thomas Klebel ***@***.***> wrote: Thanks @jsonbecker for the thorough review! I'm eager to respond to the more substantial questions (csv vs sqlite and file_name vs other_identifier), but I think I will wait for the review by @EccRiley, so I can address all comments in one go. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tklebel · 2018-03-12T08:34:12Z

I'm still eager to work on/discuss some changes, but wanted to wait for the second review. Do you happen to know until when you will be able to finish your review, @EccRiley?

sckott · 2018-03-20T18:40:43Z

@EccRiley are you still able to do the review?

karthik · 2018-04-05T15:55:18Z

@tklebel Sorry for the delays. Looks like @EccRiley hasn't responded a while. I am seeking another reviewer to replace her.

tklebel · 2018-04-05T16:04:12Z

@karthik no worries. @elinw seems to have been trying out the package, maybe she would be willing to do a review.

elinw · 2018-04-05T16:07:15Z

I have been, yes. I could do a review. I also have some data if @JasonBecker would like to try it.

karthik · 2018-04-05T17:54:14Z

@elinw Fantastic! Thank you for agreeing to review.
Assigning @elinw as second reviewer

karthik · 2018-05-03T22:46:45Z

@elinw Gentle ping checking in on your review.
🙏

elinw · 2018-05-04T16:53:46Z

@karthik On it, should finish it up this weekend.

elinw · 2018-05-07T13:07:58Z

This is still a draft

Package Review

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

JSTOR is a really important data source for the study of scholarly publication on issues. JSTOR as a data source has wide coverage over many disciplines and also includes sources going back in time. As a data source JSTOR has some challenges and features that are a bit different than others. First,
there is not an API in the form of a REST interface. Instead, researchers fill out a web form and request the results of particular queries. Then they are emailed a zip file containing potentially thousands (or more) of individual files including the xml data for the item and the n-gram data for that item in separate files for unigrams, bigrams and trigrams (assuming all of these were requested). These are organized into a
deeply nested file structure that ends with a folder for each item (article or book). Obtaining full text data is a
separate process involving a user agreement. While not fundamentally different than working with a web API that returns xml, it definitely feels different to the end user.

All of this to say that it is extremely helpful to have a package to help with managing this. Specifically the jstor package gets the data read in from all of the files into csv files separated by the type of content
(articles, authors, references, footnotes). It then writes to CSV files. I think this package will encourage people to use this data source more.

The package itself is solid, test coverage is good, documentation is complete, and good practices are followed. It does exactly what it says and works as described. The code seemed fine, and in particular the ways that it it made batch size and number of cores easy to specify was helpful to me. I appreciated the single interface to all the data types.

The README is helpful in defining the structure and consistency of the data.

Comments

These are some additional comments for consideration.

I have a JSTOR basic file (not full text) from my research that I used in my testing. It contains 1678 items, 42051 references, 12376335 bigrams. I did my testing mainly on RStudioServer (open source) installed on CentOS with 4 cores, although I also did some on my macbook air. I will say that I was happy to have a package that installed easily.

At first I was surprised that the data were read to csv files. However, after some thought I feel it is very useful to have this as the main outcome as it is then up to the user to decide what kind of storage
to use for further processing. I found it useful to have the stored files to return back to. For me, simply having a way to approach the mass of individual item files and get them into rectangles was a huge win, especially having the processing issues handled (meaning otherwise I would have probably just created a much less smart script for reading the files recursively).

One thing I will mention is that I did find that for my data I could not use read_csv() without generating warnings, but read.csv() worked fine. It might be helpful to include a discussion of storage options for the rectangular data. I also don't think this package requires documentation of all aspects of a text analysis What is more important is to focus on preparing the data for clean importing into other text analysis packages and the storage options they support. I say that, but I followed the instructions in the bigrams/tidy text vignette pretty exactly and found them incredibly helpful. The JSTOR data (basic files) is not really like the examples in the TidyText book or most other documentation for text analysis since the full texts are not there (and the n grams are already created). So I won't discourage doing more of this as you go forward with maintaining the package, especially for known quirks of JSTOR (see below). I also appreciated that there was a focus on getting a simple set of descriptive graphs.

The vignettes are helpful and the new one in particular is a nice overview of the process of cleaning and analyzing the ngram data. However, I think it would be stronger if it discussed some of the issues
specific to the JSTOR data. For example, in working with the package I ended up creating a list of what I think of as my JSTOR stop words because academic writing is a specialized language.

For example for the references data I have:

stop_ref1 <- c("", "References", "Notes", "Note", "Ibid.", "Bibliography", 
               "Loc. cit.", "Id.", "Foonotes", "Ibid.", "Endnotes", 
               "[Bibliography]", "Notes and References", "Works Cited",
               "Footnotes","Ibid", "[References]", "Statutes Cited",
               "Literatur", "Case Cited", "Footnote", "Ibid", 
               "Cases Cited", "[Endnotes]", "References", "Reference"
               )

And for the bigrams I had this.

custom_stop_words <- c("ibid", "supra", "pp", "eg", "alia", "hoc", 
                       "e.g.")

(It is very easy to add custom stop words, I could figure out how just from following the vignette and realizing
I could add my vectors to those in Tidytext). It could be useful to provide a more complete version those in some way.

I also did not necessarily want to get rid of all numbers but one kind of troublesome number was e.g. 1998b. So it would probably be easy to write a function for that, though I ended up just listing out the ones I found.

Another example, with JSTOR unless you specify otherwise, you get articles in multiple languages. The language names are not consistently coded so I used dplyr::filter(language == "en" | language == "eng") . There are various things like that which you might want to mention.

I do not have a problem with using the baseid as the main ID. This potentially lets me work with other interfaces to JSTOR, is consistently produced over time and is unique. If I have data from separate JSTOR queries it lets me find overlap easily.

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5

Review Comments

jsonbecker · 2018-05-18T19:19:03Z

I would be happy with the renaming to jst_*, but I might still prefer "verbing" to suggest the functionality, e.g. jst_get_article. I wouldn't hold up a 👍 for that. I am satisfied given more context with the move toward file_name and using that as an identifier. I guess I can't wish JSTOR toward having a better data model ;).

I'm also good with database being out of scope.

tklebel · 2018-06-01T12:42:46Z

It took some time, but I think I am finished with the requested changes. Due to the long duration of the review process, there are many changes altogether, and only some are responses to the reviewers comments.

In my post above I outlined five areas where I would make changes:

Improve formatting of vignette by using knitr::kable() and formatting headings in normal font.
Make error about references/footnotes not being available for books more specific.
Clarify the the structure of data in intro vignette: xml-files only contain one article, one book (with multiple chapters), etc.
Create helpers to re-import data after conversion with jstor_import.
Write first version of vignette about known quirks.

All of them are done as of now. The news file, containing all changes, is as follows (or in a prettier version online here: https://tklebel.github.io/jstor/news/index.html):

jstor 0.3.0

Breaking changes

jst_import and jst_import_zip now use futures as a backend for parallel
processing. This makes internals more compact and reduces dependencies.
Furthermore this reduces the number of arguments, since the argument cores
has been removed. By default, the functions run sequentially. If you want them
to execute in parallel, use futures:

library(future)
plan(multiprocess)

jst_import_zip("zip-archive.zip",
               import_spec = jst_define_import(article = jst_get_article),
               out_file = "outfile")

If you want to terminate the proceses, at least on *nix-systems you need to kill
them manually (once again).

All functions have been renamed to use a unified naming scheme: jst_*.
The former group of find_* functions is now called jst_get_*, as in
jst_get_article(). The previous functions have been deprecated and will be
removed before submission to CRAN.
The unique identifier for matching across files has been renamed to
file_name, and the corresponding helper to get this file name from
get_basename to jst_get_file_name.

Importing data directly from zip-files

There is a new set of functions which lets you directly import files from
.zip-archives: jst_import_zip() and jst_define_import().

In the following example, we have a zip-archive from DfR and want to import
metadata on books and articles. For all articles we want to apply
jst_get_article() and jst_get_authors(), for books only jst_get_book(),
and we want to read unigrams (ngram1).

First we specify what we want, and then we apply it to our zip-archive:

# specify definition
import_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),
                                 book = jst_get_book,
                                 ngram1 = jst_get_ngram)

# apply definition to archive
jst_import_zip("zip_archive.zip",
               import_spec = import_spec,
               out_file = "out_path")

If the archive contains also research reports, pamphlets or other ngrams, they
will not be imported. We could however change our specification, if we wanted
to import all kinds of ngrams (given that we originally requested them from
DfR):

# import multiple forms of ngrams
import_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),
                                 book = jst_get_book,
                                 ngram1 = jst_get_ngram,
                                 ngram2 = jst_get_ngram,
                                 ngram3 = jst_get_ngram)

Note however that for larger archives, importing all ngrams takes a very long
time. It is thus advisable to only import ngrams for articles which you
want to analyse, i.e. most likely a subset of the initial request. The new
function jst_subset_ngrams() helps you with this (see also the section on
importing bigrams in the
case study.

Before importing all files from a zip-archive, you can get a quick overview with
jst_preview_zip().

New vignette

The new vignette("known-quirks") lists common problems with data from
JSTOR/DfR. Contributions with further cases are welcome!

New functions

New function jst_get_journal_overview() supplies a tibble with contextual
information about the journals in JSTOR.
New function jst_combine_outputs() applies jst_re_import() to a whole
directory and lets you combine all related files in one go. It uses the file
structure that jst_import() and jst_import_zip() provide as a heuristic: a
filename with a dash and one or multiple digits at its end (filename-1.csv).
All files
with identical names (disregarding dash and digits) are combined into one file.
New function jst_re_import() lets you re_import a .csv file that
jstor_import() or jst_import_zip() had exported. It tries to guess the type
of
content based on the column names or, if column names are not available, from
the number of columns, raising a warning if guessing fails and reverting to a
generic import.
A new function jst_subset_ngrams() lets you create a subset of ngram files
within a zip-file which you can import with jst_get_ngram().
A new set of convenience functions for taking a few cleaning steps:
jst_clean_page() tries to turn a character vector with pages into a numeric
one, jst_unify_journal_id() merges different specifications of journals into
one, jst_add_total_pages() adds a total count of pages per article, and
jst_augment() calls all three functions to clean the data set in one go.

Minor changes

Improved documentation regarding endnotes (thanks @elinw)
jstor_import has a new argument: n_batches which lets you specify the number
of batches directly

I don't expect you to dive into all the changes, though. I hope that all your comments are addressed with the changes and that we can finish the onboarding soonish 😄

elinw · 2018-06-05T23:50:32Z

I'm sorry I'm slow on replying but this seems good. I love the import from zip. I have some things to focus on right now but by the end of the week I'm planning to have updated to the current development branch and try things out. I will also get you an example article.

jsonbecker · 2018-06-06T00:20:49Z

I am very happy with these changes. I especially like the set and structure of the vignettes and the separation of the "case study". This now serves not only as a well documented package but as solid documentation of the data it works with. This has my full 👍

elinw · 2018-06-06T17:15:48Z

25074331
basename_id journal_doi journal_jcode journal_pub_id article_doi article_pub_id article_jcode article_type
1 25074331 NA jbusiethi 25074331 research-article
article_title volume issue language pub_day
1 Ethics Education in the Workplace: An Effective Tool to Combat Employee Theft 26 2 eng 1
pub_month pub_year first_page last_page
1 7 2000 89 100
Has
21646 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85.
21647 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85.
21648 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85.
21649 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85.
21650 Greengard, Samuel: April 1993, 'Theft Control\nStarts with HR Strategies', Personnel Journal, p. 85.

I actually pulled the article and it's true, they cited the same page multiple times. It was striking because it made Greengard the most cited author but I had never heard of him.

tklebel · 2018-06-07T09:11:20Z

Thanks @elinw for the example. I added a few sentences about issues with references in the vignette and linked the article as an example.

elinw · 2018-06-07T14:12:04Z

I actually don't think that the article is "wrong" it's just following the citation style of the journal it is published in. Other journals probably would have used Ibid and op. cit. or similar.

tklebel · 2018-06-10T21:01:19Z

@elinw I changed my language a bit to remove the certainty of the judgement, though I would still consider it to be an artifact since they use Ibid too, but only for some references. However there might be another reason for this which I am not aware of. I therefore rephrased like this: "[the article cites ...] which is possibly an artifact."

tklebel · 2018-06-19T13:38:03Z

@karthik just pinging if you could take a quick look at this issue. The current status seems to be:

We are in phase 5/awaiting-reviewer(s)-response.
@jsonbecker has approved the package.
It seems like @elinw wanted to take another look at the current dev branch, but I am not entirely sure.

How will this onboarding proceed?

karthik · 2018-06-27T17:28:17Z

Hi @tklebel Apologies for the delay. I'm just waiting on @elinw to sign off and with that we can finish up.
Elin: Do you have any further unresolved issue with the master release of this effort? If not could you indicate sign off so we can proceed with acceptance? 🙏

elinw · 2018-06-27T17:42:41Z

Yes, I am good with these changes, the last of which are mainly just polishing documentation etc. This is going to be really useful and encourage the use of JfR. I noticed the change from stop() to abort() and read about the differences, my suggestion is that you make that rlang::abort() explicitly, which you'll need to do for CRAN anyway.

Great job!

tklebel · 2018-06-27T19:13:43Z

@elinw I'm actually not sure if I really prefer abort over stop in general. I used it to remove the calls to .call = FALSE. But since I import it via importFrom(rlang,abort) in NAMESPACE, this should be fine for CRAN, right? If not, I will of course change it before submitting.

Apart from that, I am very happy for the thumbs up! This comes just in time before useR! 😃

karthik · 2018-06-27T19:34:22Z

Congrats @tklebel , your submission has been approved! 🎉 Thank you for submitting and @elinw and @jsonbecker for thorough and timely reviews. To-dos:

Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
Complete final suggestions from elin, especially to get ready for CRAN
Add the rOpenSci footer to the bottom of your README

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/technotes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

tklebel · 2018-06-27T20:08:48Z

Great! Thank you @karthik for managing the application, and @jsonbecker and @elinw for your helpful comments!

I transferred the repo, added the rOpenSci footer and changed all relevant links in the README.

I'll see to make the package ready for CRAN and I will complete the suggestion by @elinw while doing so.

I'd very much like to write a blog post with some narrative about the development of the package. @stefaniebutland, I should have time to write this post some time after July.

tklebel · 2018-07-08T06:16:08Z

@karthik I changed more or less all links I found, build systems (travis and appveyor) are working. The only missing part as far as I can see are admin rights for the repo, so I can update a few remaining things (mainly the link to the pkgdown site). Then we could close the issue, I think.

stefaniebutland · 2018-07-18T05:30:03Z

I'd very much like to write a blog post with some narrative about the development of the package. @stefaniebutland, I should have time to write this post some time after July.

Very glad to hear that @tklebel. This link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length you prefer: https://ropensci.org/tags/review/.

Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post. I'll mark my calendar to follow up with you in mid- August so we can agree on a publication date. We ask that you submit your draft post via pull request a week before the planned publication date so we can provide feedback.

Happy to answer any questions as they come up.

stefaniebutland · 2018-08-21T19:05:27Z

@tklebel I have blog post publication dates available Sept 18, Oct 2, 9, 16. If you're still interested, let me know your preferred date and mark your calendar to submit a draft a week in advance.

Please see my comment above for more details.

tklebel · 2018-08-22T10:53:21Z

@stefaniebutland I'm still interested in writing a blog post. Regarding publication dates, I would prefer Oct the 9th, if that is ok for you.

stefaniebutland · 2018-08-22T16:49:37Z

Glad to hear that @tklebel. Tuesday 2018-10-09 is yours. Please submit a draft via pull request by 2018-10-02 and don't hesitate to ask any questions here.

tklebel · 2018-08-26T19:43:57Z

@karthik is it possible, that I am still missing appropriate rights for the repo? I wanted to fix the link in the title of the repo, but I still couldn't do it (someone else fixed it for me in the meantime).

karthik · 2018-08-26T19:49:06Z

@tklebel Can you please check now?

tklebel · 2018-08-26T19:49:51Z

@karthik perfect, thanks!

stefaniebutland · 2018-09-25T18:21:38Z

Hi @tklebel. Checking to see if you're still on track to submit a draft bog post by 2018-10-02 for publication the following week.

tklebel · 2018-09-26T11:56:23Z

Hi @stefaniebutland. Is it ok if the post is more about why and how I wrote the package, including what I learned along the way, and less about how to use it? I have a lot of documentation, including a case study on the pkgdown site, so I don't think it would be helpful and interesting to write something similar again...

If you are fine with that, than I would answer, that I am on track 😺

stefaniebutland · 2018-09-26T16:06:06Z

Sounds great @tklebel. More people will likely relate to this kind of post. Please include a short section noting & linking to the docs you have.

karthik added package topic:data-publication labels Jan 26, 2018

karthik self-assigned this Feb 2, 2018

karthik added 1/editor-checks 2/seeking-reviewer(s) and removed 1/editor-checks labels Feb 2, 2018

karthik added topic:data-extraction and removed topic:data-publication labels Feb 6, 2018

karthik added 3/reviewer(s)-assigned and removed 2/seeking-reviewer(s) labels Feb 8, 2018

tklebel mentioned this issue May 16, 2018

Address rOpenSci review requests ropensci/jstor#44

Closed

10 tasks

karthik added 6/approved and removed 3/reviewer(s)-assigned labels Jun 27, 2018

arfon mentioned this issue Aug 7, 2018

[REVIEW]: jstor: Import and Analyse Data from Scientific Texts openjournals/joss-reviews#883

Closed

18 tasks

karthik closed this as completed Aug 26, 2018

Submission: jstor #189

Submission: jstor #189

Comments

tklebel commented Jan 26, 2018 • edited Loading

Summary

Requirements

Publication options

Detail

karthik commented Feb 2, 2018 • edited Loading

Editor checks:

Editor comments

tklebel commented Feb 6, 2018

karthik commented Feb 6, 2018

tklebel commented Feb 8, 2018

karthik commented Feb 8, 2018

tklebel commented Feb 9, 2018

maelle commented Feb 11, 2018

EccRiley commented Feb 15, 2018 • edited Loading

jsonbecker commented Feb 21, 2018 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

tklebel commented Feb 21, 2018

jsonbecker commented Feb 21, 2018 via email

tklebel commented Mar 12, 2018

sckott commented Mar 20, 2018

karthik commented Apr 5, 2018

tklebel commented Apr 5, 2018

elinw commented Apr 5, 2018

karthik commented Apr 5, 2018

karthik commented May 3, 2018

elinw commented May 4, 2018

elinw commented May 7, 2018 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Comments

Final approval (post-review)

Estimated hours spent reviewing: 5

Review Comments

jsonbecker commented May 18, 2018

tklebel commented Jun 1, 2018

jstor 0.3.0

Breaking changes

Importing data directly from zip-files

New vignette

New functions

Minor changes

elinw commented Jun 5, 2018

jsonbecker commented Jun 6, 2018

elinw commented Jun 6, 2018

tklebel commented Jun 7, 2018

elinw commented Jun 7, 2018

tklebel commented Jun 10, 2018

tklebel commented Jun 19, 2018

karthik commented Jun 27, 2018

elinw commented Jun 27, 2018

tklebel commented Jun 27, 2018 • edited Loading

karthik commented Jun 27, 2018

tklebel commented Jun 27, 2018

tklebel commented Jul 8, 2018

stefaniebutland commented Jul 18, 2018

stefaniebutland commented Aug 21, 2018

tklebel commented Aug 22, 2018

stefaniebutland commented Aug 22, 2018

tklebel commented Aug 26, 2018

karthik commented Aug 26, 2018

tklebel commented Aug 26, 2018

stefaniebutland commented Sep 25, 2018

tklebel commented Sep 26, 2018

stefaniebutland commented Sep 26, 2018

tklebel commented Jan 26, 2018 •

edited

Loading

karthik commented Feb 2, 2018 •

edited

Loading

EccRiley commented Feb 15, 2018 •

edited

Loading

jsonbecker commented Feb 21, 2018 •

edited

Loading

elinw commented May 7, 2018 •

edited

Loading

tklebel commented Jun 27, 2018 •

edited

Loading