Think about whether hybrid data/API makes sense #20

perrystephenson · 2016-11-06T12:32:52Z

As someone who was actually an individual submitted in this exercise -- although not for a case study -- from one of the listed institutions, and personally involved in managing the staff submitted from my institution, I held the first impression that this would be a really cool package to have, for accessing this data. However the more I experimented with the package, the more I wondered why the API approach was needed. The data is static, so the only reasons not to package the data are copyright and size. Most of the information (except some of the case studies - see http://impact.ref.ac.uk/CaseStudies/Terms.aspx) is governed by a CC license, and so could easily be packaged as data. The only objection to size applies to the case studies themselves, but again, if the documentation or README.md had more on the motivation and/or documentation, I would have a better idea of just how large this is (and whether this size makes it something that is not better simply provided as a flattened large data.frame or "tibble").

The following static tables from the API are CC licensed and could easily be packaged as built-in objects:

institutions: This table is 155 x 5 data.frame of 20.8k in size
units_of_assessment: 36 x 3
tag_types: 13 x 2
values: This is much larger but the entire table could be flattened in a way that links to tag_types, if we are willing to strongly suspend the principles of relational data normalization (something most users may not know or care about).
This seems to gut the functions from the package, since it leaves only get_case_studies(), which might be appropriately handled through an API call. But here I suggest the package could really enhance value by adding data-handling functions that link the static data objects to the structure of what get_case_studies() returns, such as ways to flatten the lists that are elements of the return objects from that function. For instance, the return object from get_case_studies(ID = c(27,29)) is a 2 x 19 element tibble, but several of those columns (e.g. Continient) are variable length lists. Many users who are not experts in dissecting R objects are going to have trouble with the nesting of lists within data.frames.

In addition, by having the smaller objects as built-in data, the inputs to get_case_studies() can be checked for valid values, rather than relying on the API to reject a non-existent ID, for instance.

perrystephenson · 2017-02-04T08:00:29Z

I had a good think about this, and I decided that it was safest to keep the package as a "pure" API wrapper rather than bundling the smaller tables in the package. This is ensure that any changes to the database (as unlikely as that may be) are reflected for users of the package, and also to make the package internally consistent. The case studies tables are many times larger than what would be acceptable in a CRAN package, which means that the API is required for at least that function.

The "flattened tables" idea is a good one, and I will have a look at creating some helper functions to help users navigate the dataset.

perrystephenson closed this as completed in 775d27b Feb 11, 2017

perrystephenson mentioned this issue Mar 26, 2017

refimpact submission ropensci/software-review#78

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Think about whether hybrid data/API makes sense #20

Think about whether hybrid data/API makes sense #20

perrystephenson commented Nov 6, 2016

perrystephenson commented Feb 4, 2017

Think about whether hybrid data/API makes sense #20

Think about whether hybrid data/API makes sense #20

Comments

perrystephenson commented Nov 6, 2016

perrystephenson commented Feb 4, 2017