protect against missing cols from cr_works / missing fields from crossref API #183

maxheld83 · 2021-03-04T19:03:51Z

when a field is NA (?), cr_works() just drops the column:

rcrossref::cr_works("10.1038/s41598-020-57429-5")[["data"]]
#> # A tibble: 1 x 34
#>   alternative.id container.title    created    deposited  published.print
#>   <chr>          <chr>              <chr>      <chr>      <chr>          
#> 1 57429          Scientific Reports 2020-01-31 2021-01-30 2020-12        
#> # … with 29 more variables: published.online <chr>, doi <chr>, indexed <chr>,
#> #   issn <chr>, issue <chr>, issued <chr>, member <chr>, prefix <chr>,
#> #   publisher <chr>, score <chr>, source <chr>, reference.count <chr>,
#> #   references.count <chr>, is.referenced.by.count <chr>, subject <chr>,
#> #   title <chr>, type <chr>, update.policy <chr>, url <chr>, volume <chr>,
#> #   abstract <chr>, language <chr>, short.container.title <chr>,
#> #   assertion <list>, author <list>, funder <list>, link <list>,
#> #   license <list>, reference <list>
rcrossref::cr_works("10.1109/JLT.2019.2961931")[["data"]]
#> # A tibble: 1 x 27
#>   container.title  created  deposited published.print doi    indexed issn  issue
#>   <chr>            <chr>    <chr>     <chr>           <chr>  <chr>   <chr> <chr>
#> 1 Journal of Ligh… 2019-12… 2020-10-… 2020-07-01      10.11… 2021-0… 0733… 13   
#> # … with 19 more variables: issued <chr>, member <chr>, page <chr>,
#> #   prefix <chr>, publisher <chr>, score <chr>, source <chr>,
#> #   reference.count <chr>, references.count <chr>,
#> #   is.referenced.by.count <chr>, subject <chr>, title <chr>, type <chr>,
#> #   url <chr>, volume <chr>, short.container.title <chr>, author <list>,
#> #   funder <list>, link <list>

^{Created on 2021-03-04 by the reprex package (v1.0.0)}

This is pretty dangerous.
purrr::map_dfr() in #168 should cover many of these cases, but not deterministically.

We should either:

make it length stable in columns (requiring parsing the raw json)
~~always assert the fields we actually need, and only keep those.~~
(that might still cause an error)

The text was updated successfully, but these errors were encountered:

njahn82 · 2021-03-04T19:20:49Z

cr_works only parses fields that are returned by the Crossref API. E.g., http://api.crossref.org/v1/works/10.1109/JLT.2019.2961931 lacks license and reference nodes. Not sure how parsing the raw json would help.

njahn82 · 2021-03-04T19:27:27Z

https://github.com/ropensci/rcrossref/blob/03c3026aca49808c6d5088854ecdce11112d2d12/R/cr_works.R#L169

cr_works uses bind_rows.

maxheld83 · 2021-03-04T19:32:13Z

cr_works uses bind_rows

I know, similar to what purrr::map_dfr() will achieve.
It will probably protect us in most cases (when at least one of the dois has the necessary node), though not necessarily in all.

If we're unlucky, none of them do, and then x[[col]] won't exist at all, and we get an error which is hard to debug.
And I don't think we should work around this wherever x[[col]] is called; because that would perpetuate the instability up the call stack.

Anyway, it's not a priority, and depending on which fields we use it might be very unlikely.
But it could bite us, and then eat up a good couple of hours to debug, because it's quite unituitive.

maxheld83 · 2021-03-04T19:47:07Z

cr_works only parses fields that are returned by the Crossref API. E.g., http://api.crossref.org/v1/works/10.1109/JLT.2019.2961931 lacks license and reference nodes. Not sure how parsing the raw json would help.

Uh, that's a bummer, thanks for pointing that out.
I wrongly suspected rcrossref 😔.

crossref then! 😡

Just to show that I'm not making these expectations up; usual best standard for APIs is to always show all response fields:

If a field has no value, it shall be null.

This one is quite trivial, but is has a few nuances. Non existing numbers, strings, and booleans are > usually represented as null. But string fields without value should also be represented as null, not "".

Note: Empty arrays shall not be null. If a field is some kind of list and it is represented by an array, an array shall be returned, even if empty. This makes front-end developers’ work a lot easier.

Example:
{
   "id": 16784,
   "name": "Lorem Ipsum",
   "age": null,
   "relatives": [], // no relatives
   "address": null
}

So, bottom line: it would be possible, though inelegant / unreasonable expectation for rcrossref or any other client to work around this (they'd have to maintain a full list of all possible fields and NA them when not found ... bah 🤮).

As a backstop, we should make a list of fields which we actually use, and then test on ingest whether these fields are available for all rows and drop those rows.
And drop the fields which we don't use anyway.

In any case, we should be very defensive and fail/drop as early as possible.

maxheld83 · 2021-03-04T20:46:19Z

also raised this over at crossref: CrossRef/rest-api-doc#551

maxheld83 · 2021-03-05T06:56:15Z

it occurs to me, we've already run into a version of this bug in #117.

maxheld83 added a commit that referenced this issue Mar 4, 2021

enforce some length stability opens #183

73dc757

maxheld83 changed the title ~~cr_works is not length stable in columns~~ protect against missing cols from cr_works / missing fields from crossref API Mar 4, 2021

maxheld83 added this to the Wide Release milestone Mar 5, 2021

maxheld83 mentioned this issue Mar 5, 2021

container.title does not exist #117

Closed

njahn82 added a commit that referenced this issue Mar 5, 2021

metrics: hack fix missing funder and link columns #183

ab11f53

maxheld83 mentioned this issue Mar 15, 2021

isolate tests from live APIs #77

Open

maxheld83 added the api label Mar 19, 2021

maxheld83 mentioned this issue Mar 24, 2021

standardise order of columns #222

Open

maxheld83 modified the milestones: Wide & Intl Release, DevOps Migration Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protect against missing cols from cr_works / missing fields from crossref API #183

protect against missing cols from cr_works / missing fields from crossref API #183

maxheld83 commented Mar 4, 2021

njahn82 commented Mar 4, 2021 •

edited

njahn82 commented Mar 4, 2021 •

edited

maxheld83 commented Mar 4, 2021

maxheld83 commented Mar 4, 2021

maxheld83 commented Mar 4, 2021

maxheld83 commented Mar 5, 2021

protect against missing cols from cr_works / missing fields from crossref API #183

protect against missing cols from cr_works / missing fields from crossref API #183

Comments

maxheld83 commented Mar 4, 2021

njahn82 commented Mar 4, 2021 • edited

njahn82 commented Mar 4, 2021 • edited

maxheld83 commented Mar 4, 2021

maxheld83 commented Mar 4, 2021

maxheld83 commented Mar 4, 2021

maxheld83 commented Mar 5, 2021

njahn82 commented Mar 4, 2021 •

edited

njahn82 commented Mar 4, 2021 •

edited