Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protect against missing cols from cr_works / missing fields from crossref API #183

Open
maxheld83 opened this issue Mar 4, 2021 · 6 comments
Labels
Milestone

Comments

@maxheld83
Copy link
Contributor

when a field is NA (?), cr_works() just drops the column:

rcrossref::cr_works("10.1038/s41598-020-57429-5")[["data"]]
#> # A tibble: 1 x 34
#>   alternative.id container.title    created    deposited  published.print
#>   <chr>          <chr>              <chr>      <chr>      <chr>          
#> 1 57429          Scientific Reports 2020-01-31 2021-01-30 2020-12        
#> # … with 29 more variables: published.online <chr>, doi <chr>, indexed <chr>,
#> #   issn <chr>, issue <chr>, issued <chr>, member <chr>, prefix <chr>,
#> #   publisher <chr>, score <chr>, source <chr>, reference.count <chr>,
#> #   references.count <chr>, is.referenced.by.count <chr>, subject <chr>,
#> #   title <chr>, type <chr>, update.policy <chr>, url <chr>, volume <chr>,
#> #   abstract <chr>, language <chr>, short.container.title <chr>,
#> #   assertion <list>, author <list>, funder <list>, link <list>,
#> #   license <list>, reference <list>
rcrossref::cr_works("10.1109/JLT.2019.2961931")[["data"]]
#> # A tibble: 1 x 27
#>   container.title  created  deposited published.print doi    indexed issn  issue
#>   <chr>            <chr>    <chr>     <chr>           <chr>  <chr>   <chr> <chr>
#> 1 Journal of Ligh… 2019-12… 2020-10-… 2020-07-01      10.11… 2021-0… 0733… 13   
#> # … with 19 more variables: issued <chr>, member <chr>, page <chr>,
#> #   prefix <chr>, publisher <chr>, score <chr>, source <chr>,
#> #   reference.count <chr>, references.count <chr>,
#> #   is.referenced.by.count <chr>, subject <chr>, title <chr>, type <chr>,
#> #   url <chr>, volume <chr>, short.container.title <chr>, author <list>,
#> #   funder <list>, link <list>

Created on 2021-03-04 by the reprex package (v1.0.0)

This is pretty dangerous.
purrr::map_dfr() in #168 should cover many of these cases, but not deterministically.

We should either:

  1. make it length stable in columns (requiring parsing the raw json)
  2. always assert the fields we actually need, and only keep those.
    (that might still cause an error)
@njahn82
Copy link
Collaborator

njahn82 commented Mar 4, 2021

cr_works only parses fields that are returned by the Crossref API. E.g., http://api.crossref.org/v1/works/10.1109/JLT.2019.2961931 lacks license and reference nodes. Not sure how parsing the raw json would help.

@njahn82
Copy link
Collaborator

njahn82 commented Mar 4, 2021

https://github.com/ropensci/rcrossref/blob/03c3026aca49808c6d5088854ecdce11112d2d12/R/cr_works.R#L169

cr_works uses bind_rows.

@maxheld83
Copy link
Contributor Author

cr_works uses bind_rows

I know, similar to what purrr::map_dfr() will achieve.
It will probably protect us in most cases (when at least one of the dois has the necessary node), though not necessarily in all.

If we're unlucky, none of them do, and then x[[col]] won't exist at all, and we get an error which is hard to debug.
And I don't think we should work around this wherever x[[col]] is called; because that would perpetuate the instability up the call stack.

Anyway, it's not a priority, and depending on which fields we use it might be very unlikely.
But it could bite us, and then eat up a good couple of hours to debug, because it's quite unituitive.

@maxheld83
Copy link
Contributor Author

cr_works only parses fields that are returned by the Crossref API. E.g., http://api.crossref.org/v1/works/10.1109/JLT.2019.2961931 lacks license and reference nodes. Not sure how parsing the raw json would help.

Uh, that's a bummer, thanks for pointing that out.
I wrongly suspected rcrossref 😔.

crossref then! 😡

Just to show that I'm not making these expectations up; usual best standard for APIs is to always show all response fields:

If a field has no value, it shall be null.

This one is quite trivial, but is has a few nuances. Non existing numbers, strings, and booleans are > usually represented as null. But string fields without value should also be represented as null, not "".

Note: Empty arrays shall not be null. If a field is some kind of list and it is represented by an array, an array shall be returned, even if empty. This makes front-end developers’ work a lot easier.

Example:

{
   "id": 16784,
   "name": "Lorem Ipsum",
   "age": null,
   "relatives": [], // no relatives
   "address": null
}

So, bottom line: it would be possible, though inelegant / unreasonable expectation for rcrossref or any other client to work around this (they'd have to maintain a full list of all possible fields and NA them when not found ... bah 🤮).

As a backstop, we should make a list of fields which we actually use, and then test on ingest whether these fields are available for all rows and drop those rows.
And drop the fields which we don't use anyway.

In any case, we should be very defensive and fail/drop as early as possible.

@maxheld83 maxheld83 changed the title cr_works is not length stable in columns protect against missing cols from cr_works / missing fields from crossref API Mar 4, 2021
@maxheld83
Copy link
Contributor Author

also raised this over at crossref: CrossRef/rest-api-doc#551

@maxheld83
Copy link
Contributor Author

it occurs to me, we've already run into a version of this bug in #117.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants