Functions handle NAs in input incorrectly #224

stitam · 2020-03-17T19:31:53Z

aw_query(NA) returns http error 404.
get_chebiid(NA) returns sodium containing compounds.
ci_query(NA) returns NA, but the output structure is different from a valid query.
get_csid(NA) and all other ChemSpider functions return client error: (400) bad request.

These are just the first few, I believe most functions in the package are affected. Since a lot of webchem functions are vectorised, I think it is important that these functions always return the same output structure even if the input is invalid. I think we have to find a robust solution and implement it systematically in all of our affected functions.

The text was updated successfully, but these errors were encountered:

Aariq · 2020-03-17T19:57:45Z

I had a feeling this one was more pervasive. Something we should keep in mind for future contributions as well.

stitam · 2020-03-17T20:58:07Z

Thanks for bringing this up in PR #223! The issue came up in an example for the paper where get_cid() and get_csid() were affected. I cannot think of a better alternative than to edit each function one by one. What do you think? I will edit the contributing guide to remind the contributors of this issue.

Aariq · 2020-03-17T21:01:05Z

Just need to edit each function to have something like:

if(is.na(query){
  return(NA)
}

And add tests to make sure it still returns the expected data type (i.e. a tibble).

stitam · 2020-03-18T13:54:21Z

Yes, something like this. However, if a function returns e.g. a list of 8 elements, then it has to return the same structure when the input is NA.

andschar · 2020-03-24T10:09:33Z

I agree that the same structure should be returned, but not necessarily with the same number of elements. So if a function returns a data.frame with 4 columns, it's enough to return data.frame(NA) because data.table::rbindlist() has the fill = TRUE argument and dplyr::bind_rows() fills values automatically. Except for the case of only NA results.

l = list(iris[1:10, ], data.frame(Species = NA))
dplyr::bind_rows(l)
data.table::rbindlist(l, fill = TRUE)

So, a structure like you suggest in PR #225 in the chebi.R file is in my opinion not necessary. What do you think?

stitam · 2020-03-24T15:34:19Z

Great point! In case of get_*() functions, bind_rows() will work well with data.frame(NA), unless the first query is NA, but that can be solved by data.frame(chebiid = NA). I'll update PR #225.

I am unsure about the query functions like chebi_comp_entity(). These functions currently return a list of lists and the lowest level list elements are not necessarily data frames, wouldn't this complicate things?

andschar · 2020-03-24T16:18:18Z

Yes, it's a rather complex data structure that is returned from ChEBI (i.e. a ontology - a data graph) which is too difficult, or better an overkill to squash into one data.frame in my opinion. So I think there's no reason to bind it in general (be it for the user or us webchem devs). chebi_comp_entity() mostly returns a list of data.frames, but yes some list elements are again lists, such as the chem_structure (which doesn't fit in a data.frame).
And if an erroneous ID is supplied chebi_comp_entity() returns a data.frame(NA) object, which can be removed from a larger list e.g. by doing l[ sapply(l, function(x) !is.na(x[[1]][1])) ]. Now that I look at it, it looks rather complicated :D. I think it would even be better to just return a NA here (in the case of lists), because then a user could just do l[ !is.na(l) ].

I thought the discussion in #193 whether to uniform the output only concerns the get_*() functions? I fear that data objects from different data resources are too different to have one harmonized data.frame for all of them.

l = chebi_comp_entity(c('CHEBI:27744', 'whatever', 'CHEBI:17790'))
l2 = l[ sapply(l, function(x) !is.na(x[[1]][1])) ]

Generally it's not the easiest topic. Hm, maybe your approach, to have the empty_chebi_comp_entity() is the better option.

Actually I never really bothered much about erroneous ID inputs because my workflow in ChEBI looks as such:

retrieve IDs with get_cebiid()
Clean the IDs
Fetch data with cleaned, non-erroneous IDs with chebi_comp_entity()

stitam · 2020-03-24T17:02:15Z

I expect that most querry functions like chebi_comp_entity() will return complex and unique data structures and I agree that it won't be feasible to coerce everything into a data frame. Removing results is fine, but I would prefer to have vectorised extractor functions like cas(). These functions will use something like sappy() and so I think they will need the same substructure within the lists. I agree that empty_chebi_comp_entity() doesn't look very nice. I'll think of a better way.

stitam · 2020-03-25T10:43:35Z

With the new PubChem PUG-View web service we can access a whole PubChem page as a JSON object. Depending on the compound, these objects might not have the same paragraphs, e.g. the acetic acid page contains pKa information but the sulfuric acid page doesn't, and so a simple pKa extractor function won't work if we don't handle the missing list element. With PubChem this will be very common, and this is completely natural. Also very similar to the issue with invalid inputs.

In our data flow we often have "input"->get()$id->query->extract->"output", and NA-s have to be handled on each arrow to allow vectorisation. In the get functions, we just have to add an NA line to the output data.frame for each invalid input and the query functions will work well. But for the query functions, enforcing a uniform output structure would be a huge job. Maybe missing elements should be handled within the extractor functions. So @Aariq, I think your suggestion earlier is great! If we add something like that to each extractor function, and customize it to comply with the function's output structure, then I think this will work everywhere.

Example:

query_results <- list(a=list(name="compound", info = data.frame(b=1,c=2)),d=list())

sapply(query_results, function(x){
    if (length(x) == 0) return(NA)
    x$info$b
})

andschar · 2020-03-25T14:11:40Z

Ok, that sound also like a good plan to me, to handle NAs in the extractor functions!
So, to conclude:

get_*() functions should always return a data.frame/tibble, also in the case of no or erroneous results?
"Data functions" can return lists of data.frame/tibbles or data.frame/tibbles. If errors occur, empty list()s should be returned?
- @stitam At least you updated the PR Alternatives for handling NA inputs #225 in commit 9f45297 like that.
- this might still be a point to discuss, since it would also be possible to return NA, which would maybe be easier to filter in lists, compared to list()...

## list() approach
l = list(iris[1:10, ],
         list(),
         iris[1:10, ])
# remove erroneous results
l[ !sapply(l, length) == 0 ]
l[ !lengths(l) == 0 ]

## NA approach
l2 = list(iris[1:10, ],
          NA,
          iris[1:10, ])
# remove erroneous results
l2[ !is.na(l2) ]

Sorry, to bring it up agian. What do you think?

Actually, l[ !lengths(l) == 0 ] is also a one liner. Yet not as concise as l2[ !is.na(l2) ] imho. However I'm fine with either.

This discussion definitely also belongs to #218

stitam · 2020-03-25T15:18:37Z

get_*(): good point. On one hand I think it is good practice to always return the same class, in this case data.frame/tibble. On the other hand, what to do if none of the queries can be resolved for some reason? Maybe this is more similar to an unavailable API, and they should return an error.

list(): Earlier in PR #225 I used an empty list but later I realised NA is used everywhere else in the package, and you are right, it is easier to handle. Let's stay with NA. I also added this to the contributors' guide. The question for get_*() applies here as well: if all queries return NA maybe the function should return an error instead.

andschar · 2020-03-25T15:45:54Z

get_*(): Generally I think this is a rare case. Hm, and I assume users would probably check the IDs before running the data queries, at which point they would recognize a data.frame/tibble full of NAs. So I would rather not implement it. If you think it's much handier, we can also implement it, since I think it's probably really rare. It would mean to add some extra code for every function.

list(): Oh, I didn't see it. Thanks for clarifying! Same opinion as above for no non-erroneous results.

Great conversation, btw!

Aariq · 2020-03-26T15:02:28Z

I see no reason for get_ to error if none of the queries have results. Just output a dataframe with the queries in one column and NAs in the ID column. Warn the user that some queries could not be resolved.

Aariq · 2020-05-12T01:28:37Z

The following functions still have an unexpected behavior when given NA:

cts_convert() (returns sodium)
get_etoxid (queries database and returns 0 x 3 tibble)
opsin_query (errors)
pc_synonyms (returns results for sodium)

Edit: I removed functions that errored when only given NA from this list.

stitam · 2020-05-12T19:12:09Z

Thanks @Aariq, just couldn't get there lately.

stitam · 2020-05-12T19:13:47Z

If you'd like to work on this issue, feel free to reassign it to yourself.

Closes #224.

stitam added the bug Unexpected problem or unintended behavior label Mar 17, 2020

stitam self-assigned this Mar 18, 2020

stitam mentioned this issue Mar 19, 2020

Alternatives for handling NA inputs #225

Merged

4 tasks

stitam mentioned this issue Mar 25, 2020

Empty Etox tables #227

Closed

stitam mentioned this issue Mar 26, 2020

fixed etox_basic #229

Merged

4 tasks

stitam mentioned this issue Mar 29, 2020

Update some ChemSpider functions to handle invalid input #231

Merged

4 tasks

Aariq self-assigned this May 13, 2020

Aariq mentioned this issue May 15, 2020

Fix tests that are currently skipped #255

Closed

3 tasks

stitam removed their assignment May 17, 2020

Aariq mentioned this issue May 24, 2020

handle NAs and other miscelaneous fixes #259

Merged

4 tasks

stitam closed this as completed in #259 May 25, 2020

stitam added a commit that referenced this issue May 25, 2020

Merge pull request #259 from Aariq/aariq-nas

e8b15f1

Closes #224.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions handle NAs in input incorrectly #224

Functions handle NAs in input incorrectly #224

stitam commented Mar 17, 2020 •

edited

Aariq commented Mar 17, 2020

stitam commented Mar 17, 2020

Aariq commented Mar 17, 2020

stitam commented Mar 18, 2020

andschar commented Mar 24, 2020

stitam commented Mar 24, 2020

andschar commented Mar 24, 2020 •

edited

stitam commented Mar 24, 2020

stitam commented Mar 25, 2020

andschar commented Mar 25, 2020 •

edited

stitam commented Mar 25, 2020 •

edited

andschar commented Mar 25, 2020

Aariq commented Mar 26, 2020

Aariq commented May 12, 2020 •

edited

stitam commented May 12, 2020 •

edited

stitam commented May 12, 2020

Functions handle NAs in input incorrectly #224

Functions handle NAs in input incorrectly #224

Comments

stitam commented Mar 17, 2020 • edited

Aariq commented Mar 17, 2020

stitam commented Mar 17, 2020

Aariq commented Mar 17, 2020

stitam commented Mar 18, 2020

andschar commented Mar 24, 2020

stitam commented Mar 24, 2020

andschar commented Mar 24, 2020 • edited

stitam commented Mar 24, 2020

stitam commented Mar 25, 2020

andschar commented Mar 25, 2020 • edited

stitam commented Mar 25, 2020 • edited

andschar commented Mar 25, 2020

Aariq commented Mar 26, 2020

Aariq commented May 12, 2020 • edited

stitam commented May 12, 2020 • edited

stitam commented May 12, 2020

stitam commented Mar 17, 2020 •

edited

andschar commented Mar 24, 2020 •

edited

andschar commented Mar 25, 2020 •

edited

stitam commented Mar 25, 2020 •

edited

Aariq commented May 12, 2020 •

edited

stitam commented May 12, 2020 •

edited