Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions handle NAs in input incorrectly #224

Closed
stitam opened this issue Mar 17, 2020 · 16 comments · Fixed by #259
Closed

Functions handle NAs in input incorrectly #224

stitam opened this issue Mar 17, 2020 · 16 comments · Fixed by #259
Assignees
Labels
bug Unexpected problem or unintended behavior

Comments

@stitam
Copy link
Contributor

stitam commented Mar 17, 2020

  • aw_query(NA) returns http error 404.
  • get_chebiid(NA) returns sodium containing compounds.
  • ci_query(NA) returns NA, but the output structure is different from a valid query.
  • get_csid(NA) and all other ChemSpider functions return client error: (400) bad request.

These are just the first few, I believe most functions in the package are affected. Since a lot of webchem functions are vectorised, I think it is important that these functions always return the same output structure even if the input is invalid. I think we have to find a robust solution and implement it systematically in all of our affected functions.

@stitam stitam added the bug Unexpected problem or unintended behavior label Mar 17, 2020
@Aariq
Copy link
Collaborator

Aariq commented Mar 17, 2020

I had a feeling this one was more pervasive. Something we should keep in mind for future contributions as well.

@stitam
Copy link
Contributor Author

stitam commented Mar 17, 2020

Thanks for bringing this up in PR #223! The issue came up in an example for the paper where get_cid() and get_csid() were affected. I cannot think of a better alternative than to edit each function one by one. What do you think? I will edit the contributing guide to remind the contributors of this issue.

@Aariq
Copy link
Collaborator

Aariq commented Mar 17, 2020

Just need to edit each function to have something like:

if(is.na(query){
  return(NA)
}

And add tests to make sure it still returns the expected data type (i.e. a tibble).

@stitam
Copy link
Contributor Author

stitam commented Mar 18, 2020

Yes, something like this. However, if a function returns e.g. a list of 8 elements, then it has to return the same structure when the input is NA.

@stitam stitam self-assigned this Mar 18, 2020
@andschar
Copy link
Contributor

I agree that the same structure should be returned, but not necessarily with the same number of elements. So if a function returns a data.frame with 4 columns, it's enough to return data.frame(NA) because data.table::rbindlist() has the fill = TRUE argument and dplyr::bind_rows() fills values automatically. Except for the case of only NA results.

l = list(iris[1:10, ], data.frame(Species = NA))
dplyr::bind_rows(l)
data.table::rbindlist(l, fill = TRUE)

So, a structure like you suggest in PR #225 in the chebi.R file is in my opinion not necessary. What do you think?

@stitam
Copy link
Contributor Author

stitam commented Mar 24, 2020

Great point! In case of get_*() functions, bind_rows() will work well with data.frame(NA), unless the first query is NA, but that can be solved by data.frame(chebiid = NA). I'll update PR #225.

I am unsure about the query functions like chebi_comp_entity(). These functions currently return a list of lists and the lowest level list elements are not necessarily data frames, wouldn't this complicate things?

@andschar
Copy link
Contributor

andschar commented Mar 24, 2020

Yes, it's a rather complex data structure that is returned from ChEBI (i.e. a ontology - a data graph) which is too difficult, or better an overkill to squash into one data.frame in my opinion. So I think there's no reason to bind it in general (be it for the user or us webchem devs). chebi_comp_entity() mostly returns a list of data.frames, but yes some list elements are again lists, such as the chem_structure (which doesn't fit in a data.frame).
And if an erroneous ID is supplied chebi_comp_entity() returns a data.frame(NA) object, which can be removed from a larger list e.g. by doing l[ sapply(l, function(x) !is.na(x[[1]][1])) ]. Now that I look at it, it looks rather complicated :D. I think it would even be better to just return a NA here (in the case of lists), because then a user could just do l[ !is.na(l) ].

I thought the discussion in #193 whether to uniform the output only concerns the get_*() functions? I fear that data objects from different data resources are too different to have one harmonized data.frame for all of them.

l = chebi_comp_entity(c('CHEBI:27744', 'whatever', 'CHEBI:17790'))
l2 = l[ sapply(l, function(x) !is.na(x[[1]][1])) ]

Generally it's not the easiest topic. Hm, maybe your approach, to have the empty_chebi_comp_entity() is the better option.

Actually I never really bothered much about erroneous ID inputs because my workflow in ChEBI looks as such:

  1. retrieve IDs with get_cebiid()
  2. Clean the IDs
  3. Fetch data with cleaned, non-erroneous IDs with chebi_comp_entity()

@stitam
Copy link
Contributor Author

stitam commented Mar 24, 2020

I expect that most querry functions like chebi_comp_entity() will return complex and unique data structures and I agree that it won't be feasible to coerce everything into a data frame. Removing results is fine, but I would prefer to have vectorised extractor functions like cas(). These functions will use something like sappy() and so I think they will need the same substructure within the lists. I agree that empty_chebi_comp_entity() doesn't look very nice. I'll think of a better way.

@stitam
Copy link
Contributor Author

stitam commented Mar 25, 2020

With the new PubChem PUG-View web service we can access a whole PubChem page as a JSON object. Depending on the compound, these objects might not have the same paragraphs, e.g. the acetic acid page contains pKa information but the sulfuric acid page doesn't, and so a simple pKa extractor function won't work if we don't handle the missing list element. With PubChem this will be very common, and this is completely natural. Also very similar to the issue with invalid inputs.

In our data flow we often have "input"->get()$id->query->extract->"output", and NA-s have to be handled on each arrow to allow vectorisation. In the get functions, we just have to add an NA line to the output data.frame for each invalid input and the query functions will work well. But for the query functions, enforcing a uniform output structure would be a huge job. Maybe missing elements should be handled within the extractor functions. So @Aariq, I think your suggestion earlier is great! If we add something like that to each extractor function, and customize it to comply with the function's output structure, then I think this will work everywhere.

Example:

query_results <- list(a=list(name="compound", info = data.frame(b=1,c=2)),d=list())

sapply(query_results, function(x){
    if (length(x) == 0) return(NA)
    x$info$b
})

@andschar
Copy link
Contributor

andschar commented Mar 25, 2020

Ok, that sound also like a good plan to me, to handle NAs in the extractor functions!
So, to conclude:

  • get_*() functions should always return a data.frame/tibble, also in the case of no or erroneous results?
  • "Data functions" can return lists of data.frame/tibbles or data.frame/tibbles. If errors occur, empty list()s should be returned?
## list() approach
l = list(iris[1:10, ],
         list(),
         iris[1:10, ])
# remove erroneous results
l[ !sapply(l, length) == 0 ]
l[ !lengths(l) == 0 ]

## NA approach
l2 = list(iris[1:10, ],
          NA,
          iris[1:10, ])
# remove erroneous results
l2[ !is.na(l2) ]

Sorry, to bring it up agian. What do you think?

Actually, l[ !lengths(l) == 0 ] is also a one liner. Yet not as concise as l2[ !is.na(l2) ] imho. However I'm fine with either.

This discussion definitely also belongs to #218

@stitam
Copy link
Contributor Author

stitam commented Mar 25, 2020

get_*(): good point. On one hand I think it is good practice to always return the same class, in this case data.frame/tibble. On the other hand, what to do if none of the queries can be resolved for some reason? Maybe this is more similar to an unavailable API, and they should return an error.

list(): Earlier in PR #225 I used an empty list but later I realised NA is used everywhere else in the package, and you are right, it is easier to handle. Let's stay with NA. I also added this to the contributors' guide. The question for get_*() applies here as well: if all queries return NA maybe the function should return an error instead.

@andschar
Copy link
Contributor

get_*(): Generally I think this is a rare case. Hm, and I assume users would probably check the IDs before running the data queries, at which point they would recognize a data.frame/tibble full of NAs. So I would rather not implement it. If you think it's much handier, we can also implement it, since I think it's probably really rare. It would mean to add some extra code for every function.

list(): Oh, I didn't see it. Thanks for clarifying! Same opinion as above for no non-erroneous results.

Great conversation, btw!

@stitam stitam mentioned this issue Mar 26, 2020
4 tasks
@Aariq
Copy link
Collaborator

Aariq commented Mar 26, 2020

I see no reason for get_ to error if none of the queries have results. Just output a dataframe with the queries in one column and NAs in the ID column. Warn the user that some queries could not be resolved.

@Aariq
Copy link
Collaborator

Aariq commented May 12, 2020

The following functions still have an unexpected behavior when given NA:

  • cts_convert() (returns sodium)
  • get_etoxid (queries database and returns 0 x 3 tibble)
  • opsin_query (errors)
  • pc_synonyms (returns results for sodium)

Edit: I removed functions that errored when only given NA from this list.

@stitam
Copy link
Contributor Author

stitam commented May 12, 2020

Thanks @Aariq, just couldn't get there lately.

@stitam
Copy link
Contributor Author

stitam commented May 12, 2020

If you'd like to work on this issue, feel free to reassign it to yourself.

@Aariq Aariq self-assigned this May 13, 2020
@stitam stitam removed their assignment May 17, 2020
stitam added a commit that referenced this issue May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants