Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing what is returned by functions #218

Open
Aariq opened this issue Feb 26, 2020 · 17 comments
Open

Standardizing what is returned by functions #218

Aariq opened this issue Feb 26, 2020 · 17 comments
Labels
consistent api Uniformity across functions and data sources

Comments

@Aariq
Copy link
Collaborator

Aariq commented Feb 26, 2020

There is a huge amount of variation in what object type and structure is returned by get_* and database querying functions in webchem. Some of this variation might be necessary (e.g. because the data contained in a database is varied), but I think it would be ideal for functions to return data.frames whenever possible.

Here's a quick and dirty overview of what formats are currently returned by functions in webchem:

Get ID functions

  • get_chebiid: list of dataframes (one per query)
  • get_cid: list of vectors
  • get_csid: dataframe (one row per query?)
  • get_etoxid: dataframe with column for query.
  • get_wdid: with match = "all", list of a single vector with attribute. For match = "first", a data frame

Query database functions

  • aw_query: nested list (one list per query)
  • chebi_comp_entity: complex nested list (each compound has a nested list with some elements being dataframes and others being lists of vectors)
  • chebi_lite_entity: list of dataframes (even with a single query)
  • ci_query: complex nested list
  • cs_compinfo: a dataframe with one row per query
  • cts_compinfo: complex list
  • etox_basic: complex list
  • etox_targets, etox_tests: list with a dataframe and a source URL(character vector) nested under each query
  • fn_percept: named vector (one element per query)
  • nist_ri: dataframe with column for query
  • pan_query: nested list. For each query, the list seems like all length 1 vectors. If that's always true, there's no reason for this not to be a data frame.
  • pc_prop: data.frame
  • pc_synonyms: list of character vectors (one vector per query) unless choices != NULL, then a dataframe with queries as row names
  • srs_query: list of dataframes (one data frame per query) which includes list columns

Ideas

Something that might help us think through this is if someone came up with a "homework" problem that needs to be solved with multiple data sources and ID translations. Then we can each share our solutions (e.g. as vignettes in our own forks of webchem) and do some thinking on what would result in the least friction for the most users. The results of this could be a vignette for our package and/or supplemental for the webchem publication.

Whenever possible, I think we should:

  1. keep output consistent regardless of arguments or number of queries
  2. prioritize data frames tibbles whenever possible (except maybe functions that always return one element per query, which might be more usable as named vectors).
@Aariq Aariq added the consistent api Uniformity across functions and data sources label Feb 26, 2020
@stitam
Copy link
Contributor

stitam commented Feb 27, 2020

Thanks @Aariq for separating this topic! Prioritizing data frames is a good idea! Regarding the get_ functions, you convinced me @andschar , I agree with the list of data frames instead of a single data frame. Also I think it would be great to force the same columns and give the get_ outputs a common class. As for the query functions, each database returns different data so I presume we won't be able to define a rigid structure there, but again a list of data frames sounds easy to handle I think. And in case of multiple ids, a nested list of data frames.

@Aariq
Copy link
Collaborator Author

Aariq commented Feb 27, 2020

In cases when the data is just one dataframe per query, why not just one tidy dataframe with a query column? Wouldn't that be easier to work with?

@stitam
Copy link
Contributor

stitam commented Feb 27, 2020

So you are suggesting that e.g. the get_() functions should output a single data frame instead of a list of data frames? I don't have a strong opinion towards either. If a get_() function returns a list of data frames, it is somewhat easier to visualise if there are multiple hits for a query as @andchar mentioned, also post processing can be performed easily by lapply() and the list of data frames can be easily combined into a single data frame when needed. However, after get_() I usually use a database specific query function, where I need a vector of ids to search for. With a list fo data frames I would have to add an aditional line of code to coerce the list of data frames into a single data frame before I use the ids for the query, which is not very convenient. If get_() returns a single data frame, I can just reference the id column and I'm done. On the other hand, if I want to do post processing with the singe data frame get_() output, I might have to split the data frame, which is again another line of code.

@Aariq
Copy link
Collaborator Author

Aariq commented Feb 27, 2020

I don't think we should force users to learn lapply() or the purrr package to use webchem.

For get_* functions, I'd prefer named vectors, but I realize that some of them return additional information, so tidy data frames would be consistent and easy to convert to vectors like you mentioned.

I'd prefer single tidy dataframes for all query functions. I haven't delved into all of the query functions deep enough to know if that makes sense for all of them (e.g. the functions that return complex nested lists), but many of the structures listed above can be easily converted to tidy dataframes.

I think if you need *apply() functions to make the object returned usable, then webchem probably needs to be doing more so the user can do less. In those cases maybe there needs to be multiple functions, or arguments that specify what type of data to get, or additional helper extractor functions to deal with the complex lists. I think possibly even data frames with list-columns like what srs_query() returns might be easier for a beginner/intermediate R user to figure out than complex nested lists.

@stitam
Copy link
Contributor

stitam commented Feb 28, 2020

For the get_* functions I'd prefer tidy data frames over named vectors because I think the extra columns help us identify query issues, e.g when we don't get an exact match or we get multiple results. Regarding single data frame output vs list of data frames, a list of dfs can be combined easily e.g. with dplyr::bind_rows(), while a single df can be split easily with split() so I think we just have to pick one and stick to it. Maybe a single data frame is more user friendly than a list of data frames.

A related question is what we do with chooser(). In a previous discussion I suggested that the get_* functions could always return all results to be easy on the servers, and then the user could plug those results into a separate function (maybe choose_id() would be a better name for it in this workflow) to reduce the number of results to one per query. True, the user could do this with their own data manipulation tools, but I would personally prefer to have a convenience function with some standard filtering options like first, best, ask, na.

@andschar
Copy link
Contributor

Thanks @Aariq for the nice overview. I totally agree to keep the output of the get_*-functions consistent and I also prefer data.frame(s) over (named) vectors. Whether it should be one data.frame with all the results or a list of data.frame's I'm not completely sure anymore. As I mentioned in #193 a list of data.frame's could be easily bound, e.g. with data.table::rbindlist(l), dplyr::bind_rows(l) or base::do.call(rbind, l), but true, we could do this for the users as well. Hm, @stitam you are right, we should probably just pick one style. If we do so I'd like to highlight that it's important to also include query strings with no results in the data.frame.

@stitam
Copy link
Contributor

stitam commented Mar 1, 2020

I agree with min one row for each query string, even if a query string returns no results. I'd also prefer single tidy dataframe tibble instead of a list of data frames. Also, I think the get_* functions should return "all" results, and the results should be filtered by a separate function to return one result per query string. What do you think about this?

@andschar
Copy link
Contributor

andschar commented Mar 1, 2020

Yes I agree, to treat the APIs with care and it's probably better to query "all" results and refine them with a separate function locally. Moreover, such a function should be easy to write once all get_* functions return the same object? Hm, however, as you mentioned earlier @stitam we should think about whether the output should have at least some equal columns. E.g. there should be something like a 'score' column that looks the same for every API output to choose the "best" entry. Or, is it more clever to define not only a "get_*" S3 class, but also an API-specific S3 class according to the respective outputs. Something like class(out) <- c('get', 'chebi')?

@Aariq
Copy link
Collaborator Author

Aariq commented Mar 6, 2020

I think we agree that query functions should return tidy tibbles whenever easily possible.

So I think next steps for this issue are to identify functions that can easily return tidy tibbles that don't already, and create an issue or multiple issues that contributors can take on. For the functions that return more complex objects, we might want to put those on the back burner for now and hopefully gain some inspiration as we tidy up outputs from other functions.

Also, a lot of these things will be breaking changes for existing code if people update webchem, so we should take into consideration how to handle that. Honestly I'm not sure the best way to deal with that, but changes to data structure output by functions should at the very least be well documented in NEWS.md.

@stitam
Copy link
Contributor

stitam commented Mar 16, 2020

I think it is safe to start with get_ functions. Most of them already return data frames and tibbles are also data frames so there would be no CI issue. Once we gain inspiration for the rest, we can add an argument that controls whether to output the original structure or a tibble, then in the next release we can set that argument to default to tibble, and then in a subsequent release we can remove the old structure if we feel like it. What do you think?

@Aariq
Copy link
Collaborator Author

Aariq commented Mar 17, 2020

@stitam This sounds good to me. So any changes to the output structure of functions should be accompanied by a warning for now (I think dplyr is a good package to look for inspiration. It has changed a lot and done a good job at reminding users to update their code). Users can get the old behavior with something like output = c("tibble", "old").

@andschar
Copy link
Contributor

andschar commented Sep 29, 2020

To continue our discussion to make the package more consistent, some thoughts here on the initial list at the top by @Aariq:

Get ID functions

I think the get_*() functions are quite finished yet. They all return tibbles, no matter how many strings are queried and they all have a query column.

query = '1071-83-6'
webchem::get_chebiid(query) # OK
webchem::get_cid(query) # OK
webchem::get_csid(query, apikey = apikey) # (almost) OK
webchem::get_etoxid(query, from = 'cas') # OK
webchem::get_wdid(query) # OK

@stitam I have found that verbose = TRUE doesn't return a message for get_csid(). On purpose?

Query database functions

I think we haven't really agreed on something here. In my opinion it would be best to return a list of tibbles or lists (in case it's not a good idea to turn the data into a tibble. As described in #295, we should also discuss how to handle NA returns. Should we return an one-length NA vector or also a tibble (e.g. tibble(query = query)). I prefer the latter, because the data type would be consistent. So, we'd have only list of tibbles or (partly) lists and not lists of tibbles, NAs or (partly) lists.

Object Orientation

Some webchem functions, mostly older ones still have class() attributes assigned. Should we use this or discard OO completely? See also #295

PS: Once we have agreed on something, I would like to implement #289 and #295

@stitam
Copy link
Contributor

stitam commented Sep 29, 2020

"I have found that verbose = TRUE doesn't return a message for get_csid(). On purpose?" No, when I joined webchem I started with the chemspider functions and I didn't see what verbose is for and I may have accidentally removed it. But I suggest to keep it that way because I have a major edit for chemspider I just didn't want to open a PR before v1.1.

@stitam
Copy link
Contributor

stitam commented Sep 29, 2020

Regarding functions that actually return chemical data: I don't like the idea of returning a single tibble, but I do like the idea of returning a list of tibbles. The most atomised solution would be a tibble for each property. The value here could be that each property needs different columns, e.g. molecular weight may only have a value, dimension and reference, but e.g. nist_ri() returns a very different set of columns. If this is harmonised for each property, that would help us build powerful extractors later on as well.

@andschar
Copy link
Contributor

Regarding functions that actually return chemical data: I don't like the idea of returning a single tibble

I also meant that in my comment. Sorry for the confusion: A list of tibbles() or sub-lists (in case data isn't made for tables (e.g. MOL-file structure). As I said for no match, a simple tibble(query = query) is fine.

Not sure about a tibble for each property. chebi_comp_entity(1) has some properties clustered together, and others as single tibble/data.frame s. Thinking about extractor-functions, we could also use S3 methods again.

@Aariq
Copy link
Collaborator Author

Aariq commented Sep 30, 2020

A tibble for each property doesn't make sense---that might as well be a named vector. Whenever data is rectangular, I think a tibble should be returned as that is easier to work with than nested lists. If all the data is rectangular, I'm ambivalent about whether multiple queries should return a list of tibbles or a single tibble with a query column. Personally, I'd probably use a single tibble most often. On the other hand, not all query functions return rectangular data, so a list of tibbles is probably more consistent across the package. Plus bind_rows(out_list) is easy to do

@andschar
Copy link
Contributor

andschar commented Oct 1, 2020

I agree @Aariq , as long as one can do bind_rows() or rbindlist() it's fine.

Generally we don't have to update the functions immediately and maybe it's good to talk again before we change anything that big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consistent api Uniformity across functions and data sources
Projects
None yet
Development

No branches or pull requests

3 participants