Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission: healthdatacsv #358

Closed
9 of 29 tasks
iecastro opened this issue Jan 11, 2020 · 28 comments
Closed
9 of 29 tasks

submission: healthdatacsv #358

iecastro opened this issue Jan 11, 2020 · 28 comments

Comments

@iecastro
Copy link

iecastro commented Jan 11, 2020

Submitting Author Name: Ivan Castro
Submitting Author Github Handle: @iecastro
Repository: https://github.com/iecastro/healthdatacsv
Version submitted: 0.0.1
Editor: @karthik
Reviewers: @czeildi, @richfitz

Due date for @czeildi: 2020-03-19

Due date for @richfitz: 2020-08-25
Archive: TBD
Version accepted: TBD


  • Paste the full DESCRIPTION file inside a code block below:
Package: healthdatacsv
Title: Access data in the healthdata.gov catalog
Version: 0.0.1
Authors@R: 
    person(given = "Ivan",
           family = "Castro",
           role = c("aut", "cre"),
           email = "ivedcastr@gmail.com")
Description: This package will make an API call to the
    data.json endpoint of HealthData.gov and read data into R, if available in
    csv format. Some helper functions are available to help users query by
    listed Agency and/or keywords.
License: MIT + file LICENSE
URL: https://github.com/iecastro/healthdatacsv
BugReports: https://github.com/iecastro/healthdatacsv/issues
Imports: 
    dplyr,
    httr,
    jsonlite,
    magrittr,
    purrr,
    rlang,
    stringr,
    tibble,
    tidyr,
    vroom
Suggests: 
    readr,
    testthat (>= 2.1.0),
    covr
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.1

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • database access
    • data munging
    • data deposition
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):
    healthdatacsv falls under data retrieval because it accesses and ingests data from an online source. This package focuses on health-related data made available from the HealthData.gov API.

  • Who is the target audience and what are scientific applications of this package?
    Target audience would be researchers, instructors, students, and analysts in health-related fields (i.e. Epidemiology, Health Economics, Health Services Research). The data accessed are collected by agencies of the US Dept. of Health and Human Services, and are often nationally representative. Scientific applications include, among others, the investigation of: disease patterns, trends in health behaviors, or deficits in health systems.

  • Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
    There is rHealthDataGov; however, this package has not been updated and does not function properly with the current HealthData.gov API.

  • If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
    presubmission: healthdatacsv #350

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

JOSS Options
  • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@iecastro
Copy link
Author

Hi all: did a small update to include a COC, and a basic use vignette.

@lmullen
Copy link
Member

lmullen commented Jan 24, 2020

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly

Editor comments

Thanks for your submission. I have requests out to reviewers, and will let you know the reviewers and the review date when enough have agreed.

In the meantime, could you please address these issues?

  • The repository uses codecov but the automated code coverage through Travis CI functionality is not currently working.
  • A few mis-spellings: healdata
  • Can you add CI for Windows with Appveyor? Try the usethis package.

Reviewers:
Due date:

@lmullen
Copy link
Member

lmullen commented Jan 24, 2020

One other thing: while the package is undergoing peer review, could you please add the peer review badge? You can do this by running rodeo::use_review_badge(358).

@iecastro
Copy link
Author

Thank you @lmullen - I've started addressing your comments.

@iecastro
Copy link
Author

iecastro commented Feb 8, 2020

Editor comments

Thanks for your submission. I have requests out to reviewers, and will let you know the reviewers and the review date when enough have agreed.

In the meantime, could you please address these issues?

  • The repository uses codecov but the automated code coverage through Travis CI functionality is not currently working.
  • A few mis-spellings: healdata
    - I couldn't find these, but will keep in mind
  • Can you add CI for Windows with Appveyor? Try the usethis package.

@lmullen
Copy link
Member

lmullen commented Feb 26, 2020

Thanks to our reviewers for agreeing to take on the peer review of this package.

Reviewer: @czeildi
Reviewer: @cksun-usc
Due date: 2020-03-19

You can find details about how to do the review in the reviewers' guide.

@czeildi
Copy link

czeildi commented Mar 7, 2020

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
    • in fetch_csv(), prefer dplyr::slice() to slice() so it runs without library(dplyr)
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

In general, I found the package easy to learn and use, and the code is readable and simple. Below are my comments in three categories. Most of them are non-blocking.

usability

  • All functions calls emit the message: "No encoding supplied: defaulting to UTF-8." Consider muting this with encoding = "UTF-8" in httr::content(response, "text") in healthdata_api().
  • fetch_csv() throws a bit confusing error when called with a catalog where csv_available is FALSE. I would add and explicit test and error message for this, and also highlight in the function documentation. (Also consider the case of multiple catalogs, where some of the have csv_available, while others do not.)
  • The data_viewer argument in get_keywords() seems arbitrary for me: why here and not in other functions? It makes the code more complicated for little gain for the user (they have to provide an argument instead of for example %T>% into View()). I would suggest to remove this argument.
  • 3 different verbs are used for similar things: getting results from the API (get, fetch, list). Is there a logic to make this easier to remember? I would use the same verb for keywords and agencies as they are both helpers so that the user knows what arguments they can call fetch_catalog() with.
  • consider renaming fetch_csv(): it does not return csv files so it was a bit confusing for me
  • in list_agencies the namecheck argument is not really intuitive to me, possibly use name_match instead or sg suggesting regex matching?
  • Besides fetch_csv() all functions (with any arguments) call healthdata_api("data.json") which takes a significant time. This could be better for the user but also makes tests and vignette compilation quite slow. Consider caching the result of the api call or reorganize functions in a way that the user can easily re-use the api result. I believe the memoise package can be used for this but I do not have experience with it, it might not worth it in this case.

code style

  • Use consistent indentation and spacing (You can do this automatically with styler::style_pkg().)
  • consider renaming publisher.name to publisher_name to follow snake case in results as well
  • In healthdata_api if there is an error, you cannot necessarily parse so I believe the error message should not rely on the parsed response in all cases (see healthdata_api("data.jsn") as a forced example)
  • add a new test for the case of error from the API
  • There are repeated parts in fetch_catalog(), list_agencies() and get_keywords() which could be simplified and thus easier to understand and maintain. E.g. in get_keywords():
keywords <- jsonlite::flatten(parsed_dataset) %>%
  as_tibble() %>%
  select(
    .data$publisher.name,
    .data$keyword
  ) %>%
  tidyr::unnest(cols = c(.data$keyword)) %>%
  distinct()
if (!is.null(agency)) {
  keywords <- keywords %>%
    filter(.data$publisher.name == agency)
}
  • it is good practice to give tests unique descriptions so that if there is a failure, it is easier to find based on description. (In this case there are only a few tests so it is ok). Nonetheless, consider a unique description for all test cases (you can achieve this by changing the descriptions or grouping similar expectations together in one test case)
  • test files: it is not necessary to include library() call at the top of the test files (I am not sure whether it is problematic, but it is not standard afaik)
  • there is no need for explicit return() calls, consider removing them
  • in example in fetch_csv(), prefer dplyr::slice() to slice() so it runs without library(dplyr)

documentation

  • add Roxygen: list(markdown = TRUE) to DESCRIPTION so that markdown formatting works in roxygen2 documentation
  • use @noRd instead of @keyword internal for documenting internal functions. Reference 1, Reference 2
  • consider adding codemetar: https://devguide.ropensci.org/building.html#package-name-and-metadata
  • The readme is a bit long to me compared to the number of functionalities. Consider using head on results. (Note that I am also completely fine with the current readme)
  • Add a repostatus.org badge to readme as per guideline in https://devguide.ropensci.org/building.html#readme
  • Fix link to tibble in function documentation @return part: use [`tibble`][tibble::tibble()] or similar. (I checked in pkgdown documentation, but did not find a way to link to tibble-package documentation without hardlinging url-s)
  • Separate titles and description in package documentation documentation (insert blank line or add separate title) so that the title is short, especially in fetch_catalog()
  • use consistent code style (indentation) in readme, vignette as well, e.g.
fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment")
  • There is a sentence in the readme: "healthdatacsv is pipe friendly, so it’s easy to integrate into a tidy workflow." It is a bit out of context for me: this claim is not really supported, the single command pipeline with dplyr::pull is not a strong argument and all functions in the package have different input an output. (This is not to say that the package is pipe-unfriendly!). The fetch_catalog(...) %>% fetch_csv() pipe supports this sentence more.
  • indicate all internal functions with function_name() in readme and vignette for nice linking in pkgdown website
  • in function documentations: note NULL, TRUE as code (by enclosing with ``), healthdata.gov as link, link to other functions by enclosing in ``

@iecastro
Copy link
Author

iecastro commented Mar 9, 2020

Thanks @czeildi! I appreciate the feedback and will start addressing individual items soon.

@noamross
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

@annakrystalli
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️
In the interest of reducing load on reviewers and editors as we manage the
COVID-19 crisis, rOpenSci new submissions for software peer review are paused.

In this period new submissions will not be handled, nor new reviewers assigned.
Reviews and responses to reviews will be handled on a 'best effort' basis, but
no follow-up reminders will be sent.
Other rOpenSci community activities continue.

Please check back here again after 25 May when we will be announcing plans to slowly start back up.

We express our continued great
appreciation for the work of our authors and reviewers. Stay healthy and take
care of one other.

The rOpenSci Editorial Board
⚠️⚠️⚠️⚠️⚠️

@lmullen
Copy link
Member

lmullen commented Jun 30, 2020

Hi, @cksun-usc. Obviously a lot happened in the world since you agreed to take on this review. Is this something that you would be able to pick up at some point? If you do not wish to go forward with the review, that's more than fine; just let me know.

@karthik karthik self-assigned this Jul 13, 2020
@karthik
Copy link
Member

karthik commented Jul 13, 2020

👋 @iecastro I'll be taking over as editor for this submission and will follow up on what's needed to move this along. Thank you for your patience.

@iecastro
Copy link
Author

Hi @czeildi - hope all is well! I've addressed your comments below. Thank you for your time.

Review Comments

In general, I found the package easy to learn and use, and the code is readable and simple. Below are my comments in three categories. Most of them are non-blocking.

usability

  • All functions calls emit the message: "No encoding supplied: defaulting to UTF-8." Consider muting this with encoding = "UTF-8" in httr::content(response, "text") in healthdata_api().
  • fetch_csv() throws a bit confusing error when called with a catalog where csv_available is FALSE. I would add and explicit test and error message for this, and also highlight in the function documentation. (Also consider the case of multiple catalogs, where some of the have csv_available, while others do not.)
  • The data_viewer argument in get_keywords() seems arbitrary for me: why here and not in other functions? It makes the code more complicated for little gain for the user (they have to provide an argument instead of for example %T>% into View()). I would suggest to remove this argument.
  • 3 different verbs are used for similar things: getting results from the API (get, fetch, list). Is there a logic to make this easier to remember? I would use the same verb for keywords and agencies as they are both helpers so that the user knows what arguments they can call fetch_catalog() with.
  • keywords and agencies now have the same verb get_
  • consider renaming fetch_csv(): it does not return csv files so it was a bit confusing for me
  • renamed fetch_data()
  • in list_agencies the namecheck argument is not really intuitive to me, possibly use name_match instead or sg suggesting regex matching?
  • argument renamed agency
  • Besides fetch_csv() all functions (with any arguments) call healthdata_api("data.json") which takes a significant time. This could be better for the user but also makes tests and vignette compilation quite slow. Consider caching the result of the api call or reorganize functions in a way that the user can easily re-use the api result. I believe the memoise package can be used for this but I do not have experience with it, it might not worth it in this case.

code style

  • Use consistent indentation and spacing (You can do this automatically with styler::style_pkg().)
  • consider renaming publisher.name to publisher_name to follow snake case in results as well
  • renamed to publisher
  • In healthdata_api if there is an error, you cannot necessarily parse so I believe the error message should not rely on the parsed response in all cases (see healthdata_api("data.jsn") as a forced example)
  • healthdata_api is an internal function, so the user does not have a chance of misspelling the endpoint url. In theory, there will always be an API call made, and any errors would be parsed. But also, not much experience on this, on my part.
  • add a new test for the case of error from the API
  • There are repeated parts in fetch_catalog(), list_agencies() and get_keywords() which could be simplified and thus easier to understand and maintain. E.g. in get_keywords():
  • thank you!
keywords <- jsonlite::flatten(parsed_dataset) %>%
  as_tibble() %>%
  select(
    .data$publisher.name,
    .data$keyword
  ) %>%
  tidyr::unnest(cols = c(.data$keyword)) %>%
  distinct()
if (!is.null(agency)) {
  keywords <- keywords %>%
    filter(.data$publisher.name == agency)
}
  • it is good practice to give tests unique descriptions so that if there is a failure, it is easier to find based on description. (In this case there are only a few tests so it is ok). Nonetheless, consider a unique description for all test cases (you can achieve this by changing the descriptions or grouping similar expectations together in one test case)
  • test files: it is not necessary to include library() call at the top of the test files (I am not sure whether it is problematic, but it is not standard afaik)
  • there is no need for explicit return() calls, consider removing them
  • in example in fetch_csv(), prefer dplyr::slice() to slice() so it runs without library(dplyr)

documentation

  • Pending. Didn't realize I hadn't looked into this.
  • The readme is a bit long to me compared to the number of functionalities. Consider using head on results. (Note that I am also completely fine with the current readme)
  • Did not use head() but I did revise the readme several times - cutting down the lenght (I think - it's been a while)
  • Add a repostatus.org badge to readme as per guideline in https://devguide.ropensci.org/building.html#readme
  • Fix link to tibble in function documentation @return part: use [`tibble`][tibble::tibble()] or similar. (I checked in pkgdown documentation, but did not find a way to link to tibble-package documentation without hardlinging url-s)
  • Separate titles and description in package documentation documentation (insert blank line or add separate title) so that the title is short, especially in fetch_catalog()
  • use consistent code style (indentation) in readme, vignette as well, e.g.
fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment")
  • There is a sentence in the readme: "healthdatacsv is pipe friendly, so it’s easy to integrate into a tidy workflow." It is a bit out of context for me: this claim is not really supported, the single command pipeline with dplyr::pull is not a strong argument and all functions in the package have different input an output. (This is not to say that the package is pipe-unfriendly!). The fetch_catalog(...) %>% fetch_csv() pipe supports this sentence more.
  • indicate all internal functions with function_name() in readme and vignette for nice linking in pkgdown website
  • in function documentations: note NULL, TRUE as code (by enclosing with ), healthdata.gov as link, link to other functions by enclosing in

@iecastro
Copy link
Author

👋 @iecastro I'll be taking over as editor for this submission and will follow up on what's needed to move this along. Thank you for your patience.

No worries @karthik, thank you - there's no rush on my part

@czeildi
Copy link

czeildi commented Jul 26, 2020

@iecastro I reviewed your changes, looks great! I happily recommend approving this package.

I noticed one minor thing: you should no longer import View as you no longer use it: #' @importFrom utils View line should be deleted from healthdatacsv-package.R

@karthik
Copy link
Member

karthik commented Jul 27, 2020

Thanks very much for the review and acceptance recommendation @czeildi

The package looks to be in good shape to me as well. Since the second reviewer has gone missing, I'll try to get someone else to do a quick review and move this to acceptance.

@karthik
Copy link
Member

karthik commented Jul 27, 2020

👋 @dbuijs Would you be able to help us review this package? One reviewer has already completed a thorough review and recommended acceptance. You'll need to do a sanity check and make sure everything looks ok to you. There is a checklist to work through. Thanks for considering. 🙏

@iecastro
Copy link
Author

Thanks, all! @czeildi @karthik

@karthik
Copy link
Member

karthik commented Jul 29, 2020

@iecastro I'll keep looking for a second reviewer but if you have any suggestions for potential reviewers who don't have a conflict of interest, I am all ears.

@karthik
Copy link
Member

karthik commented Aug 4, 2020

Adding @richfitz as second reviewer. Thanks for volunteering Rich! Much appreciated.

@iecastro
Copy link
Author

iecastro commented Aug 5, 2020

Thank you! @karthik @richfitz

@richfitz
Copy link
Member

richfitz commented Aug 21, 2020

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • Briefly describe any working relationship you have (had) with the package authors.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally (failed due to API error)
  • Function Documentation: for all exported functions in R help (to minimal -see below)
  • Examples for all exported functions in R Help that run successfully locally (failed due to API error)
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • [NA] Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine. (see below)
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Looked at 56bfd39 (Mon Jul 27 07:47:25 2020 -0400).

I've not been able to complete the review ad the API is down, but the comments below should be addressed regardless. The three big issues are

  • The API is down - that's fine but the package should be able to cope with this gracefully.
  • The testing is incomplete.
  • The documentation is incomplete.

Note that none of my comments really concern the implementation, which looks fine (modulo comments about dependencies, but that's a matter of taste).

Problematic use of flakey API

> devtools::test()
Loading healthdatacsv
Testing healthdatacsv
✔ |  OK F W S | Context
✖ |   0 1     | fetch_catalog [0.5 s]
────────────────────────────────────────────────────────────────────────────────
test-fetch_catalog.R:3: error: output is tibble
SSL certificate problem: certificate has expired
Backtrace:
  1. testthat::expect_is(fetch_catalog(), "tbl_df") tests/testthat/test-fetch_catalog.R:3:2
  4. healthdatacsv::fetch_catalog()
  5. healthdatacsv::healthdata_api("data.json") R/healthdata-api.R:31:2
  6. httpcache::GET(data_url, user) R/healthdata-api.R:145:2
  7. httr::GET(url, ...)
  8. httr:::request_perform(req, hu$handle$handle)
 10. httr:::request_fetch.write_memory(req$output, req$url, handle)
 11. curl::curl_fetch_memory(url, handle = handle)

All tests that use web requests should be opt-in, especially if the package will go to CRAN. Add testthat::skip_on_cran() to these test blocks

test-fetch_catalog.R:3
test-get_agencies.R:3
test-get_agencies.R:8
test-get_keywords.R:3
test-get_keywords.R:8
test-get_keywords.R:14

Similar, but harder, comments apply to the vignette if this is going on CRAN - I do not recommend at all trying to run a vignette that uses the internet or external service there. All alternative options are unpleasant though:

  • ask them not to run the vignette
  • precompile the vignette and included one that does not use the internet (numerous hacky ways of doing this)
  • include the vignette as a pkgdown article only (users then can't read it from R or offline)
  • mock out the api calls in the vignette (hard to maintain, ugly to write, you'll have to store the data in the package)

Testing issues

The testing is minimal to the point of view that it may not be worth having, for example in test-fetch_catalogue.R

test_that("output is tibble", {
  expect_is(fetch_catalog(), "tbl_df")
})

However, when looking at fetch_catalogue(), there's significant logic there that is never tested - just that we get a tbl_df out which is unavoidable if the function does not error. So this test could be replaced by checking that the function doesn't error (this is not the solution though!).

Instead, split these functions into network-using and network free components. You're basically there already, and

fetch_catalog <- function(agency = NULL,
                          keyword = NULL) {

  api_call <- healthdata_api("data.json")
  fetch_catalogue_process(api_call, agency, keyword)
}

and focus the bulk of the testing on the second half. What happens if agency is not found in the data? What if the user provides a table rather than a string? Are multiple strings handled gracefully or is an error thrown etc.

If you split the code at this point you don't even need to bother formally mocking the api call, you can just keep a "reasonable" data set in the package for testing with.

Similar comments apply throughout.

You may want to use mocking (I like the mockery package) for ensuring that your error messages in places like healthdata_api are constructed properly. One thing to watch for is that if parsed comes back without a message property, your message will be empty.

The other coverage hole (fetch_data) look easy enough to fill but this is secondary to making sure that the tests that have been added are replaced with tests that validate more than the lack of errors.

Documentation is minimal

The documentation for fetch_data, the main entry point of the package I believe, reads - in full

#' Fetch data from the healthdata.gov API
#'
#' This function will download data from
#' [healthdata.gov](https://healthdata.gov/),
#' if available in csv format

This supposes that the user already knows what sort of data night be found there, and it's not clear what the restriction on csv is for. The catalogue is not explained relative to the data.

(Other files have the same issues)

In the DESCRIPTION, the Description field should probably be updated to reflect what the package is for, rather than how it works.

Dependencies

This seems to me an excessive number of dependencies to pull in for what is a nicely streamlined package. Things like vroom, rlang, stringr seem unwanted (is the speed difference of vroom at all noticable relative to the network cost of downloading the data). This is a purely personal view of mine - I'd not expect necessarily any changes here to accept this review. But if this was a package I, or someone on my team wrote, we'd be stripping down this list.

On the other hand, httpcache looks like a nicely chosen dependency that streamlines the package. I've not been able to see it in action here though given the API is down.

Minor comment

Remove commented code (e.g., test-healthdata_api.R) - if part way through writing a test use testthat::skip()

Please note I will be offline from 4-19 September. I can look again before then but will be unresponsive in those two weeks.

@karthik
Copy link
Member

karthik commented Aug 22, 2020

Thanks so much Rich! This is super helpful. 💯
@iecastro Please let us know once you've had a chance to review and respond to these comments.

@iecastro
Copy link
Author

Thank you, @richfitz @karthik - I appreciate everyone's time. I have reviewed the comments, and will start working on them soon.

@karthik
Copy link
Member

karthik commented Oct 28, 2020

@iecastro Checking in to see if you need any assistance. Please let us know when you're ready. No rush. 🙏

@iecastro
Copy link
Author

iecastro commented Nov 5, 2020

Hi @richfitz, @karthik - Some comments below, addressing the review items. I'm struggling with my travis build, and expanding the test suite.

@richfitz - if you have the time, would you mind attempting the review again?

Thanks, all!

Review Comments

Looked at 56bfd39 (Mon Jul 27 07:47:25 2020 -0400).

I've not been able to complete the review ad the API is down, but the comments below should be addressed regardless. The three big issues are

  • The API is down - that's fine but the package should be able to cope with this gracefully.
  • The testing is incomplete.
  • The documentation is incomplete.

Note that none of my comments really concern the implementation, which looks fine (modulo comments about dependencies, but that's a matter of taste).

Problematic use of flakey API

> devtools::test()
Loading healthdatacsv
Testing healthdatacsv
✔ |  OK F W S | Context
✖ |   0 1     | fetch_catalog [0.5 s]
────────────────────────────────────────────────────────────────────────────────
test-fetch_catalog.R:3: error: output is tibble
SSL certificate problem: certificate has expired
Backtrace:
  1. testthat::expect_is(fetch_catalog(), "tbl_df") tests/testthat/test-fetch_catalog.R:3:2
  4. healthdatacsv::fetch_catalog()
  5. healthdatacsv::healthdata_api("data.json") R/healthdata-api.R:31:2
  6. httpcache::GET(data_url, user) R/healthdata-api.R:145:2
  7. httr::GET(url, ...)
  8. httr:::request_perform(req, hu$handle$handle)
 10. httr:::request_fetch.write_memory(req$output, req$url, handle)
 11. curl::curl_fetch_memory(url, handle = handle)
  • Is this the API being down, or an SSL Certificate issue? I see the same when looking at my travis v. appveyor builds - not sure how to address this. I think the error message is appropriate since it didn't get to make an API call; on the other hand, I haven't been able to find a way to mock a failed API call to test

All tests that use web requests should be opt-in, especially if the package will go to CRAN. Add testthat::skip_on_cran() to these test blocks

  • Completed

Similar, but harder, comments apply to the vignette if this is going on CRAN - I do not recommend at all trying to run a vignette that uses the internet or external service there. All alternative options are unpleasant though:

  • Completed. I stripped the vignette to just a reference URL - will eventually replace with pkgdown article
  • ask them not to run the vignette
  • precompile the vignette and included one that does not use the internet (numerous hacky ways of doing this)
  • include the vignette as a pkgdown article only (users then can't read it from R or offline)
  • mock out the api calls in the vignette (hard to maintain, ugly to write, you'll have to store the data in the package)

Testing issues

The testing is minimal to the point of view that it may not be worth having, for example in test-fetch_catalogue.R

test_that("output is tibble", {
  expect_is(fetch_catalog(), "tbl_df")
})

However, when looking at fetch_catalogue(), there's significant logic there that is never tested - just that we get a tbl_df out which is unavoidable if the function does not error. So this test could be replaced by checking that the function doesn't error (this is not the solution though!).

Instead, split these functions into network-using and network free components. You're basically there already, and

  • I have split the function into network-using and network free components; and expanded the testing logic; although maybe not fully to the extend you had in mind?
fetch_catalog <- function(agency = NULL,
                          keyword = NULL) {

  api_call <- healthdata_api("data.json")
  fetch_catalogue_process(api_call, agency, keyword)
}

and focus the bulk of the testing on the second half. What happens if agency is not found in the data? What if the user provides a table rather than a string? Are multiple strings handled gracefully or is an error thrown etc.

If you split the code at this point you don't even need to bother formally mocking the api call, you can just keep a "reasonable" data set in the package for testing with.

Similar comments apply throughout.

You may want to use mocking (I like the mockery package) for ensuring that your error messages in places like healthdata_api are constructed properly. One thing to watch for is that if parsed comes back without a message property, your message will be empty.

The other coverage hole (fetch_data) look easy enough to fill but this is secondary to making sure that the tests that have been added are replaced with tests that validate more than the lack of errors.

Documentation is minimal

The documentation for fetch_data, the main entry point of the package I believe, reads - in full

  • I have expanded documentation
#' Fetch data from the healthdata.gov API
#'
#' This function will download data from
#' [healthdata.gov](https://healthdata.gov/),
#' if available in csv format

This supposes that the user already knows what sort of data night be found there, and it's not clear what the restriction on csv is for. The catalogue is not explained relative to the data.

(Other files have the same issues)

In the DESCRIPTION, the Description field should probably be updated to reflect what the package is for, rather than how it works.

Dependencies

This seems to me an excessive number of dependencies to pull in for what is a nicely streamlined package. Things like vroom, rlang, stringr seem unwanted (is the speed difference of vroom at all noticable relative to the network cost of downloading the data). This is a purely personal view of mine - I'd not expect necessarily any changes here to accept this review. But if this was a package I, or someone on my team wrote, we'd be stripping down this list.

  • It's not just the speed difference that vroom provides, but also the indexing - it is specially helpful when dealing with datasets of ~500K rows
  • The ability to leverage dependencies is what allows me to build something like this. I didn't start with a package in mind; I was trying to find some data

On the other hand, httpcache looks like a nicely chosen dependency that streamlines the package. I've not been able to see it in action here though given the API is down.

Minor comment

Remove commented code (e.g., test-healthdata_api.R) - if part way through writing a test use testthat::skip()

  • Done

Please note I will be offline from 4-19 September. I can look again before then but will be unresponsive in those two weeks.

@iecastro
Copy link
Author

Hi @karthik, suggest closing this issue. Have not done further updates in a while.

@karthik
Copy link
Member

karthik commented Feb 21, 2023

ok thanks @iecastro
We can reopen this when you decide to proceed again.

@karthik karthik closed this as completed Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants