submission: healthdatacsv #358

iecastro · 2020-01-11T20:19:17Z

Submitting Author Name: Ivan Castro
Submitting Author Github Handle: @iecastro
Repository: https://github.com/iecastro/healthdatacsv
Version submitted: 0.0.1
Editor: @karthik
Reviewers: @czeildi, @richfitz

Due date for @czeildi: 2020-03-19

Due date for @richfitz: 2020-08-25
Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: healthdatacsv
Title: Access data in the healthdata.gov catalog
Version: 0.0.1
Authors@R: 
    person(given = "Ivan",
           family = "Castro",
           role = c("aut", "cre"),
           email = "ivedcastr@gmail.com")
Description: This package will make an API call to the
    data.json endpoint of HealthData.gov and read data into R, if available in
    csv format. Some helper functions are available to help users query by
    listed Agency and/or keywords.
License: MIT + file LICENSE
URL: https://github.com/iecastro/healthdatacsv
BugReports: https://github.com/iecastro/healthdatacsv/issues
Imports: 
    dplyr,
    httr,
    jsonlite,
    magrittr,
    purrr,
    rlang,
    stringr,
    tibble,
    tidyr,
    vroom
Suggests: 
    readr,
    testthat (>= 2.1.0),
    covr
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.1

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- database access
- data munging
- data deposition
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- database software bindings
- geospatial data
- text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
healthdatacsv falls under data retrieval because it accesses and ingests data from an online source. This package focuses on health-related data made available from the HealthData.gov API.
Who is the target audience and what are scientific applications of this package?
Target audience would be researchers, instructors, students, and analysts in health-related fields (i.e. Epidemiology, Health Economics, Health Services Research). The data accessed are collected by agencies of the US Dept. of Health and Human Services, and are often nationally representative. Scientific applications include, among others, the investigation of: disease patterns, trends in health behaviors, or deficits in health systems.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
There is rHealthDataGov; however, this package has not been updated and does not function properly with the current HealthData.gov API.
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
presubmission: healthdatacsv #350

Technical checks

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

Do you intend for this package to go on CRAN?
Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)

Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

The package is novel and will be of interest to the broad readership of the journal.
The manuscript describing the package is no longer than 3000 words.
You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
(Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
(Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
(Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

iecastro · 2020-01-14T22:05:06Z

Hi all: did a small update to include a COC, and a basic use vignette.

lmullen · 2020-01-24T18:50:03Z

Editor checks:

Fit: The package meets criteria for fit and overlap
Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
License: The package has a CRAN or OSI accepted license
Repository: The repository link resolves correctly

Editor comments

Thanks for your submission. I have requests out to reviewers, and will let you know the reviewers and the review date when enough have agreed.

In the meantime, could you please address these issues?

The repository uses codecov but the automated code coverage through Travis CI functionality is not currently working.
A few mis-spellings: healdata
Can you add CI for Windows with Appveyor? Try the usethis package.

Reviewers:
Due date:

lmullen · 2020-01-24T18:52:05Z

One other thing: while the package is undergoing peer review, could you please add the peer review badge? You can do this by running rodeo::use_review_badge(358).

iecastro · 2020-01-28T12:21:35Z

Thank you @lmullen - I've started addressing your comments.

iecastro · 2020-02-08T13:51:17Z

Editor comments

Thanks for your submission. I have requests out to reviewers, and will let you know the reviewers and the review date when enough have agreed.

In the meantime, could you please address these issues?

The repository uses codecov but the automated code coverage through Travis CI functionality is not currently working.

A few mis-spellings: healdata
- I couldn't find these, but will keep in mind

Can you add CI for Windows with Appveyor? Try the usethis package.

lmullen · 2020-02-26T15:24:32Z

Thanks to our reviewers for agreeing to take on the peer review of this package.

Reviewer: @czeildi
Reviewer: @cksun-usc
Due date: 2020-03-19

You can find details about how to do the review in the reviewers' guide.

czeildi · 2020-03-07T16:11:17Z

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
- in fetch_csv(), prefer dplyr::slice() to slice() so it runs without library(dplyr)
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

In general, I found the package easy to learn and use, and the code is readable and simple. Below are my comments in three categories. Most of them are non-blocking.

usability

All functions calls emit the message: "No encoding supplied: defaulting to UTF-8." Consider muting this with encoding = "UTF-8" in httr::content(response, "text") in healthdata_api().
fetch_csv() throws a bit confusing error when called with a catalog where csv_available is FALSE. I would add and explicit test and error message for this, and also highlight in the function documentation. (Also consider the case of multiple catalogs, where some of the have csv_available, while others do not.)
The data_viewer argument in get_keywords() seems arbitrary for me: why here and not in other functions? It makes the code more complicated for little gain for the user (they have to provide an argument instead of for example %T>% into View()). I would suggest to remove this argument.
3 different verbs are used for similar things: getting results from the API (get, fetch, list). Is there a logic to make this easier to remember? I would use the same verb for keywords and agencies as they are both helpers so that the user knows what arguments they can call fetch_catalog() with.
consider renaming fetch_csv(): it does not return csv files so it was a bit confusing for me
in list_agencies the namecheck argument is not really intuitive to me, possibly use name_match instead or sg suggesting regex matching?
Besides fetch_csv() all functions (with any arguments) call healthdata_api("data.json") which takes a significant time. This could be better for the user but also makes tests and vignette compilation quite slow. Consider caching the result of the api call or reorganize functions in a way that the user can easily re-use the api result. I believe the memoise package can be used for this but I do not have experience with it, it might not worth it in this case.

code style

Use consistent indentation and spacing (You can do this automatically with styler::style_pkg().)
consider renaming publisher.name to publisher_name to follow snake case in results as well
In healthdata_api if there is an error, you cannot necessarily parse so I believe the error message should not rely on the parsed response in all cases (see healthdata_api("data.jsn") as a forced example)
add a new test for the case of error from the API
There are repeated parts in fetch_catalog(), list_agencies() and get_keywords() which could be simplified and thus easier to understand and maintain. E.g. in get_keywords():

keywords <- jsonlite::flatten(parsed_dataset) %>%
  as_tibble() %>%
  select(
    .data$publisher.name,
    .data$keyword
  ) %>%
  tidyr::unnest(cols = c(.data$keyword)) %>%
  distinct()
if (!is.null(agency)) {
  keywords <- keywords %>%
    filter(.data$publisher.name == agency)
}

it is good practice to give tests unique descriptions so that if there is a failure, it is easier to find based on description. (In this case there are only a few tests so it is ok). Nonetheless, consider a unique description for all test cases (you can achieve this by changing the descriptions or grouping similar expectations together in one test case)
test files: it is not necessary to include library() call at the top of the test files (I am not sure whether it is problematic, but it is not standard afaik)
there is no need for explicit return() calls, consider removing them
in example in fetch_csv(), prefer dplyr::slice() to slice() so it runs without library(dplyr)

documentation

add Roxygen: list(markdown = TRUE) to DESCRIPTION so that markdown formatting works in roxygen2 documentation
use @noRd instead of @keyword internal for documenting internal functions. Reference 1, Reference 2
consider adding codemetar: https://devguide.ropensci.org/building.html#package-name-and-metadata
The readme is a bit long to me compared to the number of functionalities. Consider using head on results. (Note that I am also completely fine with the current readme)
Add a repostatus.org badge to readme as per guideline in https://devguide.ropensci.org/building.html#readme
Fix link to tibble in function documentation @return part: use [`tibble`][tibble::tibble()] or similar. (I checked in pkgdown documentation, but did not find a way to link to tibble-package documentation without hardlinging url-s)
Separate titles and description in package documentation documentation (insert blank line or add separate title) so that the title is short, especially in fetch_catalog()
use consistent code style (indentation) in readme, vignette as well, e.g.

fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment")

There is a sentence in the readme: "healthdatacsv is pipe friendly, so it’s easy to integrate into a tidy workflow." It is a bit out of context for me: this claim is not really supported, the single command pipeline with dplyr::pull is not a strong argument and all functions in the package have different input an output. (This is not to say that the package is pipe-unfriendly!). The fetch_catalog(...) %>% fetch_csv() pipe supports this sentence more.
indicate all internal functions with function_name() in readme and vignette for nice linking in pkgdown website
in function documentations: note NULL, TRUE as code (by enclosing with ``), healthdata.gov as link, link to other functions by enclosing in ``

iecastro · 2020-03-09T13:13:34Z

Thanks @czeildi! I appreciate the feedback and will start addressing individual items soon.

noamross · 2020-03-17T03:05:14Z

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

annakrystalli · 2020-05-19T07:53:06Z

⚠️⚠️⚠️⚠️⚠️
In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci new submissions for software peer review are paused.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.
Other rOpenSci community activities continue.

Please check back here again after 25 May when we will be announcing plans to slowly start back up.

We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board
⚠️⚠️⚠️⚠️⚠️

lmullen · 2020-06-30T21:35:04Z

Hi, @cksun-usc. Obviously a lot happened in the world since you agreed to take on this review. Is this something that you would be able to pick up at some point? If you do not wish to go forward with the review, that's more than fine; just let me know.

karthik · 2020-07-13T19:52:06Z

👋 @iecastro I'll be taking over as editor for this submission and will follow up on what's needed to move this along. Thank you for your patience.

iecastro · 2020-07-15T22:09:43Z

Hi @czeildi - hope all is well! I've addressed your comments below. Thank you for your time.

Review Comments

In general, I found the package easy to learn and use, and the code is readable and simple. Below are my comments in three categories. Most of them are non-blocking.

usability

All functions calls emit the message: "No encoding supplied: defaulting to UTF-8." Consider muting this with encoding = "UTF-8" in httr::content(response, "text") in healthdata_api().

fetch_csv() throws a bit confusing error when called with a catalog where csv_available is FALSE. I would add and explicit test and error message for this, and also highlight in the function documentation. (Also consider the case of multiple catalogs, where some of the have csv_available, while others do not.)

The data_viewer argument in get_keywords() seems arbitrary for me: why here and not in other functions? It makes the code more complicated for little gain for the user (they have to provide an argument instead of for example %T>% into View()). I would suggest to remove this argument.

3 different verbs are used for similar things: getting results from the API (get, fetch, list). Is there a logic to make this easier to remember? I would use the same verb for keywords and agencies as they are both helpers so that the user knows what arguments they can call fetch_catalog() with.

keywords and agencies now have the same verb get_

consider renaming fetch_csv(): it does not return csv files so it was a bit confusing for me

renamed fetch_data()

in list_agencies the namecheck argument is not really intuitive to me, possibly use name_match instead or sg suggesting regex matching?

argument renamed agency

Besides fetch_csv() all functions (with any arguments) call healthdata_api("data.json") which takes a significant time. This could be better for the user but also makes tests and vignette compilation quite slow. Consider caching the result of the api call or reorganize functions in a way that the user can easily re-use the api result. I believe the memoise package can be used for this but I do not have experience with it, it might not worth it in this case.

results are now cached using httpcache

code style

Use consistent indentation and spacing (You can do this automatically with styler::style_pkg().)

consider renaming publisher.name to publisher_name to follow snake case in results as well

renamed to publisher

In healthdata_api if there is an error, you cannot necessarily parse so I believe the error message should not rely on the parsed response in all cases (see healthdata_api("data.jsn") as a forced example)

healthdata_api is an internal function, so the user does not have a chance of misspelling the endpoint url. In theory, there will always be an API call made, and any errors would be parsed. But also, not much experience on this, on my part.

add a new test for the case of error from the API

There are repeated parts in fetch_catalog(), list_agencies() and get_keywords() which could be simplified and thus easier to understand and maintain. E.g. in get_keywords():

thank you!

keywords <- jsonlite::flatten(parsed_dataset) %>%
  as_tibble() %>%
  select(
    .data$publisher.name,
    .data$keyword
  ) %>%
  tidyr::unnest(cols = c(.data$keyword)) %>%
  distinct()
if (!is.null(agency)) {
  keywords <- keywords %>%
    filter(.data$publisher.name == agency)
}
it is good practice to give tests unique descriptions so that if there is a failure, it is easier to find based on description. (In this case there are only a few tests so it is ok). Nonetheless, consider a unique description for all test cases (you can achieve this by changing the descriptions or grouping similar expectations together in one test case)

test files: it is not necessary to include library() call at the top of the test files (I am not sure whether it is problematic, but it is not standard afaik)

there is no need for explicit return() calls, consider removing them

in example in fetch_csv(), prefer dplyr::slice() to slice() so it runs without library(dplyr)

documentation

add Roxygen: list(markdown = TRUE) to DESCRIPTION so that markdown formatting works in roxygen2 documentation

use @noRd instead of @keyword internal for documenting internal functions. Reference 1, Reference 2

consider adding codemetar: https://devguide.ropensci.org/building.html#package-name-and-metadata

Pending. Didn't realize I hadn't looked into this.

The readme is a bit long to me compared to the number of functionalities. Consider using head on results. (Note that I am also completely fine with the current readme)

Did not use head() but I did revise the readme several times - cutting down the lenght (I think - it's been a while)

Add a repostatus.org badge to readme as per guideline in https://devguide.ropensci.org/building.html#readme

Fix link to tibble in function documentation @return part: use [`tibble`][tibble::tibble()] or similar. (I checked in pkgdown documentation, but did not find a way to link to tibble-package documentation without hardlinging url-s)

Separate titles and description in package documentation documentation (insert blank line or add separate title) so that the title is short, especially in fetch_catalog()

use consistent code style (indentation) in readme, vignette as well, e.g.
fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment")
There is a sentence in the readme: "healthdatacsv is pipe friendly, so it’s easy to integrate into a tidy workflow." It is a bit out of context for me: this claim is not really supported, the single command pipeline with dplyr::pull is not a strong argument and all functions in the package have different input an output. (This is not to say that the package is pipe-unfriendly!). The fetch_catalog(...) %>% fetch_csv() pipe supports this sentence more.

indicate all internal functions with function_name() in readme and vignette for nice linking in pkgdown website

in function documentations: note NULL, TRUE as code (by enclosing with ), healthdata.gov as link, link to other functions by enclosing in

iecastro · 2020-07-15T22:11:49Z

👋 @iecastro I'll be taking over as editor for this submission and will follow up on what's needed to move this along. Thank you for your patience.

No worries @karthik, thank you - there's no rush on my part

czeildi · 2020-07-26T09:49:59Z

@iecastro I reviewed your changes, looks great! I happily recommend approving this package.

I noticed one minor thing: you should no longer import View as you no longer use it: #' @importFrom utils View line should be deleted from healthdatacsv-package.R

karthik · 2020-07-27T00:05:39Z

Thanks very much for the review and acceptance recommendation @czeildi

The package looks to be in good shape to me as well. Since the second reviewer has gone missing, I'll try to get someone else to do a quick review and move this to acceptance.

karthik · 2020-07-27T00:08:27Z

👋 @dbuijs Would you be able to help us review this package? One reviewer has already completed a thorough review and recommended acceptance. You'll need to do a sanity check and make sure everything looks ok to you. There is a checklist to work through. Thanks for considering. 🙏

iecastro · 2020-07-28T12:07:51Z

Thanks, all! @czeildi @karthik

karthik · 2020-07-29T18:48:08Z

@iecastro I'll keep looking for a second reviewer but if you have any suggestions for potential reviewers who don't have a conflict of interest, I am all ears.

karthik · 2020-08-04T20:40:29Z

Adding @richfitz as second reviewer. Thanks for volunteering Rich! Much appreciated.

iecastro · 2020-08-05T19:22:45Z

Thank you! @karthik @richfitz

richfitz · 2020-08-21T14:31:44Z

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally (failed due to API error)
Function Documentation: for all exported functions in R help (to minimal -see below)
Examples for all exported functions in R Help that run successfully locally (failed due to API error)
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
[NA] Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine. (see below)
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Looked at 56bfd39 (Mon Jul 27 07:47:25 2020 -0400).

I've not been able to complete the review ad the API is down, but the comments below should be addressed regardless. The three big issues are

The API is down - that's fine but the package should be able to cope with this gracefully.
The testing is incomplete.
The documentation is incomplete.

Note that none of my comments really concern the implementation, which looks fine (modulo comments about dependencies, but that's a matter of taste).

Problematic use of flakey API

> devtools::test()
Loading healthdatacsv
Testing healthdatacsv
✔ |  OK F W S | Context
✖ |   0 1     | fetch_catalog [0.5 s]
────────────────────────────────────────────────────────────────────────────────
test-fetch_catalog.R:3: error: output is tibble
SSL certificate problem: certificate has expired
Backtrace:
  1. testthat::expect_is(fetch_catalog(), "tbl_df") tests/testthat/test-fetch_catalog.R:3:2
  4. healthdatacsv::fetch_catalog()
  5. healthdatacsv::healthdata_api("data.json") R/healthdata-api.R:31:2
  6. httpcache::GET(data_url, user) R/healthdata-api.R:145:2
  7. httr::GET(url, ...)
  8. httr:::request_perform(req, hu$handle$handle)
 10. httr:::request_fetch.write_memory(req$output, req$url, handle)
 11. curl::curl_fetch_memory(url, handle = handle)

All tests that use web requests should be opt-in, especially if the package will go to CRAN. Add testthat::skip_on_cran() to these test blocks

test-fetch_catalog.R:3
test-get_agencies.R:3
test-get_agencies.R:8
test-get_keywords.R:3
test-get_keywords.R:8
test-get_keywords.R:14

Similar, but harder, comments apply to the vignette if this is going on CRAN - I do not recommend at all trying to run a vignette that uses the internet or external service there. All alternative options are unpleasant though:

ask them not to run the vignette
precompile the vignette and included one that does not use the internet (numerous hacky ways of doing this)
include the vignette as a pkgdown article only (users then can't read it from R or offline)
mock out the api calls in the vignette (hard to maintain, ugly to write, you'll have to store the data in the package)

Testing issues

The testing is minimal to the point of view that it may not be worth having, for example in test-fetch_catalogue.R

test_that("output is tibble", {
  expect_is(fetch_catalog(), "tbl_df")
})

However, when looking at fetch_catalogue(), there's significant logic there that is never tested - just that we get a tbl_df out which is unavoidable if the function does not error. So this test could be replaced by checking that the function doesn't error (this is not the solution though!).

Instead, split these functions into network-using and network free components. You're basically there already, and

fetch_catalog <- function(agency = NULL,
                          keyword = NULL) {

  api_call <- healthdata_api("data.json")
  fetch_catalogue_process(api_call, agency, keyword)
}

and focus the bulk of the testing on the second half. What happens if agency is not found in the data? What if the user provides a table rather than a string? Are multiple strings handled gracefully or is an error thrown etc.

If you split the code at this point you don't even need to bother formally mocking the api call, you can just keep a "reasonable" data set in the package for testing with.

Similar comments apply throughout.

You may want to use mocking (I like the mockery package) for ensuring that your error messages in places like healthdata_api are constructed properly. One thing to watch for is that if parsed comes back without a message property, your message will be empty.

The other coverage hole (fetch_data) look easy enough to fill but this is secondary to making sure that the tests that have been added are replaced with tests that validate more than the lack of errors.

Documentation is minimal

The documentation for fetch_data, the main entry point of the package I believe, reads - in full

#' Fetch data from the healthdata.gov API
#'
#' This function will download data from
#' [healthdata.gov](https://healthdata.gov/),
#' if available in csv format

This supposes that the user already knows what sort of data night be found there, and it's not clear what the restriction on csv is for. The catalogue is not explained relative to the data.

(Other files have the same issues)

In the DESCRIPTION, the Description field should probably be updated to reflect what the package is for, rather than how it works.

Dependencies

This seems to me an excessive number of dependencies to pull in for what is a nicely streamlined package. Things like vroom, rlang, stringr seem unwanted (is the speed difference of vroom at all noticable relative to the network cost of downloading the data). This is a purely personal view of mine - I'd not expect necessarily any changes here to accept this review. But if this was a package I, or someone on my team wrote, we'd be stripping down this list.

On the other hand, httpcache looks like a nicely chosen dependency that streamlines the package. I've not been able to see it in action here though given the API is down.

Minor comment

Remove commented code (e.g., test-healthdata_api.R) - if part way through writing a test use testthat::skip()

Please note I will be offline from 4-19 September. I can look again before then but will be unresponsive in those two weeks.

karthik · 2020-08-22T21:02:36Z

Thanks so much Rich! This is super helpful. 💯
@iecastro Please let us know once you've had a chance to review and respond to these comments.

iecastro · 2020-08-24T11:39:01Z

Thank you, @richfitz @karthik - I appreciate everyone's time. I have reviewed the comments, and will start working on them soon.

karthik · 2020-10-28T19:22:13Z

@iecastro Checking in to see if you need any assistance. Please let us know when you're ready. No rush. 🙏

iecastro · 2020-11-05T13:58:36Z

Hi @richfitz, @karthik - Some comments below, addressing the review items. I'm struggling with my travis build, and expanding the test suite.

@richfitz - if you have the time, would you mind attempting the review again?

Thanks, all!

Review Comments

Looked at 56bfd39 (Mon Jul 27 07:47:25 2020 -0400).

I've not been able to complete the review ad the API is down, but the comments below should be addressed regardless. The three big issues are

The API is down - that's fine but the package should be able to cope with this gracefully.

The testing is incomplete.

The documentation is incomplete.

Note that none of my comments really concern the implementation, which looks fine (modulo comments about dependencies, but that's a matter of taste).

Problematic use of flakey API
> devtools::test()
Loading healthdatacsv
Testing healthdatacsv
✔ |  OK F W S | Context
✖ |   0 1     | fetch_catalog [0.5 s]
────────────────────────────────────────────────────────────────────────────────
test-fetch_catalog.R:3: error: output is tibble
SSL certificate problem: certificate has expired
Backtrace:
  1. testthat::expect_is(fetch_catalog(), "tbl_df") tests/testthat/test-fetch_catalog.R:3:2
  4. healthdatacsv::fetch_catalog()
  5. healthdatacsv::healthdata_api("data.json") R/healthdata-api.R:31:2
  6. httpcache::GET(data_url, user) R/healthdata-api.R:145:2
  7. httr::GET(url, ...)
  8. httr:::request_perform(req, hu$handle$handle)
 10. httr:::request_fetch.write_memory(req$output, req$url, handle)
 11. curl::curl_fetch_memory(url, handle = handle)

Is this the API being down, or an SSL Certificate issue? I see the same when looking at my travis v. appveyor builds - not sure how to address this. I think the error message is appropriate since it didn't get to make an API call; on the other hand, I haven't been able to find a way to mock a failed API call to test

All tests that use web requests should be opt-in, especially if the package will go to CRAN. Add testthat::skip_on_cran() to these test blocks

Completed

Similar, but harder, comments apply to the vignette if this is going on CRAN - I do not recommend at all trying to run a vignette that uses the internet or external service there. All alternative options are unpleasant though:

Completed. I stripped the vignette to just a reference URL - will eventually replace with pkgdown article

ask them not to run the vignette

precompile the vignette and included one that does not use the internet (numerous hacky ways of doing this)

include the vignette as a pkgdown article only (users then can't read it from R or offline)

mock out the api calls in the vignette (hard to maintain, ugly to write, you'll have to store the data in the package)

Testing issues

The testing is minimal to the point of view that it may not be worth having, for example in test-fetch_catalogue.R
test_that("output is tibble", {
  expect_is(fetch_catalog(), "tbl_df")
})
However, when looking at fetch_catalogue(), there's significant logic there that is never tested - just that we get a tbl_df out which is unavoidable if the function does not error. So this test could be replaced by checking that the function doesn't error (this is not the solution though!).

Instead, split these functions into network-using and network free components. You're basically there already, and

I have split the function into network-using and network free components; and expanded the testing logic; although maybe not fully to the extend you had in mind?

fetch_catalog <- function(agency = NULL,
                          keyword = NULL) {

  api_call <- healthdata_api("data.json")
  fetch_catalogue_process(api_call, agency, keyword)
}
and focus the bulk of the testing on the second half. What happens if agency is not found in the data? What if the user provides a table rather than a string? Are multiple strings handled gracefully or is an error thrown etc.

If you split the code at this point you don't even need to bother formally mocking the api call, you can just keep a "reasonable" data set in the package for testing with.

Similar comments apply throughout.

You may want to use mocking (I like the mockery package) for ensuring that your error messages in places like healthdata_api are constructed properly. One thing to watch for is that if parsed comes back without a message property, your message will be empty.

The other coverage hole (fetch_data) look easy enough to fill but this is secondary to making sure that the tests that have been added are replaced with tests that validate more than the lack of errors.

Documentation is minimal

The documentation for fetch_data, the main entry point of the package I believe, reads - in full

I have expanded documentation

#' Fetch data from the healthdata.gov API
#'
#' This function will download data from
#' [healthdata.gov](https://healthdata.gov/),
#' if available in csv format
This supposes that the user already knows what sort of data night be found there, and it's not clear what the restriction on csv is for. The catalogue is not explained relative to the data.

(Other files have the same issues)

In the DESCRIPTION, the Description field should probably be updated to reflect what the package is for, rather than how it works.

Dependencies

This seems to me an excessive number of dependencies to pull in for what is a nicely streamlined package. Things like vroom, rlang, stringr seem unwanted (is the speed difference of vroom at all noticable relative to the network cost of downloading the data). This is a purely personal view of mine - I'd not expect necessarily any changes here to accept this review. But if this was a package I, or someone on my team wrote, we'd be stripping down this list.

It's not just the speed difference that vroom provides, but also the indexing - it is specially helpful when dealing with datasets of ~500K rows
The ability to leverage dependencies is what allows me to build something like this. I didn't start with a package in mind; I was trying to find some data

On the other hand, httpcache looks like a nicely chosen dependency that streamlines the package. I've not been able to see it in action here though given the API is down.

Minor comment

Remove commented code (e.g., test-healthdata_api.R) - if part way through writing a test use testthat::skip()

Done

Please note I will be offline from 4-19 September. I can look again before then but will be unresponsive in those two weeks.

iecastro · 2023-02-21T21:31:24Z

Hi @karthik, suggest closing this issue. Have not done further updates in a while.

karthik · 2023-02-21T21:39:15Z

ok thanks @iecastro
We can reopen this when you decide to proceed again.

maelle assigned lmullen Jan 13, 2020

lmullen added the 1/editor-checks label Jan 24, 2020

lmullen added 2/seeking-reviewer(s) and removed 1/editor-checks labels Jan 24, 2020

noamross added the package label Feb 18, 2020

lmullen added 3/reviewer(s)-assigned and removed 2/seeking-reviewer(s) labels Feb 26, 2020

annakrystalli added the topic:data-retrieval label Mar 14, 2020

karthik self-assigned this Jul 13, 2020

karthik unassigned lmullen Jul 19, 2020

maelle added 4/review(s)-in-awaiting-changes and removed 3/reviewer(s)-assigned labels Oct 5, 2020

karthik closed this as completed Feb 21, 2023

submission: healthdatacsv #358

submission: healthdatacsv #358

Comments

iecastro commented Jan 11, 2020 • edited by maelle Loading

Scope

Technical checks

Publication options

Code of conduct

iecastro commented Jan 14, 2020

lmullen commented Jan 24, 2020

Editor checks:

Editor comments

lmullen commented Jan 24, 2020

iecastro commented Jan 28, 2020

iecastro commented Feb 8, 2020 • edited Loading

Editor comments

lmullen commented Feb 26, 2020

czeildi commented Mar 7, 2020 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

usability

code style

documentation

iecastro commented Mar 9, 2020

noamross commented Mar 17, 2020

annakrystalli commented May 19, 2020

lmullen commented Jun 30, 2020

karthik commented Jul 13, 2020

iecastro commented Jul 15, 2020

Review Comments

usability

code style

documentation

iecastro commented Jul 15, 2020

czeildi commented Jul 26, 2020

karthik commented Jul 27, 2020

karthik commented Jul 27, 2020

iecastro commented Jul 28, 2020

karthik commented Jul 29, 2020

karthik commented Aug 4, 2020

iecastro commented Aug 5, 2020

richfitz commented Aug 21, 2020 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Problematic use of flakey API

Testing issues

Documentation is minimal

Dependencies

Minor comment

karthik commented Aug 22, 2020

iecastro commented Aug 24, 2020

karthik commented Oct 28, 2020

iecastro commented Nov 5, 2020

Review Comments

Problematic use of flakey API

Testing issues

Documentation is minimal

Dependencies

Minor comment

iecastro commented Feb 21, 2023

karthik commented Feb 21, 2023

iecastro commented Jan 11, 2020 •

edited by maelle

Loading

iecastro commented Feb 8, 2020 •

edited

Loading

czeildi commented Mar 7, 2020 •

edited

Loading

richfitz commented Aug 21, 2020 •

edited

Loading