Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: comtradr #141

Closed
11 of 14 tasks
ChrisMuir opened this issue Aug 8, 2017 · 37 comments
Closed
11 of 14 tasks

Submission: comtradr #141

ChrisMuir opened this issue Aug 8, 2017 · 37 comments

Comments

@ChrisMuir
Copy link
Member

Summary

  • What does this package do? (explain in 50 words or less):

API wrapper pkg for the UN Comtrade Database, which features inter-country trade data dating back to the early 1990's. Full API documentation can be found here. This package allows users to interact with the API directly from R, and features functions for making queries and importing data.

  • Paste the full DESCRIPTION file inside a code block below:
Package: comtradr
Title: Interface with the United Nations Comtrade API
Version: 0.0.2.9000
Authors@R: person("Chris", "Muir", email = "chrismuirRVA@gmail.com", role = c("aut", "cre"))
Description: Interface with and extract data from the United Nations Comtrade API <https://comtrade.un.org/data/>. Comtrade provides country level shipping data for a variety of commodities, these functions allow for easy API query and data returned as a tidy data frame.
Depends: R (>= 3.0.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
Imports:
        httr, 
        jsonlite, 
        dplyr, 
        methods, 
        utils
RoxygenNote: 6.0.1
URL: https://github.com/ChrisMuir/comtradr
BugReports: https://github.com/ChrisMuir/comtradr/issues
NeedsCompilation: no
Author: Chris Muir [aut, cre]
Maintainer: Chris Muir <chrismuirRVA@gmail.com>
Suggests: testthat,
    knitr,
    rmarkdown
VignetteBuilder: knitr
  • URL for the package (the development repository, not a stylized html page):

https://github.com/ChrisMuir/comtradr

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

Data retrieval, this is a API package.

  • Who is the target audience?

Anyone that could benefit from easy access to historical inter-country shipping data. This can include scientists studying supply chains, economists, journalists, etc.

The Comtrade website has a "Getting API Data with R" post (here), but there are a number of issues with the code:

  1. It mixes up partner countries and reporter countries.
  2. It's missing the call to import the reporter countries.
  3. It relies on the rjson package.
  4. No mention of commodity codes, or importing the commodity codes dataset.
  5. Nondescript and confusing parameter names (px, ps, r, p, rg, etc.).

I've found a few R packages online, but they're all just the code from the Comtrade tutorial wrapped up in a package. There's a Python API package, tradedownloader.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@maelle maelle added the package label Aug 8, 2017
@ChrisMuir
Copy link
Member Author

Sorry, meant to also post the output of goodpractice::gp():

> goodpractice::gp()
Preparing: covr
Preparing: cyclocomp
* installing *source* package 'comtradr' ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (comtradr)
Preparing: description
Preparing: lintr
Preparing: namespace
Preparing: rcmdcheck
── GP comtradr ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 80% of code lines are covered by test cases.

    R/commodity_lookup.R:92:NA
    R/commodity_lookup.R:93:NA
    R/commodity_lookup.R:119:NA
    R/commodity_lookup.R:120:NA
    R/commodity_lookup.R:139:NA
    ... and 75 more lines

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data. Consider using vapply()
    instead.

    R\commodity_lookup.R:68:10

@sckott
Copy link
Contributor

sckott commented Aug 11, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @ChrisMuir !

Currently seeking reviewers. It's a good fit and not overlapping.

Thanks for including goodpractice::gp() output - you can leave that to us in the future 😄


Reviewers: @AliciaSchep @rtaph
Due date: 2017-09-11

@sckott
Copy link
Contributor

sckott commented Aug 16, 2017

thanks @AliciaSchep and @rtaph for reviewing

@ChrisMuir
Copy link
Member Author

Great, I look forward to the review process! @AliciaSchep and @rtaph if you have any questions for me just let me know.

Thanks everyone!

@sckott
Copy link
Contributor

sckott commented Sep 5, 2017

👋 @rtaph @AliciaSchep - reviews due next monday

@AliciaSchep
Copy link

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3.5 hrs


Review Comments

Package was easy and fun to use! 🎉 Function documentation was thorough and informative. I liked the example plots and manipulations in the README, although those might be more appropriate for the vignette rather than README.

Examples

Examples in package documentation did not run successfully using devtools::run_examples(run = FALSE). It looked like there was some R output included in the documentation -- I think that should be commented out if included. I am guessing the examples were set to not run for purposes of CRAN, but it would be nice to be able to run them locally in an automated way.

Package man page

The package should ideally have a man page for the entire package, which might describe briefly what the package does. A good reference for how to make this in roxygen2 is at http://r-pkgs.had.co.nz/man.html#man-packages.

Tests

The tests include many expectations each, especially for ct_search. This can make it harder to figure out exactly why the test is failing, because you get the message from the "test_that" call but then have to do some digging to figure out exactly which expectation(s) failed. My understanding is that it is better practice to have more test_that statements, so each one is reporting on a particular behavior. For example, your "ex1" and "ex2" related expectations should probably be in separate tests if those examples are trying to capture different use cases. Also, things like the following block:

  ## Check that ct_search is failing as expected.
  # Throw error with invalid input for param "reporters".
  expect_error(ct_search(reporters = "invalid_reporter",
                         partners = "Germany",
                         countrytable = countrydf,
                         tradedirection = "imports"))

should also probably be a separate test block.

The first time I ran the tests, all the expectations for "ex2" failed. Trying again multiple times, it seemed to work. I am not sure what went wrong the first time, so don't really have any suggestion for that one, just figured I'd mention it.

Continuous Integration

It would be good to include the Travis Badge on the README, and to additionally test the package on MacOs and Windows. For Mac, this can be specified in the .travis.yml file. For Windows, you can use AppVeyor.

Additional minor suggestions:

These suggestions could affect existing code using package and are also pretty minor so I can understand if you want to ignore!

Name of country_lookup and commodity_lookup functions

Most of the package names start with "ct_", except the "country_lookup" and "commodity_lookup" functions. I think it would be nice to go fully with the convention of starting all the functions with the same prefix, and add "ct_" to those two functions as well.

Argument ordering for country_lookup function

For the country_lookup function, I think it would be nice for the order of the arguments to be re-arranged so that "type" was optional without having to write-out the "lookuptable" argument. Currently, if you want just accept the default for type, then you would need to do:

country_lookup(c("Belgium", "vietnam", "brazil"), lookuptable = countrydf)

Reordering arguments for type to be last would allow you do:

country_lookup(c("Belgium", "vietnam", "brazil"), countrydf)

Having the argument be named "countrytable" for consistency with ct_search might also be nice. Then commodity_lookup could have the argument be "commoditytable".

@ChrisMuir
Copy link
Member Author

Hi Alicia,

Thanks so much for taking the time to review and for the great feedback! I will hold off on writing out an official reply to all review comments until @rtaph has replied with his review. In the meantime, I will start working on some of your suggestions, like creating man page, expanding CI, etc.

Just a quick note on the test_that tests being hit or miss....they've always been like that for me as well, and I honestly have no idea why. They seem to be fairly stable when running during the R CMD check, but I would guess that devtools::test() fails around half the time for me.

@rtaph
Copy link

rtaph commented Sep 9, 2017

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
    • Mostly there, but no documentation or examples for api_col_names()
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.

  • Functionality: Any functional claims of the software been confirmed.

  • Performance: Any performance claims of the software been confirmed.

  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.

    • Unit tests exist, but a whole battery of ct_search tests fail when you run the full test suite. This is because the API limits the query rate in guest mode and the tests are all run quickly in succession. As a result, the tests all fail. When I run the tests one by one in the console, they pass. They also pass when I pad your test suite with Sys.sleep(1) in between calls to the API. More on this below.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

    • The package does not use the recommended snake_case convention and mixes naming styles (see below). The standard is not hard and fast but it would be nice if a single approach was adopted, perhaps working around the mixed naming conventions in the API.
    • the package does not provide citation details in the README, as per the guideline
    • the package does not yet contain top-level documentation (i.e. typing ?comtrade will not return a help page)
    • per guidelines, the package ought to add #' @noRd to internal functions such as api_col_names()

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
    • TBD :)

Estimated hours spent reviewing: 8.5


Review Comments

Overall Approach

Let me start by saying that this is a great package and I enjoyed reviewing it. I wish more people in the field used R, and packages like yours make it a little bit easier to do so. I like that that virtually all the functionality of the original API is available in the package, and your integration of regex into the package philosophy means that you provide a flexible approach to working with the original API from the U.N.

My thoughts below may be thought of as feature upgrades rather than bugs, but I am including them here because some would break the current approach you have adopted. As a result, you could consider implementing them (if you agree with them) before a major release. It's up to you as a package author.

At a high level, my main recommendation would be to simplify the package philosophy to make data access even easier and faster. I think the package could benefit from design changes that a) reduce the number of steps needed to get at the final data, and b) reduce the need to open up help files to remember how to do things or recall how things are named. I found that each time I loaded up this package, I would need to re-read the help pages and do more setup than I cared for in order to get the data I wanted. I think these difficulties can be addressed in a number of ways.

  • Returning a Data Frame Directly: From the httr vignette on writing good R API wrappers, they suggest that API packages should return a helpful object instead of parsed lists. The status code of the API is not the object of interest iself, and so the fact that the data frame is nested within the return object adds a hurdle that is a bit annoying. This is particularly true in the tidyverse approach where you want consise, tabular inputs and outputs that work in piped anlytical workflow. Although switching to a data frame output would be a breaking change, IMO it is worth it since the end user will want to get at the final data frame.
  • Converting to Native R Messages and Warnings: Building on the previous point, I think it would be much nicer to convert the API's validation status codes into errors and warnings that are native to R. This is the recommendation provided by the httr vignette. You can then use examples to demonstrate how to appoach error handling using the more universal tryCatch, purrr::safely(), and purrr::possibly(). This is a more robust approach because if the disposition of the output is burried in a list that may not be checked/asserted by the end user, you run the risk that you user will not fail fast.
  • Caching Reference Tables: The fact that one needs to define countrytable and commoditydf every time the function is called is an added step and I found myself having to open up the help page each time to remember. It would be nice to have this saved as package data and have an updating function (e.g. ct_update() that refreshes the data). I think this makes sense because everyone will always be defining these tables-- and since all your users will need to do this in order to use the package, it makes sense to build it in directly into your package as a default. Also, the reference tables do not change much over time, nor do I see their content really change much as a function input. Seeing as they are much more static than dynamic, it makes sense to save them, cache them, and hide them from day-to-day use. For example, you could write a switch to only re-extract the table if the rawdata$headers$last-modified header is more recent than the current date and time. This can speed things up and reduce load on the service.
  • Throttling API requests: Since the API has user access limits, it would be great if the R package worked around those limits natively by delaying repeat API calls. One way to go about this would be to store the latest timestamp in a purpose-built package namespace (say, via ct_cache <- new.env()), and set switches within your functions to call Sys.sleep() if they are called to soon.

Package Help Page

As mentioned above, it would be useful to add a ?comtrade and ?comtrade-package help page. This was my starting point into your package and I couldn't immediately find documentation.

Form and Style

Your code was easy to follow and well annotated (thanks!). Some small thoughts:

  • I like the ct_*** convention for the naming of the family of functions-- it helps keep things consistent and is useful for remembering function names when you have autocompletion in an IDE like RStudio. I would recommend all the package functions follow this convention, including commodity_lookup() and country_lookup()
  • I would avoid recyling the package name in example assignments (comtrade <-).
  • There are different conventions used between names of variable, arguments, and argument parameters. These include lowercase (.e.g. "maxrec"), UPPERCASE (.e.g. "TOTAL"), snake_case (e.g. "return_code"), hyphen-case (.e.g. "re-imports"), Title Case (.e.g. "Partner Code"), and Sentence case (.e.g. "Unit code"). I realize some of these inconsistencies come from the API itself (and are passed to the API), but it would be nice if the package had the flexibility to accept a single style through and through, as this would remove the burden from the user to need to know (or lookup) the style. Having to always look up the available choices in the help page is a bit of a drag.

Documentation

The documentation is generally good with plenty of examples. A few topics I think could be added to the documention include:

  • Oauth Token: documenting the use of an authorization token.
  • Regex: It would be great to state in the documentation that the functions use regular expressions. Also, not all your users may know regex, so pointing them to some help pages on the topic could be great (maybe add to the see-also section of documention).

Modularization

The modularization on this package is generally great. The one thing I would split out is the part of the function that constructs the API call from the part that actually calls it. It would be nice to be able to generate the API url for troubleshooting purposes, even if this is not exported, and also save the API url along with the result (especially if caching).

User Identification

It is good web etiquette to identify oneself when accessing APIs and scraping the web. This can be done via the user agent variable in httr and would be good for people using guest mode.

Along the lines of [Munzert et al.](Automated Data Collection with R), I would set the user agent to something generic like:

paste(Sys.info()[["user"]], R.version$version.str, version$platform, sep = ", ")
#> [1] "rtaph, R version 3.4.1 (2017-06-30), x86_64-apple-darwin15.6.0"

Data Format

Is there any utility in allowing the user to specify fmt? I recommend removing the flexibility and only letting the API receive JSON, since validation information is only returned for JSON data. I am nervous about differential behaviour and with CSV important metadata is lost. If you do choose to keep this parameter, it would be good to write a whole suite of tests that check that CSV and JSON variants behave the same in every way (especially w.r.t. warnings and errors).

API version check

From what I can tell, there is no call to check the API version (is that right?). If there is, it would be nice assert the version in the return data so that if the API changes, the user gets a warning about using an older version of comtradr.

Names

This is really a matter of preference, but I would simplify by removing the option of "human readable" names in the formals, instead providing a dictionary and examples of how to do this outside of the function. This lets the user go back and forth with renaming columns once the data frame is returned.

Another thought is that the "country name" header has a space, and some people may not know how to use backticks to refer to names like this. Again, maybe think about standardizing the naming convention for data frame headers, argument names, and argument parameters.

Status 200 Check

Why not have an asserter to ensure that your receive Status: 200? Maybe this is not needed since you check the validation codes for JSON, but still, it is easy to implement via rawdata$status_code == 200L and an added layer of safety.

Spelling and Grammar

Tiny mistakes:

  • spelling of "ajust" in comtradr-vignette.Rmd, line 98.
  • spelling of "tomatos" in comtradr-vignette.Rmd, lines 90 and 94

Unit Tests

Great test coverage! These are especially important for API wrappers since the upstream protocols can change. A few ideas on tests:

  • Why not create separate testthat messages for each expectation, instead of combining them into one big test? "search return values are correct, and fail when expected" is a bit general. This would help debugging and make test failure messages more specific.
  • Is there any reason that skip_on_cran() is commented out? This seems to be the sort of package that this function is useful for.

Package Data

Why not save api_col_names as internal package data instead of a function? Also, why not save it as a named list (with machine readable variants) in order to allow the end user to have a dictionnary to go back and forth between those names?

Handling of Metadata

It would be great if the following metadata was stored somewhere along the way, including:

  • the API timestamp: When was the data called?
  • the API URL: What is the address/definition of the call?
  • system.time() timings: How long did the operation take to complete?

I think the best would be to store it as an attribute of the return object, or perhaps as a colum in the data frame. You have to be a bit careful here though because some functions outside the tidyverse drop attributes in their return objects. Alternatively, metadata can be stored in a special purpose namespace for internal reference, however this is likley to be less accessible to end users.

Other

  • There is no date field in DESCRIPTION file. It is not strictly necessary, but it means that running citation("comtradr") will not return a proper citation.

  • I suggest setting the order of your function's formals in order of importance (or frequency of use). For example, I don't think the api URL is a common call, so moving it down would help for those who like to use positional matching.

  • Why not let the verbose argument in your functions draw from the existing setting in the options? I.e. verbose = getOption("verbose", TRUE)

  • The country table is not denormalized (same code if both a reporter and partner). This is nitpicking but I think a nicer approach would be a single row for each country + code, with logical columns that define if they are a reporter and if they are a partner.

  • Since I do not think you are using the dots ... argument for anything, why not add them in to allow users to pass additional parameters to the regex functions (for example, to modify the ignore.case or perl arguments)?

Questions

  • Will fromJSON() with simplifyDataFrame = TRUE always work? Is there a way to test for non-tabular data, or catch errors? What if there is hierarchical nesting?

References and Resources

Article and Books:

@ChrisMuir
Copy link
Member Author

@AliciaSchep and @rtaph,

Thank you both so much for the thorough reviews! I've already begun working on suggestions made by @AliciaSchep, I will now start to address all suggestions made by @rtaph.

I'll compile a reply that addresses each point made within the reviews. This may take some time, but I hope to have a reply to you both in 1-2 weeks.

Also, while I'm working on package updates, I have two additional changes that I came up with recently, but have held off as I didn't want to make changes during the review process. Let me know your thoughts on these:

  • Replace dplyr import with magrittr
    The main use of dplyr within my pkg is the pipe (%>%), I think it makes sense to avoid importing dplyr if possible, due to its size.

  • Import purrr
    I've had issues in the past with type safety in function commodity_lookup (see issues #2 and #3). I'm currently using sapply followed by a series of transformations designed to account for all possible data type returns from sapply, given the input parameters....the function is a little convoluted and inelegant. I've rewritten commodity_lookup to use purrr::map in place of sapply, to streamline the function and make it less prone to type safety issues. Only drawback is I'm not sure if this is enough to warrant adding purrr to imports, although @rtaph mentioned adding purrr::safely and purrr::possibly, plus I can also use purrr::map_char to replace some calls to vapply in other functions, so might be worth it.

Thanks again!

@sckott
Copy link
Contributor

sckott commented Sep 9, 2017

thanks for your reviews @AliciaSchep and @rtaph !

plan sounds good @ChrisMuir

@rtaph
Copy link

rtaph commented Sep 9, 2017

Hi @ChrisMuir,

Happy to help. If you are only relying on the pipe, then I think your suggestion to import magrittr makes sense. Even tidyverse packages such as purrr have done this. Note that purrr require magrittr (>= 1.5); there may have been some major changes in that version.

As for the second point-- well everyone has their opinion and you could do it in base R with if you stick away from sapply(). I would say that type safety and clarity is a worthwhile tradeoff, especially as your package gains in popularity and use in the open community. purrr is a fairly mature package to rely on so I think it's worth it.

@ChrisMuir
Copy link
Member Author

Got it, thanks for sharing your thoughts on those. Definitely leaning towards implementing them both. I'll look into magrittr version history, to see if I need to specify a minimum version.

Thanks!

@ChrisMuir
Copy link
Member Author

@AliciaSchep and @rtaph,

So just to update, I've been working on implementing changes and edits to comtradr based on your reviews. I wanted to outline the changes made so far.

Before getting into it, I just wanted to say I've learned A LOT during this review and while working on refining the package. Thank you both again for taking the time.

The tl:dr version of changes is here.

Overview of Major Changes

  • Eliminated functions ct_commodities_table, ct_countries_table, ct_json_data, ct_csv_data.
  • Added new exported functions ct_commodity_db_type, ct_update_databases, ct_get_remaining_hourly_queries, ct_get_reset_time, ct_register_token.
  • A package environment (ct_env) is created upon load, that's used to handle variables related to API rate limits as well as package data.
  • Saved the country table DB and commodity table DB as pkg data files to directory inst/extdata, both are read-in to ct_env upon pkg load, and accessed by package functions whenever needed.
  • Created a framework for throttling API requests per the API limits. For now this only applies to function ct_search.
  • Add function that gives users the ability to set a valid API key/token via global options.

Replies to review comments

I liked the example plots and manipulations in the README, although those might be more appropriate for the vignette rather than README.

This makes good sense, the two examples using ggplot2 have been moved to the vignette.

Examples in package documentation did not run successfully using devtools::run_examples(run = FALSE). It looked like there was some R output included in the documentation -- I think that should be commented out if included. I am guessing the examples were set to not run for purposes of CRAN, but it would be nice to be able to run them locally in an automated way.

Yeah this was originally done for CRAN submission, but I've removed the dontrun tags from all functions that do not require an internet connection.

The package should ideally have a man page for the entire package, which might describe briefly what the package does.

This has been added.

The tests include many expectations each, especially for ct_search. This can make it harder to figure out exactly why the test is failing, because you get the message from the "test_that" call but then have to do some digging to figure out exactly which expectation(s) failed.

I split up as many of the tests as I could, so they're now more manageable.

The first time I ran the tests, all the expectations for "ex2" failed. Trying again multiple times, it seemed to work. I am not sure what went wrong the first time, so don't really have any suggestion for that one, just figured I'd mention it.

@rtaph 's assessment was correct, the ct_search tests were failing so often because they were violating the per second limit of the API (I was really surprised by this!). The package now incorporates rate limiting, so the tests are much more stable.

It would be good to include the Travis Badge on the README, and to additionally test the package on MacOs and Windows. For Mac, this can be specified in the .travis.yml file. For Windows, you can use AppVeyor.

I've expanded Travis CI to now build on Linux and Mac, and have added AppVeyor for Windows CI (plus all the badges!).

Most of the package names start with "ct_", except the "country_lookup" and "commodity_lookup" functions. I think it would be nice to go fully with the convention of starting all the functions with the same prefix, and add "ct_" to those two functions as well.

This was a great idea, all exported functions now have the ct_ prefix.

For the country_lookup function, I think it would be nice for the order of the arguments to be re-arranged so that "type" was optional without having to write-out the "lookuptable" argument.

Function ct_country_lookup now no longer has arg lookuptable, but this suggestion was mentioned by both reviews, so I took care to consider the arg order for all pkg functions, and made edits to the order within ct_search.

Unit tests exist, but a whole battery of ct_search tests fail when you run the full test suite. This is because the API limits the query rate in guest mode and the tests are all run quickly in succession. As a result, the tests all fail. When I run the tests one by one in the console, they pass. They also pass when I pad your test suite with Sys.sleep(1) in between calls to the API. More on this below.

This has been addressed.

The package does not use the recommended snake_case convention and mixes naming styles (see below). The standard is not hard and fast but it would be nice if a single approach was adopted, perhaps working around the mixed naming conventions in the API.

This has been fixed, snake_case is used throughout the package.

the package does not provide citation details in the README, as per the guideline

I had a question about this and adding a date to DESCRIPTION. Per this comment from @sckott, adding date to DESCRIPTION is not necessary? Just wanted to get clarity on this.

the package does not yet contain top-level documentation (i.e. typing ?comtrade will not return a help page)

This has been added.

per guidelines, the package ought to add #' @nord to internal functions such as api_col_names()

Done, I've added #' @noRd to all internal functions.

Returning a Data Frame Directly

I thought this was great, I had always felt kinda weird about returning a list. I made this change.

Converting to Native R Messages and Warnings

This change has been made. I'm checking the response from httr::GET to make sure the status code is 200, checking that the response type is json, and if no data was returned I'm checking to see if a helpful error message was provided by the API. Errors are thrown, as opposed to wrapping the info up as part of the return list object.

Caching Reference Tables

This has been implemented. The package now features the country reference table and commodity reference table saved as .rda files to inst/extdata. They are both read-in when the pkg loads, and the functions utilize them whenever needed. The user is no longer required to download them to begin making API requests. Users can check for reference table updates with function ct_update_databases, and this function also gives the user the option of replacing the commodity table on file with any of the other types of commodity tables available ("HS", "HS1992", "SITC", etc).

Throttling API requests

This is done. Upon package load, three variables are created in pkg env ct_env, they're basically "time of last query", "number of available queries left this hour", and "time that the current hour chunk resets". These variables allow the package to keep API requests under the limits of 1/second and 100/hour.

I would avoid recyling the package name in example assignments (comtrade <-)

I fixed this.

The documentation is generally good with plenty of examples. A few topics I think could be added to the documention include:
Oauth Token: documenting the use of an authorization token.
Regex: It would be great to state in the documentation that the functions use regular expressions. Also, not all your users may know regex, so pointing them to some help pages on the topic could be great (maybe add to the see-also section of documention).

I've added a section in the vignette on rate limits and using function ct_register_token.
For functions ct_country_lookup and ct_commodity_lookup, I give a brief explanation of the use of regex in the functions, provide a link to my favorite regex tutorial, and added grepl to the "seealso" section.

The one thing I would split out is the part of the function that constructs the API call from the part that actually calls it. It would be nice to be able to generate the API url for troubleshooting purposes, even if this is not exported, and also save the API url along with the result (especially if caching).

I have the tasks of create the API url and actually make the request split into two separate functions, but they aren't as modular as what you're getting at here. I did add to ct_search that the API url is now attached to the return data frame as an attribute.

It is good web etiquette to identify oneself when accessing APIs and scraping the web. This can be done via the user agent variable in httr and would be good for people using guest mode.

This is done, using the method that was suggested:

paste(Sys.info()[["user"]], R.version$version.str, version$platform, sep = ", ")

Is there any utility in allowing the user to specify fmt? I recommend removing the flexibility and only letting the API receive JSON

Nope there was no upside to using the "csv" option previously. I've removed it, and thus removed arg fmt from ct_search.

From what I can tell, there is no call to check the API version (is that right?). If there is, it would be nice assert the version in the return data so that if the API changes, the user gets a warning about using an older version of comtradr.

Is this suggestion in regard to the API version or the version of the comtradr package? I don't believe the API itself features versioning.

This is really a matter of preference, but I would simplify by removing the option of "human readable" names in the formals, instead providing a dictionary and examples of how to do this outside of the function. This lets the user go back and forth with renaming columns once the data frame is returned.
Another thought is that the "country name" header has a space, and some people may not know how to use backticks to refer to names like this. Again, maybe think about standardizing the naming convention for data frame headers, argument names, and argument parameters.

I edited all col names within api_col_names to be strictly snake_case. I'm not sure what you mean by providing a dictionary.....like provide a vector of names as a package data object? For now I've left the api_col_names function in, and left in the arg col_name within ct_search.

I also edited the column names approach slightly, the args are now desc and comtrade. "desc" will return headers that are descriptive and easy to interpret. "comtrade" will return the col headers that are used by the UN Comtrade API. Both options are machine-readable. Default value is "desc" within ct_search.

Why not have an asserter to ensure that your receive Status: 200? Maybe this is not needed since you check the validation codes for JSON, but still, it is easy to implement via rawdata$status_code == 200L and an added layer of safety.

I'm checking for status code 200 in function execute_api_request (error is thrown if not 200).

spelling of "ajust" in comtradr-vignette.Rmd, line 98.
spelling of "tomatos" in comtradr-vignette.Rmd, lines 90 and 94

These are fixed.

Is there any reason that skip_on_cran() is commented out? This seems to be the sort of package that this function is useful for.

Yeah at some point after getting on CRAN I was having issues with the tests being skipped on Travis, and commenting out skip_on_cran seemed to fix the issue. I've added skip_on_cran back to the functions that require an internet connection.

Why not save api_col_names as internal package data instead of a function? Also, why not save it as a named list (with machine readable variants) in order to allow the end user to have a dictionnary to go back and forth between those names?

Again, I'm not sure I understand. If I include the col names as package data, I assume I would also export a function for accessing and using it?

It would be great if the following metadata was stored somewhere along the way, including:
the API timestamp: When was the data called?
the API URL: What is the address/definition of the call?
system.time() timings: How long did the operation take to complete?

This has been done. I've added these three items as attributes to the return data frame from ct_search.

There is no date field in DESCRIPTION file. It is not strictly necessary, but it means that running citation("comtradr") will not return a proper citation.

See my earlier comment on adding date to DESCRIPTION.

I suggest setting the order of your function's formals in order of importance (or frequency of use). For example, I don't think the api URL is a common call, so moving it down would help for those who like to use positional matching.

I reordered the function args in ct_search to address this.

Why not let the verbose argument in your functions draw from the existing setting in the options? I.e. verbose = getOption("verbose", TRUE)

I wasn't sure about this, and have left it as is for now. I think I prefer the way it's set up now, but that's just me (if I knew that most users disagreed, I would certainly be inclined to change it). Is this typically handled using options in the major R packages?

The country table is not denormalized (same code if both a reporter and partner). This is nitpicking but I think a nicer approach would be a single row for each country + code, with logical columns that define if they are a reporter and if they are a partner.

I initially misread this comment and am just now realizing what you're getting at, so I haven't done anything to address this. Just to be clear, and using Germany as an example, you're saying go from this:

> country_df[country_df$`country name` == "Germany", ]
    code country name     type
101  276      Germany reporter
373  276      Germany  partner

to this:

> country_df[country_df$`country name` == "Germany", ]
  code country name reporter partner
1  276      Germany     TRUE    TRUE

Since I do not think you are using the dots ... argument for anything, why not add them in to allow users to pass additional parameters to the regex functions (for example, to modify the ignore.case or perl arguments)?

I haven't implemented this. I like the idea, but for both of the lookup functions, I would like for the default setting for ignore.case within grepl to be TRUE. I could add ignore.case to the input args for both functions along with adding the dots, so:

ct_commodity_lookup(search_terms, return_code = FALSE, return_char = FALSE, verbose = TRUE, ignore.case = TRUE, ...)

That seems weird to me though, to say "... is for additional args that will be passed along to grepl, except for ignore.case, as that's defined explicitly". Is there a better way to go about this?

Questions
Will fromJSON() with simplifyDataFrame = TRUE always work? Is there a way to test for non-tabular data, or catch errors? What if there is hierarchical nesting?

In all of the testing I've done, it's never been an issue. In the cases in which there is legitimately no data to return based on the input values (and thus no error will be thrown), within internal function execute_api_request an empty data frame is created and returned with the specified column names.

Edits I'm planning to make

  • Add section to vignette explaining function ct_update_databases.

@rtaph
Copy link

rtaph commented Sep 24, 2017

Props @ChrisMuir on implementing so many major changes so quickly! I installed ropensci/comtradr@724d723 and took your package for spin. Everything works nicely!

Based on your comments, I have some follow-on remarks:

Is this suggestion in regard to the API version or the version of the comtradr package? I don't believe the API itself features versioning.

Yes, my point was related to the API version (not the comtradr package version). I couldn't find any documentation from the original API page and thought you might know. It would have been nice to be able to query the API to get the API version but if the developers never included this, this isn't implementable.

That seems weird to me though, to say "... is for additional args that will be passed along to grepl, except for ignore.case, as that's defined explicitly". Is there a better way to go about this?

Personally I like your choice of setting ignore.case = TRUE since I anticipate almost use cases to be case-insensitive. Your suggested language sounds fine to me. However, I would add roxygen documentation for the ignore.case argument, simply saying "argument passed to grepl()" and hyperlinking to the grepl man page.

I'm not sure what you mean by providing a dictionary.....like provide a vector of names as a package data object? For now I've left the api_col_names function in, and left in the arg col_name within ct_search.

Yes, I am thinking of python with my use of 'dictionary'-- namely, a list (or vector) of uniquely named key-value pairs. My point, really, was two-fold:

  1. Saving the information as package sysdata seemed more conventional than writing a static function to do the same thing. Furthemore, defining a named list/vector is visually more maintainable compared to creating two unnamed vectors that need to matched positionally. It makes the binary relation explicit.
  2. Exposing the key-value pair(s) as exported data is useful to the analyst, who will likely need to make conversions between names.

To illustrate with a 2-entry example

# API dictionary, for internal use since the user does not need to interact
#  with the the original API naming
h1 <- list(classification = "pfCode", 
           trade_flow     = "rgDesc")
devtools::use_data(h1, internal = TRUE)

# Human-readable dictionary, exported
h2 <- list(`Classification` = "classification",
           `Trade Flow`     = "trade_flow")
h2 <- c(`Classification` = "classification", # or as a vector
        `Trade Flow`     = "trade_flow")
devtools::use_data(h2)

I guess I liked that you previously had human-readable variants of names, but thought the implementation would be stronger if this mapping was exposed to the user to use as they saw fit. A dictionary would let users do something like dplyr::rename_(ex_1, .dots = h2) and use these 'polished' names for plotting, publication tables, etc. For me as a user, I would likely want to use both the snake_case and human-readable variants for different purposes. This should be achievable as a post-query operation.

Additionally, setting col_name = "comtrade" in ct_search returns the API naming convention, which I think is much less useful than the human readable names you had previously (Hence my suggestion of having two dictionaries).

Will fromJSON() with simplifyDataFrame = TRUE always work? ... In all of the testing I've done, it's never been an issue

Okay great! I just have worked with (scraped) HTML and XML data in the past where dataframe simplification arguments failed. You know the API well so if it's not a problem you have encountered, then it's a non-issue!

verbose = getOption("verbose", TRUE) ... I wasn't sure about this, and have left it as is for now.

Ok!

I initially misread this comment and am just now realizing what you're getting at, so I haven't done anything to address this. Just to be clear, and using Germany as an example, you're saying go from this ... to this

Yes, exactly. It's a small change, but I think makes it easier for a new user to grasp the data structure and filter with logicals.

the package does not provide citation details in the README, as per the guideline

Interesting about the date field... I did not know about the CRAN behaviour! I defer to @sckott on adding the date field to the DESCRIPTION, file. But if I understand correctly, this decision might be separate from adding citation guidance in the README.

A few last suggestions:

  • I recommend changing 2nd_partner to second_partner since valid R names cannot start with a digit. Same with 2nd_partner_code and 2nd_partner_iso.
  • A number of fields return empty in JSON data which I believe would be better represented by NA instead of empty character strings. What do you think?
  • What is the purpose of the period_desc variable from the API? Since it is identical to period, can it be removed? I ask because it's redundant, and I don't like the idea of a column name describing the ordering of the data (since this can change).

@AliciaSchep
Copy link

Great work @ChrisMuir! You addressed all my comments in the review, and tests & examples now run without issue on my computer. I like all the changes you made to the function names, arguments, output format, etc.

@ChrisMuir
Copy link
Member Author

Hi @AliciaSchep and @rtaph ,

Thanks for the latest comments, I have a few more edits to make. I'm going out of town Wednesday thru Monday, so I may not finish up until next week. I will reply back when I have the next set of edits done, or if I have any further questions, though I think I have everything I need.

Thanks so much!

@ChrisMuir
Copy link
Member Author

Hi @rtaph and @AliciaSchep ,

I've finished up the second round of edits and changes to the package.

Personally I like your choice of setting ignore.case = TRUE since I anticipate almost use cases to be case-insensitive. Your suggested language sounds fine to me. However, I would add roxygen documentation for the ignore.case argument, simply saying "argument passed to grepl()" and hyperlinking to the grepl man page.

I've added ignore.case and ... as function args to both of the lookup functions, with ignore.case default value set to TRUE.

Yes, I am thinking of python with my use of 'dictionary'-- namely, a list (or vector) of uniquely named key-value pairs ... I guess I liked that you previously had human-readable variants of names, but thought the implementation would be stronger if this mapping was exposed to the user to use as they saw fit.

I added internal pkg data to apply snake_case col headers to API data as it's returned from ct_search, and added user accessible pkg data similar to what you suggested.....a named vector of "polished" col headers, as key-value pairs. The polished col headers can be accessed with data("ct_pretty_cols), and can be applied using dplyr::rename_(df, .dots = ct_pretty_cols). I've also added an exported function for applying the polished cols to an API data frame, df <- ct_use_pretty_cols(df).

I've removed function api_col_names, and arg col_name within function ct_search.

I initially misread this comment and am just now realizing what you're getting at, so I haven't done anything to address this. Just to be clear, and using Germany as an example, you're saying go from this:

country_df[country_df$country name == "Germany", ]
code country name type
101 276 Germany reporter
373 276 Germany partner

to this:

country_df[country_df$country name == "Germany", ]
code country name reporter partner
1 276 Germany TRUE TRUE

This change has been made, the country reference table has been simplified, it now has dimensions 294 x 4.

Interesting about the date field... I did not know about the CRAN behaviour! I defer to @sckott on adding the date field to the DESCRIPTION, file. But if I understand correctly, this decision might be separate from adding citation guidance in the README.

I've added this to the README:
For information on citation of this package, use citation("comtradr")
Is this typical and/or sufficient?

I recommend changing 2nd_partner to second_partner since valid R names cannot start with a digit. Same with 2nd_partner_code and 2nd_partner_iso.

This has been changed.

A number of fields return empty in JSON data which I believe would be better represented by NA instead of empty character strings. What do you think?

This was a great suggestion, I added a step within ct_search prior to returning the data that replaces all empty strings with NA.

What is the purpose of the period_desc variable from the API? Since it is identical to period, can it be removed? I ask because it's redundant, and I don't like the idea of a column name describing the ordering of the data (since this can change).

Yeah, I've never seen "period_desc" return different values than variable "period", and I'm not sure exactly what the difference is between the two variables. However, I'm hesitant to start cutting variables from the API data, so I've kept it for now.

In addition to these edits, I've updated the package vignette to go over the use of package data, and the functions related to the package data.

Overall, I feel great about the status of the package (especially compared to pre-review!), if there's any additional comments/thoughts/changes I'm happy to hear them.

Thanks!

@rtaph
Copy link

rtaph commented Oct 5, 2017

Let me start by saying great work, @ChrisMuir ! It's easy to for us reviewers to pick apart a package, much harder to implement changes. I am recommending comtradr for approval. 🎉

If @sckott has any citation guidance, perhaps your README should change, but to me your proposed approach sounds reasonable.

Last suggestion (I hope): Removing 'inst/doc' from your .gitignore so that your vignettes are included with your package. At the moment they don't install via devtools::install_github() unless the user explicitly sets build_vignettes = TRUE. You've got such a nice vignette and you want people to find it with browseVignettes("comtradr")! It would also be great to link to the vignette in your comtradr man page, as that may be the starting point for users.

@ChrisMuir
Copy link
Member Author

I am recommending comtradr for approval.

👍 👍 👍 Thanks!

Last suggestion (I hope): Removing 'inst/doc' from your .gitignore so that your vignettes are included with your package. At the moment they don't install via devtools::install_github() unless the user explicitly sets build_vignettes = TRUE. You've got such a nice vignette and you want people to find it with browseVignettes("comtradr")! It would also be great to link to the vignette in your comtradr man page, as that may be the starting point for users.

Oh these are both great ideas, I will look into implementing them after work today.

Thanks!

@ChrisMuir
Copy link
Member Author

So I've been working on the two vignette changes, but I'm having trouble getting the the HTML vignette included with the package. I've tried a few different things, using this page and this page as references, but I keep running into the "rdb is corrupt" issue described here and discussed here.

I'll keep working at it.

@ChrisMuir
Copy link
Member Author

I've updated the NEWS file to reflect all of the package changes. I'm close to a point in which I'm ready to submit the updated package to CRAN.... @sckott should I wait on that until this review process is finished?

@sckott
Copy link
Contributor

sckott commented Oct 10, 2017

Sorry for delay, was at a conference last week. Nice work on all changes. We don't have any policies specifically about relationship between review and cran submissions. Some pkgs submitted are already on cran even. It does kinda make sense to wait to submit if you think the review will be done soon :)

So about DATE in citation: date is filled in on package build step, so AFAIK it's best practice to not fill that in yourself because you may forget to update it. That does mean that if a user does e.g., devtools::install_github then citation they'll not get a date, but if they install from cran they will (works for binary and source pkgs)

About the vignette: i'll check it out tmrw

@sckott
Copy link
Contributor

sckott commented Oct 11, 2017

for the vignette issue: Once it's on CRAN, vignette should be installed by default, so it won't be an issue unless users are installing dev version.

linking to vignette from a manual page is a constant problem, but can put a link to the online (github/CRAN) version on the man page 😄

Another option is to use pkgdown, but that's in addition to the vignette.

@ChrisMuir
Copy link
Member Author

Hi Scott,

Thanks for the input. The package is currently on CRAN, this would just be submitting the updated version with all of the pkg review updates. I'll hold off on submission though, as the review is (I believe) close to being done.

I worked more on the vignette issue tonight, I think it's now fixed and all is working properly....vignette is now available after running devtools::install_github, and the package man page now links to the vignette.html file. I think the only issue was needing to run devtools::build_vignettes prior to pushing the edits to GitHub.

So I think all package edits have either been implemented or addressed 😄

@sckott
Copy link
Contributor

sckott commented Oct 18, 2017

Okay, on CRAN already, got it.

Great, will have one last quick look

@sckott
Copy link
Contributor

sckott commented Oct 18, 2017

Approved!

  • Please transfer the package to ropensci- I've invited you to a team within our ropensci organization. After you accept you should be able to move it.
  • Make sure to
    • add rOpenSci footer to README
      [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
    • Change any needed links, such those for CI badges
    • Travis CI should update to the new location automatically - you may have to update other CI systems manually
  • Are you interested in doing a blog post for our blog https://ropensci.org/blog/ ? either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development (https://ropensci.org/blog/). If so, i'll get you the details on that

@ChrisMuir
Copy link
Member Author

Great to hear! Thanks @sckott .

I've transferred the package to ropensci, and added the footer banner to the README, and updated all links in the README to point to the ropensci location of the repo.

On the blog post, I'd be interested, but my wife and I are moving to Nashville, TN in two weeks, and will be super busy for the next 3-4 weeks. So it would have to wait until after the move.

One question, should I add the "peer reviewed" badge to the repo?

@ChrisMuir
Copy link
Member Author

Also, just want to say again to @rtaph and @AliciaSchep, I truly appreciate the help and the time that was put in. I think I learned more about building packages in the last month than I had learned in the previous two years 😄 , and the comtradr package is much better off for it.

Thanks so much!!

@sckott
Copy link
Contributor

sckott commented Oct 19, 2017

Great, thanks @ChrisMuir

We'll get in touch with you after the next month or so about the blog post.

yes! do add the peer review badge. like

[![](https://badges.ropensci.org/141_status.svg)](https://github.com/ropensci/onboarding/issues/141)

@sckott sckott closed this as completed Oct 19, 2017
@stefaniebutland
Copy link
Member

Hi @ChrisMuir . I'm rOpenSci's Community Manager. Following up to ask if you're ready to write a blog post about comtradr. I see above that you've just moved so might still be very busy. If you're still interested - and I hope you are - we could choose a date in the new year. I also have Tues Dec 19th available.

For blog post we ask for a draft via pull request one week prior to the publication date. You can find technical and editorial help here: https://github.com/ropensci/roweb2#contributing-a-blog-post.

Perhaps we can continue this discussion via DM on rOpenSci Slack, when you have time.

@ChrisMuir
Copy link
Member Author

Hi @stefaniebutland, good to hear from you! Yes I'm still interested in doing a blog post, but I'm thinking sometime in the new year might be best.....My wife and I are finally getting to a point in which we're settled after our move, we were out of town for Thanksgiving and will be visiting family for a week over Christmas.....busy busy :)

I have a couple questions about the process, I will send you a DM on Slack. Thanks so much!

@sckott sckott mentioned this issue Jan 5, 2018
3 tasks
@ChrisMuir
Copy link
Member Author

ChrisMuir commented Mar 22, 2018

Hi @AliciaSchep and @rtaph , hope you're both doing well! I just saw this rOpenSci blog post about adding reviewers to a pkg.....I've been thinking comtradr is due for a new release, and I'd like to add you both as reviewers to the DESCRIPTION file prior to sending to CRAN.

So I wanted to see if this was okay with you both. If so, let me know if you also have an orcid.org ID that you'd like me to include (like in this example).

Thanks!

@rtaph
Copy link

rtaph commented Mar 22, 2018

Hi Chris!

Sure, I appreciate the consideration! My orcid is 0000-0002-3092-3493.

Cheers!

@AliciaSchep
Copy link

Hi Chris, Happy to be included -- my ORCID is 0000-0002-3915-0618. Thank you!

@ChrisMuir
Copy link
Member Author

Awesome, this is great. Thanks to you both!

@ChrisMuir
Copy link
Member Author

Hi @rtaph , for your name in the DESCRIPTION file, should I use person("Rafael", "P.H.", role = "rev", etc...), or would you prefer I use your full last name?

@rtaph
Copy link

rtaph commented Mar 22, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants