New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccafs - Client for CCAFS General Circulation Models Data #82

Closed
sckott opened this Issue Oct 27, 2016 · 37 comments

Comments

Projects
None yet
4 participants
@sckott
Copy link
Member

sckott commented Oct 27, 2016

Summary

  • What does this package do? (explain in 50 words or less):

An interface to data from Climate Change, Agriculture, and Food Security (CCAFS) General Circulation Models (GCM) data. Data is stored in Amazon S3, from which we provide functions to fetch data.

  • Paste the full DESCRIPTION file inside a code block below:
Package: ccafs
Type: Package
Title: Client for 'CCAFS' 'GCM' Data
Description: Client for Climate Change, Agriculture, and Food Security ('CCAFS')
    General Circulation Models ('GCM') data. Data is stored in Amazon 'S3', from
    which we provide functions to fetch data.
Version: 0.0.8.9100
Authors@R: person("Scott", "Chamberlain", role = c("aut", "cre"),
    email = "myrmecocystus@gmail.com")
License: MIT + file LICENSE
URL: https://github.com/ropenscilabs/ccafs
BugReports: https://github.com/ropenscilabs/ccafs/issues
Imports:
    rappdirs (>= 0.3.1),
    httr (>= 1.2.0),
    raster (>= 2.5-8),
    tibble (>= 1.2),
    xml2 (>= 1.0.0),
    data.table (>= 1.9.6)
Suggests:
    roxygen2 (>= 5.0.1),
    testthat,
    covr,
    knitr
VignetteBuilder: knitr
RoxygenNote: 5.0.1

  • URL for the package (the development repository, not a stylized html page):

https://github.com/ropenscilabs/ccafs

  • Who is the target audience?

Scientists, particularly those studying climate change, forecasting crop production, etc.

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?

Noppers.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
  • If this is a resubmission following rejection, please explain the change in circumstances:
  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Nov 5, 2016

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Currently seeking reviewers.

I find the README and documentation a little sparse. I try to follow the rule-of-thumb that a user may encounter the data and APIs we wrap for the first time through the package help or README. Package-level documentation should at least briefly describe the data source and format, and provide links to more detailed documentation at the source.

Below is goodpractice::gp() and covr::package_coverage() output. I note that much missing test coverage (when run with `NOT_CRAN="true") is for caching functions.

── GP ccafs ──────────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 50% of code
    lines are covered by test cases.

    R/cache_data.R:5:NA
    R/cache_data.R:18:NA
    R/caching.R:48:NA
    R/caching.R:49:NA
    R/caching.R:55:NA
    ... and 45 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor
    windows that are about 80 characters wide. Try make your lines shorter than 80
    characters

    R/caching.R:74:1
    R/cc_data_fetch.R:46:1
    R/cc_data_fetch.R:51:1
    R/cc_list_keys.R:22:1
    R/cc_list_keys.R:24:1
    ... and 2 more lines

────────────────────────────────────────────
ccafs Coverage: 50.50%
R/caching.R: 0.00%
R/cc_data_fetch.R: 23.08%
R/cc_data_read.R: 70.00%
R/cache_data.R: 84.62%
R/zzz.R: 90.48%
R/cc_list_keys.R: 100.00%

Reviewers: @mikoontz @manuramon
Due date: 30-11-2016

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Nov 9, 2016

Reviewers: @mikoontz @manuramon
Due date: 30-11-2016

@manuramon

This comment has been minimized.

Copy link

manuramon commented Nov 24, 2016

Here is my review. I hope that helps.

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5 hours


Review Comments

The CCAFS package has been developed to have a first sight of Climate Change, Agriculture, and Food Security (CCAFS) General Circulation Models (GCM) data in a fast and easy manner. This initial version is written to gather data from CCAFS data from Amazon S3, but I think it would be of great interest to make it more general allowing the use of data stored in other servers/places (for instance, including an URL argument in cc_data_flech).

Anyway, the package is well designed and fits its purpose.

Here are some specific comments:

  • Installation goes without problems in windows, linux and mac OS.
  • A broader description about what the package does in README would be desirable.
  • Description of functions are in some occasions a bit short but precise.
  • In my system, vignette doesn’t load when calling vignette() function, although it is available in the package directory.
  • The Vignette covers almost all functions but is quite short. I would like to have more information about what each function does and more examples. For instance, in the cc_data_fetch documentation, there is an example on how to subset zip files using the cc_list_keys and grep functions that I think is quite useful and it could be included in the vignette.
  • There is a typo in ?cc_data_read (line 1, Reaa instead of Read)
  • There is a typo in cc_list_keys, at marker arguments: you put pecifies instead of specifies. Also, reference to GET function could include the name of the package that function belongs to (httrpackage), to make easier to search for information about that function. The same could apply to all the functions referred in the documentation and that have been imported from other packages (mainly httr).
  • When running cc_data_fetch function, progress bar doesn’t ends in a newline, so the console prompt comes just after del 100% of the progress bar.
  • If you run the examples in cc_data_fetch and cc_data_read documentation without running first the example on README, plot() doesn’t work as the raster library is not loaded. I suggest to add library(raster) in all examples containing a plotting function or perhaps, to load the raster library by default (when loading the package).
  • As a suggestion, the use of the rasterVispackage instead of raster will give you more control over the graphical outputs
@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Nov 30, 2016

ccafs Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4.5


Review Comments

Overview

The ccafs package provides an R interface to geospatial data from the Climate Change Agriculture and Food Security General Circulation Models. The package is streamlined and works well with available geospatial frameworks in R, namely the raster package. The fetch-read workflow works nicely, and I appreciate the caching functionality to speed up data retrieval since these kinds of data can be quite large. Just as importantly, the cache cleaning operations are intuitive and convenient and will prove valuable to those using these data a lot. The Climate Change Agriculture and Food Security website (unaffiliated with the package) is well-organized but if I knew what data I needed to obtain from it, I would certainly prefer to use the ccafs package to do so.

I have one broad question and some more minor points. Depending on the outcome of the conversation around that broad question, many of these points can probably be addressed with modifications and/or additions to the documentation.

I was unable to get automatic checks from goodpractice or covr to run properly, but I can dig deeper to find out how I'm supposed to do that if adding to Noam's use of those tools would be helpful. (This is why I didn't check Automated tests above)

Broad Question

I am having trouble envisioning how the ccafs package is intended to fit in the target audiences' workflow with CCAFS data. I can see two possible scenarios:

  1. The package is designed to be used in tandem with the website. For instance, use the website to find a particular key of interest, use ccafs to get data for that particular key, then proceed with analysis in R.
  2. The package is designed to be a stand-alone interface to the CCAFS data. For instance, a user can load the ccafs package, learn about what data are available, filter by region or data type at will, and download corresponding data.

In either scenario, I think the package documentation would benefit from a minimum description of the data available and how it is structured.

If scenario 1 is the intended use, I think the documentation could also benefit by pointing users to the CCAFS website where more can be learned about how to find particular data of interest. I noticed on the ccafs-climate.org website that there are ways of filtering what particular tiles or datasets can be downloaded. At a minimum, it would help a new user of the ccafs package to have a link to the website and how to find the 'key' information after searching, via the website interface, for the data they want.

If it's scenario 2, I think the package would benefit from more direction as to what keys a user might be interested in. This might require additional code to help translate user-desired filtering criteria into key filtering criteria (regular expressions of some kind?). If this functionality is available via the prefix=, marker=, and delimiter= arguments in the cc_list_keys() function, then I think the package would benefit from more explanation on how to use them to find particular datasets.

In scenario 2, I think the documentation wouldn't necessarily need to be augmented beyond what would be necessary for scenario 1.

Minor Points

  1. Can the raster package be loaded as a dependency upon loading the ccafs package? Is there any reason not to do this, if the ccafs packages' functionality centers around raster objects? Is it fair to say the keys don't do much until the raster data for those keys are retrieved? Or is there also a benefit to being able to readily acquire the files, even if they aren't read into R?

  2. Relatedly, I see that the raster() and brick() functions are used for the cc_data_read.character method, which can happen even without an explicit call to library(raster). The plot() function won't work without an explicit library(raster) call first. I'm not familiar with how that works, or what the advantage is. What is the benefit of letting the cc_data_read method use the raster package behind the scenes, but not the plot method?

  3. The package help (?ccafs) would benefit from some more description. Where is the original data source? What does CCAFS GCM stand for? Some of this information is in the vignette and on the GitHub page, and I think it would help a new user to have it available here, too.

  4. I see that the CCAFS data has an attribution/non-commercial license. Is there a need to pass along the fact that there is a "non-commercial" part of the data license to the user of the package? What about passing along the citation information for the data found here? Can this be done in the documentation or in a call to some function?

  5. Can the vignette be linked to in the ?ccafs? Or be made available with a call to vignette("vignette_name")?

  6. I got the below warnings when trying to run citation("ccafs"). Is the lack of a date in the DESCRIPTION file just because the package isn't published yet?

To cite package ‘ccafs’ in publications use:

  Scott Chamberlain (NA). ccafs: Client for 'CCAFS' 'GCM' Data. R package version 0.0.8.9100.
  https://github.com/ropenscilabs/ccafs

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {ccafs: Client for 'CCAFS' 'GCM' Data},
    author = {Scott Chamberlain},
    note = {R package version 0.0.8.9100},
    url = {https://github.com/ropenscilabs/ccafs},
  }

Warning messages:
1: In citation("ccafs") :
  no date field in DESCRIPTION file of package ‘ccafs’
2: In citation("ccafs") :
  could not determine year for ‘ccafs’ from package DESCRIPTION file
  1. I see the term 'bucket' referenced a couple of times and I think I can infer its meaning from context within the function definitions, but perhaps this is something that could be mentioned in the introduction to the package if it is an important part of the CCAFS file organization structure.

  2. cc_data_fetch() and cc_data_read() examples need a library(raster) call prior to call to plot() in order for them to run. This is the reason why I didn't check the box next to Examples above.

  3. When cc_cache_details() is run on an NA (for instance, a file that doesn't exist), no error or warning is returned-- just an NA detail for the size. Is this expected behavior, or should a warning/error be reported as when files don't exist on a call to cc_cache_delete()?

> cc_cache_details(files = "foo")
<ccafs cached files>
  directory: /Users/mikoontz/Library/Caches/ccafs

  file: foo
  size: NA mb
  1. In the vignette, can it be made more clear in the following code chunk that the sub() function is piecing the pasted character object into a single, long URL. I worry that users unfamiliar with regular expressions may have trouble understanding why this key is written this way.
key <- sub("\n|\\s+", "", paste0(
  "ccafs/ccafs-climate/data/ipcc_5ar_ciat_downscaled/rcp2_6/
  2030s/bcc_csm1_1_m/10min/",
  "bcc_csm1_1_m_rcp2_6_2030s_prec_10min_r1i1p1_no_tile_asc.zip"))
  1. Is there a different purpose for [ compared to [[? Is the only difference that [ lets a user access multiple ccafs_files objects and doesn't drop names? It looks like the class remains the same for each. Does defining [[ just ensure that a user doesn't break their code by trying to subset this way?

  2. Line 9 in cc_list_keys.R: "pecifies" should be "specifies"

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Nov 30, 2016

Thank you for your reviews, @mikoontz and @manuramon!

@sckott When responding to Mike's point, I think we should consider the evolving standard of "the package may be the users' first encounter with this data/API". This doesn't mean one has to document everything in the package, but the README, vignettes, and other documentation should point the reader to adequate online documentation if necessary.

I also suggest that one possible solution to the dealing with the raster dependencies is to re-export essential functions and methods from that package (for print/plot etc.), rather than attaching its whole namespace via Depends:. In the docs/vignettes/README you'll want to point out that these are raster objects and one should load raster for further processing.

I will take a closer look at the testing suite in a bit, as well. Mike, in general the Automated Tests category doesn't refer to the tests we run, but the package tests housed in tests/testthat. The test badges at the top of the README will lead you to the results if you are unable to get them to run on your computer (with devtools::test())

@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Dec 1, 2016

Oh great, thanks for that. In that case, I can update my review:

All tests in tests/testthat pass on my local machine. The file test-all.R in the tests/ directory gives the below error. Is that expected?

> library("testthat")
> test_check("ccafs")
Error: No tests found for ccafs

The badges suggest that the automatic tests by AppVeyor and Travis-CI work. I'm not clear on whether reporting that fulfills my reviewer duties. I'm not sure how to get devtools::test() to work on my machine.

My intuition is to pass the test() function the path for my ccafs package:

devtools::test(pkg = paste0(.libPaths(), "/ccafs"))

But I can't image that's correct, since it returns a bunch of failed tests that worked fine when I copied the code directly from within the tests/testthat directory to my R console.

> devtools::test(pkg = paste0(.libPaths(), "/ccafs"))
Loading ccafs
Testing ccafs
cc_data_fetch: 123
cc_data_read: 456
cc_list_keys: 789a

Failed -------------------------------------------------------------------------------------------------------------------
1. Error: cc_data_fetch works (@test-cc_data_fetch.R#8) ------------------------------------------------------------------
could not find function "cc_data_fetch"
1: .handleSimpleError(function (e) 
   {
       e$call <- sys.calls()[(frame + 11):(sys.nframe() - 2)]
       register_expectation(e, frame + 11, sys.nframe() - 2)
       signalCondition(e)
   }, "could not find function \"cc_data_fetch\"", quote(eval(expr, envir, enclos))) at /Library/Frameworks/R.framework/Versions/3.3/Resources/library/ccafs/tests/testthat/test-cc_data_fetch.R:8
2: eval(expr, envir, enclos)

...
@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Dec 4, 2016

@mikoontz devtools::test() and other tools like covrwork on the project development directory, not the installed package, set the cloned git repo as your R working directory, and run everything from there.

@sckott I took a look at your testing and see there isn't coverage for stuff in cacheing.R. These are mostly status print functions so I think this OK. However, it makes me consider what is appropriate testing for the cache functionality (something I'm dealing with myself at the moment). I think the approriate test is to determine whether the cached files really do match what would be downloaded. Running such tests would protect against code changes that result in changes to the cached paths, or unexpected changes to the files at the data source. An approach to this would be having an artificial cache directory for testing (perhaps stashed somewhere else convenient, so as not to clutter the repo), and test that downloaded files match what's there.
I've taken a look at tests and my main thought is that test

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Dec 4, 2016

consider the evolving standard of "the package may be the users' first encounter with this data/API". This doesn't mean one has to document everything in the package, but the README, vignettes, and other documentation should point the reader to adequate online documentation if necessary.

Good idea.

I also suggest that one possible solution to the dealing with the raster dependencies is to re-export essential functions and methods from that package (for print/plot etc.)

makes sense

However, it makes me consider what is appropriate testing for the cache functionality (something I'm dealing with myself at the moment).

good ideas, will have a go at adding cache testing

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Dec 4, 2016

thanks for the reviews @manuramon @mikoontz !

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Dec 15, 2016

responses to @manuramon


I think it would be of great interest to make it more general allowing the use of data stored in other servers/places (for instance, including an URL argument in cc_data_flech).

Hmm, thanks for the idea, but i think it's important to keep this pkg specific to CCAFS data, and make new pkgs (following similar pattern) for other data sources.

specific comments:

  • A broader description about what the package does in README would be desirable.

Okay, can do! (see ropensci/ccafs#6)

  • Description of functions are in some occasions a bit short but precise.

Thanks! Will expand on what each function does. (see ropensci/ccafs#7)

  • In my system, vignette doesn’t load when calling vignette() function, although it is available in the package directory.

i guess vignette(package = "ccafs") brings up a man page for the vignette, and vignette("ccafs_vignette", package = "ccafs") opens the vignette. Does that do what you think should be done?

  • The Vignette covers almost all functions but is quite short. I would like to have more information about what each function does and more examples. For instance, in the cc_data_fetch documentation, there is an example on how to subset zip files using the cc_list_keys and grep functions that I think is quite useful and it could be included in the vignette.

Good point, will add that specific thing, and will fill out the vignette more. (see ropensci/ccafs#8)

  • There is a typo in ?cc_data_read (line 1, Reaa instead of Read)

will fix it. thanks! (see ropensci/ccafs#9)

  • There is a typo in cc_list_keys, at marker arguments: you put pecifies instead of specifies. Also, reference to GET function could include the name of the package that function belongs to (httrpackage), to make easier to search for information about that function. The same could apply to all the functions referred in the documentation and that have been imported from other packages (mainly httr).

will fix the type. thanks! (see ropensci/ccafs#9) for reference to GET are you talking about the param definition for ...? if so, the syntax behind that is \code{\link[httr]{GET}} which creates a link to the man file for GET, does that seem adequate?

  • When running cc_data_fetch function, progress bar doesn’t ends in a newline, so the console prompt comes just after del 100% of the progress bar.

good catch. hmm, that's done via httr::progress - i'm not sure i can change that behavior but I can probably change the print behavior for the output class (see ropensci/ccafs#10)

  • If you run the examples in cc_data_fetch and cc_data_read documentation without running first the example on README, plot() doesn’t work as the raster library is not loaded. I suggest to add library(raster) in all examples containing a plotting function or perhaps, to load the raster library by default (when loading the package).

thanks for catching that, I'll just add library(raster) to the example - see ropensci/ccafs#5

  • As a suggestion, the use of the rasterVispackage instead of raster will give you more control over the graphical outputs

Nice, thanks, will try that out, at least for possibly including for examples. (see ropensci/ccafs#11)

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Dec 15, 2016

responses to @mikoontz


Broad Question

by the website I assume you mean http://ccafs-climate.org ?

I'm not sure what is best. Do you have a preference as a (i assume) potential user?

  1. Can the raster package be loaded as a dependency upon loading the ccafs package? Is there any reason not to do this, if the ccafs packages' functionality centers around raster objects? Is it fair to say the keys don't do much until the raster data for those keys are retrieved? Or is there also a benefit to being able to readily acquire the files, even if they aren't read into R?

as @noamross said i'd prefer to re-export essential fxns over putting in Depends. It is fair to say the keys don't do much, they're just character strings pointing at a location on amazon s3. I imagine there is a use case for downloading the files and reading in a later R session, so i don't think we should assume the user always needs raster loaded by default. (see ropensci/ccafs#12)

  1. Relatedly, I see that the raster() and brick() functions are used for the cc_data_read.character method, which can happen even without an explicit call to library(raster). The plot() function won't work without an explicit library(raster) call first. I'm not familiar with how that works, or what the advantage is. What is the benefit of letting the cc_data_read method use the raster package behind the scenes, but not the plot method?

The reasoning behind that is to keep the package as light weight as possible, e.g., https://github.com/ropenscilabs/ccafs/blob/master/NAMESPACE#L23-L24 shows we only import two of the raster functions. WE can however import and export raster::plot - which I may do, i'll play with that first and see how it works out (see ropensci/ccafs#12)

  1. The package help (?ccafs) would benefit from some more description. Where is the original data source? What does CCAFS GCM stand for? Some of this information is in the vignette and on the GitHub page, and I think it would help a new user to have it available here, too.

Thanks! Will definitely improve the pkg level file see ropensci/ccafs#6

  1. I see that the CCAFS data has an attribution/non-commercial license. Is there a need to pass along the fact that there is a "non-commercial" part of the data license to the user of the package? What about passing along the citation information for the data found here? Can this be done in the documentation or in a call to some function?

No rule per se for CRAN but yeah, good idea, i'll add that in the pkg level man file and in readme (see ropensci/ccafs#13)

  1. Can the vignette be linked to in the ?ccafs? Or be made available with a call to vignette("vignette_name")?

it is available like vignette(package = "ccafs") or vignette("ccafs_vignette", package = "ccafs"), but i can add that to the pkg level man file (see ropensci/ccafs#14)

  1. I got the below warnings when trying to run citation("ccafs"). Is the lack of a date in the DESCRIPTION file just because the package isn't published yet?

It seems to be common now to not put Date in DESCRIPTION file, as when CRAN builds the pkg they add that in for the date built on. It's hard to remember to update that Date field, so another reason not to include it

  1. I see the term 'bucket' referenced a couple of times and I think I can infer its meaning from context within the function definitions, but perhaps this is something that could be mentioned in the introduction to the package if it is an important part of the CCAFS file organization structure.

yeah, makes sense to add more explanation. a bucket is the top level thing that files are organized in within S3 (see ropensci/ccafs#15)

  1. cc_data_fetch() and cc_data_read() examples need a library(raster) call prior to call to plot() in order for them to run. This is the reason why I didn't check the box next to Examples above.

right, thanks! noted by other reviewer as well see ropensci/ccafs#5

  1. When cc_cache_details() is run on an NA (for instance, a file that doesn't exist), no error or warning is returned-- just an NA detail for the size. Is this expected behavior, or should a warning/error be reported as when files don't exist on a call to cc_cache_delete()?

Probably not the expected behavior. Perhaps returning a file doesn't exist or similar message would be useful. though don't want to stop with error as user could pass in some files that do exist, and some that don't (see ropensci/ccafs#16)

  1. In the vignette, can it be made more clear in the following code chunk that the sub() function is piecing the pasted character object into a single, long URL. I worry that users unfamiliar with regular expressions may have trouble understanding why this key is written this way.

Right, should include a simple explanation about that. When the pkg manual is made they don't wrap lines, so that's why i try to make sure all my code and docs are 80 line width or less. Although in the vignette its md/html, so I guess doesn't matter there so much. (see ropensci/ccafs#17)

  1. Is there a different purpose for [ compared to [[? Is the only difference that [ lets a user access multiple ccafs_files objects and doesn't drop names? It looks like the class remains the same for each. Does defining [[ just ensure that a user doesn't break their code by trying to subset this way?

Yeah, you got it. The single bracket method is for indexing like [1:3], while the double bracket method is for indexing like [[1]] - and yeah, the goal with those is that the S3 class isn't lost, since the actual data within the object is just a character vector of file paths. try unclass() on the output of cc_data_fetch()

  1. Line 9 in cc_list_keys.R: "pecifies" should be "specifies"

thanks, will fix (see ropensci/ccafs#9)

@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Jan 3, 2017

I'll just respond to the points where you have a question for me. Let me know what other feedback would be helpful.

by the website I assume you mean http://ccafs-climate.org?
I'm not sure what is best. Do you have a preference as a (i assume) potential user?

Whoops, yes I meant the http://ccafs-climate.org website.

I would be comfortable working with both platforms in parallel (your package in conjunction with the official CCAFS website) as long as there is some more direction from your package about how to leverage the official websites search capabilities to find the key strings that would be necessary. The package would still provide a valuable service to a novice user. I don't think there is a need to build in a spatial/temporal/data type subsetting engine directly into the package.

as @noamross said i'd prefer to re-export essential fxns over putting in Depends. It is fair to say the keys don't do much, they're just character strings pointing at a location on amazon s3. I imagine there is a use case for downloading the files and reading in a later R session, so i don't think we should assume the user always needs raster loaded by default.

Re-exporting essential functions sounds great! I didn't know that was possible. I am on board with keeping the package lightweight, and allowing for potential use cases of the keys that may not require the entire raster package.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 3, 2017

@mikoontz thanks for your feedback!

I would like to add functionality for searching for what data is available, but they sure aren't making it easy to do programmatically - will keep looking into that

re-exporting it is

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 5, 2017

Issues linked in my responses above - dealing with those now

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 6, 2017

All issues raised by reviewers have been addressed - Anything else to do?

There still isn't a way to search for data - for a future package version, I'll see if I can scrape the search page for CCAFS, but i don't think that's a good long-term plan.

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Jan 6, 2017

Editor comments have been addressed. @mikoontz and @manuramon, could you please take a look and let us know that @sckott's changes have addressed your concerns?

@manuramon

This comment has been minimized.

Copy link

manuramon commented Jan 7, 2017

Hi @sckott @noamross and @mikoontz. For me, @sckott has addressed all our concerns in a pretty way. I think the resulted package is nice and useful and for sure that in a (not so far) future this package will be improved with new functionalities (such us search for data).

PS: the example in the cc_data_fetch using the progress bar results in one line printed for each percentage point in the vignette. This is a not important issue (in fact, there is not an issue)

@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Jan 9, 2017

Hi @sckott @noamross and @manuramon,

I think the package is much improved, largely thanks to the expanded documentation. Thanks for the extra information!

I have 4 concerns still:

First, I can't get the vignette to run using either of these lines of code:

> vignette(package = "ccafs") 
no vignettes found
> vignette("ccafs_vignette", package = "ccafs")
Warning message:
vignette ‘ccafs_vignette’ not found 

Second, I still think that the vignette could use an example workflow to get the key from the website, even given the current state of the search functionality with the R interface (i.e. not present, but perhaps in development).

As an example, I used the CCAFS website to try to get a key like a user might. Specifically, I tried to recreate getting the key that the vignette uses. Here was my workflow:

  1. Navigate to Data -> Spatial Downscaling
  2. On the left bar -> File Set, I clicked "Delta Method IPCC AR5" under the Empirical/Statistical Downscaling
  3. On the left bar -> Scenario, I clicked "RCP 2.6"
  4. On the left bar -> Model, I clicked the "bcc_csm1_1_m"
  5. On the top bar -> Extent, I clicked "Global"
  6. On the top bar -> Format, I clicked "ASCII Grid Format"
  7. On the top bar -> Period, I clicked "2030s"
  8. On the top bar -> Variable, I clicked "Precipitation"
  9. On the top bar -> Resolution, I clicked "10 minutes"
  10. This resulted in 1 file found, and I clicked "Search"
  11. I checked the box next to the file returned by the search, and clicked "Generate Download Links"
  12. I clicked "Skip" in the paragraph to go to the download links directly without submitting my email or basic information which would be used by CCAFS for impact assessment. (side note: this prompted concern # 4, see below)
  13. I copied the address of the file to download (Right click -> Copy Link Address)
  14. I pasted the link address into my R script, and deleted everything before "ccafs/ccafs-climate/data..." and used that truncated character string as my key in the cc_data_fetch() call.

Is this the workflow that you would use? If so, can you include it with the vignette or is that outside the scope of the vignette?

Third, is this package unable to get the regional datasets? I tried to get the region A1 version of the above key (with a 30 second resolution, since that's all that is available) and the package can fetch but not read. I tried a few other combinations of regional datasets and had the same problem. Are there other data sets you know aren't retrievable? I couldn't figure out how to get a key from the "Bias Correction" or the "Weather Station" parts of the website (I could only retrieve data from the "Spatial Downscaling" section); does that mean those data can't be accessed by the ccafs package?

# Example of a regional dataset that I couldn't get to work with cc_data_read()
key7 <- "ccafs/ccafs-climate/data/ipcc_5ar_ciat_tiled/rcp2_6/2030s/bcc_csm1_1_m/30s/bcc_csm1_1_m_rcp2_6_2030s_prec_30s_r1i1p1_a1_asc.zip"
(test7 <- cc_data_fetch(key = key7))
cc_data_read(test7) # Doesn't read the file

When I run cc_cache_list() I see a similar set of files as I do when I use the key from the vignette except each .asc seems to have a corresponding .prj associated with it. Maybe this is what prevents the rest of the ccafs functions from working on those files?

Fourth, have you had any communication with the CCAFS/CGIAR folks about accessing their data via R instead of their interface? From their "Terms and Conditions" paragraph (copied below) that pops up before you can get a download link, it seems like they'd be interested in assessing the impact of their work.

To continue downloading your files, please first fill in your email and then some basic information. This information will be used by CCAFS solely for impact assessment and CGIAR and Center level reporting purposes. Filling it in will greatly help us to track the use of the portal and keep improving it. This portal provides data to a very large community of users and improving its usability and efficiency is a key aspect we work on continuously. However, you may click on skip to download links directly.

@manuramon

This comment has been minimized.

Copy link

manuramon commented Jan 9, 2017

@mikoontz to get the vignette working I run the devtools::install_github() function with the option build_vignettes = TRUE.

I also have tried to get a regional data set (the example @mikoontz pointed out above) and received the following error (Error in .rasterObjectFromFile(x, band=... : Cannot create a RasterLayer from this file) when running cc_data_read. However, if I read the .asc file from the cache directory (raster::raster("~/Library/Caches/.../prec_1.asc")) I didn't get any error and I obtained the plot.

@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Jan 9, 2017

Thanks @manuramon, you're right. I can run the vignettes now using the build_vignettes = TRUE argument in the install_github() function. I assume they'll be loaded automatically when the package goes on CRAN, and so this isn't a concern anymore.

I also see that I can call the raster() function directly on the .asc file in the cache and get the desired plot. The contents from the .prj file are read automatically with their associated .asc if they are in the same directory (see here).

# .asc file in same directory as the associated .prj
> test <- raster("/Users/mikoontz/Library/Caches/ccafs/.../prec_1.asc")
> crs(test)
CRS arguments:
 +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs 

# .asc file moved to a different location to decouple it with the .prj
> test2 <- raster("/Users/mikoontz/Desktop/prec_1.asc")
> crs(test2)
CRS arguments: NA 
@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 9, 2017

the example in the cc_data_fetch using the progress bar results in one line printed for each percentage point in the vignette

right - it can be surpressed, which I'll do for the vignette probably

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 9, 2017

thanks for the help on vignette @manuramon

@mikoontz

  1. Yes, I'll include in the vignette or in another a workflow for getting keys in the browser, good idea. To be honest, I hadn't really used the web interface to drill down to data since I was focused on the data coming from the S3 source

  2. regional data sets and such ...

I don't know, I'll have a look

  1. I haven't been in contact. The best thing I could do IMO is put in the user-agent string the package name (ccafs) and version, and that it's coming from R. They probably collect logs (or they should if they dont) on the downloads from Amazon, so they'll see the user-agent string and get a sense for how many people are using this pkg

I assume they'll be loaded automatically when the package goes on CRAN, and so this isn't a concern anymore.

Yes.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 9, 2017

AFAICT "Bias Correction" or the "Weather Station" sections are somehow private data for people that have access.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 9, 2017

@mikoontz Try again after reinstalling devtools::install_github("ropenscilabs/ccafs") - Using raster::stack now instead of raster::brick, which didn't support using PROJ files (stack does). That should allow you to read that key you had in the eg. And like in your comment above, raster::raster() also supports using the PROJ file.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 9, 2017

working on this a bit trying to search via the website - it's sort of a hot dumpster fire, but there's some progress 🕙

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 10, 2017

if you two have time @manuramon @mikoontz try out new function cc_search after reinstalling devtools::install_github("ropenscilabs/ccafs") not going to work for all query combinations yet, and some errors for sure, but examples should work

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 10, 2017

i think the dumpster fire has been tamed, let me know what you think

@manuramon

This comment has been minimized.

Copy link

manuramon commented Jan 11, 2017

@sckott I have found the cc_search function useful. I find the numerical recoding difficult to remember and it forces you to go through the help, but I can't imagine a better way to do it. The example in the cc_search function help using the lapply function is very useful.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 11, 2017

@manuramon thanks, agree that there's not a better way to do it, for the few parameters that only have two options i just use a character string, but most parameters have at least 6 options many of which are quite long strings with spaces and such,

@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Jan 18, 2017

Hi @sckott, @manuramon, and @noamross. Nice package edits! There are three sections to this updated review: the vignette, the regional datasets, and the cc_search() function.

Vignette

Can you add the "amazon_s3_keys" vignette to the package help ?ccafs page?

vignette("amazon_s3_keys", package = "ccafs")

Regional Datasets

I'm now able to read the regional datasets that incorporate their associated .prj files. It looks like the ccafs_files object print method isn't reporting the correct number of files. When I read the same regional key that we've been using, the resulting object from the cc_data_fetch() function call only shows 1 file, and doesn't show the file type. Twelve plots are made after reading the fetched object, so I think it's just an accounting error (possibly also because of the extra .prj files in the cache?).

# Example of a regional dataset that I couldn't get to work with cc_data_read() in the last iteration
key7 <- "ccafs/ccafs-climate/data/ipcc_5ar_ciat_tiled/rcp2_6/2030s/bcc_csm1_1_m/30s/bcc_csm1_1_m_rcp2_6_2030s_prec_30s_r1i1p1_a1_asc.zip"
(test7 <- cc_data_fetch(key = key7))
test7 # Aren't there supposed to be 12 files here?

<CCAFS GCM files>
   1 files
   Base dir: /bcc_csm1_1_m_rcp2_6_2030s_prec_30s_r1i1p1_a1_asc
   File types (count): 
> 

cc_search() function

The cc_search() function is awesome. Thanks for adding that! It looks like you access the CCAFS file-list.php file directly. Clever!

Just 2 comments on it:

First, the exhaustive list of possible argument values in ccafs-search is great. I think it would be even more amazing if you could leverage some tab autocomplete power to help users quickly visualize the possible argument values. I was having a hard time finding an ability in R to tab complete the possible valid values of an argument in a function call-- I got the sense that capability doesn't exist. What about defining a named list of named lists object to translate human-readable text of the different valid argument values into their lookup codes? Something like this... (I didn't finish adding all of the model options, but the other categories are complete I think)

cc_params <- list(
  file_set = list("Delta method IPCC AR5" = 12,
                  "Delta method IPCC AR4" = 4,
                  "MarkSim Pattern Scaling" = 9,
                  "Eta South America" = 10,
                  "PRECIS Andes"= 7,
                  "CORDEX" = 8,
                  "Disaggregation IPCC AR4" = 11,
                  "Delta Climgen" = 3,
                  "Delta Method IPCC AR4 (Climgen Data)" = 2,
                  "Delta Method IPCC AR4 (Stanford Data)" = 5,
                  "Delta Method IPCC AR3" = 6),
  scenario = list("Baseline" = 1,
                  "SRES A1B" = 2,
                  "SRES A2A" = 3,
                  "SRES B2A" = 4,
                  "SRES A2" = 5,
                  "SRES B1" = 6,
                  "RCP 2.6" = 7,
                  "RCP 4.5" = 8,
                  "RCP 6.0" = 9,
                  "RCP 8.5" = 10),
  model = list("Baseline" = 1,
               "bcc_csm1_1" = 42,
               "bcc_csm1_1_m" = 43,
               "bccr_bcm2_0" = 2,
               "bnu_esm" = 44,
               "cccma_cancm4" = 45,
               "cccma_canesm2" = 46),
  extent = list("global" = "global",
                "region" = "region"),
  format = list("ascii" = 1,
                "esri" = 2),
  period = list("1970s" = 1,
                "1990s" = 10,
                "2000s" = 2,
                "2020s" = 3,
                "2030s" = 4,
                "2040s" = 5,
                "2050s" = 6,
                "2060s" = 7,
                "2070s" = 8,
                "2080s" = 9),
  variable = list("Bioclimatics" = 1,
                  "Diurnal Temperature Range" = 6,
                  "Maximum Temperature" = 3,
                  "Mean Temperature" = 4,
                  "Minimum Temperature" = 5,
                  "Precipitation" = 2,
                  "Solar Radiation" = 7,
                  "Other" = 9999),
  resolution = list("30 seconds" = 1,
                    "2.5 minutes" = 2,
                    "5 minutes" = 3,
                    "10 minutes" = 4,
                    "30 minutes" = 5,
                    "25 minutes" = 6,
                    "20 minutes" = 7),
  tile = list("A1" = "A1",
              "A2" = "A2",
              "A3" = "A3",
              "A4" = "A4",
              "A5" = "A5",
              "A6" = "A6",
              "B1" = "B1",
              "B2" = "B2",
              "B3" = "B3",
              "B4" = "B4",
              "B5" = "B5",
              "B6" = "B6",
              "C1" = "C1",
              "C2" = "C2",
              "C3" = "C3",
              "C4" = "C4",
              "C5" = "C5",
              "C6" = "C6"))

An example use case would turn the directly coded search of the file key like you had in the ?cc_search example (reproduced below)...

(res <- cc_search(file_set = 4, scenario = 6, model = 2, extent = "global", format = "ascii", period = 5, variable = 2, resolution = 3))

...into something like this, where tab complete helped every step of the way.

(res <- cc_search(file_set = cc_params$file_set$`Delta method IPCC AR4`,
                 scenario = cc_params$scenario$`SRES B1`,
                 model = cc_params$model$bccr_bcm2_0,
                 extent = cc_params$extent$global,
                 format = cc_params$format$esri,
                 period = cc_params$period$`2040s`,
                 variable = cc_params$variable$Precipitation,
                 resolution = cc_params$resolution$`5 minutes`))

I have no idea whether it is good practice to hard code a lookup table like this, but maybe there is a better way to implement the concept. It may also make it easier to for the package maintainer to update the cc_search() function if the CCAFS group changes their codes by updating this object rather than the help file. Maybe that'd save the end user from needing to double check the help file every time?

Second, the cc_search() function returns errors if the format= or extent= arguments are integer values, but the ?cc_search help ("Arguments" section) suggests each of these arguments should (or can?) take a value of 1 or 2.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Jan 18, 2017

Can you add the "amazon_s3_keys" vignette to the package help ?ccafs page?

yeah, will do

I'll check on the regional datasets problem

cc_search

  • glad you like cc_search()
  • pop up hints for parameters are there, but yeah kind of unwieldy when there's so many options. Good idea for a list of options. Right that it will have to be updated if they update their options, kind of annoying, but I can try that and see if it mkakes sense.
  • added input type checkers for the parameters, so will fail well on wrong inputs now

sckott added a commit to ropensci/ccafs that referenced this issue Feb 14, 2017

sckott added a commit to ropensci/ccafs that referenced this issue Feb 14, 2017

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Feb 14, 2017

@mikoontz Okay, I think I addressed all your points - changes made -

anything else I should do folks?

@mikoontz

This comment has been minimized.

Copy link

mikoontz commented Feb 15, 2017

Awesome @sckott. I see all the points as addressed: vignette is in the help file, good accounting of the files coming from the regional keys, and a way to search with helpful prompts.

Green light from me! This is great!

cc @manuramon

@manuramon

This comment has been minimized.

Copy link

manuramon commented Feb 15, 2017

Great job @sckott. I think this final version of the ccafs package is easy to use very useful for the purpose it was developed.

Green light from me, as well.

PS. For me this has been the first time I have reviewed an R package and I have learned a lot and enjoyed it, so thanks to all of you! I hope to have the opportunity to review more packages in the future.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Feb 15, 2017

thanks @mikoontz and @manuramon

glad you enjoyed it @manuramon !

@noamross

This comment has been minimized.

Copy link
Collaborator

noamross commented Feb 22, 2017

Thanks @mikoontz and @manuramon for your reviews and excellent follow-up! @sckott, I've done final checks and we're good to go.

@sckott

This comment has been minimized.

Copy link
Member Author

sckott commented Feb 22, 2017

cool, thanks everyone ( @mikoontz @manuramon @noamross ) for improving the package

@noamross noamross closed this Feb 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment