New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bowerbird #139

Closed
raymondben opened this Issue Aug 4, 2017 · 67 comments

Comments

Projects
None yet
7 participants
@raymondben
Member

raymondben commented Aug 4, 2017

Summary

  • What does this package do? (explain in 50 words or less):

A package for maintaining a local collection of data sets from a range of data providers. Bowerbird can mirror an entire remote collection of files, using wget's recursive download functionality. Bowerbird also provides some functions around data provenance and versioning (it doesn't fundamentally solve these issues, but goes some way towards solutions).

  • Paste the full DESCRIPTION file inside a code block below:
Package: bowerbird
Type: Package
Title: Keep a Collection of Sparkly Data Resources
Version: 0.3.4
Authors@R: c(person("Ben","Raymond",email="ben.raymond@aad.gov.au",
  	   role=c("aut","cre")),
	   person("Michael","Sumner",role="aut"))
Description: Tools to get and maintain a data repository from third-party data
    providers.
URL: https://github.com/AustralianAntarcticDivision/bowerbird
BugReports: https://github.com/AustralianAntarcticDivision/bowerbird/issues
License: MIT + file LICENSE
Imports:
    assertthat,
    dplyr,
    openssl,
    R.utils,
    rmarkdown,
    rvest,
    stringr,
    xml2
LazyLoad: yes
RoxygenNote: 6.0.1
Suggests:
    archive,
    knitr,
    testthat,
    covr
Remotes: jimhester/archive
VignetteBuilder: knitr

  • URL for the package (the development repository, not a stylized html page):
    https://github.com/AustralianAntarcticDivision/bowerbird

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
    Data retrieval, but also venturing into reproducibility. Primarily it is intended as a mechanism for maintaining a local data collection (from remote data providers), but could also be used as a wrapper to allow others to reproduce your work (e.g. "you'll need these 100GB of files installed locally; here's the bowerbird script to do so"). Bowerbird also has a few functions to help with data provenance, see vignette("data_provenance")

  • Who is the target audience?
    Research scientists/technicians/data managers who want to maintain a local library of data files (either for their own use, or perhaps a single shared library on behalf of a number of local users, as we do). Researchers who want to share work that relies on local copies of data. Also potentially package developers who need some sort of data retrieval that isn't easily accomplished by existing tools (e.g. recursive download of a whole collection of data files from a satellite data provider).

  • Are there other R packages that accomplish the same thing? If so, how does
    yours differ or meet our criteria for best-in-category?

Nothing to our knowledge that really does the same thing. Some similarity to https://github.com/ropensci/rdataretriever, though rdataretriever seems to be angled towards biodiversity data sets in particular and creating sensible local database structures for them. Bowerbird is focused on mirroring remote data to a local file system, and providing some functions around data provenance. Passing overlap with http packages (httr, crul) but these are generally intended for single-transaction sort of usage. Jeroen's curl package (not an ropensci one?) is also similar to bowerbird in that it wraps a underlying http client: bowerbird typically uses wget under the hood to accomplish its web traffic, whereas curl binds to libcurl. AFAIK curl doesn't support mirroring of external sites (which wget does, and which bowerbird relies heavily on).

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions. More needed, but the important ones are covered for now. The more-sparsely-documented functions at this stage are the ones that the average user is unlikely to need to interact with directly.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov. More test coverage still to be added - codecov is currently reporting around 80%, goodpractice around 70%, not sure which is right!
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • [no] Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • [more or less] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

    • no NEWS file yet
    • some function docs need expanding
    • we use cat() for printing progress, despite the packaging guide suggestions to use message instead. This is because (a) progress information doesn't really strike me as a "condition", which message is intended for, (b) all cat-issues messages can be turned off by specifying bb_sync(...,verbose=FALSE), (c) an anticipated common use for bowerbird is for unattended (cron-job) updates to a local data library, in which case it's likely the user will want to sink() all output to a log file. Using cat() means that a simple sink() will catch everything, including output from wget calls (if they are made). I think this becomes less reliable if message is used (you'd have to sink(...type="message") but then I'm not sure it'd catch wget output, but admittedly haven't tried this)
  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Additional notes re: our presubmission enquiry

General note: the package is not in a final polished state, but we think far enough advanced (and stable enough) to be a good point for onboarding consideration.

It would be helpful to actually separate out the core mechanism and additional sources. This could go as far as having separate packages (which we could handle together).

Fair suggestion, and one that we've considered - and maybe that split is still reasonable to consider down the track. But for now, at least, we think it's better to keep core-functionality and the data-source-definitions bundled together.

Have you considered using rappdirs for default data directories?

It's up to the user where they want to put their data. We do make a suggestion in the README and vignette for users to consider rappdirs.

@maelle

This comment has been minimized.

Member

maelle commented Aug 4, 2017

Editor checks:

  • [x ] Fit: The package meets criteria for fit and overlap
  • [x ] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • [ x] License: The package has a CRAN or OSI accepted license
  • [ x] Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thanks for your submission @raymondben! I'm currently seeking one reviewer (already found one)!

Here is the output of goodpractice::gp() for you and the reviewers to consider (note: we don't expect you to have 100% unit test code coverage!)

It is good practice towrite unit tests for all functions, and all package code in general. 79% of
    code lines are covered by test cases.

    R/aadc_eds_handler.R:14:NA
    R/aadc_eds_handler.R:15:NA
    R/aadc_eds_handler.R:16:NA
    R/aadc_eds_handler.R:17:NA
    R/aadc_eds_handler.R:27:NA
    ... and 385 more lineswrite short and simple functions. These functions have high cyclomatic
    complexity:bb_source (56).
  ✖ avoid long code lines, it is bad for readability. Also, many people prefer
    editor windows that are about 80 characters wide. Try make your lines
    shorter than 80 characters

    R\aadc_eds_handler.R:28:1
    R\aadc_eds_handler.R:32:1
    R\aadc_eds_handler.R:36:1
    R\aadc_eds_handler.R:38:1
    R\aadc_eds_handler.R:40:1
    ... and 480 more linesavoid calling setwd(), it changes the global environment. If you need it,
    consider using on.exit() to restore the working directory.

    R\aadc_eds_handler.R:44:5
    R\amps_handler.R:44:9
    R\provenance.R:50:5
    R\sync.R:63:5
    R\utils.R:77:5
    ... and 2 more lines

Reviewers: @MilesMcBain @lwasser
Due date: 2017-11-07

@raymondben

This comment has been minimized.

Member

raymondben commented Aug 4, 2017

Minor comment on the goodpractice output, which I meant to include in the initial submission: setwd is necessary, but all calls are protected by on.exit(), which also takes care of other global changes where necessary (e.g. to http_proxy env settings).

@maelle

This comment has been minimized.

Member

maelle commented Aug 5, 2017

Ok makes sense @raymondben, thanks for this comment!

@maelle

This comment has been minimized.

Member

maelle commented Aug 11, 2017

Thanks for agreeing to review this package @MilesMcBain @lwasser! 😸 Your reviews are due on 2017-09-04.

As a reminder here are links to the recently updated reviewing and packaging guides and to the review template.

@MilesMcBain

This comment has been minimized.

MilesMcBain commented Aug 16, 2017

With the Editor's permission, I am posting some feedback ahead of a full review in the interests of discussing potential structural changes as early as possible.

The issue of separating the core functionality from data sources was raised by the editors and the more I poke around under the hood, the more I am convinced this will also be my recommendation.

Part of my motivation is selfish. The number of sources bowerbird touches means a comprehensive review of this package, as is, will be an onerous task. Some other reasons I can think of why this might be a good idea:

  • Discoverability: For many sources there is significant source specific domain knowledge contained within the handler functions or choice of method_flags. Will those in niche areas who could most benefit from this be able to find it in a general package about synchronisation?

  • Package Maintenance: It seems like ensuring calls to these different sources and APIs are working as advertised could become a maintenance issue that leads to frequent package updates. Have remote server changes generated much work to date?

  • Nicer Data Interfaces: Assuming data sources get moved to new packages that depend on bowebird, there may then be scope to tweak the APIs of those packages so they more closely suit their user's needs. By this I mean implement API specific function arguments that mean users do not need to understand the workings of wget arguments. An example would be setting a date range as function arguments, instead of directly as --reject-regex arguments in the case of the README download size example.

@maelle

This comment has been minimized.

Member

maelle commented Aug 16, 2017

Thanks @MilesMcBain!

@raymondben

This comment has been minimized.

Member

raymondben commented Aug 16, 2017

Thanks @MilesMcBain. I'll confess I'm not 100% sold on the splitting idea, but I'm also not vehemently against it and your points are good. @mdsumner and I will have a chat and see what we come up with. In the meantime, suggest that you and @lwasser don't spend too much more time on it, since it's likely to look fairly different if we do split it.

@raymondben

This comment has been minimized.

Member

raymondben commented Aug 21, 2017

OK, so: on reflection, we agree that splitting is probably a good idea. @maelle, what's the preference here? Do you want to close this issue and we resubmit, or do we just carry on here? (The split version --- not entirely finalized yet --- is currently sitting in the "split" branch.) Note that we'll only submit the core part of the package at this stage, including a few example data sources. The extended collection of data sources will be hived off into a separate package for further work when time permits, and possible submission.

For completeness, some thoughts on Miles' comments above:

Package Maintenance: It seems like ensuring calls to these different sources and APIs are working as advertised could become a maintenance issue that leads to frequent package updates. Have remote server changes generated much work to date?

Moving sources into one or more other packages doesn't reduce the maintenance burden, it just spreads it across packages. We see occasional but not overwhelming changes to remote servers that require tweaks to the handlers, but over the 10+ years that we've been doing this (in various guises) we haven't found it to be a showstopper. However, splitting the package (into "core" plus other data-themed package(s) that depend on it) should indeed mean that the core components will remain more stable. That alone seems like a pretty good argument in favour of the split.

Nicer Data Interfaces: Assuming data sources get moved to new packages that depend on bowerbird, there may then be scope to tweak the APIs of those packages so they more closely suit their user's needs. By this I mean implement API specific function arguments that mean users do not need to understand the workings of wget arguments. An example would be setting a date range as function arguments, instead of directly as --reject-regex arguments in the case of the README download size example.

Yes, agreed, but it's worth exploring this a bit. The aim of bowerbird is to be flexible enough to be able to handle a wide range of data sources, and to allow new data sources to b e added fairly easily. Hence bowerbird itself doesn't generally attempt to wrap syntactic sugar around the wget arguments, because it's not possible to do it in a general manner. It's not just dates you might want to subset, it could also be geographic space, or vertical levels within a climate model, or different file formats, or any number of other facets depending on the data set --- and different data providers will structure their data directories and files differently. There's simply not enough consistency. But in most cases this sort of subsetting isn't necessary, and the wget calls are fairly straightforward --- so that aim of bowerbird (ability to add new data sources fairly easily) generally seems to hold up. More complicated examples will always arise, but that's just the nature of the task.

Now, for sure, nicer function arguments could be part of more targeted packages (that might themselves depend on bowerbird) --- but doing so immediately makes the maintenance burden high because of the complexity and diversity of those options, and hence such packages tend to focus on a small number of data providers. So where is it best to invest that effort? Is it at the data-fetching level? Maybe. But a data-fetching package might well be used to maintain a data library on behalf of a number of users (this is what we do, anyway). In this scenario you're generally likely to want all files that are part of a given collection, because it's probably impossible to know a priori which ones your users will want. And if it's a centrally-maintained library within an organisation, then only a small number of admin users need to know about the data-fetching internals. Perhaps investing effort in polishing these interfaces is not going to benefit a huge number of people, and it might be better to invest the nicer-function-argument effort at the next level, at the stage of reading data from those files into R. General-purpose packages like raster provide the nuts and bolts for this, but to use them requires specialist knowledge of how the data files and directories are structured. Data-reader functions that provide e.g. a date range argument (and translate that into appropriate raster calls to read the right files and bits of files) are enormously helpful to users who just want to use the data, because this relieves them of the need to know about the file structure details (see e.g. @mdsumner's https://github.com/AustralianAntarcticDivision/raadtools package as one example of this, alongside many others).

Anyway (sorry, that was a bit of a monologue!) none of that argues against splitting the package. But it does encourage some thought as to the best way to build upwards.

@maelle

This comment has been minimized.

Member

maelle commented Aug 21, 2017

@raymondben thanks! I first read this too quickly and thought we'd have two submissions instead of one which isn't the case.

I'll put the submission on hold while you do the splitting, and after that @MilesMcBain & @lwasser can review the core package.

@maelle maelle added the holding label Aug 21, 2017

@MilesMcBain

This comment has been minimized.

MilesMcBain commented Aug 24, 2017

Thanks for your response @raymondben.

Firstly, one small point about the maintenance: With the split design you get much more freedom. You may not even put all your datasources on CRAN, you can choose to leave some on GitHub where the expectations about maintenance are much lower. In this way the maintenance burden could be reduced.

Secondly, I did get some of the way through my review before we put this on hold, so I thought it might be helpful to have the higher level feedback now while things are being actively worked on:

Following bowerbird.Rmd

cf <- bb_config(local_file_root="~/data")
cf <- cf %>% add(bb_sources("CERSAT SSM/I sea ice concentration"))
bb_sync(cf)

This kicked off a 2.5Gb download that blocked my R session which I had to eventually kill. Not a great first user experience ;). I know down the page you point out download size for this source can be reduced. But the total size of the source is not mentioned until the source list at the bottom.

I would like to make the following recommendations:

  • The example should be a small data source that can be downloaded in a minute or two.
  • If the intent is for syncs to be able to be made interactively, there needs to be some indication of progress. A message like "Fetching File n/M (xMb), yMb Total remaining" or a progress bar.
  • If this package is intended to be used exclusively from the command line that should really be how the vignette directs users. Consider including this.

A question: is there a way to check sync status in interactive mode without fecthing these large repos?

Programming model

bb_config returns a sync configuration object. This would be in line with my expectations of API package usage. A very similar pattern is in the usage section of the ropensci/osmdata README.

Unlike osmdata::opq, bb_config seems unusual in that returns its configuration as attriubtes of an empty Tibble which is not mentioned in the documentation. It means to inspect configuration one needs to use attributes(cf) instead of say cf$ and using IDE autcomplete features in inspect object structure. I would argue this is unsual enough that it should feature promenantly in the documentation. Users need to be warned about applying operations to the Tibble that obliterate the attributes - very easy to do in the tidyverse.

I would also make the comment that I do not think this is the idiomatic tidyverse way of approaching things. I only mention this since it seems you use many tidyverse funcitons. I think the tidyverse way would be to use a list-column to capture this information. Take a look at how tidyverse/googledrive stores file metadata in the dribble() class for an example.

Finally, the add() function would need to have a 'bb_' prefix added to ahere to function naming guidelines.

@raymondben

This comment has been minimized.

Member

raymondben commented Aug 24, 2017

Thanks @MilesMcBain - a quick question on the config arrangement (and, yes, it is unfortunately easy to unknowingly lose the attributes). The reason for doing it that way is because those attributes apply to the whole configuration (and thus should only be stored once, not in every row) whereas the rows hold data-source-specific information that is not necessarily common across rows. An alternative structure that we considered would be to create a list object that has something like a data_sources tibble inside it (i.e. the existing bb_config rows) and a config tibble (holding what are now stored as attributes). This is pretty close to what opq does - would that seem better to you? I can't remember now why we didn't go that way. Probably some arbitrary decision of mine!

@mdsumner

This comment has been minimized.

mdsumner commented Aug 24, 2017

We might commit more deeply to a nested-tibble model for this one-to-many attribute problem, but I'm not sure that covers everything, definitely something that we haven't been sure about so the feedback is very welcome.

Also, @MilesMcBain I'm very sympathetic to this user-problem for a 2Gb download, but we are in danger of diluting the entire point of the abstraction of bowerbird - in that it's not ever going to be immediately obvious from a user-perspective why you'd want this package, since the value comes after you've decided to commit to getting all or most or of a particular collection - especially those parts that aren't yet published. Note that while the first-time sync-run is inconvenient, if it successfully completes then when it runs next time, tomorrow, next week it will be hugely fast as it will only check to see that it already has all but the handful of brand-new files from the source, and it will dutifully obtain those new files without any intervention by a human.

We could choose a weekly SST data set, that would be easily downloadable in a single file - the initial commitment for download is minutes or less - and the real value of bowerbird will become apparent next week when that file is updated by the provider, and the automated bowerbird process (that you configure to run nightly, or so) kicks in and detects a change.

But, not many or most users will care about SST as a variable of interest, and those that do will want much higher resolution in time and space, before they care. So, by definition and rather unfotunately the examples that are most compelling and immediately-reproducible are not going to have an actual audience, though they should have an abstract appeal in terms of what else can be done with this tool that is "relevant to me". I do want to ensure we discuss this audience definition, since we are ultimately targetting people like us, those that commit to a large and continuously maintained collection for shared use. (We are also trying to highlight that committing to this is very valuable, an actual game changer for the way we think and approach these issues).

I definitely agree about the "yesno" prompt for whether you really want to download a potentially huge collection, but again that interactivity, normal for "usual use" kind of gets in the way of the task of maintaining a huge collection of automatically updated data. We do see huge value in specific package applications, to provide smarts about available dimensional and variable filters, but we unavoidably end up in very specific areas. I am a bit wary of illustrating the value of bowerbird by exploring a particular data set to the level of actually reading it because as soon as we go that far we've moved away from the bowerbird abstraction of being an automatable, configure-anything, data-getter.

Thanks for the discussion and feedback! (I do think the weekly SST example is a good one, and @raymondben and I should look carefully at a user-case for illustrating a full workflow - and probably we need to tell the story of the pending updates that an easily-configured bowerbird robot will sort out for us tomorrow, next week, in the months ahead).

@maelle

This comment has been minimized.

Member

maelle commented Sep 30, 2017

@raymondben @mdsumner just checking in. 😸 Have you made enough progress on the splitting for @MilesMcBain & @lwasser to soon review the package? Thanks!

@raymondben

This comment has been minimized.

Member

raymondben commented Oct 1, 2017

Hi guys, sorry, we've been held up - in part by a bug in sys but Jeroen has kindly resolved that now. I don't think we're far off, hopefully get back to you this week.

@maelle

This comment has been minimized.

Member

maelle commented Oct 1, 2017

Ok thanks for the update!

@raymondben

This comment has been minimized.

Member

raymondben commented Oct 9, 2017

@MilesMcBain @lwasser @maelle

Sorry for the hiatus, think we're OK to resume the review process now. The major change is that the data source definitions have been removed, except for a few representative examples. (The full set of definitions has moved to https://github.com/AustralianAntarcticDivision/blueant. This is a separate package and not part of this ropensci submission, but it's there should you want to get some idea of how a themed-data-package might make use of bowerbird.)

Please note that until sys version 1.5 hits CRAN, you'll need to install this from github using devtools::install_github("jeroen/sys") prior to installing bowerbird. Bowerbird requires v1.5 in order to avoid a Windows bug in 1.4. Also note that check is currently failing on both Travis and AppVeyor, but because of some Ubuntu-package issue on Travis (that I haven't yet figured out) and on AppVeyor because of the sys issue. But bowerbird is passing check locally on both Windows and Ubuntu, so I think it's OK. Update: sys v1.5 is now on CRAN.

Some replies to Miles' partial review:

Not a great first user experience

Yah, fair enough. The example is now a small one, and in addition we've switched from using system2 to sys, which means that the user can interrupt the external process without needing to kill their session.

If this package is intended to be used exclusively from the command line ...

No, it's not exclusively this. See updated comments in README. Also, we can't give progress indicators (also discussed in README).

is there a way to check sync status in interactive mode without fetching these large repos?

I'm not sure what you mean by this, sorry.

bb_config seems unusual in that returns its configuration as attributes of an empty tibble ...

That structure has now changed and doesn't rely on attributes to carry the repo-wide settings. bb_config now returns a list with two elements: data_sources (a tibble) and settings (a list of repo-wide settings).

All exported function names are now prepended with bb_.

Update: fix found for Travis issue.
Update 2: sys v1.5 is now on CRAN so no special preinstall needed for that, and AppVeyor is passing.

@maelle

This comment has been minimized.

Member

maelle commented Oct 9, 2017

Thanks @raymondben! @MilesMcBain & @lwasser, you can now proceed with the review. 😺

@maelle

This comment has been minimized.

Member

maelle commented Oct 9, 2017

Note: I've updated the reviews due date to the 2017-10-30.

@maelle maelle removed the holding label Oct 9, 2017

@MilesMcBain

This comment has been minimized.

MilesMcBain commented Oct 24, 2017

Thanks for the update @raymondben. I look forward to reviewing the new version.

Regarding this comment:

is there a way to check sync status in interactive mode without fetching these large repos?

Since the package is intended to be used interactively I thought it would be helpful if I can check I am on the latest version without initiating a large download. If it turns out I need to sync, I'd probably like to tee it up in separate session so that it doesn't block my main work environment. It might be this is already possible and I did not see the function.

@MilesMcBain

This comment has been minimized.

MilesMcBain commented Oct 24, 2017

@maelle @noamross, I would like to flag I am not going to be able to make that review deadline. Would it be okay to have my review in by the 7th of November?

@raymondben

This comment has been minimized.

Member

raymondben commented Jan 8, 2018

Hi @maelle - think we've done most/all of the major code changes (see postrev branch), now a matter of revising all documentation & vignettes. Unfortunately this isn't high priority work right now, so it may not be fast turnaround ...

@maelle

This comment has been minimized.

Member

maelle commented Jan 9, 2018

Okay thanks, what's your timeline then?

@lwasser has just been recruited for another review by @karthik 😉

@raymondben

This comment has been minimized.

Member

raymondben commented Jan 14, 2018

@maelle actualy probably not far off now, we have given it a good thrashing in the last few days. In the next week or so, with a bit of luck?

@maelle

This comment has been minimized.

Member

maelle commented Jan 25, 2018

@raymondben @mdsumner just checking up on this submission, do you think you'll soon be able to give an update to reviewers? Thank you! 😺

@raymondben

This comment has been minimized.

Member

raymondben commented Jan 30, 2018

Response to reviews

@MilesMcBain @lwasser - thanks again for your reviews. Responses to each of your points appear below. Please note that the revised version of bowerbird is in its postrev branch. It will be merged into master once the revisions are complete and dependent packages have been updated to match.

Other notes: there is still not a lot of documentation to help power users who want to write their own handler functions (there is a brief section in the vignette). However, since this is likely to be a vanishingly small number of users we don't feel it's a showstopper right now. There is an issue for this though so that it doesn't drop off the radar (ropensci/bowerbird#15).

Leah - Review Comments

I'd like to see several clear uses cases for this package that include the user personas who are accessing the data - ie how much does the audience know for each use case? particularly as it relates to remote sensing data. I see many potential challenges for users who are less well versed in working with these data - particularly in the data size realm and dealing with API calls and associated arguments.

Various changes have been made to address this, notably the "Users: level of usage and expected knowledge" section has been added to the README and vignette, along with an expanded set of examples and guidance on writing data source definitions.

I'd also like to see a bit more documentation and warnings associated with potentially large requests and associated resource availability on a users machine.

bb_sync now takes a confirm_downloads_larger_than parameter, which (during interactive use) will cause the process to pause for confirmation before downloading a source larger than the specified size (100MB by default). Re: "resource availability on a users machine" - see related comments below.

I also think that all of the function documentation needs to be developed further.

Yup. All function documentation has been revised.

Minor but i do suggest adding the library(bowerbird) call to the readme vignette

Done.

this URL doesn't work for me: http://oceancolor.gsfc.nasa.gov/

Ooops, should have been https, sorry. Whatever browser you were using should have redirected, but obviously it didn't. (Changed)

Please follow the tidy styleguide: http://style.tidyverse.org/ All vignettes, documentation, etc. are missing spaces around function argument assignment operators and after commas.

Style is to some extent personal preference (indeed, the tidyverse style guide itself says that "All style guides are fundamentally opinionated ... Some decisions genuinely do make code easier to use ... but many decisions are arbitrary."). Unless it's mandatory for ropensci (we don't believe that it is?) we would prefer not to follow it.

Add more explicit instructions and discussion about wget(). Add instructions for both Mac and Windows and be explicitly about which install path is for which OS. If you expect to only support advanced users, then they may be able to figure it out.
Adjust error messages surrounding wget() to be windows vs mac specific to not mislead users - like me :) .

All the wget-related stuff has been revised with more platform-specific instructions, and clear notes on bb_install_wget to indicate that it only supports Windows (at this stage, anyway).

bb_config() setups up the data "warehouse" directory where data will be kept.
Recommendation: Test whether the input directory exists in bb_config()

We do that check in bb_sync, because bb_config simply sets up a configuration object. There is no assumption that it will necessarily be used on the user's local machine, or run right now (maybe it will be run on a different machine, or at a later time when the directory might have been created by some other process). The function documentation is pretty clear about this behaviour (see the create_root parameter in help("bb_sync")). We haven't changed this.

When I first look at the example below, if I am a less experienced user, I may think that the reference is actually the source_url.

reference is now named doc_url

I also may be overwhelmed by number of arguments that I need to parse through.

That's unfortunately the nature of the beast. For what it's worth, we don't particularly enjoy the complexity either. But in order to be a versatile, flexible tool that can cope with arbitrarily-structured repositories from different data providers, there are a lot of arguments to cover different situations.

Here are some suggestions to consider:
Provide the most bare bones of bare bones vignettes that only use required arguments. Currently the vignettes use both required and non required arguments.

The vignette now has four examples of writing a data source, stepping through various levels of detail and demonstrating all three of the packaged handler functions.

Consider breaking up this mega function into two parts. Maybe one is a list or vector of metadata items and two is a list of api items?
Or perhaps there is another clever way to help a user generate this function call?

Not that we are aware of.

At a minimum break up the function documentation to have two sections of arguments to make the API vs the metadata explicit

As far as we are aware there is no way to have separate sections of arguments with roxygen. However, each mandatory function parameter here is clearly labelled as "required".

I think there is a function passed to an argument here: method=quote(bb_handler_wget). This will confuse a user because when you look up the documentation for this, you see additional arguments.

The use of quote has now been removed: these are now specified as lists that will be more familiar to users, i.e. method=list("function_name_as_a_string",argument1=value1,argument2=value2). Also the additional arguments have now been hidden from the user (see more detail about this in the response to Miles' comments below).

Rework documentation: for instance the earthdata function documentation page, only provides a link to how to register for the website. I am left uncertain about how to actually use the API to get earth data - data. And i've used that website a good bit - granted last year's version of it anyway. :)
Vignettes that explain how to identify the source_url string would be helpful. It is not clear to me how i would setup access to an earth data download. For instance - what if i wanted to get snow cover from NSIDC? https://n5eil01u.ecs.nsidc.org/VIIRS/VNP10.001/ Or what if i wanted landsat vegetation indices? Vignettes that explain how to access specific datasets (perhaps ones identified in your use cases) would help significantly. On the earth data website it actually took me a while to find any URL's that lead to data. This isn't Bowerbirds problem - it is Nasa's. BUT could this package guide a user through how to find an appropriate source_url for a dataset? Just a could examples...

Documentation has had a rewrite, as mentioned above. We feel that detailed tutorials on using data provider websites are beyond the scope of package documentation. Nevertheless, the vignette now includes a fuller example of writing an Earthdata data source. For the record, that URL you give above is your data source URL: it points directly to that data set.

I didnt notice this in the vignettes and perhaps i missed it in the code. is there a summary function that gives you an overview table of all of the data in your "warehouse"? if there is that would be useful and please add that to the vignette

Yes, bb_summary does this. The data source list at the end of the vignette is produced using this. The vignette now draws more attention to this function.

Why is this called bowerbird? The name doesn't fully seem to relate to the library function BUT maybe i need to read more about bowerbirds?!

Bowerbirds (real ones) collect shiny things to decorate their bowers with. Bowerbird (the package) lets you collect nice shiny data sets and add them to your collection.

How does the package handle data updates? For instance, last year USGS reprocessed all of the Landsat 8 data with the create of collections. Thus, all of the data that I downloaded last year, are different now. Different naming conventions, some changes in files such as the cloud cover layers, etc. Will Bowerbird help a user identify this type of change and guide them through how they want to handle the change (ie re-download data, start a new library with the newly processed data)? How will it do that?

Bowerbird just downloads whatever the data provider makes available, following whatever the data source configuration parameters are. As explained in the vignette, bowerbird doesn't know in detail about the data structures or anything else. If they change, the new version will be downloaded.

Many of these datasets are extremely large. Does bowerbird help a user track resource availability on their computers - especially as it's downloading zips and unzipping? Bowerbird could be a beautiful helper package to manage data. It also could allow for less experiences users to make big mistakes in terms of trying to download data that are too large for their computing resources.

Each data source has a collection_size parameter that gives an indication of the disk space required. The bb_sync function will now also prompt for confirmation on large downloads. Beyond that, we do not attempt to manage a user's own disk - that's the user's responsibility.

Can bowerbird help a user not make huge requests or provide a warning? The beauty of this package is the ease in which you can download and maintain data. The danger is the same - its very easy to kick off some huge calls that may be fill a users hard drive.

A confirm_downloads_larger_than parameter has been added, which defaults to 100MB. If R is running in interactive mode, user confirmation will be sought before initiating downloads larger than this.

In general I have a question that i'm sure everyone involved has grappled with: Do we want to encourage people to store data locally? Generally remote sensing data are so large that we may not even want to process it locally. Things are moving towards the cloud and we do want to avoid people starting to build their own mini DAACs on their machines.

Yes, as a general statement it's probably reasonable to encourage users to avoid individually downloading large data sets (especially multiple individuals within the same institution downloading the same large data set). But at some point someone has to store data locally. For example, we routinely run analyses that require the entire satellite data record of some parameter (e.g. sea ice or SST). This can't be done by issuing remote data requests, it would be cripplingly slow. Yes, we can run these analyses "in the cloud", and we have virtual servers for exactly this purpose ... but then where does that virtual server get its data from? That data storage ideally needs to be on the same local network as the compute facility so that access is fast. Someone, somewhere, has to maintain that collection of data files. This is one use case for bowerbird. And having one data librarian maintain a collection of data sets on behalf of a number of local users will reduce the need for individuals to download their own copies.

I didn't see a contributing file or description of how contributing works for this package. I also don't see bugreports / maintainer

The bugreports URL is in the DESCRIPTION file. Maintainer is also there (but it's hidden! The "cre" (creator) role in Authors@R gets interpreted as maintainer). CONTRIBUTING.md and code of conduct files have been added.

is this being submitted to JOSS? How do I know that?

No, it's not. (You can see that in the submission details in the onboarding issue - I don't know if as a reviewer you get any special notification of that though).

NEWS is missing

There is no NEWS file yet: to date, the package has been in an active development stage with everything liable to change, so it doesn't make sense to document all this via NEWS. We will add a NEWS file once the onboarding process is complete and things seem stable enough to warrant it.

Miles - Review Comments

In the self-titled vignette the use of the postprocess argument is only mentioned in passing with respect to decompression. I'd argue it is major functionality with enough usecases that deserves its own subheading.

It now has a subheading.

As I mention in the function documentation section. This package could benefit from a greater breadth of examples with commentary that explains how features of the datasource result in each choice of method flag and postprocessing function.

See response to Leah's review above - the README/vignette now has four examples of increasing complexity, and demonstrating all three of the packaged handler functions.

Improve function documentation

General improvements all round have been made.

bb_settings - Documentation does not explain the usecase for this function. It is not clear why it would be needed in addition to bb_config. The oceandata example made this clearer.

Function documentation now explains its role a little more clearly.

bb_validate - Documentation does not explain why this function does or is used for.

This function is now not exported - it is useful internally but probably not for the user.

bb_cleanup - The use of the bb_config argument is mystifying. The function is specified as part of the config?
bb_decompress - As above.

For these functions, as well as the handler functions (bb_handler_wget, bb_handler_oceandata, bb_handler_earthdata) there are several arguments that are passed automatically when bb_sync is run, and which don't need to be specified by the user. Previously these were exposed as regular function arguments, which was indeed confusing. These have now been masked from regular users by use of the dots argument. Note that developers who wish to write their own handler functions will need to know about these arguments: there is a section of the vignette that will explain this. But regular users, who will almost certainly be the overwhelming majority of the user base, will find the new structure much less confusing.

The documentation for the handler and other functions that are called by bb_sync now also makes it clear that these functions are not intended to be called directly by the user, but specified as part of a data source definition.

bb_fingerprint() - Unclear how this function is expected to be used. It gets a mentioned in the 'Data provenance' vignette but that does not explain how the output should be used to detect changes.

Data provenance, and frameworks to help manage it, are evolving areas. The bb_fingerprint function was added to bowerbird following our presubmission enquiry, during which Noam asked for this sort of functionality to be added. Unfortunately, and as explained in the README, bowerbird doesn't know a great deal about the data sources that it is downloading. It generally just starts at the top-level URL and recurses downwards. So there is a real limit to how much it can help with data provenance. The bb_fingerprint function basically provides a snapshot of a data source's files. It's not clear to us how useful this is, and it's not intended to be a mature aspect of the package. As R packages for data provenance evolve, it may become clear how to better integrate the bb_fingerprint-style functionality with those frameworks, and we can work towards that. Alternatively, it might become clear that bb_fingerprint is basically useless, and we can drop it.

bb_handler_wget - Unclear on the effect or valid usecase for local_dir_only arg. Did not appear to affect the operation of bb_sync.

Is essentially an internally-used argument, which (like the config and other automatically-passed arguments to handler functions) has now been hidden from the user.

bb_source - The documentation for this key function is not sufficient for me to be able to use it confidently. access_function sounds like it is important but does not appear in any examples and the explanation is vague. The example uses some method flags that are not explained. I think this needs more examples and more commentary of said examples. Explain the nature of the web page and data and how that resulted in the choices of each method_flag, and postprocess step. Users would benefit from multiple examples of increasing complexity so they can get a sense of the usespace of bowerbird.

Function documentation has in general been improved, and the vignette examples give more detail.

Explain config first argument
Either the help or vignettes or both need to clarify the mysterious first argument of bb_handler_wget, the bb_decompress family, and bb_cleanup - the vignette examples and the built-in example data sources use these functions without supplying this first argument, even though it has no default. It was only after digging into the code of bb_sync that I saw why this argument exists and how it is populated.

See above: now hidden from the user.

On a related but more general note, you might also cover in a vignette that anyone needing to implement their own handler needs to write a function that must take the special arguments config, verbose, and local_dir_only in addition to any context specific args.

Ditch dplyr or update to tidyeval

dplyr has been ditched!

Unusual usage of quote

Now removed: the parts of data source definitions that need to specify functions (and possibly arguments) now do so via a list("function_name_as_string",argument1=value1,argument2=value2) type syntax.

Make some wget method flags into arguments
In bb_config you did a really nice thing where you created a local function argument (clobber) to provide a nicer API to the wget clobber arguments. The help for this argument is also quite detailed giving me confidence I could use it correctly if need be. I think this pattern is the bar that the rest of the package should aspire to.

The bb_wget function now has explicit arguments for a number of common wget command line parameters. The function documentation for each of these explains their use more fully (generally, the underlying wget help text is copied or paraphrased, with additional bowerbird-specific explanation in some cases). This change (along with dropping quote also makes the data source syntax more appealing, because these arguments can be passed as regular list elements.

what you would be saying to us novices is: "Don't worry, I've understood all the wget arguments so you don't have to. These are the ones you care about and this is how you use them"

We like your optimism! wget still gets us scratching our heads on occasion. But for all its faults and complexity, it is a powerful tool.

As things are there is also some weirdness with the fact that some wget flags are handled automatically by the handler eg: -no-if-modified-since --restrict-file-names=windows --progress=dot:giga while others are expected to be supplied. What happens if they clash?

There are a number of wget flags that can clash, and other than trying them all in combination we are not aware of any way of knowing how those clashes are resolved. That's really an issue with wget itself.

I did not find top-level documentation using ?bowerbird as per rOpenSci packaging guidelines.

Added.

Widespread usage of cat(sprintf(...)) for errors/warnings etc. Should replace with message()/stop() as per guidelines.

This is a little more complicated than it may seem. First, the majority of these cat usages are for progress information, not for errors or warnings. We don't think that using message for these is ideal: firstly, message (according to its help) is intended for "diagnostic messages which are neither warnings nor errors, but nevertheless represented as conditions". Vague though this definition is, we don't think progress information fits into this category. Secondly, and perhaps more importantly, bowerbird will be used in unattended/scheduled job mode, for which users will want to capture all output into a log file. Using cat(...) consistently means that a simple sink by the user will capture all of this output. The guidelines that suggest using message make this suggestion because it's easier for the user to suppress output if they want to. For bowerbird, specifying verbose=FALSE (which is now the default setting) will suppress all of this output.

There are, however, some instances where we use cat for what appear to be errors or warnings. The reasons for this are again because of the unattended usage model, and also because of the sequential nature of the sync process. bb_sync has a catch_errors parameter. If TRUE (which it is by default) then errors are caught and the immediate action is cancelled, with the error message issued via cat - but the overall sync process continues. Otherwise an error in one data source would prevent all others from running. Issuing a warning would tend to be unhelpful in this application as well, because the user would not receive the warning in the context of its accompanying progress information [with bb_sync(...,verbose=TRUE)] and it would be very difficult to use that information to track down problems. Even more annoyingly, some parts of the bb_sync process generate things that appear to be errors, but which are isolated failures in some part of the process. stopping on these is not appropriate, because it will halt the entire sync process for that data source. An example of this is calling wget to do a recursive download. If there is a 404 server response to any request during that download, wget will return a non-zero status code. Often these 404's aren't errors that matter (e.g. broken external links), so we don't stop.

For functions outside of the bb_sync chain (which won't typically be run in unattended mode, and which aren't issuing "progress information" as above), we do use message, stop, and warning.

We've gone through all usages of cat and ensured that they are consistent with the above logic.

Test coverage is good overall but sticks mostly to 'happy path'. Untested error paths appear in sync.R, utils.R, wget.R, postprocess.R etc. I feel like a jerk for bringing this up. I know how unfun it is to write test cases for these.

We've added a few more error path tests, and will continue to add appropriate tests to the suite as development progresses.

No Maintainer in DESCRIPTION.

Actually there is - the "cre" (creator) role in Authors@R gets interpreted as maintainer.

@maelle

This comment has been minimized.

Member

maelle commented Jan 30, 2018

Thanks! @lwasser @MilesMcBain could you please have a look at the answer?

Reg. the tidyverse style, adding white spaces will make the code much easier to read and using styler it should be a one function call (styler::style_pkg()). 😉

@lwasser

This comment has been minimized.

lwasser commented Jan 31, 2018

hi @maelle i think (without actually going through another review) that their feedback looks very thorough and patient considering all of the review comments :) i also didn't know about styler !!! so i'm excited to try that. I have another review that i'm supposed to do by feb 10 so please consider that if there is another round with bowerbird! i'll likely be doing the modistsp review close to the 10th given my current schedule :)

@maelle

This comment has been minimized.

Member

maelle commented Jan 31, 2018

Thanks a lot @lwasser! Yes styler looks super useful (I'm yet to actually use it!)

@raymondben I forgot to tell you to add the rOpenSci review badge to the README! 🙈

[![](https://badges.ropensci.org/139_status.svg)](https://github.com/ropensci/onboarding/issues/139)

@MilesMcBain

This comment has been minimized.

MilesMcBain commented Feb 11, 2018

Hi All, just making a note that I'm picking up this up again now. I should be able to respond fully within the next week. 👍

@maelle

This comment has been minimized.

Member

maelle commented Feb 11, 2018

Thanks @MilesMcBain !

@MilesMcBain

This comment has been minimized.

MilesMcBain commented Feb 19, 2018

Thanks @raymondben and @mdsumner!

An excellent job on the documentation and vignettes. The main vignette is now a stand out - one of the best I have seen. It gives this package a great chance of getting some use about the place. 👏 😄

I am also happy with the way you addressed my other comments. Thankyou for the explanation on the use of cat(). I think the verbose option is in the spirit of the rule as you suggest, I also agree that within the context of try-catching, this is a reasonable option and I take no issue with it.

I've updated my review block above with the final ✔️

One minor comment: When reviewing some of the changes in the code I noticed a lot of old code sitting around in commented out blocks. This was true in maybe two thirds of bowerbird R files. This is definitely a personal thing, but I find these distracting when reviewing code and they also slightly decrease my trust in the code. "What was/is the bug here that's not fully resolved?" I suggest you remove as many of these as possible.

Otherwise, congratulations on polishing up this package and it has been my pleasure to review it.

@raymondben

This comment has been minimized.

Member

raymondben commented Feb 19, 2018

Thanks indeed @MilesMcBain. Re: commented-out code, yes, I'll take the blame for that, I do rather have a tendency to leave it littered around. I'll have a purge ...

@maelle

This comment has been minimized.

Member

maelle commented Feb 19, 2018

Thanks a lot @MilesMcBain!

@raymondben please update this thread when you've done that, so that I might take a last look before approval.

Reg. purging comments you could count the number of lines you've suppressed by using cloc at different commits. 😉

@raymondben

This comment has been minimized.

Member

raymondben commented Feb 19, 2018

@maelle , I've already cleaned them out. (master branch)

@maelle

This comment has been minimized.

Member

maelle commented Feb 19, 2018

Awesome, I'll have a look later today/this week!

@maelle

This comment has been minimized.

Member

maelle commented Feb 20, 2018

I have started looking, really great docs as Miles say!

I was wondering whether it'd be good to split the README into a more minimal README and a few vignettes linked from the README? E.g. "Defining data sources" could be a vignette. Or add a table of contents at the top of the README? (but the website might still be easier to browse?)

@maelle

This comment has been minimized.

Member

maelle commented Feb 20, 2018

Approved! 👏

Thanks a lot @raymondben @mdsumner @lwasser @MilesMcBain for your work! A very productive review process IMO!

I have three suggestions:

  • improving readability/browsability of the README via splitting it/adding a table of contents (see previous comment)

  • you still have a few long lines flagged by goodpractice::gp, consider making them shorter.

  • I don't think the origin of the name is in the docs? It'd be cool.

Now here is the list of things you have to do before I close this issue 😉

  • [] Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.

  • [] Add this badge to your README

[![](https://badges.ropensci.org/139_status.svg)](https://github.com/ropensci/onboarding/issues/139)

  • [] Add the rOpenSci footer to the bottom of your README
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

  • [] Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

@stefaniebutland

This comment has been minimized.

stefaniebutland commented Feb 20, 2018

@raymondben @mdsumner We'd love to publish a blog post about bowerbird. Given @MilesMcBain's comment about your vignette, it's bound to be good. Here are some editorial and technical guidelines: https://github.com/ropensci/roweb2#contributing-a-blog-post.

Was just looking back at a discussion of "the journey" from code for my own use, to code that I want others to find useful, where @noamross suggested a blog post using bowerbird as example. I was thinking that a post from a pkg submitter perspective (submit earlier vs later dilemma, docs, challenges of moving beyond personal cr*p code) would be very well received. However! That sounds more like two posts - one on bowerbird itself, and one about process.

Either way, this is optional and only if you have the capacity and interest to do this. Discussion reminded me that I need to think about how we can help authors and reviewers with this process

Let me know what you think. No rush.

@maelle

This comment has been minimized.

Member

maelle commented Mar 3, 2018

👋 @raymondben @mdsumner could you please soon do the different items of the checklist above including transferring your repo? Thanks!

@maelle maelle referenced this issue Mar 3, 2018

Closed

NLMR #188

14 of 19 tasks complete
@raymondben

This comment has been minimized.

Member

raymondben commented Mar 3, 2018

@maelle - done. Sorry, was waiting for a revised footer to be finalized, but looks like that may take a while so I've gone ahead and transferred now.

@maelle

This comment has been minimized.

Member

maelle commented Mar 3, 2018

Cool, thanks! Could you also add the review badge mentioned in the checklist?

I've activated the repo in Appveyor.

@maelle

This comment has been minimized.

Member

maelle commented Mar 3, 2018

Note that the badges do not render now, but this issue will soon be fixed, so please add it to your README before I close this issue. :-)

@raymondben

This comment has been minimized.

Member

raymondben commented Mar 3, 2018

The review badge is in the readme, just not rendering. It was showing under our org, but not now.

@maelle

This comment has been minimized.

Member

maelle commented Mar 3, 2018

Aaah thanks and sorry! Perfect! I'm closing the issue but this doesn't prevent the discussion of blog posts to continue. 😸

@maelle maelle closed this Mar 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment