Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
A package for maintaining a local collection of data sets from a range of data providers. Bowerbird can mirror an entire remote collection of files, using
Nothing to our knowledge that really does the same thing. Some similarity to https://github.com/ropensci/rdataretriever, though rdataretriever seems to be angled towards biodiversity data sets in particular and creating sensible local database structures for them. Bowerbird is focused on mirroring remote data to a local file system, and providing some functions around data provenance. Passing overlap with http packages (httr, crul) but these are generally intended for single-transaction sort of usage. Jeroen's
Confirm each of the following by checking the box. This package:
Additional notes re: our presubmission enquiry
General note: the package is not in a final polished state, but we think far enough advanced (and stable enough) to be a good point for onboarding consideration.
Fair suggestion, and one that we've considered - and maybe that split is still reasonable to consider down the track. But for now, at least, we think it's better to keep core-functionality and the data-source-definitions bundled together.
It's up to the user where they want to put their data. We do make a suggestion in the README and vignette for users to consider rappdirs.
Thanks for your submission @raymondben! I'm currently seeking one reviewer (already found one)!
Here is the output of
It is good practice to ✖ write unit tests for all functions, and all package code in general. 79% of code lines are covered by test cases. R/aadc_eds_handler.R:14:NA R/aadc_eds_handler.R:15:NA R/aadc_eds_handler.R:16:NA R/aadc_eds_handler.R:17:NA R/aadc_eds_handler.R:27:NA ... and 385 more lines ✖ write short and simple functions. These functions have high cyclomatic complexity:bb_source (56). ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters R\aadc_eds_handler.R:28:1 R\aadc_eds_handler.R:32:1 R\aadc_eds_handler.R:36:1 R\aadc_eds_handler.R:38:1 R\aadc_eds_handler.R:40:1 ... and 480 more lines ✖ avoid calling setwd(), it changes the global environment. If you need it, consider using on.exit() to restore the working directory. R\aadc_eds_handler.R:44:5 R\amps_handler.R:44:9 R\provenance.R:50:5 R\sync.R:63:5 R\utils.R:77:5 ... and 2 more lines
With the Editor's permission, I am posting some feedback ahead of a full review in the interests of discussing potential structural changes as early as possible.
The issue of separating the core functionality from data sources was raised by the editors and the more I poke around under the hood, the more I am convinced this will also be my recommendation.
Part of my motivation is selfish. The number of sources
Thanks @MilesMcBain. I'll confess I'm not 100% sold on the splitting idea, but I'm also not vehemently against it and your points are good. @mdsumner and I will have a chat and see what we come up with. In the meantime, suggest that you and @lwasser don't spend too much more time on it, since it's likely to look fairly different if we do split it.
OK, so: on reflection, we agree that splitting is probably a good idea. @maelle, what's the preference here? Do you want to close this issue and we resubmit, or do we just carry on here? (The split version --- not entirely finalized yet --- is currently sitting in the "split" branch.) Note that we'll only submit the core part of the package at this stage, including a few example data sources. The extended collection of data sources will be hived off into a separate package for further work when time permits, and possible submission.
For completeness, some thoughts on Miles' comments above:
Moving sources into one or more other packages doesn't reduce the maintenance burden, it just spreads it across packages. We see occasional but not overwhelming changes to remote servers that require tweaks to the handlers, but over the 10+ years that we've been doing this (in various guises) we haven't found it to be a showstopper. However, splitting the package (into "core" plus other data-themed package(s) that depend on it) should indeed mean that the core components will remain more stable. That alone seems like a pretty good argument in favour of the split.
Yes, agreed, but it's worth exploring this a bit. The aim of bowerbird is to be flexible enough to be able to handle a wide range of data sources, and to allow new data sources to b e added fairly easily. Hence bowerbird itself doesn't generally attempt to wrap syntactic sugar around the wget arguments, because it's not possible to do it in a general manner. It's not just dates you might want to subset, it could also be geographic space, or vertical levels within a climate model, or different file formats, or any number of other facets depending on the data set --- and different data providers will structure their data directories and files differently. There's simply not enough consistency. But in most cases this sort of subsetting isn't necessary, and the wget calls are fairly straightforward --- so that aim of bowerbird (ability to add new data sources fairly easily) generally seems to hold up. More complicated examples will always arise, but that's just the nature of the task.
Now, for sure, nicer function arguments could be part of more targeted packages (that might themselves depend on bowerbird) --- but doing so immediately makes the maintenance burden high because of the complexity and diversity of those options, and hence such packages tend to focus on a small number of data providers. So where is it best to invest that effort? Is it at the data-fetching level? Maybe. But a data-fetching package might well be used to maintain a data library on behalf of a number of users (this is what we do, anyway). In this scenario you're generally likely to want all files that are part of a given collection, because it's probably impossible to know a priori which ones your users will want. And if it's a centrally-maintained library within an organisation, then only a small number of admin users need to know about the data-fetching internals. Perhaps investing effort in polishing these interfaces is not going to benefit a huge number of people, and it might be better to invest the nicer-function-argument effort at the next level, at the stage of reading data from those files into R. General-purpose packages like
Anyway (sorry, that was a bit of a monologue!) none of that argues against splitting the package. But it does encourage some thought as to the best way to build upwards.
Thanks for your response @raymondben.
Firstly, one small point about the maintenance: With the split design you get much more freedom. You may not even put all your datasources on CRAN, you can choose to leave some on GitHub where the expectations about maintenance are much lower. In this way the maintenance burden could be reduced.
Secondly, I did get some of the way through my review before we put this on hold, so I thought it might be helpful to have the higher level feedback now while things are being actively worked on:
This kicked off a 2.5Gb download that blocked my R session which I had to eventually kill. Not a great first user experience ;). I know down the page you point out download size for this source can be reduced. But the total size of the source is not mentioned until the source list at the bottom.
I would like to make the following recommendations:
A question: is there a way to check sync status in interactive mode without fecthing these large repos?
I would also make the comment that I do not think this is the idiomatic
Thanks @MilesMcBain - a quick question on the config arrangement (and, yes, it is unfortunately easy to unknowingly lose the attributes). The reason for doing it that way is because those attributes apply to the whole configuration (and thus should only be stored once, not in every row) whereas the rows hold data-source-specific information that is not necessarily common across rows. An alternative structure that we considered would be to create a list object that has something like a
We might commit more deeply to a nested-tibble model for this one-to-many attribute problem, but I'm not sure that covers everything, definitely something that we haven't been sure about so the feedback is very welcome.
Also, @MilesMcBain I'm very sympathetic to this user-problem for a 2Gb download, but we are in danger of diluting the entire point of the abstraction of bowerbird - in that it's not ever going to be immediately obvious from a user-perspective why you'd want this package, since the value comes after you've decided to commit to getting all or most or of a particular collection - especially those parts that aren't yet published. Note that while the first-time sync-run is inconvenient, if it successfully completes then when it runs next time, tomorrow, next week it will be hugely fast as it will only check to see that it already has all but the handful of brand-new files from the source, and it will dutifully obtain those new files without any intervention by a human.
We could choose a weekly SST data set, that would be easily downloadable in a single file - the initial commitment for download is minutes or less - and the real value of bowerbird will become apparent next week when that file is updated by the provider, and the automated bowerbird process (that you configure to run nightly, or so) kicks in and detects a change.
But, not many or most users will care about SST as a variable of interest, and those that do will want much higher resolution in time and space, before they care. So, by definition and rather unfotunately the examples that are most compelling and immediately-reproducible are not going to have an actual audience, though they should have an abstract appeal in terms of what else can be done with this tool that is "relevant to me". I do want to ensure we discuss this audience definition, since we are ultimately targetting people like us, those that commit to a large and continuously maintained collection for shared use. (We are also trying to highlight that committing to this is very valuable, an actual game changer for the way we think and approach these issues).
I definitely agree about the "yesno" prompt for whether you really want to download a potentially huge collection, but again that interactivity, normal for "usual use" kind of gets in the way of the task of maintaining a huge collection of automatically updated data. We do see huge value in specific package applications, to provide smarts about available dimensional and variable filters, but we unavoidably end up in very specific areas. I am a bit wary of illustrating the value of bowerbird by exploring a particular data set to the level of actually reading it because as soon as we go that far we've moved away from the bowerbird abstraction of being an automatable, configure-anything, data-getter.
Thanks for the discussion and feedback! (I do think the weekly SST example is a good one, and @raymondben and I should look carefully at a user-case for illustrating a full workflow - and probably we need to tell the story of the pending updates that an easily-configured bowerbird robot will sort out for us tomorrow, next week, in the months ahead).
Sorry for the hiatus, think we're OK to resume the review process now. The major change is that the data source definitions have been removed, except for a few representative examples. (The full set of definitions has moved to https://github.com/AustralianAntarcticDivision/blueant. This is a separate package and not part of this ropensci submission, but it's there should you want to get some idea of how a themed-data-package might make use of bowerbird.)
Some replies to Miles' partial review:
Yah, fair enough. The example is now a small one, and in addition we've switched from using
No, it's not exclusively this. See updated comments in README. Also, we can't give progress indicators (also discussed in README).
I'm not sure what you mean by this, sorry.
That structure has now changed and doesn't rely on attributes to carry the repo-wide settings.
All exported function names are now prepended with
Update: fix found for Travis issue.
Thanks for the update @raymondben. I look forward to reviewing the new version.
Regarding this comment:
Since the package is intended to be used interactively I thought it would be helpful if I can check I am on the latest version without initiating a large download. If it turns out I need to sync, I'd probably like to tee it up in separate session so that it doesn't block my main work environment. It might be this is already possible and I did not see the function.
Response to reviews
@MilesMcBain @lwasser - thanks again for your reviews. Responses to each of your points appear below. Please note that the revised version of bowerbird is in its postrev branch. It will be merged into master once the revisions are complete and dependent packages have been updated to match.
Other notes: there is still not a lot of documentation to help power users who want to write their own handler functions (there is a brief section in the vignette). However, since this is likely to be a vanishingly small number of users we don't feel it's a showstopper right now. There is an issue for this though so that it doesn't drop off the radar (ropensci/bowerbird#15).
Leah - Review Comments
Various changes have been made to address this, notably the "Users: level of usage and expected knowledge" section has been added to the README and vignette, along with an expanded set of examples and guidance on writing data source definitions.
Yup. All function documentation has been revised.
Ooops, should have been https, sorry. Whatever browser you were using should have redirected, but obviously it didn't. (Changed)
Style is to some extent personal preference (indeed, the tidyverse style guide itself says that "All style guides are fundamentally opinionated ... Some decisions genuinely do make code easier to use ... but many decisions are arbitrary."). Unless it's mandatory for ropensci (we don't believe that it is?) we would prefer not to follow it.
All the wget-related stuff has been revised with more platform-specific instructions, and clear notes on
We do that check in
That's unfortunately the nature of the beast. For what it's worth, we don't particularly enjoy the complexity either. But in order to be a versatile, flexible tool that can cope with arbitrarily-structured repositories from different data providers, there are a lot of arguments to cover different situations.
The vignette now has four examples of writing a data source, stepping through various levels of detail and demonstrating all three of the packaged handler functions.
Not that we are aware of.
As far as we are aware there is no way to have separate sections of arguments with roxygen. However, each mandatory function parameter here is clearly labelled as "required".
The use of
Documentation has had a rewrite, as mentioned above. We feel that detailed tutorials on using data provider websites are beyond the scope of package documentation. Nevertheless, the vignette now includes a fuller example of writing an Earthdata data source. For the record, that URL you give above is your data source URL: it points directly to that data set.
Bowerbirds (real ones) collect shiny things to decorate their bowers with. Bowerbird (the package) lets you collect nice shiny data sets and add them to your collection.
Bowerbird just downloads whatever the data provider makes available, following whatever the data source configuration parameters are. As explained in the vignette, bowerbird doesn't know in detail about the data structures or anything else. If they change, the new version will be downloaded.
Each data source has a
Yes, as a general statement it's probably reasonable to encourage users to avoid individually downloading large data sets (especially multiple individuals within the same institution downloading the same large data set). But at some point someone has to store data locally. For example, we routinely run analyses that require the entire satellite data record of some parameter (e.g. sea ice or SST). This can't be done by issuing remote data requests, it would be cripplingly slow. Yes, we can run these analyses "in the cloud", and we have virtual servers for exactly this purpose ... but then where does that virtual server get its data from? That data storage ideally needs to be on the same local network as the compute facility so that access is fast. Someone, somewhere, has to maintain that collection of data files. This is one use case for bowerbird. And having one data librarian maintain a collection of data sets on behalf of a number of local users will reduce the need for individuals to download their own copies.
The bugreports URL is in the DESCRIPTION file. Maintainer is also there (but it's hidden! The "cre" (creator) role in Authors@R gets interpreted as maintainer). CONTRIBUTING.md and code of conduct files have been added.
No, it's not. (You can see that in the submission details in the onboarding issue - I don't know if as a reviewer you get any special notification of that though).
There is no NEWS file yet: to date, the package has been in an active development stage with everything liable to change, so it doesn't make sense to document all this via NEWS. We will add a NEWS file once the onboarding process is complete and things seem stable enough to warrant it.
Miles - Review Comments
It now has a subheading.
See response to Leah's review above - the README/vignette now has four examples of increasing complexity, and demonstrating all three of the packaged handler functions.
General improvements all round have been made.
Function documentation now explains its role a little more clearly.
This function is now not exported - it is useful internally but probably not for the user.
For these functions, as well as the handler functions (bb_handler_wget, bb_handler_oceandata, bb_handler_earthdata) there are several arguments that are passed automatically when bb_sync is run, and which don't need to be specified by the user. Previously these were exposed as regular function arguments, which was indeed confusing. These have now been masked from regular users by use of the dots argument. Note that developers who wish to write their own handler functions will need to know about these arguments: there is a section of the vignette that will explain this. But regular users, who will almost certainly be the overwhelming majority of the user base, will find the new structure much less confusing.
The documentation for the handler and other functions that are called by bb_sync now also makes it clear that these functions are not intended to be called directly by the user, but specified as part of a data source definition.
Data provenance, and frameworks to help manage it, are evolving areas. The
Is essentially an internally-used argument, which (like the config and other automatically-passed arguments to handler functions) has now been hidden from the user.
Function documentation has in general been improved, and the vignette examples give more detail.
See above: now hidden from the user.
dplyr has been ditched!
Now removed: the parts of data source definitions that need to specify functions (and possibly arguments) now do so via a
We like your optimism! wget still gets us scratching our heads on occasion. But for all its faults and complexity, it is a powerful tool.
There are a number of wget flags that can clash, and other than trying them all in combination we are not aware of any way of knowing how those clashes are resolved. That's really an issue with wget itself.
This is a little more complicated than it may seem. First, the majority of these
There are, however, some instances where we use
For functions outside of the
We've gone through all usages of
We've added a few more error path tests, and will continue to add appropriate tests to the suite as development progresses.
Actually there is - the "cre" (creator) role in Authors@R gets interpreted as maintainer.
hi @maelle i think (without actually going through another review) that their feedback looks very thorough and patient considering all of the review comments :) i also didn't know about
An excellent job on the documentation and vignettes. The main vignette is now a stand out - one of the best I have seen. It gives this package a great chance of getting some use about the place.
I am also happy with the way you addressed my other comments. Thankyou for the explanation on the use of
I've updated my review block above with the final
One minor comment: When reviewing some of the changes in the code I noticed a lot of old code sitting around in commented out blocks. This was true in maybe two thirds of
Otherwise, congratulations on polishing up this package and it has been my pleasure to review it.
I have started looking, really great docs as Miles say!
I was wondering whether it'd be good to split the README into a more minimal README and a few vignettes linked from the README? E.g. "Defining data sources" could be a vignette. Or add a table of contents at the top of the README? (but the website might still be easier to browse?)
I have three suggestions:
Now here is the list of things you have to do before I close this issue
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.
@raymondben @mdsumner We'd love to publish a blog post about
Was just looking back at a discussion of "the journey" from code for my own use, to code that I want others to find useful, where @noamross suggested a blog post using
Either way, this is optional and only if you have the capacity and interest to do this. Discussion reminded me that I need to think about how we can help authors and reviewers with this process
Let me know what you think. No rush.