Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache data locally after download #8

Open
vincentarelbundock opened this issue Mar 5, 2014 · 18 comments
Open

Cache data locally after download #8

vincentarelbundock opened this issue Mar 5, 2014 · 18 comments
Milestone

Comments

@vincentarelbundock
Copy link

Allow users to pass a flag that will cache the data locally after it is downloaded. Goals:

1- Reproducibility. Even if the online version disappears, the user will still have a local copy
2- Quicker download on second request.

Should be accompanied by a standalone cleaning script if applicable.

@leeper
Copy link
Member

leeper commented Mar 5, 2014

This is a great idea. Are you thinking an R.cache dependency or
something else?

Thomas J. Leeper
http://www.thomasleeper.com

On Wed, Mar 5, 2014 at 6:29 AM, Vincent Arel-Bundock <
notifications@github.com> wrote:

Allow users to pass a flag that will cache the data locally after it is
downloaded. Goals:

1- Reproducibility. Even if the online version disappears, the user will
still have a local copy
2- Quicker download on second request.

Should be accompanied by a standalone cleaning script if applicable.

Reply to this email directly or view it on GitHubhttps://github.com//issues/8
.

@christophergandrud
Copy link
Contributor

+1

@vincentarelbundock
Copy link
Author

No need for a dependency. Just let user specify a local path where to save stuff.

@antagomir
Copy link
Member

Note that CRAN has reservations for writing on local files: "Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory (or during installation in the location pointed to by TMPDIR: and such usage should be cleaned up). Installing into the system’s R installation (e.g., scripts to its bin directory) is not allowed." http://cran.r-project.org/web/packages/policies.html

But perhaps this is not a problem, if the package does not do this by default, but only after users active choice. Of course there should be an opportunity for packages that are designed for data access. But need to be careful how to implement that. My suggestion is to cache to local tmpdir.

@christophergandrud
Copy link
Contributor

Just added a file argument to get_data. It allows the user to specify an RData file to save the data frame into. I'll push it up in a few minutes.

(I don't think this is a problem for CRAN as I have a number of packages that do this, the default is always set to not write the file.)

I do like the cache idea though. This allows the user to not have to think about their file paths, which can differ when they run the function on different computers, for example.

@vincentarelbundock
Copy link
Author

Cool. I think there should also be a way to replicate the file processing (e.g. create unique country names, drop duplicate country years, etc.). So we'd need to save a R script as well.

@christophergandrud
Copy link
Contributor

@vincentarelbundock currently it saves the downloaded and processed data frame as the last step before return(data) at the end of the function. Does this make sense or should we try something else?

@vincentarelbundock
Copy link
Author

That's cool, but in my view it's not enough. If you're given only the raw and the processed data, you still might have to reverse engineer to find out what the transformations were. Was that log(variable) or log(variable+1)? Which duplicate countries did we drop, exactly?

I think what makes the most sense would be related to my issue here: #10

You would have 2 files:

  • database_political_institutions.yaml (download url, bibtex cite, etc.)
  • database_political_institutions.R (cleaning script with all transformations)

And a standardized function:

  • get_data(): Parse .yaml file, download data if not already cached, and run R script.
  • If flagged for caching, then copy yaml, raw data, processed data and R script to specified path.

@antagomir
Copy link
Member

I agree that in principle it would be useful to have both raw data and the full script available in a reproducible format. If this gets too messy, however, it would be useful to at least store the package version and function name. This is already rather useful as such, and virtually sufficient for full reproducibility.

@vincentarelbundock
Copy link
Author

I think your current setup is going to be much messier in the long run. "Silos" for each datasets are the only way I can think of to cleanly separate the cleaning scripts. That's a real simple way to scale the project...

@christophergandrud
Copy link
Contributor

The function calls + package version is effectively a recipe. So for example, having an R script that went:

## Using psData 1.0
library(psData)

# Download data
Main <- get_dpi(vars = 'checks', sha1 = 'foo', file = 'Checks.rda')

# Set as psData
MainP <- as.panel(Main, 'Country', 'Year')

# Transform
AS OF YET NOT CREATED TRANSFORMATION FUNCTIONS

DROP OBSERVATIONS, ETC

It would definitely be helpful to add the package version and call to the metadata in as.panel (we might be able to do it a step earlier at get_dpi). Maybe it's possible to also output a log of the whole process. But I think fundamentally, this is reproducible silos and recipes. Am I missing something?

Note also, something that hasn't been started yet: @briatte did a good job of creating a way of running basic data sets through get_data to get a basic data frame. But we also need to have a system for data that requires more manipulation to get it into the panel-series data frame format. These were the 'variable builders' in the original version of psData. These would also effectively be recipes, if I understand you correctly.

Thoughts?

@vincentarelbundock
Copy link
Author

Here's a repo I quickly put together as a proof of concept to show what I mean more concretely.

https://github.com/vincentarelbundock/dpi

I think you may be right. I guess I'm just suggesting that a slightly different repo structure with explicit silos would be cleaner, more scalable, and much easier to contribute to.

In my experience, generic functions called things like "get_data" tend to try to do too much, and end up being unwieldy. Better to provide some basic functionalities, and cleanly delineate which transforms were operated on each dataset.

Even if there is some code duplication, it is better for transparency to have a file called "database_political_institutions.R" that is more or less self-contained and can show the user exactly what is done in the pre-processing stage.

@antagomir
Copy link
Member

I agree. Unfortunately I am personally too busy now with building the overall rOpenGov infra right now to contribute much code here. But based on our experiences I also think that Vincent's suggestion is the optimal way to go, if time will allow for others to advance this.

@christophergandrud
Copy link
Contributor

I think there is some confusion among the various people working on the package.

In the original version, each data set was hard silo-ed into their own functions. Then it was recommended that these be spun off into separate packages that would be called by get_data. That would as much as possible provide a common syntax and output type, but apart from a few shared features (such as panel ID standardisation), each getter/variable builder silo would follow it's own recipe.

However, in subsequent collaborator updates get_data was reversed, so that the individual getter functions would essentially be particular calls of get_data.

I think this is the change that @vincentarelbundock is arguing against. I largely agree. However, I see the logic in having a core set of functionalities in get_data that basically deals with 'easy' data sets, i.e. those that can be downloaded from one URL and are already in panel-series format. This basically covers a lot of different data sets and will drastically cut down on code duplication.

I do think get_data needs to be changed so that it calls other recipe functions to handle non-vanilla data sets. The other option would be to have a different function called build_data that would pull in recipes for non standard data.

I would also recommend against baking in things like transforming and recoding the data in the get function beyond simply getting the data into a standardised panel-series format. These other decisions should be left to individual users.

Finally, it would be great to work on a way to dynamically generate meta data like bibtex entries and noting psData versions. It will easier over the long term than updating these manually.

@vincentarelbundock
Copy link
Author

Thanks @christophergandrud , that's useful background info. I don't want to insist too much because I don't have a good sense of the 10,000ft view, but I agree that a common get_data user-facing function would be great.

One simple approach would be to have get_data be a simple wrapper that just calls the recipe cleaning script and returns the data. For "vanilla" datasets, the cleaning script will just be a single call to read.spss(url) or some such. That way, developers don't have to make distinctions between clean and non-clean data. Each of them have an associated loading/cleaning script. Some of those are just more complicated than others, but everything is very transparent and easy to see.

Plus there is a single point of entry for all datasets: a get_data(...) function where the three dots allow users to (potentially) pass arbitrary parameters to the cleaning scripts. Minimal overhead, standardized, high flexibility.

The info to include in the accompanying yaml files can be agreed on, and enforced at the Pull Request level.

@vincentarelbundock
Copy link
Author

I agree about minimizing data transformations. My log example was a bad one. Think more about dropping YUG after break-up.

@christophergandrud
Copy link
Contributor

@vincentarelbundock I really like your suggestion for the user facing get_data. That should be the way to go.

We should have a separate discussion on how to deal with the 'problem countries' like Yugoslavia and East/West Germany. These are important things to consider because they affect merging, which is a going to be a core psData functionality.

Right now there is a really crude method of notifying the user about/dropping duplicates. But this is kind of unsatisfactory.

@antagomir
Copy link
Member

Yes, get_data as a wrapper for the full versions should be the way to go. We recently had related discussion in ropengov mailing list. It would be good to harmonize these conventions across different packages, and the model discussed here is the best candidate to me so far. Still need to think how to implement cacheing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants