Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle large data? #13

Open
demal-lab opened this issue Apr 1, 2019 · 1 comment
Open

How to handle large data? #13

demal-lab opened this issue Apr 1, 2019 · 1 comment

Comments

@demal-lab
Copy link

This is a great approach that I'd love to integrate into my work flow. One of my biggest barriers is that my regularly updated data sets are too large for GitHub's regular storage limit of 100MB. So I usually just keep my data locally--or in some kind of Postgres database. Do you have thoughts about how to tackle this with data that exceed the GitHub limit (assuming you can't pay for more storage)?

@ethanwhite
Copy link
Member

I don't have any well thought out ideas about working with 100+ MB in this way, but I'm happy to chat about it a bit here in case the thoughts I do have are useful. Some of this depends on the frequency of update and other considerations, but I'll start with some general thoughts and we can go from there.

For 0.1 - 50 GB size data Zenodo is still a good option for archiving and since it also takes care of versioning would serve as an effective central location (it also has DOIs, won't be going anywhere for a long time, etc.). If the update frequency is low you could run things locally and update the Zenodo version manually. If the update frequency is high then it should still be possible to automate the archiving step directly through their API (instead of using the GitHub-Zenodo integration), but I haven't tried it.

For full automation you probably want to run QA/QC + cleaning + archiving code on your own compute resources (whether they be local or cloud). Depending on the update frequency and how important it is to immediately check and archive the newest data there are a few different approaches to this automation. In order from simplest to most powerful.

  1. cron: automatically run your scripts on a regular schedule (could be daily, weekly, etc.). We do this for automation in a different context (https://phenology.naturecast.org/).
  2. jenkins: open source locally runable continuous integration system. We've played with this for other things and running on a single server has been fairly manageable (larger scaling has been a bit tricky). It also integrates with GitHub, so if you wanted some sort of hybrid setup with the code on GitHub but local data then Jenkins would be one way to do this.
  3. A more general event-response style system. I've looked at OpenWhisk a bit and know folks building things with it, so that's probably where I'd start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants