How to handle large data? #13

demal-lab · 2019-04-01T14:27:24Z

This is a great approach that I'd love to integrate into my work flow. One of my biggest barriers is that my regularly updated data sets are too large for GitHub's regular storage limit of 100MB. So I usually just keep my data locally--or in some kind of Postgres database. Do you have thoughts about how to tackle this with data that exceed the GitHub limit (assuming you can't pay for more storage)?

ethanwhite · 2019-04-13T15:27:46Z

I don't have any well thought out ideas about working with 100+ MB in this way, but I'm happy to chat about it a bit here in case the thoughts I do have are useful. Some of this depends on the frequency of update and other considerations, but I'll start with some general thoughts and we can go from there.

For 0.1 - 50 GB size data Zenodo is still a good option for archiving and since it also takes care of versioning would serve as an effective central location (it also has DOIs, won't be going anywhere for a long time, etc.). If the update frequency is low you could run things locally and update the Zenodo version manually. If the update frequency is high then it should still be possible to automate the archiving step directly through their API (instead of using the GitHub-Zenodo integration), but I haven't tried it.

For full automation you probably want to run QA/QC + cleaning + archiving code on your own compute resources (whether they be local or cloud). Depending on the update frequency and how important it is to immediately check and archive the newest data there are a few different approaches to this automation. In order from simplest to most powerful.

cron: automatically run your scripts on a regular schedule (could be daily, weekly, etc.). We do this for automation in a different context (https://phenology.naturecast.org/).
jenkins: open source locally runable continuous integration system. We've played with this for other things and running on a single server has been fairly manageable (larger scaling has been a bit tricky). It also integrates with GitHub, so if you wanted some sort of hybrid setup with the code on GitHub but local data then Jenkins would be one way to do this.
A more general event-response style system. I've looked at OpenWhisk a bit and know folks building things with it, so that's probably where I'd start.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle large data? #13

How to handle large data? #13

demal-lab commented Apr 1, 2019

ethanwhite commented Apr 13, 2019

How to handle large data? #13

How to handle large data? #13

Comments

demal-lab commented Apr 1, 2019

ethanwhite commented Apr 13, 2019