From 277c5769d215d407c56746d3856d1822a560d605 Mon Sep 17 00:00:00 2001 From: David Nicholson Date: Fri, 27 Oct 2023 09:17:23 -0400 Subject: [PATCH] WIP: Add data/intro.md --- data/intro.md | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 186 insertions(+) create mode 100644 data/intro.md diff --git a/data/intro.md b/data/intro.md new file mode 100644 index 00000000..9b100ecf --- /dev/null +++ b/data/intro.md @@ -0,0 +1,186 @@ +# Data for your package + +In this section we talk about data for your scientific Python package: +when you would need it, and how you can access it and provide it to your users. + +```{admonition} +:class: note +Some material adapted from: +https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html +https://learn.scientific-python.org/development/patterns/data-files/ +``` + +## When and why you might need data + +First we describe when and why you might need data. +Basically there are two cases: +for examples, and for tests. +We'll talk through both in the next couple of sections. + +### Data for example usage + +It's very common for scientific Python packages to need data that helps their users understand how the library is to be used. +Often the package provides functionality to access this data, +either by loading it from inside the source code, or by downloading it off of a remote host. +In fact, the latter approach is so common that libraries have been developed just to "fetch" data, +like pooch. +We will show you how to use both methods for providing access to data below, +but here we present some examples. + +#### Examples in pyOpenSci packages +* movingpandas: + +#### Examples in core scientific Python packages +* scikit-image: +* scikit-learn: + +### Data for tests +It is common to design your code and tests in such a way that you can quickly test on fake data, +ranging from something as simple as a NumPy array of zeros, +to something much more complex like a test suite that "mocks" data for a specific domain. +This lets you make sure the core logic of your code works, *without* needing real data. +At the end of the day though, you do want to make sure your code works on real data, +especially if it is scientific code that may work with very specific data formats. +That's why you will often want at least a small amount of real-world test data. +A good rule of thumb is to have a handful of small files, +say no more than 10 files that are a maximum of 50 MB each. +Anything more than that you will probably want to store on-line and download, +for reasons we describe in the next section. + +## Why you should prefer to download data: size limits + +Below we will introduce places you can store data on-line, and show you tools you can use to download that data. +We suggest you prefer this approach when possible, +The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab, +and for package indexes--most importantly, PyPI. +Especially with scientific datasets that can be quite large, +we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure. + +### Forges (GitHub, GitLab, BitBucket, etc.) + +Forges for hosting source code have maximum sizes for both files and projects. +For example, on GitHub, a single file cannot be more than 100 MB. +You would be surprised how quickly you can make a csv file this big! +You also want to avoid committing larger binary files (like images or audio) +to a version control system like git, because it is hard to go back and remove them later, +and it can really slow down the speed with which you can clone the project. +More importantly, it slows down the speed with which potential contributors can clone your project! + +### Data size and PyPI + +The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either +a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). +These limits are not documented as far as we can tell, +but most estimates are around 100 MB per file and 1 GB for the total project. +Files this large place a real strain on the resources supporting PyPI, as discussed here. +For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact. +Don't worry, we're here to help you do that! +You can request increases for both file size and project size +(see [here](https://pypi.org/help/#file-size-limit) +and [here](https://pypi.org/help/#project-size-limit)) +but we strongly suggest you read about other options here first. + +## Where to store your data + +Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it? +Here we provide several options. + +### Within the repository + +As stated above, there *are* cases where relatively small datasets +can be included in a package. +If this data consists of examples for usage, +then you would likely put it inside your source code +so that it will be included in the sdist and wheel. +If the data is meant only for tests, +and you have a separate test directory (as we suggest) + +* strengths and weaknesses + * Strengths + * easy to access + * can be very do-able for smaller files, e.g. text files used in bioinformatics + * Weaknesses + * maximum file sizes on forges like GitHub and on PyPI + * You want to avoid adding these files to your version control history (git) and draining the resources of PyPI +* examples: + * pyOpenSci: + * opengenomics: + * jointly: + * core scientific-python packages + * scikit-learn: + +### In the cloud + +#### scientific data repositories +* strengths and weaknesses + * strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages + * weaknesses: may be hard to automate for data that changes frequently +* examples + * Zenodo + * OSF + * FigShare + * Dryad (paid?) + +#### private cloud +* strengths and weaknesses + * strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how + * weaknesses: not free +* examples + * AWS + * Google Cloud + * Linode + +```{admonition} Data version control +:class: tip + +Did you know that tools exist that let you track changes to datasets in the same way version control systems like git +lets you track changes to code? Although you don't strictly need data versioning to include data with your package, +you probably want to be aware that such tools exist if you are reading this section. +Such tools could be particularly important if your package focuses mainly on providing access to datasets. +Within science, tools have been developed to provide distributed access to datasets. These tools +general +DataLad https://www.datalad.org/ +related tools that are used for data engineering and industry (maybe breakout?) +Git-LFS +DVC +Pachyderm (I think it's called?) +``` + +```{admonition} Field specific standards + metadata +:class: tip + +It's important to be aware of field-specific standards +eg astronomy +neuroscience: DANDI, NWB +many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist +see also: FAIR data +``` + +## How to access your data + +Last but definitely not least, it's important to understand how you *and* your users + +### For examples: in documentation, tutorials, docstrings, etc. + +### Accessing local files with importlib-resources + +If you have included data files in your source code, then you can provide access to these through importlib-resources. + +link to PyCon talk w/Barry Warsaw +code snippet example here +mention python-3.9 backport +* examples: + * pyOpenSci packages: + * crowsetta: + * core scientific Python packages: + * scikit-learn: + +### Accessing files that are hosted remotely with pooch +pooch: +https://github.com/fatiando/pooch +code snippet example of using pooch + +### For tests +Many of the same tools apply. +You can download test data as a set-up step in your CI. +Pytest fixtures for accessing test data.