Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: data section to guide #110

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
186 changes: 186 additions & 0 deletions data/intro.md
@@ -0,0 +1,186 @@
# Data for your package

In this section we talk about data for your scientific Python package:
when you would need it, and how you can access it and provide it to your users.

```{admonition}
:class: note
Some material adapted from:
https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html
https://learn.scientific-python.org/development/patterns/data-files/
```

## When and why you might need data

First we describe when and why you might need data.
Basically there are two cases:
for examples, and for tests.
We'll talk through both in the next couple of sections.

### Data for example usage

It's very common for scientific Python packages to need data that helps their users understand how the library is to be used.
Often the package provides functionality to access this data,
either by loading it from inside the source code, or by downloading it off of a remote host.
In fact, the latter approach is so common that libraries have been developed just to "fetch" data,
like pooch.
We will show you how to use both methods for providing access to data below,
but here we present some examples.

#### Examples in pyOpenSci packages
* movingpandas: <https://movingpandas.github.io/movingpandas-website/2-analysis-examples/bird-migration.html>

#### Examples in core scientific Python packages
* scikit-image: <https://github.com/scikit-image/scikit-image/tree/main/skimage/data>
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data>

### Data for tests
It is common to design your code and tests in such a way that you can quickly test on fake data,
ranging from something as simple as a NumPy array of zeros,
to something much more complex like a test suite that "mocks" data for a specific domain.
This lets you make sure the core logic of your code works, *without* needing real data.
At the end of the day though, you do want to make sure your code works on real data,
especially if it is scientific code that may work with very specific data formats.
That's why you will often want at least a small amount of real-world test data.
A good rule of thumb is to have a handful of small files,
say no more than 10 files that are a maximum of 50 MB each.
Anything more than that you will probably want to store on-line and download,
for reasons we describe in the next section.

## Why you should prefer to download data: size limits

Below we will introduce places you can store data on-line, and show you tools you can use to download that data.
We suggest you prefer this approach when possible,
The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab,
and for package indexes--most importantly, PyPI.
Especially with scientific datasets that can be quite large,
we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure.

### Forges (GitHub, GitLab, BitBucket, etc.)

Forges for hosting source code have maximum sizes for both files and projects.
For example, on GitHub, a single file cannot be more than 100 MB.
You would be surprised how quickly you can make a csv file this big!
You also want to avoid committing larger binary files (like images or audio)
to a version control system like git, because it is hard to go back and remove them later,
and it can really slow down the speed with which you can clone the project.
More importantly, it slows down the speed with which potential contributors can clone your project!

### Data size and PyPI

The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either
a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files").
These limits are not documented as far as we can tell,
but most estimates are around 100 MB per file and 1 GB for the total project.
Files this large place a real strain on the resources supporting PyPI, as discussed here.
For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact.
Don't worry, we're here to help you do that!
You can request increases for both file size and project size
(see [here](https://pypi.org/help/#file-size-limit)
and [here](https://pypi.org/help/#project-size-limit))
but we strongly suggest you read about other options here first.

## Where to store your data

Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it?
Here we provide several options.

### Within the repository

As stated above, there *are* cases where relatively small datasets
can be included in a package.
If this data consists of examples for usage,
then you would likely put it inside your source code
so that it will be included in the sdist and wheel.
If the data is meant only for tests,
and you have a separate test directory (as we suggest)

* strengths and weaknesses
* Strengths
* easy to access
* can be very do-able for smaller files, e.g. text files used in bioinformatics
* Weaknesses
* maximum file sizes on forges like GitHub and on PyPI
* You want to avoid adding these files to your version control history (git) and draining the resources of PyPI
* examples:
* pyOpenSci:
* opengenomics: <https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD>
* jointly: <https://github.com/hpi-dhc/jointly/tree/master/test-data>
* core scientific-python packages
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data>

### In the cloud

#### scientific data repositories
* strengths and weaknesses
* strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages
* weaknesses: may be hard to automate for data that changes frequently
* examples
* Zenodo
* OSF
* FigShare
* Dryad (paid?)

#### private cloud
* strengths and weaknesses
* strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how
* weaknesses: not free
* examples
* AWS
* Google Cloud
* Linode

```{admonition} Data version control
:class: tip

Did you know that tools exist that let you track changes to datasets in the same way version control systems like git
lets you track changes to code? Although you don't strictly need data versioning to include data with your package,
you probably want to be aware that such tools exist if you are reading this section.
Such tools could be particularly important if your package focuses mainly on providing access to datasets.
Within science, tools have been developed to provide distributed access to datasets. These tools
general
DataLad https://www.datalad.org/
related tools that are used for data engineering and industry (maybe breakout?)
Git-LFS
DVC
Pachyderm (I think it's called?)
```

```{admonition} Field specific standards + metadata
:class: tip

It's important to be aware of field-specific standards
eg astronomy
neuroscience: DANDI, NWB
many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist
see also: FAIR data
```

## How to access your data

Last but definitely not least, it's important to understand how you *and* your users

### For examples: in documentation, tutorials, docstrings, etc.

### Accessing local files with importlib-resources

If you have included data files in your source code, then you can provide access to these through importlib-resources.

link to PyCon talk w/Barry Warsaw
code snippet example here
mention python-3.9 backport
* examples:
* pyOpenSci packages:
* crowsetta: <https://github.com/vocalpy/crowsetta/blob/main/src/crowsetta/data/data.py>
* core scientific Python packages:
* scikit-learn: <https://github.com/scikit-learn/scikit-learn/blob/f86f41d80bff882689fc16bd7da1fef4a805b464/sklearn/datasets/_base.py#L297>

### Accessing files that are hosted remotely with pooch
pooch:
https://github.com/fatiando/pooch
code snippet example of using pooch

### For tests
Many of the same tools apply.
You can download test data as a set-up step in your CI.
Pytest fixtures for accessing test data.