WIP/ENH: datasets: Create the datasets subpackage. #8707

WarrenWeckesser · 2018-04-11T05:43:13Z

Create the new subpackage datasets.
Move misc.face, misc.ascent, and misc.electrocardiogram to datasets.
Deprecate misc.face, misc.ascent, and misc.electrocardiogram.

pv · 2018-04-11T08:32:54Z

Mailing list: https://mail.python.org/pipermail/scipy-dev/2018-March/022717.html https://mail.python.org/pipermail/scipy-dev/2018-April/022754.html

WarrenWeckesser · 2018-04-11T16:22:48Z

This will wait for 1.2.

larsoner · 2018-04-11T16:28:55Z

scipy/datasets/_funcs.py

@@ -0,0 +1,192 @@
+"""
+Functions that load datasets.


In the meantime (moving toward 1.2.0) this would be a good place to try to distill a summary of the scope of this submodule from the mailing list.

Create the new subpackage `datasets`. Move misc.face, misc.ascent, and misc.electrocardiogram to datasets. Deprecate misc.face and misc.ascent. (misc.electrocardiogram has not been in a release yet, so it does not require deprecation.)

WarrenWeckesser · 2018-04-27T04:29:09Z

I updated the PR for 1.2, but this is still work-in-progress, for two reasons. First, we haven't actually agreed to create a datasets subpackage, and second, I'd like to include at least one new dataset in this PR, and use it in one or more docstrings. A good candidate is Fisher's Iris data, which could be used as sample data for several functions in scipy.stats.

lagru

Small notice: electrocardiogram is now imported and used in benchmarks/peak_finding.py as well (see #8769).

WarrenWeckesser · 2018-10-24T22:53:02Z

I removed the 1.2 milestone. I'll get back to this, but probably not in time for 1.2.

larsoner · 2018-10-24T23:07:17Z

If you want, I could try to take over and push to get it into 1.2, assuming we are settled on the API. I think we did have things more or less settled on the mailing list, but I could be wrong. I'm also fine waiting if you are keen to keep working on it @WarrenWeckesser.

I would probably also just migrate the existing datasets rather than adding anything new. This would perhaps make review simpler but also simultaneously less fun.

leouieda · 2020-05-14T18:26:55Z

Hi all, just wanted to point out the Pooch package which was created for this exact purpose. It handles downloading and caching sample data so that they don't have to be included in the distributions (making downloads smaller). The dependencies are minimal (requests, packaging, and appdirs) so it wouldn't be a huge burden on users. It might be useful if the goal is to expand the sample data or include some larger datasets.

rgommers · 2020-05-15T15:26:33Z

Thanks for your input @leouieda. I think the current thinking on this PR is to keep it outside of SciPy, instead making a separate package (as mentioned in https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html#data-sets) where datasets for multiple packages like NumPy and SciPy can be stored.

Pooch was already brought up by a couple of people in conversations about that new package.

AnirudhDagar · 2022-01-06T17:43:04Z

Hi all, I'd like to know the state of this PR in 2022 and our plans to actually deprecate the misc module and create the datasets module in SciPy as proposed.

What's blocking the PR right now? Is it the addition of new datasets (like iris or co2 etc.)? If yes, I'd like to help move these efforts forward and push this if we are decided on the datasets API design already.

Also, what are some of the popular datasets that the Scientific community would like us to include? Thanks!

rgommers · 2022-01-13T13:49:36Z

Hi all, I'd like to know the state of this PR in 2022 and our plans to actually deprecate the misc module and create the datasets module in SciPy as proposed.

Hi @AnirudhDagar, this PR/idea is still of interest I think. As is deprecating misc.

The key thing to me is how we can add datasets without bloating SciPy, and without making things fragile. This requires a lazy data loading solution. Two packages that are often mentioned in this context are:

If we want the data fetching to be as robust as this repo itself, then we could have an architecture like this:

Use a separate GitHub organization, say scipy-datasets
Add one new repo for each dataset under that (e.g., dataset-iris)
In the data loader, do a shallow git clone over https, or if git is not available fetch the tarball from a direct link
Cache downloaded datasets locally

We want the datasets to be under our control, and hosted with minimal maintenance (so no server maintenance for example). If that imposes constraints like "no individual files over 50 MB and no repo size over 2 GB", that should be okay.

Some of the datasets that could be a good test bed:

the ones in this PR
the NIST stats data files in scipy/stats/tests/data
the Iris dataset, as pointed out by @WarrenWeckesser above
atmospheric CO2 at Mauna Loa (https://www.climate.gov/teaching/resources/atmospheric-co2-mauna-loa-observatory), useful for signal processing

I'd suggest first reading about Intake and Pooch, and trying to figure out if one of them is suitable for us out of the box. Also check the scikit-learn data loader. Then summarize here which you'd prefer and why. After that, update this PR with that approach and do some testing. Then ask others who are interested to do some testing.

A datasets submodule should have zero dependencies on other modules. Which means it could also be developed as a separate package. That's a lot lighter-weight to deal with. And then we can decide whether it makes sense to depend on, or rename it to scipy.datasets. If the price in scipy binary size and maintainability is small enough, then I think making it scipy.datasets may be preferable. The stats.tests.data directory alone is over 2 MB, and the misc/*.dat files are 2 MB as well. So we conceivably could make wheels smaller rather than larger (or easier to strip):

% find . -name "*.dat" | xargs du -sc
5.1M
% find . -name "*.npy" | xargs du -sch
408K
% find . -name "*.txt" | xargs du -sch
18M

leouieda · 2022-01-14T10:13:09Z

Hi all, we're currently re-organizing how we distribute sample data in Fatiando a Terra. We were using Pooch in each of our packages with a datasets submodule that handled the downloading, caching, and version-based sandboxing. We recently switched to a separate package that all our packages will use instead: Ensaio. This way we can reuse sample data between projects without creating complex interdependencies.

The new setup is:

Data curation and preprocessing is done in a separate repo: https://github.com/fatiando/data
We periodically make releases of the data bundle using semantic versioning
Datasets are stored as GitHub release artefacts (v1.0.0) and on Zenodo (v1.0.0)
Ensaio uses Pooch to download the data from either GitHub or Zenodo (can be controlled with an environment variable for now): https://github.com/fatiando/ensaio/blob/main/ensaio/v1.py
For compatibility, each major release of the data (meaning that exiting datasets were altered) get a separate module in Ensaio (ensaio.v1 etc). Minor data releases would just add new datasets so compatibility is guaranteed. So examples in the documentation should keep working, even if newer versions of Ensaio are installed by users.

Intake is also a great alternative but I'm not that familiar with their setup. For our packages, we opted for Pooch to keep things lightweight and also avoid loading data to memory automatically. Ensaio (and Pooch) only return the path to the downloaded/cached file. I find this a positive thing for documentation/tutorials since a common complaint we have is users not knowing how to apply the code to their own data. With examples including the data loading step, that becomes clearer (at least that's our hope). Pooch can also now download directly from Zenodo and figshare based only on the DOI.

Hope this helps and happy to chat more about this. I know some folks at Project Pythia (@andersy005) have been doing similar things.

@WarrenWeckesser

Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.

@WarrenWeckesser

Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.

@WarrenWeckesser

Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.

@WarrenWeckesser

Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.

@WarrenWeckesser

* ENH: Add scipy.datasets module Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in #8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push. * MAINT: Silence pooch import dep warning python3.11 Add scipy.datasets files to allowed warning filters Use warnings.filterwarnings without kw action Note that when using `warnings.filterwarnings`, a special test needs to be made aware of the addition of the filtered warning. This test is the `test_warning_calls_filters` in `test_warnings.py` It should also be noted that the `test_warning_calls_filters` uses the fixture `warning_calls` which in turn uses ast to find lines with the use of filtered warnings. The implementation here is not actually complete since it doesn't account for `warnings.filterwarnings` or `warnings.simplefilter` with keyword arguments. Eg, when `warnings.filterwarnings(action='ignore')` is used instead of `warnings.filterwarnings('ignore')`, the keyword 'action' is not accounted for in the test and the test itself runs into a list out of index error. This should be fixed in a separate PR. Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>

rgommers · 2022-08-12T06:21:45Z

Completed in gh-15607, closing. Thanks Warren!

@WarrenWeckesser

* ENH: Add scipy.datasets module Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push. * MAINT: Silence pooch import dep warning python3.11 Add scipy.datasets files to allowed warning filters Use warnings.filterwarnings without kw action Note that when using `warnings.filterwarnings`, a special test needs to be made aware of the addition of the filtered warning. This test is the `test_warning_calls_filters` in `test_warnings.py` It should also be noted that the `test_warning_calls_filters` uses the fixture `warning_calls` which in turn uses ast to find lines with the use of filtered warnings. The implementation here is not actually complete since it doesn't account for `warnings.filterwarnings` or `warnings.simplefilter` with keyword arguments. Eg, when `warnings.filterwarnings(action='ignore')` is used instead of `warnings.filterwarnings('ignore')`, the keyword 'action' is not accounted for in the test and the test itself runs into a list out of index error. This should be fixed in a separate PR. Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>

WarrenWeckesser added the enhancement A new feature or improvement label Apr 11, 2018

WarrenWeckesser force-pushed the datasets branch from 08dd78b to 1382913 Compare April 11, 2018 06:49

WarrenWeckesser changed the title ~~ENH: datasets: Create the datasets subpackage.~~ WIP/ENH: datasets: Create the datasets subpackage. Apr 11, 2018

larsoner reviewed Apr 11, 2018

View reviewed changes

rgommers added this to the 1.2.0 milestone Apr 11, 2018

ENH: datasets: Create the datasets subpackage.

cd54815

Create the new subpackage `datasets`. Move misc.face, misc.ascent, and misc.electrocardiogram to datasets. Deprecate misc.face and misc.ascent. (misc.electrocardiogram has not been in a release yet, so it does not require deprecation.)

WarrenWeckesser force-pushed the datasets branch from 1382913 to cd54815 Compare April 27, 2018 03:26

lagru reviewed Apr 27, 2018

View reviewed changes

WarrenWeckesser removed this from the 1.2.0 milestone Oct 24, 2018

WarrenWeckesser marked this pull request as draft April 19, 2020 13:18

tylerjereddy mentioned this pull request Jan 15, 2022

BLD: need PyPI package eventually darshan-hpc/darshan-logs#15

Open

AnirudhDagar mentioned this pull request Feb 16, 2022

ENH: Add scipy.datasets submodule #15607

Merged

6 tasks

rgommers mentioned this pull request Feb 17, 2022

Add data optional reqs to allow pip install scikit-image[data] scikit-image/scikit-image#5105

Merged

rgommers closed this Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/ENH: datasets: Create the datasets subpackage. #8707

WIP/ENH: datasets: Create the datasets subpackage. #8707

WarrenWeckesser commented Apr 11, 2018 •

edited

Loading

pv commented Apr 11, 2018 via email

WarrenWeckesser commented Apr 11, 2018

larsoner Apr 11, 2018

WarrenWeckesser commented Apr 27, 2018

lagru left a comment

WarrenWeckesser commented Oct 24, 2018

larsoner commented Oct 24, 2018

leouieda commented May 14, 2020

rgommers commented May 15, 2020

AnirudhDagar commented Jan 6, 2022

rgommers commented Jan 13, 2022 •

edited

Loading

leouieda commented Jan 14, 2022

rgommers commented Aug 12, 2022

WIP/ENH: datasets: Create the datasets subpackage. #8707

WIP/ENH: datasets: Create the datasets subpackage. #8707

Conversation

WarrenWeckesser commented Apr 11, 2018 • edited Loading

pv commented Apr 11, 2018 via email

WarrenWeckesser commented Apr 11, 2018

larsoner Apr 11, 2018

Choose a reason for hiding this comment

WarrenWeckesser commented Apr 27, 2018

lagru left a comment

Choose a reason for hiding this comment

WarrenWeckesser commented Oct 24, 2018

larsoner commented Oct 24, 2018

leouieda commented May 14, 2020

rgommers commented May 15, 2020

AnirudhDagar commented Jan 6, 2022

rgommers commented Jan 13, 2022 • edited Loading

leouieda commented Jan 14, 2022

rgommers commented Aug 12, 2022

WarrenWeckesser commented Apr 11, 2018 •

edited

Loading

rgommers commented Jan 13, 2022 •

edited

Loading