Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/ENH: datasets: Create the datasets subpackage. #8707

Closed
wants to merge 1 commit into from

Conversation

WarrenWeckesser
Copy link
Member

@WarrenWeckesser WarrenWeckesser commented Apr 11, 2018

Create the new subpackage datasets.
Move misc.face, misc.ascent, and misc.electrocardiogram to datasets.
Deprecate misc.face, misc.ascent, and misc.electrocardiogram.

@pv
Copy link
Member

pv commented Apr 11, 2018 via email

@WarrenWeckesser WarrenWeckesser changed the title ENH: datasets: Create the datasets subpackage. WIP/ENH: datasets: Create the datasets subpackage. Apr 11, 2018
@WarrenWeckesser
Copy link
Member Author

This will wait for 1.2.

@@ -0,0 +1,192 @@
"""
Functions that load datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime (moving toward 1.2.0) this would be a good place to try to distill a summary of the scope of this submodule from the mailing list.

@rgommers rgommers added this to the 1.2.0 milestone Apr 11, 2018
Create the new subpackage `datasets`.
Move misc.face, misc.ascent, and misc.electrocardiogram to datasets.
Deprecate misc.face and misc.ascent.
(misc.electrocardiogram has not been in a release yet, so it does
not require deprecation.)
@WarrenWeckesser
Copy link
Member Author

I updated the PR for 1.2, but this is still work-in-progress, for two reasons. First, we haven't actually agreed to create a datasets subpackage, and second, I'd like to include at least one new dataset in this PR, and use it in one or more docstrings. A good candidate is Fisher's Iris data, which could be used as sample data for several functions in scipy.stats.

Copy link
Contributor

@lagru lagru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small notice: electrocardiogram is now imported and used in benchmarks/peak_finding.py as well (see #8769).

@WarrenWeckesser WarrenWeckesser removed this from the 1.2.0 milestone Oct 24, 2018
@WarrenWeckesser
Copy link
Member Author

I removed the 1.2 milestone. I'll get back to this, but probably not in time for 1.2.

@larsoner
Copy link
Member

If you want, I could try to take over and push to get it into 1.2, assuming we are settled on the API. I think we did have things more or less settled on the mailing list, but I could be wrong. I'm also fine waiting if you are keen to keep working on it @WarrenWeckesser.

I would probably also just migrate the existing datasets rather than adding anything new. This would perhaps make review simpler but also simultaneously less fun.

@WarrenWeckesser WarrenWeckesser marked this pull request as draft April 19, 2020 13:18
@leouieda
Copy link

Hi all, just wanted to point out the Pooch package which was created for this exact purpose. It handles downloading and caching sample data so that they don't have to be included in the distributions (making downloads smaller). The dependencies are minimal (requests, packaging, and appdirs) so it wouldn't be a huge burden on users. It might be useful if the goal is to expand the sample data or include some larger datasets.

@rgommers
Copy link
Member

Thanks for your input @leouieda. I think the current thinking on this PR is to keep it outside of SciPy, instead making a separate package (as mentioned in https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html#data-sets) where datasets for multiple packages like NumPy and SciPy can be stored.

Pooch was already brought up by a couple of people in conversations about that new package.

@AnirudhDagar
Copy link
Member

Hi all, I'd like to know the state of this PR in 2022 and our plans to actually deprecate the misc module and create the datasets module in SciPy as proposed.

What's blocking the PR right now? Is it the addition of new datasets (like iris or co2 etc.)? If yes, I'd like to help move these efforts forward and push this if we are decided on the datasets API design already.

Also, what are some of the popular datasets that the Scientific community would like us to include? Thanks!

@rgommers
Copy link
Member

rgommers commented Jan 13, 2022

Hi all, I'd like to know the state of this PR in 2022 and our plans to actually deprecate the misc module and create the datasets module in SciPy as proposed.

Hi @AnirudhDagar, this PR/idea is still of interest I think. As is deprecating misc.

The key thing to me is how we can add datasets without bloating SciPy, and without making things fragile. This requires a lazy data loading solution. Two packages that are often mentioned in this context are:

If we want the data fetching to be as robust as this repo itself, then we could have an architecture like this:

  • Use a separate GitHub organization, say scipy-datasets
  • Add one new repo for each dataset under that (e.g., dataset-iris)
  • In the data loader, do a shallow git clone over https, or if git is not available fetch the tarball from a direct link
  • Cache downloaded datasets locally

We want the datasets to be under our control, and hosted with minimal maintenance (so no server maintenance for example). If that imposes constraints like "no individual files over 50 MB and no repo size over 2 GB", that should be okay.

Some of the datasets that could be a good test bed:

I'd suggest first reading about Intake and Pooch, and trying to figure out if one of them is suitable for us out of the box. Also check the scikit-learn data loader. Then summarize here which you'd prefer and why. After that, update this PR with that approach and do some testing. Then ask others who are interested to do some testing.

A datasets submodule should have zero dependencies on other modules. Which means it could also be developed as a separate package. That's a lot lighter-weight to deal with. And then we can decide whether it makes sense to depend on, or rename it to scipy.datasets. If the price in scipy binary size and maintainability is small enough, then I think making it scipy.datasets may be preferable. The stats.tests.data directory alone is over 2 MB, and the misc/*.dat files are 2 MB as well. So we conceivably could make wheels smaller rather than larger (or easier to strip):

% find . -name "*.dat" | xargs du -sc
5.1M
% find . -name "*.npy" | xargs du -sch
408K
% find . -name "*.txt" | xargs du -sch
18M

@leouieda
Copy link

Hi all, we're currently re-organizing how we distribute sample data in Fatiando a Terra. We were using Pooch in each of our packages with a datasets submodule that handled the downloading, caching, and version-based sandboxing. We recently switched to a separate package that all our packages will use instead: Ensaio. This way we can reuse sample data between projects without creating complex interdependencies.

The new setup is:

  • Data curation and preprocessing is done in a separate repo: https://github.com/fatiando/data
  • We periodically make releases of the data bundle using semantic versioning
  • Datasets are stored as GitHub release artefacts (v1.0.0) and on Zenodo (v1.0.0)
  • Ensaio uses Pooch to download the data from either GitHub or Zenodo (can be controlled with an environment variable for now): https://github.com/fatiando/ensaio/blob/main/ensaio/v1.py
  • For compatibility, each major release of the data (meaning that exiting datasets were altered) get a separate module in Ensaio (ensaio.v1 etc). Minor data releases would just add new datasets so compatibility is guaranteed. So examples in the documentation should keep working, even if newer versions of Ensaio are installed by users.

Intake is also a great alternative but I'm not that familiar with their setup. For our packages, we opted for Pooch to keep things lightweight and also avoid loading data to memory automatically. Ensaio (and Pooch) only return the path to the downloaded/cached file. I find this a positive thing for documentation/tutorials since a common complaint we have is users not knowing how to apply the code to their own data. With examples including the data loading step, that becomes clearer (at least that's our hope). Pooch can also now download directly from Zenodo and figshare based only on the DOI.

Hope this helps and happy to chat more about this. I know some folks at Project Pythia (@andersy005) have been doing similar things.

AnirudhDagar added a commit to AnirudhDagar/scipy that referenced this pull request Feb 16, 2022
Note that initial work for creating scipy.datasets module
was carried in scipy#8707
by @WarrenWeckesser and all the credit for the new datasets module
should be attributed to his initial push.
AnirudhDagar added a commit to AnirudhDagar/scipy that referenced this pull request Jul 21, 2022
Enable meson support for scipy.datasets
Enable scipy.datasets in refguide_check.py

Note that initial work for creating scipy.datasets module
was carried in scipy#8707
by @WarrenWeckesser and all the credit for the new datasets module
should be attributed to his initial push.
AnirudhDagar added a commit to AnirudhDagar/scipy that referenced this pull request Jul 22, 2022
Enable meson support for scipy.datasets
Enable scipy.datasets in refguide_check.py

Note that initial work for creating scipy.datasets module
was carried in scipy#8707
by @WarrenWeckesser and all the credit for the new datasets module
should be attributed to his initial push.
AnirudhDagar added a commit to AnirudhDagar/scipy that referenced this pull request Aug 11, 2022
Enable meson support for scipy.datasets
Enable scipy.datasets in refguide_check.py

Note that initial work for creating scipy.datasets module
was carried in scipy#8707
by @WarrenWeckesser and all the credit for the new datasets module
should be attributed to his initial push.
rgommers pushed a commit that referenced this pull request Aug 12, 2022
* ENH: Add scipy.datasets module

Enable meson support for scipy.datasets
Enable scipy.datasets in refguide_check.py

Note that initial work for creating scipy.datasets module
was carried in #8707
by @WarrenWeckesser and all the credit for the new datasets module
should be attributed to his initial push.


* MAINT: Silence pooch import dep warning python3.11

Add scipy.datasets files to allowed warning filters
Use warnings.filterwarnings without kw action

Note that when using `warnings.filterwarnings`, a special test needs to
be made aware of the addition of the filtered warning. This test is
the `test_warning_calls_filters` in `test_warnings.py`

It should also be noted that the `test_warning_calls_filters` uses the fixture
`warning_calls` which in turn uses ast to find lines with the use of filtered warnings.
The implementation here is not actually complete since it doesn't account for
`warnings.filterwarnings` or `warnings.simplefilter` with keyword arguments.

Eg, when `warnings.filterwarnings(action='ignore')` is used instead
of `warnings.filterwarnings('ignore')`, the keyword 'action'
is not accounted for in the test and the test itself runs into a list out of index
error. This should be fixed in a separate PR.


Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
@rgommers
Copy link
Member

Completed in gh-15607, closing. Thanks Warren!

@rgommers rgommers closed this Aug 12, 2022
tartopohm pushed a commit to tartopohm/scipy that referenced this pull request Aug 13, 2022
* ENH: Add scipy.datasets module

Enable meson support for scipy.datasets
Enable scipy.datasets in refguide_check.py

Note that initial work for creating scipy.datasets module
was carried in scipy#8707
by @WarrenWeckesser and all the credit for the new datasets module
should be attributed to his initial push.


* MAINT: Silence pooch import dep warning python3.11

Add scipy.datasets files to allowed warning filters
Use warnings.filterwarnings without kw action

Note that when using `warnings.filterwarnings`, a special test needs to
be made aware of the addition of the filtered warning. This test is
the `test_warning_calls_filters` in `test_warnings.py`

It should also be noted that the `test_warning_calls_filters` uses the fixture
`warning_calls` which in turn uses ast to find lines with the use of filtered warnings.
The implementation here is not actually complete since it doesn't account for
`warnings.filterwarnings` or `warnings.simplefilter` with keyword arguments.

Eg, when `warnings.filterwarnings(action='ignore')` is used instead
of `warnings.filterwarnings('ignore')`, the keyword 'action'
is not accounted for in the test and the test itself runs into a list out of index
error. This should be fixed in a separate PR.


Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants