-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP/ENH: datasets: Create the datasets subpackage. #8707
Conversation
08dd78b
to
1382913
Compare
This will wait for 1.2. |
@@ -0,0 +1,192 @@ | |||
""" | |||
Functions that load datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime (moving toward 1.2.0) this would be a good place to try to distill a summary of the scope of this submodule from the mailing list.
Create the new subpackage `datasets`. Move misc.face, misc.ascent, and misc.electrocardiogram to datasets. Deprecate misc.face and misc.ascent. (misc.electrocardiogram has not been in a release yet, so it does not require deprecation.)
1382913
to
cd54815
Compare
I updated the PR for 1.2, but this is still work-in-progress, for two reasons. First, we haven't actually agreed to create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small notice: electrocardiogram
is now imported and used in benchmarks/peak_finding.py as well (see #8769).
I removed the 1.2 milestone. I'll get back to this, but probably not in time for 1.2. |
If you want, I could try to take over and push to get it into 1.2, assuming we are settled on the API. I think we did have things more or less settled on the mailing list, but I could be wrong. I'm also fine waiting if you are keen to keep working on it @WarrenWeckesser. I would probably also just migrate the existing datasets rather than adding anything new. This would perhaps make review simpler but also simultaneously less fun. |
Hi all, just wanted to point out the Pooch package which was created for this exact purpose. It handles downloading and caching sample data so that they don't have to be included in the distributions (making downloads smaller). The dependencies are minimal (requests, packaging, and appdirs) so it wouldn't be a huge burden on users. It might be useful if the goal is to expand the sample data or include some larger datasets. |
Thanks for your input @leouieda. I think the current thinking on this PR is to keep it outside of SciPy, instead making a separate package (as mentioned in https://numpy.org/neps/nep-0044-restructuring-numpy-docs.html#data-sets) where datasets for multiple packages like NumPy and SciPy can be stored. Pooch was already brought up by a couple of people in conversations about that new package. |
Hi all, I'd like to know the state of this PR in 2022 and our plans to actually deprecate the What's blocking the PR right now? Is it the addition of new datasets (like Also, what are some of the popular datasets that the Scientific community would like us to include? Thanks! |
Hi @AnirudhDagar, this PR/idea is still of interest I think. As is deprecating The key thing to me is how we can add datasets without bloating SciPy, and without making things fragile. This requires a lazy data loading solution. Two packages that are often mentioned in this context are: If we want the data fetching to be as robust as this repo itself, then we could have an architecture like this:
We want the datasets to be under our control, and hosted with minimal maintenance (so no server maintenance for example). If that imposes constraints like "no individual files over 50 MB and no repo size over 2 GB", that should be okay. Some of the datasets that could be a good test bed:
I'd suggest first reading about Intake and Pooch, and trying to figure out if one of them is suitable for us out of the box. Also check the scikit-learn data loader. Then summarize here which you'd prefer and why. After that, update this PR with that approach and do some testing. Then ask others who are interested to do some testing. A
|
Hi all, we're currently re-organizing how we distribute sample data in Fatiando a Terra. We were using Pooch in each of our packages with a The new setup is:
Intake is also a great alternative but I'm not that familiar with their setup. For our packages, we opted for Pooch to keep things lightweight and also avoid loading data to memory automatically. Ensaio (and Pooch) only return the path to the downloaded/cached file. I find this a positive thing for documentation/tutorials since a common complaint we have is users not knowing how to apply the code to their own data. With examples including the data loading step, that becomes clearer (at least that's our hope). Pooch can also now download directly from Zenodo and figshare based only on the DOI. Hope this helps and happy to chat more about this. I know some folks at Project Pythia (@andersy005) have been doing similar things. |
Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.
Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.
Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.
Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push.
* ENH: Add scipy.datasets module Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in #8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push. * MAINT: Silence pooch import dep warning python3.11 Add scipy.datasets files to allowed warning filters Use warnings.filterwarnings without kw action Note that when using `warnings.filterwarnings`, a special test needs to be made aware of the addition of the filtered warning. This test is the `test_warning_calls_filters` in `test_warnings.py` It should also be noted that the `test_warning_calls_filters` uses the fixture `warning_calls` which in turn uses ast to find lines with the use of filtered warnings. The implementation here is not actually complete since it doesn't account for `warnings.filterwarnings` or `warnings.simplefilter` with keyword arguments. Eg, when `warnings.filterwarnings(action='ignore')` is used instead of `warnings.filterwarnings('ignore')`, the keyword 'action' is not accounted for in the test and the test itself runs into a list out of index error. This should be fixed in a separate PR. Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
Completed in gh-15607, closing. Thanks Warren! |
* ENH: Add scipy.datasets module Enable meson support for scipy.datasets Enable scipy.datasets in refguide_check.py Note that initial work for creating scipy.datasets module was carried in scipy#8707 by @WarrenWeckesser and all the credit for the new datasets module should be attributed to his initial push. * MAINT: Silence pooch import dep warning python3.11 Add scipy.datasets files to allowed warning filters Use warnings.filterwarnings without kw action Note that when using `warnings.filterwarnings`, a special test needs to be made aware of the addition of the filtered warning. This test is the `test_warning_calls_filters` in `test_warnings.py` It should also be noted that the `test_warning_calls_filters` uses the fixture `warning_calls` which in turn uses ast to find lines with the use of filtered warnings. The implementation here is not actually complete since it doesn't account for `warnings.filterwarnings` or `warnings.simplefilter` with keyword arguments. Eg, when `warnings.filterwarnings(action='ignore')` is used instead of `warnings.filterwarnings('ignore')`, the keyword 'action' is not accounted for in the test and the test itself runs into a list out of index error. This should be fixed in a separate PR. Co-authored-by: Warren Weckesser <warren.weckesser@gmail.com>
Create the new subpackage
datasets
.Move
misc.face
,misc.ascent
, andmisc.electrocardiogram
todatasets
.Deprecate
misc.face
,misc.ascent
, andmisc.electrocardiogram
.