Add redundancy to OpenML datasets with Figshare#1218
Add redundancy to OpenML datasets with Figshare#1218jeromedockes merged 51 commits intoskrub-data:mainfrom
Conversation
|
Datasets to upload as CSV, in a zip:
|
jeromedockes
left a comment
There was a problem hiding this comment.
nice! a first few comments :)
Update examples
|
I think something is off with circle CI |
|
TODO left:
|
skrub/datasets/_utils.py
Outdated
| return bunch | ||
|
|
||
|
|
||
| def _load_dataset_files(dataset_name, data_home): |
There was a problem hiding this comment.
I guess this method should be named load_dataset_files, as it is used outside of this module?
|
there seems to be a problem with the videogame sales dataset (example 6 -- see the negative r2 scores), I'll look into it edit: the dataset is ordered by decreasing global sales so we needed kfold with shuffle=True, I just pushed the fix |
|
I added the docstrings. @Vincent-Maladiere what was the second TODO? |
|
I uploaded the zip files to osf and updated the urls in this PR so it should be mostly ok now checking uploadsimport requests
import hashlib
from skrub.datasets._utils import DATASET_INFO
for name, data in DATASET_INFO.items():
print(name)
assert len(data['urls']) == 2
assert 'github.com' in data['urls'][0]
assert 'osf.io' in data['urls'][1]
for url in data['urls']:
content = requests.get(url).content
assert hashlib.sha256(content).hexdigest() == data['sha256'] |
|
@glemaitre if you have time you may want to have a look at this pr, too |
|
+1 for the changes. I think this PR is ready for a final review before merging (@glemaitre if you have some spare time) :) |
Addresses #1217