Skip to content

Conversation

@LilianBoulard
Copy link
Member

Follows #147


This is a refactor of the dataset fetching system.
Historically, datasets were fetched from various websites.

Our aim with this update is to only use OpenML.org's API, through Scikit-learn's fetch_openml() function.
This allows us to have a much more reliable and unified interface, and avoids losing access to datasets due to deletion, renaming, etc.

For instance, with the current system, 4 on the 7 datasets are unavailable (403 Access Denied, website down, etc.).

The user-interface stays the same with the functions fetch_*() (e.g fetch_open_payments()), still returning a similar dictionary.
The major difference is that this dictionary returns, among other information, a path, where a CSV file is located, and must be loaded (using for instance pandas' read_csv() function).


TL;DR of the previous thread:
the way fetch_openml() is used here makes pandas a requirement.

@LilianBoulard LilianBoulard added the enhancement New feature or request label Feb 18, 2021
@LilianBoulard
Copy link
Member Author

So, for this (I hope last) error, we got an issue with the JSON decoding.
It gets raised inside sklearn.datasets.fetch_openml, and unfortunately, this is linked to the version, as the tests pass with scikit-learn>=0.21.0.

Unless told to, I won't try to dig inside sklearn's code to figure out at which step it goes wrong, and perhaps monkeypatch a solution.
So meanwhile, we have two options:

  • bump sklearn's minimal version for dirty_cat to 0.21.0
  • raise a clean error inviting the user to upgrade his version if he tries to fetch a dataset

What do you suggest ?

@GaelVaroquaux
Copy link
Member

Is this error happening for all the datasets, or only some?

@GaelVaroquaux
Copy link
Member

By the way, I checked, sklearn 0.21.0 is a couple years old. It's fine to bump the requirement.

@LilianBoulard
Copy link
Member Author

Is this error happening for all the datasets, or only some?

All of them, as the error is linked to fetch_openml and they all use it. See this screenshot:
image

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished reviewing this and made all my comments. After these small things, this will be ready for merge.

@LilianBoulard
Copy link
Member Author

These most recent changes improve the user interface.
Now, instead of reading the CSV from the disk, it is directly loaded in memory if load_dataframe=True.
It also returns a namedtuple instead of a dictionary, which has an object-oriented interface.

@alexis-cvetkov
Copy link
Contributor

This PR looks good to me, I think it can be merged.
Thanks Lilian !

@GaelVaroquaux
Copy link
Member

I was about to merge, but don't we need a an entry summarizing things in the changelog?

@alexis-cvetkov
Copy link
Contributor

Yes you're right, I always forget that.

@GaelVaroquaux GaelVaroquaux merged commit 103b453 into skrub-data:master Oct 6, 2021
@GaelVaroquaux
Copy link
Member

Yey! Merged

@LilianBoulard LilianBoulard deleted the fetching_refactor branch October 12, 2021 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants