Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] Remove support for DatasetDict as input into from_huggingface() #37555

Merged
merged 7 commits into from
Jul 20, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Jul 19, 2023

Why are these changes needed?

To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset.

Related issue number

Closes #37523

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
dataset: A `Hugging Face Datasets Dataset`_ or `DatasetDict`_.
:class:`~ray.data.IterableDataset` isn't supported.
dataset: A `Hugging Face Datasets Dataset`_.
:class:`~ray.data.IterableDataset` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man this IterableDataset link is just straight up wrong 😅. Not sure how this had slipped in before.

if isinstance(dataset, datasets.DatasetDict):
available_keys = list(dataset.keys())
logger.warning(
"You provided a Huggingface DatasetDict which contains multiple "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this, but change to an error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the last case raising TypeError cover this case as well? or do we want to explicitly raise a separate error for this case, with a message like "DatasetDict is not supported"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want an explicit error for this case since we can show the actual splits and provide an example on how to get a specific split from the DatasetDict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, it should be a DeprecationWarning raised since technically this is a breaking change right.

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee requested a review from amogkam July 19, 2023 19:56
"Datasets. To convert just a single Huggingface Dataset to a "
logger.error(
"You provided a Hugging Face DatasetDict which contains multiple "
"datasets, but `from_huggingface` now only accepts a siingle Hugging Face "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

siingle -> single

"You provided a Huggingface DatasetDict which contains multiple "
"datasets. The output of `from_huggingface` is a dictionary of Ray "
"Datasets. To convert just a single Huggingface Dataset to a "
logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to raise an exception here?

Copy link
Contributor Author

@scottjlee scottjlee Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, got it. let me raise a TypeError here

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@amogkam amogkam merged commit 100f396 into ray-project:master Jul 20, 2023
61 of 63 checks passed
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…gface()` (ray-project#37555)

To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset.

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…gface()` (ray-project#37555)

To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset.

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
…gface()` (ray-project#37555)

To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset.

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…gface()` (ray-project#37555)

To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset.

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…gface()` (ray-project#37555)

To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset.

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Only support Hugging Face dataset in from_huggingface API
4 participants