-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dataset] Remove support for DatasetDict
as input into from_huggingface()
#37555
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
python/ray/data/read_api.py
Outdated
dataset: A `Hugging Face Datasets Dataset`_ or `DatasetDict`_. | ||
:class:`~ray.data.IterableDataset` isn't supported. | ||
dataset: A `Hugging Face Datasets Dataset`_. | ||
:class:`~ray.data.IterableDataset` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man this IterableDataset link is just straight up wrong 😅. Not sure how this had slipped in before.
if isinstance(dataset, datasets.DatasetDict): | ||
available_keys = list(dataset.keys()) | ||
logger.warning( | ||
"You provided a Huggingface DatasetDict which contains multiple " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this, but change to an error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the last case raising TypeError
cover this case as well? or do we want to explicitly raise a separate error for this case, with a message like "DatasetDict is not supported"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want an explicit error for this case since we can show the actual splits and provide an example on how to get a specific split from the DatasetDict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, it should be a DeprecationWarning raised since technically this is a breaking change right.
Signed-off-by: Scott Lee <sjl@anyscale.com>
python/ray/data/read_api.py
Outdated
"Datasets. To convert just a single Huggingface Dataset to a " | ||
logger.error( | ||
"You provided a Hugging Face DatasetDict which contains multiple " | ||
"datasets, but `from_huggingface` now only accepts a siingle Hugging Face " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
siingle
-> single
python/ray/data/read_api.py
Outdated
"You provided a Huggingface DatasetDict which contains multiple " | ||
"datasets. The output of `from_huggingface` is a dictionary of Ray " | ||
"Datasets. To convert just a single Huggingface Dataset to a " | ||
logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to raise an exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, got it. let me raise a TypeError here
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
…gface()` (ray-project#37555) To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…gface()` (ray-project#37555) To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
…gface()` (ray-project#37555) To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset. --------- Signed-off-by: Scott Lee <sjl@anyscale.com>
…gface()` (ray-project#37555) To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…gface()` (ray-project#37555) To keep the from_huggingface() API simple and consistent with other from_xxx read APIs, we remove support for Hugging Face datasets.DatasetDict, so that the method only accepts a Hugging Face datasets.Dataset. --------- Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
To keep the
from_huggingface()
API simple and consistent with otherfrom_xxx
read APIs, we remove support for Hugging Facedatasets.DatasetDict
, so that the method only accepts a Hugging Facedatasets.Dataset
.Related issue number
Closes #37523
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.