Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BC-breaking] remove unnecessary split argument from datasets #1591

Merged
merged 6 commits into from
Feb 9, 2022

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented Feb 7, 2022

This PR remove the split argument from CC100 and Enwik9 datasets. This is superficial just to comply with other datasets format. Specially datasets that serves unsupervised learning, it is often the case that it is just a collection of text. The dev/test split are not always necessarily available.

BC-Breaking: This PR is BC-breaking for EnWiK9 dataset as we are removing the split argument.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes look good to me. A couple of follow-ups:

  • Let's remove test_enwik9_split_argument from the corresponding test for the EnWik9 dataset
  • Should we make note of the BC breaking changes (removing the split arg from EnWik9) in our release notes?

@@ -22,10 +17,15 @@
DATASET_NAME = "EnWik9"


@_add_docstring_header(num_lines=NUM_LINES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we're removing the _add_docstring_header decorator and manually adding the docstrings? Can we modify the _add_docstring_header to work with a dataset function that doesn't have a split arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decorator is doing a job of avoiding writing doc-string and make strong assumptions about datasets. But it's scope is fairly limited (for eg: it does not handle cases when there are no splits or when we have additional arguments other than root and split). Overall, I think we probably need to get rid of _add_doctring_header decorator in light of this issue #1588 anyway since we plan to expand the scope of documentation of what we currently have.

@parmeet parmeet changed the title remove unnecessary split argument from datasets [BC-breaking] remove unnecessary split argument from datasets Feb 8, 2022
@parmeet parmeet requested a review from Nayef211 February 8, 2022 23:16
Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to remove the split argument when calling the EnWik9 dataset as well as the @parameterized.expand(["train"]) decorator for the test_enwik9 method to ensure all unit tests are passing.

On the bright side, it looks like our mocked tests are already providing useful signals when we make BC breaking changes to our datasets 😁

@parmeet
Copy link
Contributor Author

parmeet commented Feb 9, 2022

I think we still need to remove the split argument when calling the EnWik9 dataset as well as the @parameterized.expand(["train"]) decorator for the test_enwik9 method to ensure all unit tests are passing.

On the bright side, it looks like our mocked tests are already providing useful signals when we make BC breaking changes to our datasets 😁

haha, true that! :)

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's fix the stylecheck errors before merging

@parmeet parmeet merged commit 18b61fa into pytorch:main Feb 9, 2022
@parmeet parmeet deleted the remove_split branch February 9, 2022 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants