-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BC-breaking] remove unnecessary split argument from datasets #1591
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look good to me. A couple of follow-ups:
- Let's remove
test_enwik9_split_argument
from the corresponding test for theEnWik9
dataset - Should we make note of the BC breaking changes (removing the
split
arg from EnWik9) in our release notes?
@@ -22,10 +17,15 @@ | |||
DATASET_NAME = "EnWik9" | |||
|
|||
|
|||
@_add_docstring_header(num_lines=NUM_LINES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we're removing the _add_docstring_header
decorator and manually adding the docstrings? Can we modify the _add_docstring_header
to work with a dataset function that doesn't have a split
arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decorator is doing a job of avoiding writing doc-string and make strong assumptions about datasets. But it's scope is fairly limited (for eg: it does not handle cases when there are no splits or when we have additional arguments other than root and split). Overall, I think we probably need to get rid of _add_doctring_header decorator in light of this issue #1588 anyway since we plan to expand the scope of documentation of what we currently have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to remove the split
argument when calling the EnWik9 dataset as well as the @parameterized.expand(["train"])
decorator for the test_enwik9
method to ensure all unit tests are passing.
On the bright side, it looks like our mocked tests are already providing useful signals when we make BC breaking changes to our datasets 😁
haha, true that! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's fix the stylecheck errors before merging
This PR remove the split argument from CC100 and Enwik9 datasets. This is superficial just to comply with other datasets format. Specially datasets that serves unsupervised learning, it is often the case that it is just a collection of text. The dev/test split are not always necessarily available.
BC-Breaking: This PR is BC-breaking for EnWiK9 dataset as we are removing the split argument.