[BC-breaking] remove unnecessary split argument from datasets #1591

parmeet · 2022-02-07T22:56:37Z

This PR remove the split argument from CC100 and Enwik9 datasets. This is superficial just to comply with other datasets format. Specially datasets that serves unsupervised learning, it is often the case that it is just a collection of text. The dev/test split are not always necessarily available.

BC-Breaking: This PR is BC-breaking for EnWiK9 dataset as we are removing the split argument.

Nayef211

Overall the changes look good to me. A couple of follow-ups:

Let's remove test_enwik9_split_argument from the corresponding test for the EnWik9 dataset
Should we make note of the BC breaking changes (removing the split arg from EnWik9) in our release notes?

Nayef211 · 2022-02-08T02:42:05Z

torchtext/datasets/enwik9.py

@@ -22,10 +17,15 @@
 DATASET_NAME = "EnWik9"


-@_add_docstring_header(num_lines=NUM_LINES)


Any reason we're removing the _add_docstring_header decorator and manually adding the docstrings? Can we modify the _add_docstring_header to work with a dataset function that doesn't have a split arg?

The decorator is doing a job of avoiding writing doc-string and make strong assumptions about datasets. But it's scope is fairly limited (for eg: it does not handle cases when there are no splits or when we have additional arguments other than root and split). Overall, I think we probably need to get rid of _add_doctring_header decorator in light of this issue #1588 anyway since we plan to expand the scope of documentation of what we currently have.

Nayef211

I think we still need to remove the split argument when calling the EnWik9 dataset as well as the @parameterized.expand(["train"]) decorator for the test_enwik9 method to ensure all unit tests are passing.

On the bright side, it looks like our mocked tests are already providing useful signals when we make BC breaking changes to our datasets 😁

parmeet · 2022-02-09T03:57:55Z

I think we still need to remove the split argument when calling the EnWik9 dataset as well as the @parameterized.expand(["train"]) decorator for the test_enwik9 method to ensure all unit tests are passing.

On the bright side, it looks like our mocked tests are already providing useful signals when we make BC breaking changes to our datasets 😁

haha, true that! :)

Nayef211

LGTM! Let's fix the stylecheck errors before merging

remove unnecessary split argument

daba509

pytorch-bot bot added the ciflow/default label Feb 7, 2022

facebook-github-bot added the cla signed label Feb 7, 2022

Nayef211 reviewed Feb 8, 2022

View reviewed changes

remove spit testing for enwik9

84f8819

parmeet changed the title ~~remove unnecessary split argument from datasets~~ [BC-breaking] remove unnecessary split argument from datasets Feb 8, 2022

parmeet requested a review from Nayef211 February 8, 2022 23:16

fix cc100 test

178331e

Nayef211 reviewed Feb 9, 2022

View reviewed changes

fix test enwik9

58ead42

Nayef211 approved these changes Feb 9, 2022

View reviewed changes

parmeet added 2 commits February 8, 2022 23:06

fix stylecheck

0e19bb5

Merge branch 'main' of github.com:pytorch/text into remove_split

77eac63

parmeet merged commit 18b61fa into pytorch:main Feb 9, 2022

parmeet deleted the remove_split branch February 9, 2022 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BC-breaking] remove unnecessary split argument from datasets #1591

[BC-breaking] remove unnecessary split argument from datasets #1591

parmeet commented Feb 7, 2022 •

edited

Loading

Nayef211 left a comment

Nayef211 Feb 8, 2022

parmeet Feb 8, 2022

Nayef211 left a comment •

edited

Loading

parmeet commented Feb 9, 2022

Nayef211 left a comment

		@@ -22,10 +17,15 @@
		DATASET_NAME = "EnWik9"


		@_add_docstring_header(num_lines=NUM_LINES)

[BC-breaking] remove unnecessary split argument from datasets #1591

[BC-breaking] remove unnecessary split argument from datasets #1591

Conversation

parmeet commented Feb 7, 2022 • edited Loading

Nayef211 left a comment

Choose a reason for hiding this comment

Nayef211 Feb 8, 2022

Choose a reason for hiding this comment

parmeet Feb 8, 2022

Choose a reason for hiding this comment

Nayef211 left a comment • edited Loading

Choose a reason for hiding this comment

parmeet commented Feb 9, 2022

Nayef211 left a comment

Choose a reason for hiding this comment

parmeet commented Feb 7, 2022 •

edited

Loading

Nayef211 left a comment •

edited

Loading