-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Datasets contribution guidelines #1798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing these guidelines @parmeet, this will very useful for new contributors! 🚀
I would just add one comment on adding the dataset to the documentation in docs/source/datasets.rst
, otherwise LGTM!
Great suggestion @VirgileHlav. Let me do that before landing :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a couple of nit comments around spelling and grammar. Thanks for adding contribution guidelines around datasets!
CONTRIBUTING_DATASETS.md
Outdated
- Number of Citations received by the dataset | ||
- Community needs | ||
- `Licensing concerns:` Last, but not least, make sure there are no licensing concerns over providing access to the | ||
dataset through torchtext’s Datasets API. We have a disclaimer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be consistent with how we refer to "TorchText"? i.e. not mixing "TorchText" with "torchtext"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think we can probably use all lower case (except when starting sentence in which case we can capitalize the first letter)?
CONTRIBUTING_DATASETS.md
Outdated
Sample code to add function definition: | ||
|
||
```python | ||
DATASET_NAME = “MyDataName” |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use "
character here instead of “
and ”
CONTRIBUTING_DATASETS.md
Outdated
We use mocking to implement end-2-end testing for the implemented dataset. We avoid testing using a real dataset since | ||
it is expensive to download and/or cache the dataset for testing purposes. | ||
|
||
To implement the dataset test, create the corresponding testing file `test_<datasetname>.py` under tests/datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we convert the filepath into code tests/datasets
CONTRIBUTING_DATASETS.md
Outdated
- Samples returned on iterating over the dataset | ||
- Dataset returned by passing split argument as `tuple` and as `str` | ||
|
||
For detailed examples on how to write the test, please follow the existing test suite under tests/datasets directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tests/datasets
Thanks @Nayef211 for the detailed review. Let me take care of all the comments before landing :) |
Adding contribution guidelines to implement datasets based on DataPipes