Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datasets contribution guidelines #1798

Merged
merged 18 commits into from
Jun 26, 2022
Merged

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented Jun 20, 2022

Adding contribution guidelines to implement datasets based on DataPipes

Copy link
Contributor

@vcm2114 vcm2114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing these guidelines @parmeet, this will very useful for new contributors! 🚀
I would just add one comment on adding the dataset to the documentation in docs/source/datasets.rst, otherwise LGTM!

@parmeet
Copy link
Contributor Author

parmeet commented Jun 20, 2022

I would just add one comment on adding the dataset to the documentation in docs/source/datasets.rst

Great suggestion @VirgileHlav. Let me do that before landing :)

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a couple of nit comments around spelling and grammar. Thanks for adding contribution guidelines around datasets!

CONTRIBUTING_DATASETS.md Outdated Show resolved Hide resolved
CONTRIBUTING_DATASETS.md Outdated Show resolved Hide resolved
- Number of Citations received by the dataset
- Community needs
- `Licensing concerns:` Last, but not least, make sure there are no licensing concerns over providing access to the
dataset through torchtext’s Datasets API. We have a disclaimer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be consistent with how we refer to "TorchText"? i.e. not mixing "TorchText" with "torchtext"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think we can probably use all lower case (except when starting sentence in which case we can capitalize the first letter)?

CONTRIBUTING_DATASETS.md Outdated Show resolved Hide resolved
Sample code to add function definition:

```python
DATASET_NAME = “MyDataName”
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use " character here instead of and

CONTRIBUTING_DATASETS.md Outdated Show resolved Hide resolved
We use mocking to implement end-2-end testing for the implemented dataset. We avoid testing using a real dataset since
it is expensive to download and/or cache the dataset for testing purposes.

To implement the dataset test, create the corresponding testing file `test_<datasetname>.py` under tests/datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we convert the filepath into code tests/datasets

CONTRIBUTING_DATASETS.md Outdated Show resolved Hide resolved
- Samples returned on iterating over the dataset
- Dataset returned by passing split argument as `tuple` and as `str`

For detailed examples on how to write the test, please follow the existing test suite under tests/datasets directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tests/datasets

CONTRIBUTING_DATASETS.md Outdated Show resolved Hide resolved
@parmeet
Copy link
Contributor Author

parmeet commented Jun 20, 2022

Thanks @Nayef211 for the detailed review. Let me take care of all the comments before landing :)

@parmeet parmeet merged commit 238c414 into pytorch:main Jun 26, 2022
@parmeet parmeet deleted the datasets_contrib branch June 26, 2022 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants