Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp TorchText Dataset Testing Strategy #1493

Closed
27 tasks done
Nayef211 opened this issue Jan 7, 2022 · 9 comments
Closed
27 tasks done

Revamp TorchText Dataset Testing Strategy #1493

Nayef211 opened this issue Jan 7, 2022 · 9 comments
Labels
datasets enhancement need discussions The issues that need more discussions internally and OSS testing

Comments

@Nayef211
Copy link
Contributor

Nayef211 commented Jan 7, 2022

馃殌 Feature

Revamp our dataset testing strategy to reduce amount of time spent waiting on tests to complete before merging PRs in torchtext.

Motivation

TorchText dataset testing currently relies on downloading and caching the datasets daily and then running CircleCI tests on the cached data. This can be slow and unreliable for the first PR that kicks off the dataset download and caching. In addition, dataset extraction can be time consuming for some of the larger datasets within torchtext and this extraction process occurs each time the dataset tests are run on a PR. Due to these reasons, tests on CircleCI can take up to an hour to run for each PR whereas vision/audio tests run in mere minutes. We want to revamp our dataset testing strategy in order to reduce the amount of time we spend waiting on tests to complete before merging our PRs in torchtext.

Pitch
We need to update the legacy dataset tests within torchtext. Currently we test for things including:

  • URL link
  • MD5 hash of the entire dataset
  • dataset name
  • number of lines in dataset

Going forward it doesn鈥檛 make sense to test the MD5 hash or the number of lines in the dataset. Instead we

  • Use mocking to test the implementation of our dataset
  • Use smoke tests for URLs and integrity of data (potentially with Github Actions)

Backlog of Dataset Tests

Contributing

We have already implemented a dataset test for AmazonReviewPolarity (#1532) as an example to follow when implementing future dataset tests. Please leave a message below if you plan to work on particular dataset test to avoid duplication of efforts. Also please link to the corresponding PRs.

Follow-Up Items

Additional Context

We explored how other domains implemented testing for datasets and summarize them below. We will implement our new testing strategy by taking inspiration from TorchAudio and TorchVision

Possible Approaches

  • Download and cache the dataset daily before running tests (current state of testing)
  • Create mock data for each dataset (used by torchaudio, and torchvision)
    • Requires us to understand the structure of the datasets before creating tests
  • Store a small portion of the dataset (10 lines) in an assets folder
    • Might run into legal problems since we aren鈥檛 allowed to host datasets

TorchAudio Approach

  • Each test lives in its own file
  • Plan to add integration tests in the future to check dataset URLs
  • Each test class extends TestBaseMixin and PytorchTestCase (link)
    • TestBaseMixin base class provide consistent way to define device/dtype/backend aware TestCase
  • Each test file contains a get_mock_dataset() method which is responsible for creating the mocked data and saving it to a file in a temp dir (link)
    • This method gets called in the setUp classmethod within each test class
  • The actual test method creates a dataset from the mocked dataset file tests the dataset

TorchVision Approach

  • All tests live in the test_datasets.py file. This file is really long (1300) and a little hard to read as opposed to seperating tests for each dataset into it's own file
  • Testing whether dataset URLs are available and download correctly (link)
  • CocoDetectionTestCase for the COCO dataset extends the DatasetTestCase base class (link)
    • DatasetTestCase is the abstract base class for all dataset test cases and expects child classes to overwrite class attributes such as DATASET_CLASS and FEATURE_TYPES (link)
  • Here are all the tests from DatasetTestCase that get run for each dataset (link)
@Nayef211
Copy link
Contributor Author

Nayef211 commented Jan 7, 2022

@Nayef211
Copy link
Contributor Author

I plan to pick up SQuAD1, SQuAD2, and SST2 tests next!

@VirgileHlav
Copy link
Contributor

I have pending feature branches for EnWik9, AmazonReviewFull, DBpdia, YelpReviewPolarity, YelpReviewFull, UDPOS and CoNLL2000Chunking. I plan to pick up next WikiText103 and WikiText2!

@parmeet
Copy link
Contributor

parmeet commented Jan 29, 2022

will give a try to Multi30K

@erip
Copy link
Contributor

erip commented Jan 30, 2022

I can pick up the IWSLTs

@Nayef211
Copy link
Contributor Author

Nayef211 commented Feb 4, 2022

Planning on picking up YahooAnswers, SogouNews, and PennTreebank next!

@parmeet
Copy link
Contributor

parmeet commented Feb 4, 2022

Planning on picking-up cc-100.

@Nayef211
Copy link
Contributor Author

Just a quick update. This issue can be closed once #1608 is merged in. All other dataset testing is completed

@Nayef211
Copy link
Contributor Author

Nayef211 commented Mar 9, 2022

Thanks @parmeet, @abhinavarora, @erip, and @VirgileHlav for all your help with designing, implementing, and iterating on the mock dataset tests. I'm going to go ahead and close this now that all tasks within the backlog are complete!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets enhancement need discussions The issues that need more discussions internally and OSS testing
Projects
None yet
Development

No branches or pull requests

4 participants