Revamp TorchText Dataset Testing Strategy #1493

Nayef211 · 2022-01-07T17:20:40Z

🚀 Feature

Revamp our dataset testing strategy to reduce amount of time spent waiting on tests to complete before merging PRs in torchtext.

Motivation

TorchText dataset testing currently relies on downloading and caching the datasets daily and then running CircleCI tests on the cached data. This can be slow and unreliable for the first PR that kicks off the dataset download and caching. In addition, dataset extraction can be time consuming for some of the larger datasets within torchtext and this extraction process occurs each time the dataset tests are run on a PR. Due to these reasons, tests on CircleCI can take up to an hour to run for each PR whereas vision/audio tests run in mere minutes. We want to revamp our dataset testing strategy in order to reduce the amount of time we spend waiting on tests to complete before merging our PRs in torchtext.

Pitch
We need to update the legacy dataset tests within torchtext. Currently we test for things including:

URL link
MD5 hash of the entire dataset
dataset name
number of lines in dataset

Going forward it doesn’t make sense to test the MD5 hash or the number of lines in the dataset. Instead we

Use mocking to test the implementation of our dataset
Use smoke tests for URLs and integrity of data (potentially with Github Actions)

Backlog of Dataset Tests

Contributing

We have already implemented a dataset test for AmazonReviewPolarity (#1532) as an example to follow when implementing future dataset tests. Please leave a message below if you plan to work on particular dataset test to avoid duplication of efforts. Also please link to the corresponding PRs.

Follow-Up Items

Encode all strings as utf8 before writing to file when creating mocked data (see Multi30k mocked testing #1554 (comment))
- generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599
- Add unicode generation to IWSLT tests (followup to #1608) #1642
Parameterize tests for similar datasets (see Add SQuAD2 Mocked Unit Test #1575 (comment)) Parameterize tests for similar datasets #1600
Fix formatting for all dataset tests [FORMATTING] Update formatting for dataset tests #1601

Additional Context

We explored how other domains implemented testing for datasets and summarize them below. We will implement our new testing strategy by taking inspiration from TorchAudio and TorchVision

Possible Approaches

Download and cache the dataset daily before running tests (current state of testing)
Create mock data for each dataset (used by torchaudio, and torchvision)
- Requires us to understand the structure of the datasets before creating tests
Store a small portion of the dataset (10 lines) in an assets folder
- Might run into legal problems since we aren’t allowed to host datasets

TorchAudio Approach

Each test lives in its own file
Plan to add integration tests in the future to check dataset URLs
Each test class extends TestBaseMixin and PytorchTestCase (link)
- TestBaseMixin base class provide consistent way to define device/dtype/backend aware TestCase
Each test file contains a get_mock_dataset() method which is responsible for creating the mocked data and saving it to a file in a temp dir (link)
- This method gets called in the setUp classmethod within each test class
The actual test method creates a dataset from the mocked dataset file tests the dataset

TorchVision Approach

All tests live in the test_datasets.py file. This file is really long (1300) and a little hard to read as opposed to seperating tests for each dataset into it's own file
Testing whether dataset URLs are available and download correctly (link)
CocoDetectionTestCase for the COCO dataset extends the DatasetTestCase base class (link)
- DatasetTestCase is the abstract base class for all dataset test cases and expects child classes to overwrite class attributes such as DATASET_CLASS and FEATURE_TYPES (link)
Here are all the tests from DatasetTestCase that get run for each dataset (link)

The text was updated successfully, but these errors were encountered:

Nayef211 · 2022-01-07T17:21:44Z

cc @mthrok @NicolasHug @parmeet @abhinavarora

Nayef211 · 2022-01-27T17:24:01Z

I plan to pick up SQuAD1, SQuAD2, and SST2 tests next!

VirgileHlav · 2022-01-29T04:13:03Z

I have pending feature branches for EnWik9, AmazonReviewFull, DBpdia, YelpReviewPolarity, YelpReviewFull, UDPOS and CoNLL2000Chunking. I plan to pick up next WikiText103 and WikiText2!

parmeet · 2022-01-29T15:32:46Z

will give a try to Multi30K

erip · 2022-01-30T03:01:09Z

I can pick up the IWSLTs

Nayef211 · 2022-02-04T02:17:09Z

Planning on picking up YahooAnswers, SogouNews, and PennTreebank next!

parmeet · 2022-02-04T21:40:27Z

Planning on picking-up cc-100.

Nayef211 · 2022-02-15T19:21:39Z

Just a quick update. This issue can be closed once #1608 is merged in. All other dataset testing is completed

Nayef211 · 2022-03-09T05:05:08Z

Thanks @parmeet, @abhinavarora, @erip, and @VirgileHlav for all your help with designing, implementing, and iterating on the mock dataset tests. I'm going to go ahead and close this now that all tasks within the backlog are complete!

parmeet added enhancement need discussions The issues that need more discussions internally and OSS ciflow/default testing and removed ciflow/default labels Jan 7, 2022

This was referenced Jan 10, 2022

migrate AG_NEWS to datapipes. #1498

Merged

migrate YelpReviewPolarity to datapipes. #1509

Merged

Address CI issues related to GDrive download quota pytorch/data#167

Open

Nayef211 mentioned this issue Jan 20, 2022

Add AmazonReviewPolarity Mocked Unit Test #1532

Merged

1 task

Nayef211 pinned this issue Jan 21, 2022

Nayef211 mentioned this issue Jan 26, 2022

Migrate SST2 from experimental to datasets folder #1538

Merged

Nayef211 mentioned this issue Jan 27, 2022

Add SST2 Mocked Unit Test #1542

Merged

1 task

erip mentioned this issue Jan 29, 2022

mock up AG NEWS test for faster testing. #1553

Merged

parmeet mentioned this issue Jan 30, 2022

Multi30k mocked testing #1554

Merged

This was referenced Jan 31, 2022

Add EnWik9 Mocked Unit Test #1560

Merged

Add AmazonReviewFull Mocked Unit Test #1561

Merged

erip mentioned this issue Feb 1, 2022

mock up IWSLT2016 test for faster testing. #1563

Merged

This was referenced Feb 3, 2022

Add DBpedia Mocked Unit Test #1566

Merged

Merge YelpReviewPolarity and YelpReviewFull Mocked Unit Tests #1567

Merged

Add YelpReviewFull Mocked Unit Test #1568

Merged

Add UDPOS Mocked Unit Test #1569

Merged

Add CoNLL2000Checking Mocked Unit Test #1570

Merged

This was referenced Feb 4, 2022

Add SQuAD1 Mocked Unit Test #1574

Merged

Add SQuAD2 Mocked Unit Test #1575

Merged

This was referenced Feb 4, 2022

Add SogouNews Mocked Unit Test #1576

Merged

Add YahooAnswers Mocked Unit Test #1574 #1577

Merged

Add PennTreebank Mocked Unit Test #1578

Merged

Add IMDB Mocked Unit Test #1579

Merged

This was referenced Feb 4, 2022

add CC100 mocking test #1583

Merged

Remove real dataset caching and testing #1587

Merged

Nayef211 mentioned this issue Feb 7, 2022

Datasets Documentation #1588

Closed

VirgileHlav mentioned this issue Feb 8, 2022

Add WikiText103 and WikiText2 Mocked Unit Tests #1592

Merged

Nayef211 mentioned this issue Feb 9, 2022

Fix IWSLT2016 testing #1585

Closed

parmeet mentioned this issue Feb 9, 2022

Add Mock test for IWSLT2017 dataset #1598

Merged

erip mentioned this issue Feb 10, 2022

generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599

Merged

This was referenced Feb 10, 2022

Parameterize tests for similar datasets #1600

Merged

[FORMATTING] Update formatting for dataset tests #1601

Merged

parmeet added the datasets label Feb 11, 2022

Nayef211 mentioned this issue Feb 25, 2022

[WIP] add unicode generation to IWSLT tests #1608

Closed

Nayef211 closed this as completed Mar 9, 2022

Nayef211 unpinned this issue Mar 9, 2022

parmeet mentioned this issue Apr 24, 2022

Unable to download IWSLT datasets #1676

Open

VirgileHlav mentioned this issue May 11, 2022

Add support for all datasets of the GLUE benchmark #1710

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp TorchText Dataset Testing Strategy #1493

Revamp TorchText Dataset Testing Strategy #1493

Nayef211 commented Jan 7, 2022 •

edited

Nayef211 commented Jan 7, 2022

Nayef211 commented Jan 27, 2022

VirgileHlav commented Jan 29, 2022

parmeet commented Jan 29, 2022

erip commented Jan 30, 2022

Nayef211 commented Feb 4, 2022

parmeet commented Feb 4, 2022

Nayef211 commented Feb 15, 2022

Nayef211 commented Mar 9, 2022

Revamp TorchText Dataset Testing Strategy #1493

Revamp TorchText Dataset Testing Strategy #1493

Comments

Nayef211 commented Jan 7, 2022 • edited

🚀 Feature

Backlog of Dataset Tests

Contributing

Follow-Up Items

Additional Context

Nayef211 commented Jan 7, 2022

Nayef211 commented Jan 27, 2022

VirgileHlav commented Jan 29, 2022

parmeet commented Jan 29, 2022

erip commented Jan 30, 2022

Nayef211 commented Feb 4, 2022

parmeet commented Feb 4, 2022

Nayef211 commented Feb 15, 2022

Nayef211 commented Mar 9, 2022

Nayef211 commented Jan 7, 2022 •

edited