IWSLT testing to start from compressed file #1596

parmeet · 2022-02-09T05:56:32Z

Original Credit: @erip

This PR did some changes on top of #1585 to fix failing tests. The main changes include:

Removal of uncompressed directories to make sure the dataset is actually doing the extraction
Removal of code block that conditionally write uncleaned files
Creation of new base directory with every parameterized test call to make sure no artifacts are left from other tests

…age.

…ort langpairs.

…xpectations match.

parmeet · 2022-02-09T14:44:39Z

@erip hope the changes in the PR make sense. Let me know if you catch anything unusual or out of your expectation.

parmeet · 2022-02-09T15:01:12Z

Oh, actually wait. I think I still haven't implemented it correctly. It's about the temporary directory and what we are compressing as part of it. Let's not review it yet @Nayef211, @erip

parmeet · 2022-02-09T18:25:54Z

OK, I think we are good to go for review @Nayef211

codecov · 2022-02-09T18:49:50Z

Codecov Report

Merging #1596 (f66ae99) into main (da34de2) will increase coverage by 0.62%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #1596      +/-   ##
==========================================
+ Coverage   80.34%   80.96%   +0.62%     
==========================================
  Files          58       58              
  Lines        2569     2569              
==========================================
+ Hits         2064     2080      +16     
+ Misses        505      489      -16

Impacted Files	Coverage Δ
torchtext/data/datasets_utils.py	`64.74% <0.00%> (+5.75%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da34de2...f66ae99. Read the comment docs.

erip · 2022-02-09T18:56:37Z

test/datasets/test_iwslt2016.py

@@ -150,7 +152,7 @@ def tearDownClass(cls):
    ])
    def test_iwslt2016(self, split, src, tgt, dev_set, test_set):

-        root_dir = self.get_base_temp_dir()
+        root_dir = tempfile.TemporaryDirectory().name


qq, why is this used in test_iwslt2016, but not in test_iwslt2016_split_argument?

It's a good catch. I should use unique base directories in test_iwslt2016_split_argument as well, let me push this change.

To answer why we need unique base directories with every test: It seems keeping same base directory lead to tests failure that I didn't completely get grasp of. My suspicious is since we are writing and extracting partial (we only write files that are requested in test) dataset in the same base directory with every parameterized test, this is leading to some weird conflicts in caching and extraction of dataset. So to avoid this situation, it's best to generate new base directory for every parameterized test. Although, I also do see need for better explanation of what's really going on under the hood.

Can we also use a context manager here to ensure that the temp directory get's cleaned up? Usually this would be handled by the TempDirMixin class but this test is a special case.

good catch, let me do that!

Nayef211 · 2022-02-09T19:53:17Z

test/datasets/test_iwslt2016.py

@@ -150,7 +152,7 @@ def tearDownClass(cls):
    ])
    def test_iwslt2016(self, split, src, tgt, dev_set, test_set):

-        root_dir = self.get_base_temp_dir()
+        root_dir = tempfile.TemporaryDirectory().name


Can we also use a context manager here to ensure that the temp directory get's cleaned up? Usually this would be handled by the TempDirMixin class but this test is a special case.

Nayef211 · 2022-02-09T19:54:05Z

test/datasets/test_iwslt2016.py

@@ -165,8 +167,12 @@ def test_iwslt2016(self, split, src, tgt, dev_set, test_set):
    @parameterized.expand(["train", "valid", "test"])
    def test_iwslt2016_split_argument(self, split):
        root_dir = self.get_base_temp_dir()


If you are planning on removing the call to get_base_temp_dir , can we modify this test class to stop extending the TempDirMixin class?

Nayef211 · 2022-02-09T19:58:34Z

test/datasets/test_iwslt2016.py

@@ -78,7 +79,8 @@ def _get_mock_dataset(root_dir, split, src, tgt, valid_set, test_set):
    """

    base_dir = os.path.join(root_dir, DATASET_NAME)
-    outer_temp_dataset_dir = os.path.join(base_dir, f"2016-01/texts/{src}/{tgt}/")
+    temp_dataset_dir = os.path.join(base_dir, 'temp_dataset_dir')


This addition of temp_dataset_dir is actually not necessary anymore now that you're deleting this directory at the end of this function. If the directory wasn't being deleted, this would be relevant since otherwise the caching logic would skip extracting the files (since the files exist within this dir). We can still keep this for consistency with the other tests.

that's a good point. let me see where to remove the redundancy.

OK, so I think let's keep it consistent with other datasets test. I just removed the deletion of temp_datset_dir.

Sure that makes sense to me!

This reverts commit 7e93369.

Nayef211

LGTM! Thanks for helping fix the caching issue for the dataset!

Nayef211 · 2022-02-10T09:05:03Z

test/datasets/test_iwslt2016.py

@@ -78,7 +79,8 @@ def _get_mock_dataset(root_dir, split, src, tgt, valid_set, test_set):
    """

    base_dir = os.path.join(root_dir, DATASET_NAME)
-    outer_temp_dataset_dir = os.path.join(base_dir, f"2016-01/texts/{src}/{tgt}/")
+    temp_dataset_dir = os.path.join(base_dir, 'temp_dataset_dir')


Sure that makes sense to me!

erip · 2022-02-10T11:24:38Z

This also looks good to me. Thanks for getting it over the finish line!

erip and others added 13 commits February 8, 2022 11:00

start from outermost tar for consistency and better testing.

c7c5dbe

fix bug with inner path location and add test splits for better cover…

25b10dc

…age.

fix comment.

50f9f7e

incorporate feedback from review.

dbd99b4

revert temp_dataset_dir temporarily but add test support for all supp…

d502709

…ort langpairs.

fix flake.

2fb3275

expand split, src, and tgt appropriately.

c4df43e

pass langpair to constructor so appropriate files are searched.

9ae1fd3

parameterize dev and test sets.

6156ffe

fix flake.

d242b4f

refactor logic so we read previously-cleaned files if they exist so e…

264a298

…xpectations match.

revert experiment which uses triple as a key since it is unnecessary.

91e2cf2

fix testing issues

4446af4

pytorch-bot bot added the ciflow/default label Feb 9, 2022

facebook-github-bot added the cla signed label Feb 9, 2022

parmeet requested a review from Nayef211 February 9, 2022 14:43

fix issues with temporary directory

386b40b

parmeet mentioned this pull request Feb 9, 2022

Revamp TorchText Dataset Testing Strategy #1493

Closed

27 tasks

erip reviewed Feb 9, 2022

View reviewed changes

create unique base directories for test_iwslt2016_split_argument

1ea385e

Nayef211 reviewed Feb 9, 2022

View reviewed changes

parmeet added 2 commits February 9, 2022 17:00

fix comments

05a042a

fix flake

caac508

parmeet mentioned this pull request Feb 9, 2022

Add Mock test for IWSLT2017 dataset #1598

Merged

parmeet added 2 commits February 9, 2022 18:31

fix flake

7e93369

Revert "fix flake"

f66ae99

This reverts commit 7e93369.

Nayef211 approved these changes Feb 10, 2022

View reviewed changes

parmeet merged commit 7b7a90d into pytorch:main Feb 10, 2022

parmeet deleted the erip_1585 branch February 10, 2022 14:40

Nayef211 mentioned this pull request Feb 10, 2022

Fix IWSLT2016 testing #1585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IWSLT testing to start from compressed file #1596

IWSLT testing to start from compressed file #1596

parmeet commented Feb 9, 2022

parmeet commented Feb 9, 2022

parmeet commented Feb 9, 2022

parmeet commented Feb 9, 2022

codecov bot commented Feb 9, 2022 •

edited

erip Feb 9, 2022

parmeet Feb 9, 2022

Nayef211 Feb 9, 2022

parmeet Feb 9, 2022

Nayef211 Feb 9, 2022

Nayef211 Feb 9, 2022

Nayef211 Feb 9, 2022

parmeet Feb 9, 2022

parmeet Feb 9, 2022

Nayef211 Feb 10, 2022

Nayef211 left a comment

Nayef211 Feb 10, 2022

erip commented Feb 10, 2022

IWSLT testing to start from compressed file #1596

IWSLT testing to start from compressed file #1596

Conversation

parmeet commented Feb 9, 2022

parmeet commented Feb 9, 2022

parmeet commented Feb 9, 2022

parmeet commented Feb 9, 2022

codecov bot commented Feb 9, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nayef211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erip commented Feb 10, 2022

codecov bot commented Feb 9, 2022 •

edited