Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IWSLT testing to start from compressed file #1596

Merged
merged 19 commits into from Feb 10, 2022
Merged

Conversation

parmeet
Copy link
Contributor

@parmeet parmeet commented Feb 9, 2022

Original Credit: @erip

This PR did some changes on top of #1585 to fix failing tests. The main changes include:

  • Removal of uncompressed directories to make sure the dataset is actually doing the extraction
  • Removal of code block that conditionally write uncleaned files
  • Creation of new base directory with every parameterized test call to make sure no artifacts are left from other tests

@parmeet
Copy link
Contributor Author

parmeet commented Feb 9, 2022

@erip hope the changes in the PR make sense. Let me know if you catch anything unusual or out of your expectation.

@parmeet
Copy link
Contributor Author

parmeet commented Feb 9, 2022

Oh, actually wait. I think I still haven't implemented it correctly. It's about the temporary directory and what we are compressing as part of it. Let's not review it yet @Nayef211, @erip

@parmeet
Copy link
Contributor Author

parmeet commented Feb 9, 2022

OK, I think we are good to go for review @Nayef211

@codecov
Copy link

codecov bot commented Feb 9, 2022

Codecov Report

Merging #1596 (f66ae99) into main (da34de2) will increase coverage by 0.62%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1596      +/-   ##
==========================================
+ Coverage   80.34%   80.96%   +0.62%     
==========================================
  Files          58       58              
  Lines        2569     2569              
==========================================
+ Hits         2064     2080      +16     
+ Misses        505      489      -16     
Impacted Files Coverage Δ
torchtext/data/datasets_utils.py 64.74% <0.00%> (+5.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da34de2...f66ae99. Read the comment docs.

@@ -150,7 +152,7 @@ def tearDownClass(cls):
])
def test_iwslt2016(self, split, src, tgt, dev_set, test_set):

root_dir = self.get_base_temp_dir()
root_dir = tempfile.TemporaryDirectory().name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, why is this used in test_iwslt2016, but not in test_iwslt2016_split_argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good catch. I should use unique base directories in test_iwslt2016_split_argument as well, let me push this change.

To answer why we need unique base directories with every test: It seems keeping same base directory lead to tests failure that I didn't completely get grasp of. My suspicious is since we are writing and extracting partial (we only write files that are requested in test) dataset in the same base directory with every parameterized test, this is leading to some weird conflicts in caching and extraction of dataset. So to avoid this situation, it's best to generate new base directory for every parameterized test. Although, I also do see need for better explanation of what's really going on under the hood.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use a context manager here to ensure that the temp directory get's cleaned up? Usually this would be handled by the TempDirMixin class but this test is a special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, let me do that!

@@ -150,7 +152,7 @@ def tearDownClass(cls):
])
def test_iwslt2016(self, split, src, tgt, dev_set, test_set):

root_dir = self.get_base_temp_dir()
root_dir = tempfile.TemporaryDirectory().name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use a context manager here to ensure that the temp directory get's cleaned up? Usually this would be handled by the TempDirMixin class but this test is a special case.

@@ -165,8 +167,12 @@ def test_iwslt2016(self, split, src, tgt, dev_set, test_set):
@parameterized.expand(["train", "valid", "test"])
def test_iwslt2016_split_argument(self, split):
root_dir = self.get_base_temp_dir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are planning on removing the call to get_base_temp_dir , can we modify this test class to stop extending the TempDirMixin class?

@@ -78,7 +79,8 @@ def _get_mock_dataset(root_dir, split, src, tgt, valid_set, test_set):
"""

base_dir = os.path.join(root_dir, DATASET_NAME)
outer_temp_dataset_dir = os.path.join(base_dir, f"2016-01/texts/{src}/{tgt}/")
temp_dataset_dir = os.path.join(base_dir, 'temp_dataset_dir')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition of temp_dataset_dir is actually not necessary anymore now that you're deleting this directory at the end of this function. If the directory wasn't being deleted, this would be relevant since otherwise the caching logic would skip extracting the files (since the files exist within this dir). We can still keep this for consistency with the other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. let me see where to remove the redundancy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I think let's keep it consistent with other datasets test. I just removed the deletion of temp_datset_dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that makes sense to me!

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for helping fix the caching issue for the dataset!

@@ -78,7 +79,8 @@ def _get_mock_dataset(root_dir, split, src, tgt, valid_set, test_set):
"""

base_dir = os.path.join(root_dir, DATASET_NAME)
outer_temp_dataset_dir = os.path.join(base_dir, f"2016-01/texts/{src}/{tgt}/")
temp_dataset_dir = os.path.join(base_dir, 'temp_dataset_dir')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that makes sense to me!

@erip
Copy link
Contributor

erip commented Feb 10, 2022

This also looks good to me. Thanks for getting it over the finish line!

@parmeet parmeet merged commit 7b7a90d into pytorch:main Feb 10, 2022
@parmeet parmeet deleted the erip_1585 branch February 10, 2022 14:40
@Nayef211 Nayef211 mentioned this pull request Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants