New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IWSLT testing to start from compressed file #1596
Conversation
…xpectations match.
@erip hope the changes in the PR make sense. Let me know if you catch anything unusual or out of your expectation. |
OK, I think we are good to go for review @Nayef211 |
Codecov Report
@@ Coverage Diff @@
## main #1596 +/- ##
==========================================
+ Coverage 80.34% 80.96% +0.62%
==========================================
Files 58 58
Lines 2569 2569
==========================================
+ Hits 2064 2080 +16
+ Misses 505 489 -16
Continue to review full report at Codecov.
|
test/datasets/test_iwslt2016.py
Outdated
@@ -150,7 +152,7 @@ def tearDownClass(cls): | |||
]) | |||
def test_iwslt2016(self, split, src, tgt, dev_set, test_set): | |||
|
|||
root_dir = self.get_base_temp_dir() | |||
root_dir = tempfile.TemporaryDirectory().name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq, why is this used in test_iwslt2016
, but not in test_iwslt2016_split_argument
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good catch. I should use unique base directories in test_iwslt2016_split_argument
as well, let me push this change.
To answer why we need unique base directories with every test: It seems keeping same base directory lead to tests failure that I didn't completely get grasp of. My suspicious is since we are writing and extracting partial (we only write files that are requested in test) dataset in the same base directory with every parameterized test, this is leading to some weird conflicts in caching and extraction of dataset. So to avoid this situation, it's best to generate new base directory for every parameterized test. Although, I also do see need for better explanation of what's really going on under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use a context manager here to ensure that the temp directory get's cleaned up? Usually this would be handled by the TempDirMixin
class but this test is a special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, let me do that!
test/datasets/test_iwslt2016.py
Outdated
@@ -150,7 +152,7 @@ def tearDownClass(cls): | |||
]) | |||
def test_iwslt2016(self, split, src, tgt, dev_set, test_set): | |||
|
|||
root_dir = self.get_base_temp_dir() | |||
root_dir = tempfile.TemporaryDirectory().name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use a context manager here to ensure that the temp directory get's cleaned up? Usually this would be handled by the TempDirMixin
class but this test is a special case.
test/datasets/test_iwslt2016.py
Outdated
@@ -165,8 +167,12 @@ def test_iwslt2016(self, split, src, tgt, dev_set, test_set): | |||
@parameterized.expand(["train", "valid", "test"]) | |||
def test_iwslt2016_split_argument(self, split): | |||
root_dir = self.get_base_temp_dir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are planning on removing the call to get_base_temp_dir
, can we modify this test class to stop extending the TempDirMixin
class?
@@ -78,7 +79,8 @@ def _get_mock_dataset(root_dir, split, src, tgt, valid_set, test_set): | |||
""" | |||
|
|||
base_dir = os.path.join(root_dir, DATASET_NAME) | |||
outer_temp_dataset_dir = os.path.join(base_dir, f"2016-01/texts/{src}/{tgt}/") | |||
temp_dataset_dir = os.path.join(base_dir, 'temp_dataset_dir') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition of temp_dataset_dir
is actually not necessary anymore now that you're deleting this directory at the end of this function. If the directory wasn't being deleted, this would be relevant since otherwise the caching logic would skip extracting the files (since the files exist within this dir). We can still keep this for consistency with the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point. let me see where to remove the redundancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I think let's keep it consistent with other datasets test. I just removed the deletion of temp_datset_dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes sense to me!
This reverts commit 7e93369.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for helping fix the caching issue for the dataset!
@@ -78,7 +79,8 @@ def _get_mock_dataset(root_dir, split, src, tgt, valid_set, test_set): | |||
""" | |||
|
|||
base_dir = os.path.join(root_dir, DATASET_NAME) | |||
outer_temp_dataset_dir = os.path.join(base_dir, f"2016-01/texts/{src}/{tgt}/") | |||
temp_dataset_dir = os.path.join(base_dir, 'temp_dataset_dir') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes sense to me!
This also looks good to me. Thanks for getting it over the finish line! |
Original Credit: @erip
This PR did some changes on top of #1585 to fix failing tests. The main changes include: