close streams in prototype datasets #6647

pmeier · 2022-09-26T13:48:12Z

Redo of #6128 with the changes suggested there by me:

Instead of closing every single EncodedImage.from_file(buffer); buffer.close() individually, I moved the closing into EncodedImage.from_file. So far we we don't have a case where we need to read from the same stream twice. Even if we at some point do, e.g. optical flow (although we are probably better of with caching a single image), we can always introduce a keyword close: bool = True to overwrite the behavior.
The patch above accounts for the largest chunk of changes in Closing streams to avoid testing issues #6128. There are a few datasets that handle other files as well and I closed them directly. Plus, I also added a .close() call into read_mat
I've added a test that checks that a dataset doesn't leave unclosed streams after a full iteration.

pmeier · 2022-09-26T13:49:21Z

test/test_prototype_datasets_builtin.py

+    deque(iterator, maxlen=0)
+
+
+def next_consume(iterator):


This is useful since we often just want the first sample, but need to make sure to still consume to avoid dangling streams. list(iterator) would also do the trick, but keeps everything in memory for no reason.

test/test_prototype_datasets_builtin.py

pmeier · 2022-09-26T13:53:21Z

test/test_prototype_datasets_builtin.py

+    def test_no_simple_tensors(self, dataset_mock, config):
+        dataset, _ = dataset_mock.load(config)
+
+        simple_tensors = {key for key, value in next_consume(iter(dataset)).items() if features.is_simple_tensor(value)}


This is a drive-by since I was already touching the line: the term "vanilla" tensor is no longer used. In the prototype transforms we use "simple tensor" now and also have features.is_simple_tensor to check for them.

pmeier · 2022-09-26T13:58:41Z

torchvision/prototype/datasets/_builtin/sbd.py

+        split_dp, to_be_closed_dp = (
+            (extra_split_dp, split_dp) if self._split == "train_noval" else (split_dp, extra_split_dp)
+        )
+        for _, file in to_be_closed_dp:
+            file.close()


This is somewhat problematic in the split == "train_noval" case. We want to close all file handles that are coming from the split_dp. Unfortunately, split_dp comes from a Demultiplexer. Thus, by fully iterating over it here, we are loading everything into the demux buffer.

Is there an idiom to "mark" a datapipe to be closed at runtime even if we don't return the datapipe? The only thing I came up with is changing the classifier function of the Demultiplexer to drop the samples that would go into split_dp if split == "train_noval".

It is incorrect to consume Datapipe in construction time.

for _, file in to_be_closed_dp: file.close()

It will not affect executed graph.

Ideally, we want to have something like dp.close(), which will effectively remove dangling pieces of the graph.

But for now you can either use trick like split_dp = split_dp.concatinate(to_be_closed_dp.filter(lambda x: False)) or do code branching before Demux and avoid creating dangling pieces.

It is incorrect to consume Datapipe in construction time.

I don't think that is fully correct for our case. At construction, we load a couple of files and weave them together into one datapipe. These files are always loaded unconditionally, but for some configurations not all of the files are needed. So we should be able to simply close them during the construction of the dataset datapipe, since they will never make it in the graph, correct?

I agree, we shouldn't do this for datapipes that stem for a Demultiplexer if the other parts make it into the final graph.

do code branching before Demux and avoid creating dangling pieces.

If possible, I think that is the better solution, since the other still iterates over all items when the loop should actually be done. I guess that could be irritating as well.

I implemented branching in afb0ec2. PTAL

torchvision/prototype/datasets/utils/_internal.py

VitalyFedyunin

Thank you for taking over it!

… time

VitalyFedyunin · 2022-10-06T17:23:28Z

torchvision/prototype/datasets/_builtin/sbd.py

+                drop_none=True,
+            )
+        else:
+            archive_dp = resource_dps[0]


What will happen with resource_dps[1] in this case? It is disconnected from the graph or remaining unconsumed?

If if self._split != "train_noval", we have only one element in resource_dps. Meaning, it will not be loaded at all and thus also does not need to be closed.

Summary: * close streams in prototype datasets * refactor prototype SBD to avoid closing demux streams at construction time * mypy Reviewed By: NicolasHug Differential Revision: D40427477 fbshipit-source-id: 854554f283ff281f8c9eb0e2786644116a4b4dd8

close streams in prototype datasets

6e6c31e

pmeier added module: datasets prototype labels Sep 26, 2022

pmeier requested review from VitalyFedyunin and NicolasHug September 26, 2022 13:48

facebook-github-bot added the cla signed label Sep 26, 2022

pmeier commented Sep 26, 2022

View reviewed changes

pmeier mentioned this pull request Sep 26, 2022

Closing streams to avoid testing issues #6128

Closed

VitalyFedyunin reviewed Sep 26, 2022

View reviewed changes

refactor prototype SBD to avoid closing demux streams at construction…

afb0ec2

… time

pmeier requested a review from VitalyFedyunin September 27, 2022 07:56

pmeier added 2 commits September 27, 2022 10:03

mypy

d63214e

Merge branch 'main' into datasets-close-streams

9393fce

VitalyFedyunin reviewed Oct 6, 2022

View reviewed changes

VitalyFedyunin approved these changes Oct 6, 2022

View reviewed changes

Merge branch 'main' into datasets-close-streams

02335f3

pmeier merged commit 7eb5d7f into pytorch:main Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

close streams in prototype datasets #6647

close streams in prototype datasets #6647

pmeier commented Sep 26, 2022

pmeier Sep 26, 2022

pmeier Sep 26, 2022

pmeier Sep 26, 2022

VitalyFedyunin Sep 26, 2022

pmeier Sep 27, 2022

pmeier Sep 27, 2022

VitalyFedyunin left a comment

VitalyFedyunin Oct 6, 2022

pmeier Oct 6, 2022

close streams in prototype datasets #6647

close streams in prototype datasets #6647

Conversation

pmeier commented Sep 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment