[DataLoader] `getitems` added to description of Dataset API and better supported within `Subset` #100375

stsouko · 2023-05-01T13:42:39Z

DataLoader supports batched loading from Mapped Datasets.

This is the fetcher's implementation of auto-detection of batch loading support.

torch.utils.data._utils.fetch._MapDatasetFetcher

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]

Description of Dataset API now shows this feature.

Additionally, Subset dataset now supports __getitems__ if parent dataset supports it.

pytorch-bot · 2023-05-01T13:42:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100375

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 954fbd6:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

NivekT · 2023-05-01T16:50:12Z

@stsouko Can you circle back and address @ejguan's comment here?

torch/utils/data/dataset.py

stsouko · 2023-05-01T20:06:05Z

@stsouko Can you circle back and address @ejguan's comment here?

Opened fixing PR.
#100409

Skylion007 · 2023-05-02T13:44:28Z

torch/utils/data/dataset.py

@@ -291,6 +298,11 @@ def __init__(self, dataset: Dataset[T_co], indices: Sequence[int]) -> None:
        self.dataset = dataset
        self.indices = indices

+        # add batched sampling support when parent dataset supports it.
+        # see torch.utils.data._utils.fetch._MapDatasetFetcher
+        if getattr(dataset, "__getitems__", None):


Nit if you don't want to rely on the falsy evaluation of None, you could also make this:

Suggested change

if getattr(dataset, "__getitems__", None):

if getattr(dataset, "__getitems__", False):

I think Python does have a proper way to do it:

Suggested change

if getattr(dataset, "__getitems__", None):

if hasattr(dataset, "__getitems__"):

@ejguan This is a bit different semantically, the original would still return false if a.__getitems__ = False etc. while hasattr would return true in that case.

The code that was originally in this PR was hasattr(dataset, "__getitems__") and dataset.__getitems__ which caused mypy to complain unlike this version.

Wait, what is the case that __getitems__ becomes an attribute rather than a method? Then, why not hasattr(dataset, "__getitems__") and callable(dataset.__getitems__)

@ejguan That's a double lookup on __getitems__ better to do callable(getattr(dataset, "__getitems__", None)) which is more efficient. I am not sure what case __getitems__ could become an attribute but better to code defensively in this case.

Suggested change

if getattr(dataset, "__getitems__", None):

if callable(getattr(dataset, "__getitems__", None)):

Originally I kept the same behavior as torch.utils.data._utils.fetch._MapDatasetFetcher

class _MapDatasetFetcher(_BaseDatasetFetcher): def fetch(self, possibly_batched_index): if self.auto_collation: if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:

Both call sites should probably changed to be honest.

Probably better to do the same for fetcher, but it's out of scope of this PR.

ejguan · 2023-05-02T13:51:23Z

torch/utils/data/dataset.py

@@ -299,6 +311,9 @@ def __getitem__(self, idx):
    def __len__(self):
        return len(self.indices)

+    def _getitems(self, idx):


Why not just add __getitems__ and raise Error if self.dataset doesn't have __getitems__ implemented?

Because torch.utils.data._utils.fetch._MapDatasetFetcher will do false positive check

class _MapDatasetFetcher(_BaseDatasetFetcher): def fetch(self, possibly_batched_index): if self.auto_collation: if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__: data = self.dataset.__getitems__(possibly_batched_index) else: data = [self.dataset[idx] for idx in possibly_batched_index]

Then, we can do the similar thing for Subset.__getitems__ by calling __getitems__ if it's available. Otherwise, calling self.dataset[idx] by iterating idx from possibly_batched_index

ejguan

The change looks good to me now. I am importing to internal to see if this PR breaks any internal system.

facebook-github-bot · 2023-05-03T15:59:59Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT

LGTM!

ejguan · 2023-05-03T20:28:58Z

@pytorchbot merge -r

pytorchmergebot · 2023-05-03T20:30:49Z

@pytorchbot successfully started a rebase job. Check the current status here

Subset dataset now supports __getitems__.

pytorchmergebot · 2023-05-03T20:30:55Z

Successfully rebased getitems onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout getitems && git pull --rebase)

pytorchmergebot · 2023-05-03T20:32:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-03T20:32:07Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

stsouko · 2023-05-04T18:08:03Z

@ejguan, could you check tests pipeline?

ejguan · 2023-05-04T18:16:26Z

@pytorchbot merge

pytorchmergebot · 2023-05-04T18:18:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-05-04T18:18:32Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

facebook-github-bot · 2023-05-04T18:51:21Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2023-05-04T18:55:45Z

It seems something wrong with the CI. I will land it from internal. Thanks

ejguan · 2023-05-05T15:50:29Z

@pytorchbot merge

pytorchmergebot · 2023-05-05T15:52:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…etter supported within `Subset` (pytorch#100375) DataLoader supports batched loading from Mapped Datasets. This is the fetcher's implementation of auto-detection of batch loading support. torch.utils.data._utils.fetch._MapDatasetFetcher ``` class _MapDatasetFetcher(_BaseDatasetFetcher): def fetch(self, possibly_batched_index): if self.auto_collation: if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__: data = self.dataset.__getitems__(possibly_batched_index) else: data = [self.dataset[idx] for idx in possibly_batched_index] ``` Description of Dataset API now shows this feature. Additionally, Subset dataset now supports `__getitems__` if parent dataset supports it. Pull Request resolved: pytorch#100375 Approved by: https://github.com/ejguan, https://github.com/NivekT

stsouko requested review from NivekT and ejguan as code owners May 1, 2023 13:42

pytorch-bot bot added the release notes: dataloader release notes category label May 1, 2023

pytorchbot added the open source label May 1, 2023

Skylion007 reviewed May 1, 2023

View reviewed changes

torch/utils/data/dataset.py Outdated Show resolved Hide resolved

stsouko force-pushed the getitems branch from 1acf130 to f4906d6 Compare May 1, 2023 20:12

Skylion007 reviewed May 2, 2023

View reviewed changes

ejguan reviewed May 2, 2023

View reviewed changes

stsouko requested a review from ejguan May 3, 2023 15:50

ejguan approved these changes May 3, 2023

View reviewed changes

NivekT added the topic: improvements topic category label May 3, 2023

NivekT changed the title ~~__getitems__ added to description of Dataset API.~~ [DataLoader] __getitems__ added to description of Dataset API and better supported within Subset May 3, 2023

NivekT approved these changes May 3, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2023

stsouko added 7 commits May 3, 2023 20:30

__getitems__ added to description of Dataset API.

0bfbcfe

Subset dataset now supports __getitems__.

wording fixed.

d0dbbb9

fixes Subset

601e43b

fix linter

a88c64b

fixed indent

868d547

refactored

fc72e22

checking reimplemented

954fbd6

pytorchmergebot force-pushed the getitems branch from 6708295 to 954fbd6 Compare May 3, 2023 20:30

pytorchmergebot added the merging label May 3, 2023

pytorchmergebot removed the merging label May 3, 2023

pytorchmergebot added the merging label May 4, 2023

pytorchmergebot removed the merging label May 4, 2023

pytorchmergebot added the merging label May 5, 2023

pytorchmergebot added Merged and removed merging labels May 5, 2023

pytorchmergebot closed this in a2e81a8 May 5, 2023

stsouko deleted the getitems branch May 20, 2023 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] `getitems` added to description of Dataset API and better supported within `Subset` #100375

[DataLoader] `getitems` added to description of Dataset API and better supported within `Subset` #100375

stsouko commented May 1, 2023

pytorch-bot bot commented May 1, 2023 •

edited

NivekT commented May 1, 2023

stsouko commented May 1, 2023

Skylion007 May 2, 2023

ejguan May 2, 2023

Skylion007 May 2, 2023

Skylion007 May 2, 2023 •

edited

ejguan May 2, 2023

Skylion007 May 2, 2023 •

edited

stsouko May 2, 2023

Skylion007 May 2, 2023 •

edited

stsouko May 2, 2023

ejguan May 2, 2023

stsouko May 2, 2023

ejguan May 2, 2023

stsouko May 2, 2023

ejguan left a comment

facebook-github-bot commented May 3, 2023

NivekT left a comment

ejguan commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

stsouko commented May 4, 2023

ejguan commented May 4, 2023

pytorchmergebot commented May 4, 2023

pytorchmergebot commented May 4, 2023

facebook-github-bot commented May 4, 2023

ejguan commented May 4, 2023

ejguan commented May 5, 2023

pytorchmergebot commented May 5, 2023

	if getattr(dataset, "__getitems__", None):
	if getattr(dataset, "__getitems__", False):

	if getattr(dataset, "__getitems__", None):
	if hasattr(dataset, "__getitems__"):

	if getattr(dataset, "__getitems__", None):
	if callable(getattr(dataset, "__getitems__", None)):

[DataLoader] __getitems__ added to description of Dataset API and better supported within Subset #100375

[DataLoader] __getitems__ added to description of Dataset API and better supported within Subset #100375

Conversation

stsouko commented May 1, 2023

pytorch-bot bot commented May 1, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/100375

✅ No Failures

NivekT commented May 1, 2023

stsouko commented May 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Skylion007 May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Skylion007 May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Skylion007 May 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejguan left a comment

Choose a reason for hiding this comment

facebook-github-bot commented May 3, 2023

NivekT left a comment

Choose a reason for hiding this comment

ejguan commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

pytorchmergebot commented May 3, 2023

Merge started

pytorchmergebot commented May 3, 2023

Merge failed

stsouko commented May 4, 2023

ejguan commented May 4, 2023

pytorchmergebot commented May 4, 2023

Merge started

pytorchmergebot commented May 4, 2023

Merge failed

facebook-github-bot commented May 4, 2023

ejguan commented May 4, 2023

ejguan commented May 5, 2023

pytorchmergebot commented May 5, 2023

Merge started

[DataLoader] `getitems` added to description of Dataset API and better supported within `Subset` #100375

[DataLoader] `getitems` added to description of Dataset API and better supported within `Subset` #100375

pytorch-bot bot commented May 1, 2023 •

edited

Skylion007 May 2, 2023 •

edited

Skylion007 May 2, 2023 •

edited

Skylion007 May 2, 2023 •

edited