-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend ConcatDataset
to return dataset index
#32034
Comments
Thanks for opening the issue! I think it would be more valuable, instead of adding a new constructor argument to the dataset, to instead add an extra method that computes the Thoughts? |
I cannot find any reference of that method outside the class. The way I intend to use this feature is something like: data = torch.utils.data.ConcatDataset((dataset_A, dataset_B))
loader = torch.utils.data.DataLoader(data, batch_size=10, shuffle=True)
features, labels, dataset_idx = next(iter(loader)) By adopting your proposal, I can't see how I would be able to use a |
Yes, you can do something like: class MyConcatDataset(ConcatDataset):
def __getitem__(self, idx):
data = super().__getitem__(idx)
dataset_idx = self.get_dataset_idx(idx)
return data, dataset_idx which is trivially done if such a method is available. Plus, it lets you easily configure how / what to return, which is not possible with your method. |
Ah, you meant implementing it in such a way that the user would still need to override I definitely see the merits of restructuring the function in the way you propose, as it will be very useful in any case. Adding the extra argument I recommended on top might make the code harder to understand, so unless there's more use-cases for it, I'd skip it for now. Should I go ahead and open a PR with this minor restructuring? |
Yes, I think restructuring the |
I now implemented the change and tried to send a PR. I don't have the rights to push to this repo though. I was also wondering about the recommended policy for contributions. Should I push a change from an own fork, or directly push my branch to this repo? I couldn't find anything in the contribution guide here, and I'm a bit of a newbie in open-source contributing so I don't know what's standard process. |
@ATriantafyllopoulos I typically push changes to my own fork, which then publish a PR. Are you still working on this? If not, I can pick up where you left off. |
Uh, thanks for bumping this. I procrastinated this far longer than was necessary. PR is now open: #39052 |
@fmassa @ATriantafyllopoulos I have a somewhat related question: Suppose I have N datasets, each with tensors of shape How would I go about enabling the above, while at the same time also ensuring each process receives data exclusive to it (for which one typically uses My data loader code currently looks like the following (where
But this is obviously failing because some of my datasets have differing |
Is there an anticipated update to address this issue? I am convinced that integrating diverse data types plays a crucial role in enhancing current deep learning methodologies. |
馃殌 Feature
This feature would extend
torch.utils.data.ConcatDataset
to return the data set index from which a sample originated.Motivation
This would be useful when the user wants to keep track of which data set each sample comes from.
The use case I have in mind is multi-task learning when using data sets that contain mutually exclusive labels (e.g. dataset A contains labels for Task A, and dataset B contains labels for task B) as opposed to all labels being present in a single data set.
In this case, it is still possible to do multi-task learning by pooling examples from different source, if one keeps track of which source each example in a batch came from, which is the motivation behind this request.
Pitch
The feature would consist of two changes:
return_index
argument to the class (defaulting toFalse
)dataset_idx
to the return ifreturn_index
is set toTrue
(pytorch/torch/utils/data/dataset.py
Line 207 in 8ea49e7
cc @ssnl
The text was updated successfully, but these errors were encountered: