Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up data loading for `TensorDataset` if the underlying dataset supports index by a list of indices #4959

Open
colinfang opened this issue Jan 31, 2018 · 3 comments

Comments

@colinfang
Copy link

@colinfang colinfang commented Jan 31, 2018

Looking at

batch = self.collate_fn([self.dataset[i] for i in indices])

The loader fetches one row at a time of the data set, and then combine them into a minibatch. It is quite inefficient if the underlying data set already supports indexing by a list of indices.

If there are a lot of elements in a row, e.g. image data, it is relatively OK since it takes more time to process (infer + back-propagate in GPU) about the data. However if the feature dimension is small, say < 100, then data loading becomes the bottleneck.

An alternative is to do the following

# batch = self.collate_fn([self.dataset[i] for i in indices])
batch = self.dataset[indices]

I applied monkey patch for my specific problem.

def data_loader_next(self):
    if self.num_workers == 0:  # same-process loading
        indices = next(self.sample_iter)  # may raise StopIteration
        # I know that my dataset supports index by indices.
        # -- batch = self.collate_fn([self.dataset[i] for i in indices])
        batch = self.dataset[indices]
        if self.pin_memory:
            batch = pin_memory_batch(batch)
        return batch

    # check if the next sample has already been generated
    if self.rcvd_idx in self.reorder_dict:
        batch = self.reorder_dict.pop(self.rcvd_idx)
        return self._process_next_batch(batch)

    if self.batches_outstanding == 0:
        self._shutdown_workers()
        raise StopIteration

    while True:
        assert (not self.shutdown and self.batches_outstanding > 0)
        idx, batch = self.data_queue.get()
        self.batches_outstanding -= 1
        if idx != self.rcvd_idx:
            # store out-of-order samples
            self.reorder_dict[idx] = batch
            continue
        return self._process_next_batch(batch)

DataLoaderIter.next = data_loader_next
DataLoaderIter.__next__ = data_loader_next

It speeds up the data loading by 5 times while using num_workers=0.

cc @SsnL @VitalyFedyunin @ngimel @mruberry

@bheinzerling

This comment has been minimized.

Copy link
Contributor

@bheinzerling bheinzerling commented Jan 31, 2018

DataLoader is very convenient and offers lots of functionality, but in simple single-process cases I found it much faster to not use DataLoader at all. Just tensorize your data and do the batching yourself, something like:

def batch_iter(X, y, batch_size=64)
    """
    X: feature tensor (shape: num_instances x num_features)
    y: target tensor (shape: num_instances)
    """
    idxs = torch.randperm(X.size(0))
    if X.is_cuda:
         idxs = idxs.cuda()
    for batch_idxs in idxs.split(batch_size):
        yield X[batch_idxs], y[batch_idxs]
@fmassa

This comment has been minimized.

Copy link
Member

@fmassa fmassa commented Feb 1, 2018

If your dataset already supports indexing by lists and performs the collate_fn internally, one possibility is to let the user write their own Sampler that returns a tuple of indices at a time.
So the __iter__ method would look something like

def __iter__(self):
        return iter(torch.randperm(len(self.data_source)).long().split(batch_size))

And then you can just add the batch size in the DataLoader to 1.
What do you think?

@mboratko

This comment has been minimized.

Copy link

@mboratko mboratko commented Oct 12, 2018

I think a better architecture would be for the collate_fn functionality to be attached to the Dataset object itself, like so:

class Dataset(object):
    def __getitem__(self, index):
        if type(index) is int:
            raise NotImplementedError
        else:
            requested_elements = [self[i] for i in index]
            return default_collate(requested_elements)

and the call within the DataLoader should be replaced as @colinfang described:

# batch = self.collate_fn([self.dataset[i] for i in indices])
batch = self.dataset[indices]

This isn't completely backwards-compatible with the current API, however it's a simple fix. Exising implementations which are working can just change their overridden __getitem__ method to

if type(index) is int:
    (... existing implementation...)
else:
    super().__getitem__(self, index)

I agree that tensorizing the data and using it directly is a workaround for now, however there are libraries built expecting DataLoader objects to be fed into them as well and this solution doesn't work in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.