Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray.torch.train.prepare_data_loader #38115

Closed
MMorente opened this issue Aug 4, 2023 · 6 comments
Closed

ray.torch.train.prepare_data_loader #38115

MMorente opened this issue Aug 4, 2023 · 6 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@MMorente
Copy link

MMorente commented Aug 4, 2023

What happened + What you expected to happen

I have a local ray cluster and I am using pytorch dataloaders. As described in the docs, I am using ray.train.torch.prepare_data_loader to prepare the dataloader for ray, this works well when the workers only use one GPU, however if in the ScalingConfig I specify more than one GPU per worker, prepare_data_loader fails with the message:

device = get_device() 
self._auto_transfer = auto_transfer if device.type == "cuda" else False
AttributeError: 'list' object has no attribute 'type'

My cluster has two nodes and each one has 4 GPUs.

Versions / Dependencies

Ray: 2.6.1.
Python: 3.9.5
OS: Unix

Reproduction script

from torch.utils.data import DataLoader, Dataset
import numpy as np

import ray
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

class TestDataset(Dataset):
    def __init__(self):
        super().__init__()
        self.data = np.random.random_sample((10,1))

    def __getitem__(self, index):
        return self.data[index]

def train_func(config: dict) -> None:
   dataset = TestDataset()
   dataloader = DataLoader(dataset)
   dataloader = ray.train.torch.prepare_data_loader(dataloader)
   print("done")

if __name__ == "__main__":
    trainer = TorchTrainer(
        train_loop_per_worker=train_func,
        scaling_config=ScalingConfig(num_workers=1, use_gpu=True, resources_per_worker={"GPU":2})
    )
    trainer.fit()


Issue Severity

High: It blocks me from completing my task.

@MMorente MMorente added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 4, 2023
@xwjiang2010 xwjiang2010 added P2 Important issue, but not time-critical train Ray Train Related Issue air P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P2 Important issue, but not time-critical labels Aug 4, 2023
@xwjiang2010
Copy link
Contributor

I verified that the bug existed on 2.6.1. And probably on master as well. We didn't do a good job testing multiple GPU per worker case.

The following needs to happen:

  • crystalize get_device semantics. Across multiple versions of this file, I see the semantics change - whether it's a list v.s. it's an int.
  • update get_device doc string to reflect that clearly.
  • update the logic in WrappedDataLoader to accommodate the fact that it can be a list.

cc @matthewdeng for triaging visibility. Marking this as P1 for now.

@MMorente would you like to contribute a PR to fix this? I think the fix can be as simple as

device = get_device()
if isinstance(device, list):
    device = device[0]

@matthewdeng
Copy link
Contributor

@xwjiang2010 can you update the docs to reflect the behavior?

@MMorente
Copy link
Author

MMorente commented Aug 4, 2023

I verified that the bug existed on 2.6.1. And probably on master as well. We didn't do a good job testing multiple GPU per worker case.

The following needs to happen:

  • crystalize get_device semantics. Across multiple versions of this file, I see the semantics change - whether it's a list v.s. it's an int.
  • update get_device doc string to reflect that clearly.
  • update the logic in WrappedDataLoader to accommodate the fact that it can be a list.

cc @matthewdeng for triaging visibility. Marking this as P1 for now.

@MMorente would you like to contribute a PR to fix this? I think the fix can be as simple as

device = get_device()
if isinstance(device, list):
    device = device[0]

@xwjiang2010 I have just created a pull request #38127 with the change.

@anyscalesam
Copy link
Collaborator

@matthewdeng can you re-review and advise on next steps?

@matthewdeng
Copy link
Contributor

This will be fixed by #42314, cc @woshiyyya to double check.

@woshiyyya
Copy link
Member

Yes this will be fixed by the this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants