Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Trainer fails on multi-GPU cluster #30063

Closed
victor7246 opened this issue Nov 7, 2022 · 2 comments
Closed

[Train] Trainer fails on multi-GPU cluster #30063

victor7246 opened this issue Nov 7, 2022 · 2 comments
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. needs-repro-script Issue needs a runnable script to be reproduced train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@victor7246
Copy link

victor7246 commented Nov 7, 2022

What happened + What you expected to happen

Trainer(backend='torch') fails with Ray 2.0 when running on multiple GPU nodes. We are running Trainer on 3 A100 workers, each having 4 GPUs.

(BackendExecutor pid=5631) RuntimeError: Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 1 does not equal 3 (while checking arguments for addmm)

Versions / Dependencies

Ray 2.0

Reproduction script

import argparse
from typing import Dict
from ray.air import session

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

import ray.train as train
from ray.train import Trainer
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig

training_data = datasets.FashionMNIST(
    root="~/data",
    train=True,
    download=True,
    transform=ToTensor(),
)

test_data = datasets.FashionMNIST(
    root="~/data",
    train=False,
    download=True,
    transform=ToTensor(),
)


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.BatchNorm1d(10),
            nn.ReLU(),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits


def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset) // session.get_world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        pred = model(X)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // session.get_world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(
        f"Test Error: \n "
        f"Accuracy: {(100 * correct):>0.1f}%, "
        f"Avg loss: {test_loss:>8f} \n"
    )
    return test_loss


def train_func(config: Dict):
    batch_size = config["batch_size"]
    lr = config["lr"]
    epochs = config["epochs"]

    worker_batch_size = batch_size // session.get_world_size()

    train_dataloader = DataLoader(training_data, batch_size=worker_batch_size)
    test_dataloader = DataLoader(test_data, batch_size=worker_batch_size)

    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
    test_dataloader = train.torch.prepare_data_loader(test_dataloader)

    model = NeuralNetwork()
    model = train.torch.prepare_model(model)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    loss_results = []

    for _ in range(epochs):
        train_epoch(train_dataloader, model, loss_fn, optimizer)
        loss = validate_epoch(test_dataloader, model, loss_fn)
        loss_results.append(loss)
        session.report(dict(loss=loss))

    return loss_results


def train_fashion_mnist(num_workers=2, use_gpu=False):

    config = {"lr": 1e-3, "batch_size": 64, "epochs": 4}
    trainer = Trainer(backend="torch", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()

    out = trainer.run(
                train_func=train_func,config=config
                )
    trainer.shutdown()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--address", required=False, type=str, help="the address to use for Ray"
    )
    parser.add_argument(
        "--num-workers",
        "-n",
        type=int,
        default=2,
        help="Sets number of workers for training.",
    )
    parser.add_argument(
        "--use-gpu", action="store_true", default=False, help="Enables GPU training"
    )

    args, _ = parser.parse_known_args()

    import ray

    ray.init(address=args.address)
    train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu)

Issue Severity

High: It blocks me from completing my task.

@victor7246 victor7246 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 7, 2022
@amogkam
Copy link
Contributor

amogkam commented Nov 7, 2022

@victor7246 can you try with Ray nightly?

Can you also share the full stack trace? And are you manually setting the CUDA_VISIBLE_DEVICES environment variable?

@richardliaw richardliaw changed the title [Core] Trainer fails on multi-GPU cluster [Train] Trainer fails on multi-GPU cluster Nov 11, 2022
@xwjiang2010 xwjiang2010 added needs-repro-script Issue needs a runnable script to be reproduced train Ray Train Related Issue air and removed bug Something that is supposed to be working; but isn't labels Nov 14, 2022
@hora-anyscale hora-anyscale added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 5, 2022
@hora-anyscale
Copy link
Contributor

Per Triage Sync: No response from reporter for 30 days, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. needs-repro-script Issue needs a runnable script to be reproduced train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants