Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/backend/list devices #1883

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Feat/backend/list devices #1883

wants to merge 3 commits into from

Conversation

laggui
Copy link
Member

@laggui laggui commented Jun 12, 2024

Checklist

  • Confirmed that run-checks all script has been executed.

Changes

Added list_available_devices to Backend and JitRuntime trait along with backend-specific implementation

Allows the user to get a list of available devices at runtime, e.g.

use burn::{
    backend::{Autodiff, LibTorch, NdArray, Wgpu},
    tensor::backend::Backend,
};

fn main() {
    type A = NdArray;
    type B = Wgpu;
    type C = LibTorch;
    type D = Autodiff<B>;

    println!("NdArray: {:?}", A::list_available_devices());

    println!("Wgpu: {:?}", B::list_available_devices());

    println!("Tch: {:?}", C::list_available_devices());

    println!("Autodiff<Wgpu>: {:?}", D::list_available_devices());
}
NdArray: [Cpu]
Wgpu: [DiscreteGpu(0), IntegratedGpu(0), IntegratedGpu(1)]
Tch: [Cpu, Cuda(0)]
Autodiff<Wgpu>: [DiscreteGpu(0), IntegratedGpu(0), IntegratedGpu(1)]

@jin-eld
Copy link

jin-eld commented Jun 12, 2024

Let's move the discussion from Discord to the PR, so that it's easier to track :)

Fist of all, thank you for reacting so quickly! I was about to ask if I should file a feature request, but you are already working on a PR ))

So, throwing in a few thoughts:

Wgpu: [DiscreteGpu(0), IntegratedGpu(0), IntegratedGpu(1)]
Tch: [Cpu, Cuda(0)]
Autodiff<Wgpu>: [DiscreteGpu(0), IntegratedGpu(0), IntegratedGpu(1)]

Would it be possible to generalize the device types to an enum which could be used for all backends? In the above output the code that selects the device would have to be backend specific, i.e. the GPU appears as Cuda(0) on tch, but as DiscreteGpu(0) on Wgpu and so on. Imho it'd be a lot more convenient from a user's perspective if we could have a higher level Burn API which internally maps the devices according to their meaning and uses the same device list enum type for all backends, allowing to check the devices in the same way regardless of which backend is being used. I'm thinking of how currently (referring to various examples in the repo) one selects the desired backend in the beginning and the rest of the code does not care that much which Backend exactly is being used. It'd be awesome if the same approach was possible for querying devices.

You raised the issue, that PyTorch reports ROCm devices as CUDA, however I think, that this can be ignored for this particular use case: the intention is to find a GPU suitable for inference or multiple GPUs suitable for training. Identifying what kind of GPU it really is could perhaps be achieved via a dedicated API querying the device name or device properties (i.e. similar to https://pytorch.org/docs/stable/cuda.html) if someone really requires it.

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 0% with 66 lines in your changes missing coverage. Please review.

Project coverage is 86.08%. Comparing base (671ec8c) to head (64ae3d6).
Report is 1 commits behind head on main.

Files Patch % Lines
crates/burn-wgpu/src/runtime.rs 0.00% 28 Missing ⚠️
crates/burn-tch/src/backend.rs 0.00% 14 Missing ⚠️
crates/burn-candle/src/backend.rs 0.00% 12 Missing ⚠️
crates/burn-autodiff/src/backend.rs 0.00% 3 Missing ⚠️
crates/burn-fusion/src/backend.rs 0.00% 3 Missing ⚠️
crates/burn-jit/src/backend.rs 0.00% 3 Missing ⚠️
crates/burn-ndarray/src/backend.rs 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1883      +/-   ##
==========================================
- Coverage   86.11%   86.08%   -0.04%     
==========================================
  Files         777      778       +1     
  Lines       90555    90846     +291     
==========================================
+ Hits        77979    78202     +223     
- Misses      12576    12644      +68     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@laggui
Copy link
Member Author

laggui commented Jun 13, 2024

With the discussions we had on discord it wasn't too difficult to figure out an easy way to draft this :)

I understand your point of view a bit better now, so ideally you'd like something like this if I got your point right?

pub enum Device<B: Backend> {
    DiscreteGpu(B::Device),
    IntegratedGpu(B::Device),
    Cpu(B::Device),
}

And this enum could possibly grow with different backends (new variants would need to be added for backends with different devices, e.g. TPU).

The issue I see (and tried to raise on discord) is that backends are kind of free to categorize their devices as they wish. For example, with wgpu we can differentiate between discrete and integrated gpu (plus, there are other types like virtual and "other" detected devices). But if we take torch, well their mps device for MacOS doesn't differentiate between a discrete, integrated or external gpu even if metal supports all of them (as far as I understand). In that case, you could think that simply not differentiating between discrete and integrated gpus could work (so simplifying the enum variants), but then if you actually have a choice you probably would prefer a discrete gpu (on wgpu for example). This would remain a choice at the backend-level then...

So I see the motivation but not sure about the usefulness of this abstraction at a higher level 🤔 but I can be convinced otherwise 🙂

@jin-eld
Copy link

jin-eld commented Jun 13, 2024

I think there are two ways plus a possible middle ground on how to look at this. One is as you described in the above post, having a detailed enum with precise types and of course this is not without problems: as you correctly pointed out, some backends simply do not provide the necessary information right away or may not provide fine grained information at all.

The other way is to treat this as a convenience API which will be good enough for the majority of standard use cases and here the PyTorch approach is imho sufficient: usually the most pressing question is if there is any usable GPU on the system at all and the wish to select it instead of the CPU. The next more advanced case is with training and LLM inference, where multiple GPUs can be used at the same time (llama.cpp will even use the CPU in addition to the GPUs). In this situation one usually does not care which GPUs exactly there are: it's a "use all that are there" scenario.

So for the above logic I'd argue that it would even be enough to shrink the enum to the Torch-style interpretation of:

pub enum Device<B: Backend> {
    Gpu(B::Device),
    Cpu(B::Device),
    // perhaps TPU, NPU, other devices that are not GPUs can be added later
}

I recognize your point about wanting to prefer a discrete GPU over an integrated GPU, but I think selection of a specific GPU is a more special case, also "use these two, but not all" is also something that the user should decide, so such detailed settings are left to users via command line parameters.

There could perhaps be a way to still provide a way to handle more sophisticated selections for those who require it:

I'd have to fire up my AI server to see what these functions report and how useful the information is, but I am assuming that there will be some more detailed infos about the underlying hardware there:

https://pytorch.org/docs/stable/generated/torch.cuda.get_device_properties.html#torch.cuda.get_device_properties
https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html#torch.cuda.get_device_capability

Here, as a middle way, the idea would be, similar to Torch, to have an additional API function which allows to query a specific Device::GPU(0) and retrieve its properties for use cases where someone wants to implement a "pick the best device" solution.

To be fair, I did not check how much of the device info/properties all backends expose and this could be generalized at all. Then again, if someone wants such a fine grained control, they could indeed go down to the Backend level and use the tch/wgpu/etc functions directly, the convenience API is what it is - for conveniently handling most the most common use cases.

Again, my argument is for a convenience API which does not have to be that detailed and which covers the most common use case of "give me one or more GPUs that there are on the system", regardless of the underlying details. At least this is the main use case that I see when looking at the gazillion of AI applications that are out there. Would be nice if other users share their view on this, how would you guys prefer to handle the CPU vs GPU and multi-GPU scenario in the applications that you are developing?

@laggui
Copy link
Member Author

laggui commented Jun 13, 2024

Ok perhaps I misunderstood your initial request 🙂 thought you wanted something a bit in between, but in this case it's simply differentiating high level types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants