-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/backend/list devices #1883
base: main
Are you sure you want to change the base?
Conversation
Let's move the discussion from Discord to the PR, so that it's easier to track :) Fist of all, thank you for reacting so quickly! I was about to ask if I should file a feature request, but you are already working on a PR )) So, throwing in a few thoughts:
Would it be possible to generalize the device types to an enum which could be used for all backends? In the above output the code that selects the device would have to be backend specific, i.e. the GPU appears as You raised the issue, that PyTorch reports ROCm devices as CUDA, however I think, that this can be ignored for this particular use case: the intention is to find a GPU suitable for inference or multiple GPUs suitable for training. Identifying what kind of GPU it really is could perhaps be achieved via a dedicated API querying the device name or device properties (i.e. similar to https://pytorch.org/docs/stable/cuda.html) if someone really requires it. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1883 +/- ##
==========================================
- Coverage 86.11% 86.08% -0.04%
==========================================
Files 777 778 +1
Lines 90555 90846 +291
==========================================
+ Hits 77979 78202 +223
- Misses 12576 12644 +68 ☔ View full report in Codecov by Sentry. |
With the discussions we had on discord it wasn't too difficult to figure out an easy way to draft this :) I understand your point of view a bit better now, so ideally you'd like something like this if I got your point right? pub enum Device<B: Backend> {
DiscreteGpu(B::Device),
IntegratedGpu(B::Device),
Cpu(B::Device),
} And this enum could possibly grow with different backends (new variants would need to be added for backends with different devices, e.g. TPU). The issue I see (and tried to raise on discord) is that backends are kind of free to categorize their devices as they wish. For example, with wgpu we can differentiate between discrete and integrated gpu (plus, there are other types like virtual and "other" detected devices). But if we take torch, well their mps device for MacOS doesn't differentiate between a discrete, integrated or external gpu even if metal supports all of them (as far as I understand). In that case, you could think that simply not differentiating between discrete and integrated gpus could work (so simplifying the enum variants), but then if you actually have a choice you probably would prefer a discrete gpu (on wgpu for example). This would remain a choice at the backend-level then... So I see the motivation but not sure about the usefulness of this abstraction at a higher level 🤔 but I can be convinced otherwise 🙂 |
I think there are two ways plus a possible middle ground on how to look at this. One is as you described in the above post, having a detailed enum with precise types and of course this is not without problems: as you correctly pointed out, some backends simply do not provide the necessary information right away or may not provide fine grained information at all. The other way is to treat this as a convenience API which will be good enough for the majority of standard use cases and here the PyTorch approach is imho sufficient: usually the most pressing question is if there is any usable GPU on the system at all and the wish to select it instead of the CPU. The next more advanced case is with training and LLM inference, where multiple GPUs can be used at the same time (llama.cpp will even use the CPU in addition to the GPUs). In this situation one usually does not care which GPUs exactly there are: it's a "use all that are there" scenario. So for the above logic I'd argue that it would even be enough to shrink the enum to the Torch-style interpretation of:
I recognize your point about wanting to prefer a discrete GPU over an integrated GPU, but I think selection of a specific GPU is a more special case, also "use these two, but not all" is also something that the user should decide, so such detailed settings are left to users via command line parameters. There could perhaps be a way to still provide a way to handle more sophisticated selections for those who require it: I'd have to fire up my AI server to see what these functions report and how useful the information is, but I am assuming that there will be some more detailed infos about the underlying hardware there: https://pytorch.org/docs/stable/generated/torch.cuda.get_device_properties.html#torch.cuda.get_device_properties Here, as a middle way, the idea would be, similar to Torch, to have an additional API function which allows to query a specific To be fair, I did not check how much of the device info/properties all backends expose and this could be generalized at all. Then again, if someone wants such a fine grained control, they could indeed go down to the Backend level and use the tch/wgpu/etc functions directly, the convenience API is what it is - for conveniently handling most the most common use cases. Again, my argument is for a convenience API which does not have to be that detailed and which covers the most common use case of "give me one or more GPUs that there are on the system", regardless of the underlying details. At least this is the main use case that I see when looking at the gazillion of AI applications that are out there. Would be nice if other users share their view on this, how would you guys prefer to handle the CPU vs GPU and multi-GPU scenario in the applications that you are developing? |
Ok perhaps I misunderstood your initial request 🙂 thought you wanted something a bit in between, but in this case it's simply differentiating high level types. |
Checklist
run-checks all
script has been executed.Changes
Added
list_available_devices
toBackend
andJitRuntime
trait along with backend-specific implementationAllows the user to get a list of available devices at runtime, e.g.