Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune][rllib] Windows: GPU not recognized #9166

Closed
2 tasks done
juliusfrost opened this issue Jun 27, 2020 · 11 comments · Fixed by #9300
Closed
2 tasks done

[tune][rllib] Windows: GPU not recognized #9166

juliusfrost opened this issue Jun 27, 2020 · 11 comments · Fixed by #9300
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical rllib RLlib related issues tune Tune-related issues windows

Comments

@juliusfrost
Copy link
Member

juliusfrost commented Jun 27, 2020

What is the problem?

I'm getting ray.tune.error.TuneError: Insufficient cluster resources to launch trial.
I specified a GPU in my config but ray does not recognize my GPU (RTX 2080) and throws an error.
I can get passed this by setting num_gpus: 0 in my config for now.

https://gist.github.com/juliusfrost/fa7ebbb8d1dfc66eea0bbc4babcbe5aa

Reproduction (REQUIRED)

git clone https://github.com/juliusfrost/rllib-tune-atari.git
cd rllib-tune-atari
pip install -r requirements.txt
python train.py --algo a2c
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@juliusfrost juliusfrost added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 27, 2020
@juliusfrost
Copy link
Member Author

#9114 @mehrdadn

@juliusfrost
Copy link
Member Author

@mehrdadn Did you get a chance to look at this? I updated the script to reproduce to remove external dependencies.

@mehrdadn
Copy link
Contributor

mehrdadn commented Jul 3, 2020

@juliusfrost I don't have the full CUDA setup available to test this right now, but perhaps try running this after import ray and let me know if it changes anything?

import ray.resource_spec
import subprocess
# NOTE: This may not work in Windows 11+; see below for another solution.
ray.resource_spec._autodetect_num_gpus = lambda: len([line.rstrip() for line in subprocess.check_output([os.path.join(os.environ['SystemRoot'], 'System32', 'wbem', 'WMIC.exe'), "PATH", "Win32_VideoController", "GET", "AdapterCompatibility"], creationflags=subprocess.CREATE_NO_WINDOW).splitlines()[1:] if line.startswith(b"NVIDIA")])

2022 update: If the above doesn't work, you can try using PowerShell (below) instead.

@juliusfrost
Copy link
Member Author

@mehrdadn Awesome it worked! Great job 👍

@abhayraw1
Copy link

I'm still getting this issue. Following the snippet posted by @mehrdadn ray.resource_spec._autodetect_num_gpus() returns 1 but ray.get_gpu_ids() still returns an empty list [].

Screenshot 2020-09-18 144156

ray version: 0.9.0.dev0
OS: Windows 10

@mehrdadn
Copy link
Contributor

@wuisawesome Might this be due to _get_gpu_info_string not handling Windows?

@mehrdadn mehrdadn reopened this Sep 18, 2020
@ericl ericl added P2 Important issue, but not time-critical tune Tune-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 21, 2020
@mehrdadn
Copy link
Contributor

mehrdadn commented Sep 4, 2021

Heads up, WMIC is now deprecated so the approach from #9300 may stop working. CC @richardliaw

@richardliaw richardliaw added the rllib RLlib related issues label Oct 5, 2021
@fnw
Copy link

fnw commented Dec 6, 2021

I am on Ray 1.9.0 and on Windows 11 Insider Preview (Build 22483.1011) and attempting to run GPU experiments results in a "FileNotFoundError: [WinError 2] The system cannot find the file specified" error when trying to run the WMIC command in the GPU auto-detection function, which would seem to indicate that WMIC has already been removed, at least on Insider Preview builds.

@sai-prasanna
Copy link

Anyone managed to fix this?

@mehrdadn
Copy link
Contributor

mehrdadn commented Apr 30, 2022

@sai-prasanna @richardliaw @fnw you should be able to use PowerShell instead:

import ray.resource_spec
import subprocess
ray.resource_spec._autodetect_num_gpus = lambda: int(subprocess.check_output([os.path.join(os.environ['SystemRoot'], 'System32', 'WindowsPowerShell', 'v1.0', 'powershell.exe'), '-Command', "& {@(Get-WmiObject -Query \"SELECT * FROM Win32_VideoController WHERE AdapterCompatibility = 'NVIDIA'\").count}"], stdin=subprocess.DEVNULL, stderr=subprocess.DEVNULL, creationflags=subprocess.CREATE_NO_WINDOW).rstrip())

That should be sufficient if they haven't removed the ability to query WMI from PowerShell.


Otherwise, you can do

python3 -m pip install wmi

and then

ray.resource_spec._autodetect_num_gpus = lambda: len(list(filter(lambda c: c.wmi_property('AdapterCompatibility').value == "NVIDIA", __import__('wmi').WMI().Win32_VideoController())))

If it tries to install pywin32 and fails, you might need to first grab the wheel for that from Gohlke's website.

It's probably possible to use ctypes to manually invoke WMI stuff and bypass the need for that, but it's a fair bit more effort.

@mattip
Copy link
Contributor

mattip commented Jun 3, 2022

I wonder if we could use gpustat to get this information instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical rllib RLlib related issues tune Tune-related issues windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants