Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for different types of GPUs with SLURM #1631

Closed
iparask opened this issue Jun 6, 2018 · 5 comments
Closed

Support for different types of GPUs with SLURM #1631

iparask opened this issue Jun 6, 2018 · 5 comments

Comments

@iparask
Copy link
Contributor

iparask commented Jun 6, 2018

There are clusters that provide multiple types of GPUs. Bridges for example has K80 and P100. There is a SLURM environment variable, at least on Bridges called SLURM_GRES. This variable when probed provides the following information:

[paraskev@gpu047 ~]$ echo $SLURM_GRES
gpu:p100:1

I am currently looking how we can make sure that this variable exists. This way the RPs LRMS will be able to discover the allocated GPUs.

This issue is parallel to RADICAL.SAGA issue #679

@andre-merzky
Copy link
Member

The SAGA issue is addressed.

@mturilli
Copy link
Contributor

mturilli commented Jul 7, 2020

@iparask any update on this?

@iparask
Copy link
Contributor Author

iparask commented Jul 9, 2020

Yes. I'm working on a fix for Bridges. It will allow SAGA to submit both to p100 and k80 nodes on Bridges at least. Do we need this for another resource? I will need to check how to get different nodes there as well.

Now that I think about it, it may be good to include the AI nodes Bridges has, but I am not sure I have access there right now.

@mturilli
Copy link
Contributor

mturilli commented Aug 3, 2020

@iparask to check whether this works with RP.

@iparask
Copy link
Contributor Author

iparask commented Aug 7, 2020

It does work with RP. Here is the output I get from a unit running nvidia-smi:

[paraskev@login006 unit.000000]$ cat STDOUT
Wed Aug  5 17:58:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:83:00.0 Off |                    0 |
| N/A   28C    P8    25W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   27C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:8A:00.0 Off |                    0 |
| N/A   33C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:8B:00.0 Off |                    0 |
| N/A   30C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We need though a new configuration for Bridges. I will create one the next days and open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants