Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing 'GPU' entries in metrics #289

Open
roscisz opened this issue Sep 1, 2020 · 8 comments
Open

Missing 'GPU' entries in metrics #289

roscisz opened this issue Sep 1, 2020 · 8 comments
Assignees

Comments

@roscisz
Copy link
Owner

roscisz commented Sep 1, 2020

Hmmm I cannot see GPUs even when I click on the "+" sign, nothing happens, I guess the http request used is "/api/0.3.1/nodes/metrics" ? If so here are the contents of the response:

{
  "gpu1.***.com": {
    "CPU": {
      "CPU_gpu1.***.com": {
        "index": 0,
        "metrics": {
          "mem_free": {
            "unit": "MiB",
            "value": 1295
          },
          "mem_total": {
            "unit": "MiB",
            "value": 192925
          },
          "mem_used": {
            "unit": "MiB",
            "value": 152775
          },
          "utilization": {
            "unit": "%",
            "value": 6.15449
          }
        }
      }
    }
  },
  "gpu2.***.com": {
    "CPU": null
  },
  "gpu3.***.com": {
    "CPU": {
      "CPU_gpu3.***.com": {
        "index": 0,
        "metrics": {
          "mem_free": {
            "unit": "MiB",
            "value": 1039
          },
          "mem_total": {
            "unit": "MiB",
            "value": 192925
          },
          "mem_used": {
            "unit": "MiB",
            "value": 40694
          },
          "utilization": {
            "unit": "%",
            "value": 1.45833
          }
        }
      }
    }
  }
}

I replaced all hostnames with fakes ones.

Here are free -m results on all my machines: (OS are Centos 7):

[root@gpu1 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         192925      166473        7244        5040       19207       19968
Swap:         15931       14994         937
[root@gpu2 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         192925      158857        1487        6334       32579       26228
Swap:         15931        7748        8183
[root@gpu3 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         192925       40692        1036       28797      151196      122702
Swap:         15931           0       15931

Also nvidia-smi:

[root@gpu1 ~]# nvidia-smi
Tue Sep  1 14:23:26 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:14:00.0 Off |                    0 |
| N/A   28C    P0    49W / 250W |  21956MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:15:00.0 Off |                    0 |
| N/A   36C    P0    49W / 250W |  14135MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:39:00.0 Off |                    0 |
| N/A   54C    P0   163W / 250W |  21849MiB / 22919MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   52C    P0   177W / 250W |  21886MiB / 22919MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   23C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:89:00.0 Off |                    0 |
| N/A   30C    P0    50W / 250W |  18663MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   25C    P8    10W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   22C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4301      C   ...er_conda/miniconda/envs/CCR/bin/python3 18787MiB |
|    0     44728      C   ...se180025/.conda/envs/ner_env/bin/python  3157MiB |
|    1     16768      C   ...er_conda/miniconda/envs/CCR/bin/python3 14125MiB |
|    2     39814      C   ...se170020/.conda/envs/GPUtest/bin/python 21837MiB |
|    3      4301      C   ...er_conda/miniconda/envs/CCR/bin/python3   619MiB |
|    3     23651      C   ...se170020/.conda/envs/GPUtest/bin/python 21255MiB |
|    5      4301      C   ...er_conda/miniconda/envs/CCR/bin/python3 18653MiB |
+-----------------------------------------------------------------------------+

[root@gpu2 ~]# nvidia-smi
Tue Sep  1 14:23:53 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:14:00.0 Off |                    0 |
| N/A   28C    P0    48W / 250W |  21926MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:15:00.0 Off |                    0 |
| N/A   33C    P0    49W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:39:00.0 Off |                    0 |
| N/A   29C    P0    49W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   28C    P0    49W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   33C    P0    48W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:89:00.0 Off |                    0 |
| N/A   25C    P0    48W / 250W |    306MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   52C    P0   133W / 250W |  21867MiB / 22919MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   47C    P0   207W / 250W |  21867MiB / 22919MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    0      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3 21767MiB |
|    1      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    1      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    2      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    2      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    3      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    3      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    4      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    4      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    5      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    5      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    6      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    6      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    6     27039      C   ...se170020/.conda/envs/GPUtest/bin/python 21559MiB |
|    7      8757      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    7      8763      C   ...i/.conda/envs/pyenv_dotmani/bin/python3   147MiB |
|    7     23491      C   ...se170020/.conda/envs/GPUtest/bin/python 21559MiB |
+-----------------------------------------------------------------------------+

[root@gpu3 ~]# nvidia-smi
Tue Sep  1 14:24:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:14:00.0 Off |                    0 |
| N/A   31C    P0    49W / 250W |  21950MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:15:00.0 Off |                    0 |
| N/A   40C    P0    48W / 250W |  21923MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P40           Off  | 00000000:39:00.0 Off |                    0 |
| N/A   34C    P0    49W / 250W |  21923MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P40           Off  | 00000000:3A:00.0 Off |                    0 |
| N/A   31C    P0    49W / 250W |   1699MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:88:00.0 Off |                    0 |
| N/A   24C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:89:00.0 Off |                    0 |
| N/A   26C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   28C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:B2:00.0 Off |                    0 |
| N/A   25C    P8     9W / 250W |     10MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6714      C   /opt/gpudb/core/bin/gpudb_cluster_cuda       501MiB |
|    0      6716    C+G   /opt/gpudb/core/bin/gpudb_cluster_cuda     21430MiB |
|    1      6718    C+G   /opt/gpudb/core/bin/gpudb_cluster_cuda     21904MiB |
|    2      6720    C+G   /opt/gpudb/core/bin/gpudb_cluster_cuda     21904MiB |
|    3      4062      C   /opt/conda/envs/rapids/bin/python            319MiB |
|    3     28474      C   /opt/conda/envs/rapids/bin/python            377MiB |
|    3     31448      C   /opt/conda/envs/rapids/bin/python            271MiB |
|    3     33893      C   /opt/conda/envs/rapids/bin/python            229MiB |
|    3     47777      C   /opt/conda/envs/rapids/bin/python            399MiB |
+-----------------------------------------------------------------------------+

Thanks for your help :)

Originally posted by @Dubrzr in #286 (comment)

@roscisz roscisz added the bug label Sep 1, 2020
@roscisz roscisz self-assigned this Sep 1, 2020
@roscisz
Copy link
Owner Author

roscisz commented Sep 1, 2020

@Dubrzr could you please provide the output of the following command on your gpu2 server:

awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'

@Dubrzr
Copy link

Dubrzr commented Sep 1, 2020

It looks like it works fine:

[root@gpu1 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
3.08784
Mem:         192925      166538        7237        5040       19150       19906
[root@gpu2 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
6.61519
Mem:         192925      159153        1300        6254       32471       26015
[root@gpu3 ~]# awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 100 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat);free -m | awk 'NR==2'
1.24818
Mem:         192925       40694        1031       28797      151198      122699

@roscisz
Copy link
Owner Author

roscisz commented Sep 1, 2020

Thanks... wrong intuition then...

This is indeed the right endpoint. My sample output:

{
  "ai": {
    "CPU": {
      "CPU_ai": {
        "index": 0,
        "metrics": {
          "mem_free": {
            "unit": "MiB",
            "value": 30806
          },
          "mem_total": {
            "unit": "MiB",
            "value": 257868
          },
          "mem_used": {
            "unit": "MiB",
            "value": 44770
          },
          "utilization": {
            "unit": "%",
            "value": 5.62359
          }
        }
      }
    },
    "GPU": {
      "GPU-5db488ff-6728-fb07-93be-ee423d4ab086": {
        "index": 3,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1824
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16128
          },
          "mem_used": {
            "unit": "MiB",
            "value": 14304
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "62.97"
          },
          "temp": 46,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "python3",
            "owner": "macsakow",
            "pid": 22294
          }
        ]
      },
      "GPU-6602a81f-14bf-4ee5-2992-45a7e519802b": {
        "index": 1,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 16117
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16128
          },
          "mem_used": {
            "unit": "MiB",
            "value": 11
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "39.85"
          },
          "temp": 44,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "-",
            "owner": null,
            "pid": "-"
          }
        ]
      },
      "GPU-cc813906-1a7c-e941-ccf8-1a40cd99ccfb": {
        "index": 2,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1824
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16128
          },
          "mem_used": {
            "unit": "MiB",
            "value": 14304
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "52.27"
          },
          "temp": 45,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "python3",
            "owner": "macsakow",
            "pid": 22233
          }
        ]
      },
      "GPU-d1ae8368-a34f-3afc-14aa-01afbc0fa787": {
        "index": 0,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 16113
          },
          "mem_total": {
            "unit": "MiB",
            "value": 16125
          },
          "mem_used": {
            "unit": "MiB",
            "value": 12
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "39.27"
          },
          "temp": 44,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla V100-DGXS-16GB",
        "processes": [
          {
            "command": "-",
            "owner": null,
            "pid": "-"
          }
        ]
      }
    }
  }
}

@micmarty: any ideas? While nvidia-smi output looks fine and there is no error message from GPUMonitor, there are no 'GPU' entries in metrics.

@roscisz roscisz changed the title CPU monitoring AssertionError Missing 'GPU' entries in metrics Sep 1, 2020
@roscisz
Copy link
Owner Author

roscisz commented Sep 1, 2020

@Dubrzr: and how about this command:

nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits

I see that you have a newer version of NVIDIA driver (the newest version that we've tested is 418.116), maybe there have also been some changes to nvidia-smi...

@Dubrzr
Copy link

Dubrzr commented Sep 1, 2020

Here it is: :)

[root@gpu1 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 100
Tesla P40, [Not Supported], 99
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0

[root@gpu2 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 0
Tesla P40, [Not Supported], 99
Tesla P40, [Not Supported], 99

[root@gpu3 ~]# nvidia-smi --query-gpu=name,fan.speed,utilization.gpu --format=csv,nounits
name, fan.speed [%], utilization.gpu [%]
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0
Tesla P40, [N/A], 0

@roscisz
Copy link
Owner Author

roscisz commented Sep 1, 2020

Everything looks fine here...

Could you try modifying line 73 in tensorhive/core/managers/TensorHiveManager.py and set:

monitors = []

and see if it helps?

@Dubrzr
Copy link

Dubrzr commented Sep 1, 2020

Yep! It indeed works better 🎉 But gpu2 don't :

{
  "gpu1": {
    "GPU": {
      "GPU-373e1376-924e-2bfc-1f62-064eaaccd10d": {
        "index": 3,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1033
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21886
          },
          "mem_util": {
            "unit": "%",
            "value": 75
          },
          "power": {
            "unit": "W",
            "value": "225.63"
          },
          "temp": 52,
          "utilization": {
            "unit": "%",
            "value": 100
          }
        },
        "name": "Tesla P40"
      },
      "GPU-3d85c94f-202d-12a7-4bbc-9649c35ad28c": {
        "index": 6,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "10.02"
          },
          "temp": 25,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-6110d200-8918-9021-c962-74c486c34b0c": {
        "index": 1,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 8784
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 14135
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.36"
          },
          "temp": 36,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-71a449d4-02dc-ba5a-7006-59039b9895f9": {
        "index": 0,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 963
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21956
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.52"
          },
          "temp": 28,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-77749c57-33ca-d5fd-705a-f037e459a3ba": {
        "index": 4,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.63"
          },
          "temp": 23,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-c111d493-b7b5-89cf-19bb-5f25aad9b12f": {
        "index": 7,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.54"
          },
          "temp": 22,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-e0f615fb-f81e-b1c8-6d10-60c160bc30f8": {
        "index": 5,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 4256
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 18663
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "50.32"
          },
          "temp": 30,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-ef774f0c-ed68-01bb-5b42-2b54ed482ba1": {
        "index": 2,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": null
          },
          "mem_free": {
            "unit": "MiB",
            "value": 1070
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21849
          },
          "mem_util": {
            "unit": "%",
            "value": 74
          },
          "power": {
            "unit": "W",
            "value": "130.64"
          },
          "temp": 55,
          "utilization": {
            "unit": "%",
            "value": 99
          }
        },
        "name": "Tesla P40"
      }
    }
  },
  "gpu2": {
    "GPU": null
  },
  "gpu3": {
    "GPU": {
      "GPU-13f10865-a510-351c-65d5-1630c9d0a941": {
        "index": 5,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.84"
          },
          "temp": 26,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-1d351c42-ad31-d8fd-79ad-3db6a5d375a2": {
        "index": 3,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 21220
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 1699
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.16"
          },
          "temp": 31,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-29d32502-07b6-21a4-b655-e06d2137d248": {
        "index": 6,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "10.33"
          },
          "temp": 28,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-798ca8f5-3a20-5e7a-098a-79b749df27d6": {
        "index": 2,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 996
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21923
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "50.06"
          },
          "temp": 35,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-7a4f2061-9efa-63ef-446f-15e385a2818d": {
        "index": 1,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 996
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21923
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "48.88"
          },
          "temp": 40,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-a1469d32-daf1-8187-de12-c3bea645073a": {
        "index": 0,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 969
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 21950
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "49.87"
          },
          "temp": 31,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-dfb43b29-8262-e134-3c80-1ff955b2e0de": {
        "index": 7,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.75"
          },
          "temp": 25,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      },
      "GPU-f9811f7e-662d-ba70-fc06-f8c713358fc6": {
        "index": 4,
        "metrics": {
          "fan_speed": {
            "unit": "%",
            "value": "[N/A]"
          },
          "mem_free": {
            "unit": "MiB",
            "value": 22909
          },
          "mem_total": {
            "unit": "MiB",
            "value": 22919
          },
          "mem_used": {
            "unit": "MiB",
            "value": 10
          },
          "mem_util": {
            "unit": "%",
            "value": 0
          },
          "power": {
            "unit": "W",
            "value": "9.85"
          },
          "temp": 25,
          "utilization": {
            "unit": "%",
            "value": 0
          }
        },
        "name": "Tesla P40"
      }
    }
  }
}

@roscisz
Copy link
Owner Author

roscisz commented Sep 11, 2020

@Dubrzr do you have any new observations or hints?

If the data was lacking for gpu3, we would at least have an idea that the differing Fan speed "[N/A]" notation is not parsed properly. And with the proper output from nvidia-smi on gpu2, we currently have no ideas how to help...

What is the OS user account used by tensorhive? nvidia-smi works properly for root user, but does it also work for the user account used by TH on gpu2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants