Monitoring GPU and MIG instances

Hi,

I am trying to simulate a node with MIG instances and monitor them using the integrated dcgm-exporter. While I can successfully make the MIG instances appear on the node and be available for scheduling, I cannot get any Prometheus metrics for them.

### **What I Tried**
I configured the values.yaml to define MIG instances using the otherDevices key, as discovered from the source code. For example:

```
# values.yaml
topology:
 nodePools:
   default:
     gpuProduct: "NVIDIA-A100-SXM4-80GB"
     gpuCount: 1
     gpuMemory: 80000
     otherDevices:
       - name: "nvidia.com/mig-1g.10gb"
         count: 7
```

I applied this configuration via Helm.

### **Actual Behavior (The Problem)**
The MIG resources are correctly advertised on the node and pods can be scheduled on them. However, the dcgm-exporter's /metrics endpoint does not show any metrics for the nvidia.com/mig-1g.10gb instances. It only exports metrics for full GPUs if gpuCount is greater than 0.

### **What I found**
After digging into the source code. In internal/status-exporter/export/metrics/exporter.go, the export function contains the following loop:

```
for gpuIdx, gpu := range nodeTopology.Gpus {
    // ... exports metrics for the full GPU ...
}
```

This code only iterates over the "gpus" field from the node topology. 

So I labeled the nodes with "node-role.kubernetes.io/runai-dynamic-mig=true" and "[node-role.kubernetes.io/runai-mig-enabled=true](http://node-role.kubernetes.io/runai-mig-enabled=true)" and also added the following annotation:

```
run.ai/mig.config: |-
     version: v1
     mig-configs:
       selected:
       - devices: [0]
         mig-enabled: true
         mig-devices:
         - name: 1g.10gb
           position: 0
           size: 1
         - name: 1g.10gb
           position: 1
           size: 1
         - name: 1g.10gb
           position: 2
           size: 1
         - name: 1g.10gb
           position: 3
           size: 1
         - name: 1g.10gb
           position: 4
           size: 1
         - name: 1g.10gb
           position: 5
           size: 1
         - name: 1g.10gb
           position: 6
           size: 1
```

I got the annotation "run.ai/mig-mapping" and the label "nvidia.com/mig.config.state=success", but it didn’t register the GPU instances on the “gpus” section in the node topology and didn't get any error from the logs of the pods, in fact mig-faker says "Successfuly updated MIG config". I don’t know how to add the GPU instances on the topology in the “gpus” section, only in otherDevices.

### **Question**
Do I misunderstand something? Is there a different, undocumented workflow for enabling MIG monitoring?

Thanks for the great work done on this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring GPU and MIG instances #134

What I Tried

Actual Behavior (The Problem)

What I found

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Monitoring GPU and MIG instances #134

Description

What I Tried

Actual Behavior (The Problem)

What I found

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions