Skip to content

Monitoring GPU and MIG instances #134

@jowy-ti

Description

@jowy-ti

Hi,

I am trying to simulate a node with MIG instances and monitor them using the integrated dcgm-exporter. While I can successfully make the MIG instances appear on the node and be available for scheduling, I cannot get any Prometheus metrics for them.

What I Tried

I configured the values.yaml to define MIG instances using the otherDevices key, as discovered from the source code. For example:

# values.yaml
topology:
 nodePools:
   default:
     gpuProduct: "NVIDIA-A100-SXM4-80GB"
     gpuCount: 1
     gpuMemory: 80000
     otherDevices:
       - name: "nvidia.com/mig-1g.10gb"
         count: 7

I applied this configuration via Helm.

Actual Behavior (The Problem)

The MIG resources are correctly advertised on the node and pods can be scheduled on them. However, the dcgm-exporter's /metrics endpoint does not show any metrics for the nvidia.com/mig-1g.10gb instances. It only exports metrics for full GPUs if gpuCount is greater than 0.

What I found

After digging into the source code. In internal/status-exporter/export/metrics/exporter.go, the export function contains the following loop:

for gpuIdx, gpu := range nodeTopology.Gpus {
    // ... exports metrics for the full GPU ...
}

This code only iterates over the "gpus" field from the node topology.

So I labeled the nodes with "node-role.kubernetes.io/runai-dynamic-mig=true" and "node-role.kubernetes.io/runai-mig-enabled=true" and also added the following annotation:

run.ai/mig.config: |-
     version: v1
     mig-configs:
       selected:
       - devices: [0]
         mig-enabled: true
         mig-devices:
         - name: 1g.10gb
           position: 0
           size: 1
         - name: 1g.10gb
           position: 1
           size: 1
         - name: 1g.10gb
           position: 2
           size: 1
         - name: 1g.10gb
           position: 3
           size: 1
         - name: 1g.10gb
           position: 4
           size: 1
         - name: 1g.10gb
           position: 5
           size: 1
         - name: 1g.10gb
           position: 6
           size: 1

I got the annotation "run.ai/mig-mapping" and the label "nvidia.com/mig.config.state=success", but it didn’t register the GPU instances on the “gpus” section in the node topology and didn't get any error from the logs of the pods, in fact mig-faker says "Successfuly updated MIG config". I don’t know how to add the GPU instances on the topology in the “gpus” section, only in otherDevices.

Question

Do I misunderstand something? Is there a different, undocumented workflow for enabling MIG monitoring?

Thanks for the great work done on this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions