Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UltraServer support for CloudWatch agent #1571

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

petruanica
Copy link

@petruanica petruanica commented Feb 26, 2025

Description of the issue

  • Add support for monitoring EC2 UltraServers using CloudWatch Agent

Description of changes

  • Allowlisted UltraServer dimension for Neuron metrics
  • Added new (ClusterName, UltraServer) dimension for Neuron metrics emitted at the node level

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  • Updated test files to include the new UltraServer dimension
  • Verified metric output format includes the new UltraServer identifier by deploying changes in EKS test cluster

EMF sample:

{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "UltraServer"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "InstanceType",
                    "NeuronCore",
                    "NeuronDevice",
                    "NodeName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_neuroncore_memory_usage_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_runtime_memory",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_constants",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_model_code",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_model_shared_scratchpad",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_neuroncore_memory_usage_tensors",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "HyperPodNeuronEFA",
    "InstanceId": "i-07c72013a77a6826e",
    "InstanceType": "trn1.2xlarge",
    "NeuronCore": "core1",
    "NeuronDevice": "device0",
    "NodeName": "ip-192-168-94-194.us-west-2.compute.internal",
    "Timestamp": "1740591416636",
    "Type": "NodeAWSNeuronCore",
    "UltraServer": "u-1234567890",
    "Version": "0",
    "availability_zone": "us-west-2b",
    "kubernetes": {
        "host": "ip-192-168-94-194.us-west-2.compute.internal"
    },
    "region": "us-west-2",
    "subnet_id": "subnet-0dfa65d0c9792b6f3",
    "node_neuroncore_memory_usage_constants": 0,
    "node_neuroncore_memory_usage_model_code": 0,
    "node_neuroncore_memory_usage_model_shared_scratchpad": 0,
    "node_neuroncore_memory_usage_runtime_memory": 0,
    "node_neuroncore_memory_usage_tensors": 0,
    "node_neuroncore_memory_usage_total": 0,
    "node_neuroncore_utilization": 0
}

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@petruanica petruanica marked this pull request as ready for review March 3, 2025 16:24
@petruanica petruanica requested a review from a team as a code owner March 3, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants