Skip to content

Conversation

@zxiiro
Copy link
Contributor

@zxiiro zxiiro commented Aug 25, 2025

The AWS H100 check regularly flaps due to it being a limited resource and seems like it can regularly queue for 3 or 4 hrs, even occassionally going as high as 5-6 hrs. Split the check out and set the check time to 4 hrs so that it flaps hopefully less.

The AWS H100 check regularly flaps due to it being a limited resource
and seems like it can regularly queue for 3 or 4 hrs, even
occassionally going as high as 5-6 hrs. Split the check out and set
the check time to 4 hrs so that it flaps hopefully less.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
@zxiiro zxiiro requested a review from Copilot August 25, 2025 14:24
@zxiiro zxiiro requested a review from a team as a code owner August 25, 2025 14:24
@github-actions
Copy link

github-actions bot commented Aug 25, 2025

OpenTofu plan for prod

Plan: 1 to add, 1 to change, 0 to destroy.
OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

OpenTofu will perform the following actions:

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta" {
        id               = "nnz-icu-8qk"
        name             = "GHA Runner Queue Check - Meta Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
                dd.expect(dd.response.statusCode).to.equal(200);
              - const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.rocm.', '.s390x', '^lf\\.'];
              + const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
                const jsonData = dd.response.body;
                const parsedData = JSON.parse(jsonData);
                const highQueueItems = parsedData
                  .filter(item => {
                    const machineType = item.machine_type;
                    return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
                      pattern.startsWith('^') ?
                        new RegExp(pattern).test(machineType) :
                        machineType.includes(pattern)
                    ) && item.avg_queue_s > 7200;
                  })
                  .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                if (highQueueItems.length > 0) {
                  const machineDetails = highQueueItems
                    .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
                    .join(', ');
                  const message = `High queue detected for machine types: ${machineDetails}`;
                  console.error(message);
                }
                dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta-h100 will be created
+   resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta-h100" {
+       id         = (known after apply)
+       locations  = [
+           "aws:us-west-2",
        ]
+       message    = <<-EOT
            Detected GitHub Runner Queue - Meta Runners - AWS H100 has jobs waiting
            unusually long for runners.
            
            {{synthetics.attributes.result.failure.message}}
            
            Check https://hud.pytorch.org/metrics for more details.
            
            @slack-pytorch-infra-alerts
        EOT
+       monitor_id = (known after apply)
+       name       = "GHA Runner Queue Check - Meta Runners - AWS H100"
+       status     = "live"
+       tags       = [
+           "env:project",
+           "project:pytorch",
+           "service:gha-runners",
        ]
+       type       = "api"

+       assertion {
+           code = <<-EOT
                dd.expect(dd.response.statusCode).to.equal(200);
                const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
                const jsonData = dd.response.body;
                const parsedData = JSON.parse(jsonData);
                const highQueueItems = parsedData
                  .filter(item => {
                    const machineType = item.machine_type;
                    return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
                      pattern.startsWith('^') ?
                        new RegExp(pattern).test(machineType) :
                        machineType.includes(pattern)
                    ) && item.avg_queue_s > 7200;
                  })
                  .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                if (highQueueItems.length > 0) {
                  const machineDetails = highQueueItems
                    .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
                    .join(', ');
                  const message = `High queue detected for machine types: ${machineDetails}`;
                  console.error(message);
                }
                dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
+           type = "javascript"
        }

+       options_list {
+           http_version        = "any"
+           min_location_failed = 1
+           tick_every          = 900
        }

+       request_definition {
+           method = "GET"
+           url    = "https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D"
        }
    }

Plan: 1 to add, 1 to change, 0 to destroy.

✅ Plan applied in Tofu Apply #27

@jordanconway
Copy link
Contributor

jordanconway commented Aug 25, 2025

I think that megalinter grype error is either an upstream issue or transient network thing - tried retriggering. If it doesn't work, maybe we add a config to ignore the check, since I suspect the chances of any CVEs in our terraform is pretty low.

@zxiiro zxiiro merged commit f1b3f48 into main Aug 25, 2025
2 of 3 checks passed
@zxiiro zxiiro deleted the zxiiro/queue-alerts-meta-h100 branch August 25, 2025 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants