Split out the Meta AWS H100 to a separate check #27

zxiiro · 2025-08-25T14:24:13Z

The AWS H100 check regularly flaps due to it being a limited resource and seems like it can regularly queue for 3 or 4 hrs, even occassionally going as high as 5-6 hrs. Split the check out and set the check time to 4 hrs so that it flaps hopefully less.

The AWS H100 check regularly flaps due to it being a limited resource and seems like it can regularly queue for 3 or 4 hrs, even occassionally going as high as 5-6 hrs. Split the check out and set the check time to 4 hrs so that it flaps hopefully less. Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>

github-actions · 2025-08-25T14:25:06Z

OpenTofu plan for prod

Plan: 1 to add, 1 to change, 0 to destroy.

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

OpenTofu will perform the following actions:

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta" {
        id               = "nnz-icu-8qk"
        name             = "GHA Runner Queue Check - Meta Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
                dd.expect(dd.response.statusCode).to.equal(200);
              - const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.rocm.', '.s390x', '^lf\\.'];
              + const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
                const jsonData = dd.response.body;
                const parsedData = JSON.parse(jsonData);
                const highQueueItems = parsedData
                  .filter(item => {
                    const machineType = item.machine_type;
                    return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
                      pattern.startsWith('^') ?
                        new RegExp(pattern).test(machineType) :
                        machineType.includes(pattern)
                    ) && item.avg_queue_s > 7200;
                  })
                  .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                if (highQueueItems.length > 0) {
                  const machineDetails = highQueueItems
                    .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
                    .join(', ');
                  const message = `High queue detected for machine types: ${machineDetails}`;
                  console.error(message);
                }
                dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta-h100 will be created
+   resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta-h100" {
+       id         = (known after apply)
+       locations  = [
+           "aws:us-west-2",
        ]
+       message    = <<-EOT
            Detected GitHub Runner Queue - Meta Runners - AWS H100 has jobs waiting
            unusually long for runners.
            
            {{synthetics.attributes.result.failure.message}}
            
            Check https://hud.pytorch.org/metrics for more details.
            
            @slack-pytorch-infra-alerts
        EOT
+       monitor_id = (known after apply)
+       name       = "GHA Runner Queue Check - Meta Runners - AWS H100"
+       status     = "live"
+       tags       = [
+           "env:project",
+           "project:pytorch",
+           "service:gha-runners",
        ]
+       type       = "api"

+       assertion {
+           code = <<-EOT
                dd.expect(dd.response.statusCode).to.equal(200);
                const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
                const jsonData = dd.response.body;
                const parsedData = JSON.parse(jsonData);
                const highQueueItems = parsedData
                  .filter(item => {
                    const machineType = item.machine_type;
                    return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
                      pattern.startsWith('^') ?
                        new RegExp(pattern).test(machineType) :
                        machineType.includes(pattern)
                    ) && item.avg_queue_s > 7200;
                  })
                  .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                if (highQueueItems.length > 0) {
                  const machineDetails = highQueueItems
                    .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
                    .join(', ');
                  const message = `High queue detected for machine types: ${machineDetails}`;
                  console.error(message);
                }
                dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
+           type = "javascript"
        }

+       options_list {
+           http_version        = "any"
+           min_location_failed = 1
+           tick_every          = 900
        }

+       request_definition {
+           method = "GET"
+           url    = "https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D"
        }
    }

Plan: 1 to add, 1 to change, 0 to destroy.

✅ Plan applied in Tofu Apply #27

jordanconway · 2025-08-25T14:37:58Z

I think that megalinter grype error is either an upstream issue or transient network thing - tried retriggering. If it doesn't work, maybe we add a config to ignore the check, since I suspect the chances of any CVEs in our terraform is pretty low.

zxiiro requested a review from Copilot August 25, 2025 14:24

zxiiro requested a review from a team as a code owner August 25, 2025 14:24

zxiiro temporarily deployed to prod August 25, 2025 14:24 — with GitHub Actions Inactive

jordanconway approved these changes Aug 25, 2025

View reviewed changes

zxiiro merged commit f1b3f48 into main Aug 25, 2025
2 of 3 checks passed

zxiiro deleted the zxiiro/queue-alerts-meta-h100 branch August 25, 2025 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split out the Meta AWS H100 to a separate check #27

Split out the Meta AWS H100 to a separate check #27

Uh oh!

zxiiro commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025 •

edited

Loading

Uh oh!

jordanconway commented Aug 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Split out the Meta AWS H100 to a separate check #27

Split out the Meta AWS H100 to a separate check #27

Uh oh!

Conversation

zxiiro commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordanconway commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Aug 25, 2025 •

edited

Loading

jordanconway commented Aug 25, 2025 •

edited

Loading