-
Notifications
You must be signed in to change notification settings - Fork 3
Do not fail check if PyTorch HUD API is down #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the PyTorch HUD is down we are not able to check on queue details so rather than reporting a potentially incorrect alert. Ignore the check until the PyTorch HUD API is back in service. Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
|
OpenTofu plan for prod Plan: 0 to add, 8 to change, 0 to destroy.OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~ update in-place
OpenTofu will perform the following actions:
# datadog_synthetics_test.pytorch-gha-runners-queue-check-amd will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-amd" {
id = "yt8-7zy-xpj"
name = "GHA Runner Queue Check - AMD Runners"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
if (dd.response.statusCode !== 200) {
// We do not want to fail due to hud.pytorch.org API failure.
console.log('Status code is not 200, stopping execution');
dd.expect(true).to.equal(true);
}
else {
const MACHINE_TYPE_FILTER = '.rocm.';
const jsonData = dd.response.body;
const parsedData = JSON.parse(jsonData);
const highQueueItems = parsedData
.filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 14400)
.map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
if (highQueueItems.length > 0) {
const machineDetails = highQueueItems
.map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
.join(', ');
const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
console.error(message);
}
dd.expect(highQueueItems.length > 0).to.be.false;
}
EOT
# (1 unchanged attribute hidden)
}
!~ options_list {
# (16 unchanged attributes hidden)
- monitor_options {
- notification_preset_name = "show_all" -> null
- renotify_interval = 0 -> null
- renotify_occurrences = 0 -> null
}
- retry {
- count = 0 -> null
- interval = 300 -> null
}
}
# (1 unchanged block hidden)
}
# datadog_synthetics_test.pytorch-gha-runners-queue-check-ibm will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-ibm" {
id = "sc6-zip-2n9"
name = "GHA Runner Queue Check - IBM Runners"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
- dd.expect(dd.response.statusCode).to.equal(200);
+ if (dd.response.statusCode !== 200) {
+ // We do not want to fail due to hud.pytorch.org API failure.
+ console.log('Status code is not 200, stopping execution');
+ dd.expect(true).to.equal(true);
+ }
+ else {
+ const MACHINE_TYPE_FILTER = '.s390x';
+ const jsonData = dd.response.body;
+ const parsedData = JSON.parse(jsonData);
- const MACHINE_TYPE_FILTER = '.s390x';
- const jsonData = dd.response.body;
- const parsedData = JSON.parse(jsonData);
+ const highQueueItems = parsedData
+ .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
+ .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
- const highQueueItems = parsedData
- .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
- .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
+ if (highQueueItems.length > 0) {
+ const machineDetails = highQueueItems
+ .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
+ .join(', ');
+ const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
+ console.error(message);
+ }
- if (highQueueItems.length > 0) {
- const machineDetails = highQueueItems
- .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
- .join(', ');
- const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
- console.error(message);
+ dd.expect(highQueueItems.length > 0).to.be.false;
}
-
- dd.expect(highQueueItems.length > 0).to.be.false;
EOT
# (1 unchanged attribute hidden)
}
# (2 unchanged blocks hidden)
}
# datadog_synthetics_test.pytorch-gha-runners-queue-check-intel will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-intel" {
id = "67g-icy-6mh"
name = "GHA Runner Queue Check - Intel Runners"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
- dd.expect(dd.response.statusCode).to.equal(200);
+ if (dd.response.statusCode !== 200) {
+ // We do not want to fail due to hud.pytorch.org API failure.
+ console.log('Status code is not 200, stopping execution');
+ dd.expect(true).to.equal(true);
+ }
+ else {
+ const MACHINE_TYPE_FILTER = '.idc.';
+ const jsonData = dd.response.body;
+ const parsedData = JSON.parse(jsonData);
- const MACHINE_TYPE_FILTER = '.idc.';
- const jsonData = dd.response.body;
- const parsedData = JSON.parse(jsonData);
+ const highQueueItems = parsedData
+ .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
+ .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
- const highQueueItems = parsedData
- .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
- .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
+ if (highQueueItems.length > 0) {
+ const machineDetails = highQueueItems
+ .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
+ .join(', ');
+ const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
+ console.error(message);
+ }
- if (highQueueItems.length > 0) {
- const machineDetails = highQueueItems
- .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
- .join(', ');
- const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
- console.error(message);
+ dd.expect(highQueueItems.length > 0).to.be.false;
}
-
- dd.expect(highQueueItems.length > 0).to.be.false;
EOT
# (1 unchanged attribute hidden)
}
# (2 unchanged blocks hidden)
}
# datadog_synthetics_test.pytorch-gha-runners-queue-check-lf will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-lf" {
id = "p69-6vj-54b"
name = "GHA Runner Queue Check - Linux Foundation Runners"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
- dd.expect(dd.response.statusCode).to.equal(200);
+ if (dd.response.statusCode !== 200) {
+ // We do not want to fail due to hud.pytorch.org API failure.
+ console.log('Status code is not 200, stopping execution');
+ dd.expect(true).to.equal(true);
+ }
+ else {
+ const MACHINE_TYPE_FILTER = 'lf.';
+ const jsonData = dd.response.body;
+ const parsedData = JSON.parse(jsonData);
- const MACHINE_TYPE_FILTER = 'lf.';
- const jsonData = dd.response.body;
- const parsedData = JSON.parse(jsonData);
+ const highQueueItems = parsedData
+ .filter(item => item.machine_type.startsWith(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
+ .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
- const highQueueItems = parsedData
- .filter(item => item.machine_type.startsWith(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
- .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
+ if (highQueueItems.length > 0) {
+ const machineDetails = highQueueItems
+ .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
+ .join(', ');
+ const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
+ console.error(message);
+ }
- if (highQueueItems.length > 0) {
- const machineDetails = highQueueItems
- .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
- .join(', ');
- const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
- console.error(message);
+ dd.expect(highQueueItems.length > 0).to.be.false;
}
-
- dd.expect(highQueueItems.length > 0).to.be.false;
EOT
# (1 unchanged attribute hidden)
}
!~ options_list {
# (16 unchanged attributes hidden)
- monitor_options {
- notification_preset_name = "show_all" -> null
- renotify_interval = 0 -> null
- renotify_occurrences = 0 -> null
}
- retry {
- count = 0 -> null
- interval = 300 -> null
}
}
# (1 unchanged block hidden)
}
# datadog_synthetics_test.pytorch-gha-runners-queue-check-meta will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta" {
id = "nnz-icu-8qk"
name = "GHA Runner Queue Check - Meta Runners"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
- dd.expect(dd.response.statusCode).to.equal(200);
- const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.idc.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
- const jsonData = dd.response.body;
- const parsedData = JSON.parse(jsonData);
- const highQueueItems = parsedData
- .filter(item => {
- const machineType = item.machine_type;
- return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
- pattern.startsWith('^') ?
- new RegExp(pattern).test(machineType) :
- machineType.includes(pattern)
- ) && item.avg_queue_s > 10800;
- })
- .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
- if (highQueueItems.length > 0) {
- const machineDetails = highQueueItems
- .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
- .join(', ');
- const message = `High queue detected for machine types: ${machineDetails}`;
- console.error(message);
+ if (dd.response.statusCode !== 200) {
+ // We do not want to fail due to hud.pytorch.org API failure.
+ console.log('Status code is not 200, stopping execution');
+ dd.expect(true).to.equal(true);
}
- dd.expect(highQueueItems.length > 0).to.be.false;
+ else {
+ const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.idc.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
+ const jsonData = dd.response.body;
+ const parsedData = JSON.parse(jsonData);
+ const highQueueItems = parsedData
+ .filter(item => {
+ const machineType = item.machine_type;
+ return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
+ pattern.startsWith('^') ?
+ new RegExp(pattern).test(machineType) :
+ machineType.includes(pattern)
+ ) && item.avg_queue_s > 10800;
+ })
+ .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
+ if (highQueueItems.length > 0) {
+ const machineDetails = highQueueItems
+ .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
+ .join(', ');
+ const message = `High queue detected for machine types: ${machineDetails}`;
+ console.error(message);
+ }
+ dd.expect(highQueueItems.length > 0).to.be.false;
+ }
EOT
# (1 unchanged attribute hidden)
}
# (2 unchanged blocks hidden)
}
# datadog_synthetics_test.pytorch-gha-runners-queue-check-meta-h100 will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta-h100" {
id = "hpi-psi-z8i"
name = "GHA Runner Queue Check - Meta Runners - AWS H100"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
- dd.expect(dd.response.statusCode).to.equal(200);
+ if (dd.response.statusCode !== 200) {
+ // We do not want to fail due to hud.pytorch.org API failure.
+ console.log('Status code is not 200, stopping execution');
+ dd.expect(true).to.equal(true);
+ }
+ else {
+ const MACHINE_TYPE_FILTER = 'linux.aws.h100';
+ const jsonData = dd.response.body;
+ const parsedData = JSON.parse(jsonData);
- const MACHINE_TYPE_FILTER = 'linux.aws.h100';
- const jsonData = dd.response.body;
- const parsedData = JSON.parse(jsonData);
+ const highQueueItems = parsedData
+ .filter(item => item.machine_type === MACHINE_TYPE_FILTER && item.avg_queue_s > 21600)
+ .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
- const highQueueItems = parsedData
- .filter(item => item.machine_type === MACHINE_TYPE_FILTER && item.avg_queue_s > 21600)
- .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
+ if (highQueueItems.length > 0) {
+ const machineDetails = highQueueItems
+ .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
+ .join(', ');
+ const message = `High queue detected for machine type ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
+ console.error(message);
+ }
- if (highQueueItems.length > 0) {
- const machineDetails = highQueueItems
- .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
- .join(', ');
- const message = `High queue detected for machine type ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
- console.error(message);
+ dd.expect(highQueueItems.length > 0).to.be.false;
}
-
- dd.expect(highQueueItems.length > 0).to.be.false;
EOT
# (1 unchanged attribute hidden)
}
# (2 unchanged blocks hidden)
}
# datadog_synthetics_test.pytorch-gha-runners-queue-check-nvidia will be updated in-place
!~ resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-nvidia" {
id = "sxd-d72-36u"
name = "GHA Runner Queue Check - Nvidia Runners"
tags = [
"env:project",
"project:pytorch",
"service:gha-runners",
]
# (10 unchanged attributes hidden)
!~ assertion {
!~ code = <<-EOT
- dd.expect(dd.response.statusCode).to.equal(200);
+ if (dd.response.statusCode !== 200) {
+ // We do not want to fail due to hud.pytorch.org API failure.
+ console.log('Status code is not 200, stopping execution');
+ dd.expect(true).to.equal(true);
+ }
+ else {
+ const MACHINE_TYPE_FILTER = '.dgx.';
+ const jsonData = dd.response.body;
+ const parsedData = JSON.parse(jsonData);
- const MACHINE_TYPE_FILTER = '.dgx.';
- const jsonData = dd.response.body;
- const parsedData = JSON.parse(jsonData);
+ const highQueueItems = parsedData
+ .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
+ .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
- const highQueueItems = parsedData
- .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
- .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
+ if (highQueueItems.length > 0) {
+ const machineDetails = highQueueItems
+ .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
+ .join(', ');
+ const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
+ console.error(message);
+ }
- if (highQueueItems.length > 0) {
- const machineDetails = highQueueItems
- .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
- .join(', ');
- const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
- console.error(message);
+ dd.expect(highQueueItems.length > 0).to.be.false;
}
-
- dd.expect(highQueueItems.length > 0).to.be.false;
EOT
# (1 unchanged attribute hidden)
}
# (2 unchanged blocks hidden)
}
# datadog_webhook.lf-incident-io will be updated in-place
!~ resource "datadog_webhook" "lf-incident-io" {
!~ custom_headers = (sensitive value)
!~ id = "**************" -> (known after apply)
name = "lf-incident-io"
# (3 unchanged attributes hidden)
}
Plan: 0 to add, 8 to change, 0 to destroy.✅ Plan applied in Tofu Apply #36 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements graceful handling when the PyTorch HUD API is unavailable by preventing check failures due to API downtime. Instead of failing when the API returns a non-200 status code, the scripts now log the issue and pass the check with a dummy assertion.
- Replaces strict status code validation with conditional error handling
- Adds logging for API unavailability scenarios
- Wraps existing queue checking logic in else blocks to prevent execution during API downtime
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/check-long-queue-s390x.js | Adds API downtime handling for s390x queue monitoring |
| scripts/check-long-queue-rocm.js | Adds API downtime handling for ROCm queue monitoring |
| scripts/check-long-queue-nvidia.js | Adds API downtime handling for NVIDIA queue monitoring |
| scripts/check-long-queue-meta.js | Adds API downtime handling for Meta queue monitoring |
| scripts/check-long-queue-meta-h100.js | Adds API downtime handling for Meta H100 queue monitoring |
| scripts/check-long-queue-lf.js | Adds API downtime handling for LF queue monitoring |
| scripts/check-long-queue-intel.js | Adds API downtime handling for Intel queue monitoring |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
If the PyTorch HUD is down we are not able to check on queue details so rather than reporting a potentially incorrect alert. Ignore the check until the PyTorch HUD API is back in service.