Skip to content

Conversation

shoumikhin
Copy link
Contributor

@shoumikhin shoumikhin commented May 9, 2025

Treat 403, 429 and 503 http errors as success.
Ignore non-verbal hostnames.
Kill child jobs immediately.

Treat 403, 429 and 503 http errors as success.
Ignore non-verbal hostnames.
Copy link

pytorch-bot bot commented May 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153246

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 1dca413 with merge base 916f6ba (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 9, 2025
@shoumikhin shoumikhin requested review from albanD and malfet May 9, 2025 15:01
@albanD
Copy link
Collaborator

albanD commented May 9, 2025

Not sure what's the logic behind this change?
Why this vs only considering 404 as failures?
tbh I don't know my http err code by heart, so not sure how many other variants we want to consider failures?

@shoumikhin
Copy link
Contributor Author

shoumikhin commented May 9, 2025

@albanD other things we likely want to catch:

410 - the resource used to exist, but is permanently removed
401 - need credentials to see it: likely some private link leaked
400 - the request URL itself was malformed
5xx - could be a genuine outage or mis-deployment - worth noticing

So we're trying to be on a safer side and ignore only a few codes related to "too many requests" or Cloudflare and similar challenges.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what's your plan to monitor the flakyness of this to know when it's ok to re-enable?

@shoumikhin
Copy link
Contributor Author

The plan is to turn it back on once it proves to be green on nightly and in logs on every PR for a long period of time (except those that do actually break urls).

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but is it possible to replace shell scripts with Python (grev -Ev is GNU extension), so one would not be able to run this on their Mac or Win machine to reproduce linter failure

@shoumikhin
Copy link
Contributor Author

@malfet overall, yes, we can make it a python script in future to support Windows (for users w/o Cygwin, etc.). And it should work on macOS, feel free to try ./scripts/lint_urls.sh.

@malfet
Copy link
Contributor

malfet commented May 9, 2025

The plan is to turn it back on once it proves to be green on nightly and in logs on every PR for a long period of time (except those that do actually break urls).

This does not sound like a good plan to me, because any system that relies on all the websites on the internet to be always available is flaky. There should be a periodic process that crawls the URLs and updates it's availability in DB of some sort, which can have something like availability_raiting which is number between 0 (unavailable) to 10 (always available) and every time this periodic job runs it increases or decreases availability by one (clamped to 0-10 range)

@shoumikhin
Copy link
Contributor Author

@malfet if I got that right, you're proposing to remember for each URL whether it worked during past checks, and have some heuristics to either forgive or flag if it fails now, depending on its previous status?
That sounds robust, and a bit like over-engineering to me, at least for this specific task.

Let's step back for a moment.

What problem we're trying to solve?
Maintain URLs to keep them valid.

How can we do that?
Check if they are alive periodically, and at the time they are introduced/modified.

Can URLs get temporarily broken not due to our fault?
Of course. And they will.

What do we do if we find a broken link with a nightly job?
Triage and address it manually. E.g. check the logs and, as an option, look up if it failed in previous jobs too and make a decision on how healthy it is. The periodic job should be FYI only and should not be treated as a failure (maybe return success always if there's no concept of FYI job?). It can also come handy as a sanity check, especially before the next release.

Can we make the nightly job always pass unless some URLs have been failing consistently for a long time?
Yes, by keeping a log of the health score per each URL. Although, is it worth the effort? Not sure.

What do we do if we find a broken URL with a PR job?
The PR job checks the modified/added code only. In rare case someone changed a line with an URL w/ or w/o actually touching the link and it turned out to be broken, it's still a good hint to either fix it along the way, or add the lint ignore comment/label to the PR explicitly to flag it and move on.

Anyhow, this particular PR is to improve the script and let it skip some of the expected HTTP errors. It doesn't change when the script is run.

@shoumikhin shoumikhin mentioned this pull request May 9, 2025
pytorchmergebot pushed a commit that referenced this pull request May 14, 2025
Or ignore them.
Found by running the lint_urls.sh script locally with #153246

Pull Request resolved: #153277
Approved by: https://github.com/malfet
@shoumikhin
Copy link
Contributor Author

@pytorchbot merge -f "lint only changes"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Kkewrr34-kk referenced this pull request May 16, 2025
Summary:
replace read_config with select

For more info, please refer to the [doc](https://docs.google.com/document/d/1e0Hvht8WEHhcRvlCAodq_R9xnAtKBrAhdyvxcAqQjCw/edit?tab=t.hl8j18gza0cv)

Test Plan: CI

Reviewed By: malfet

Differential Revision: D70267850

Pull Request resolved: #148925
Approved by: https://github.com/malfet
@github-actions github-actions bot deleted the shoumikhin-patch-10 branch June 18, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants