Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos auto PR tester stability issues #3276

Closed
bartlettroscoe opened this issue Aug 10, 2018 · 328 comments
Closed

Trilinos auto PR tester stability issues #3276

bartlettroscoe opened this issue Aug 10, 2018 · 328 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot.

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 10, 2018

@trilinos/framework

Description

Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.

This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.

PR Builds Showing Random Failures

Below are a few examples of the stability problems (but are not all of the problems).

PR ID Num PR Builds to reach passing First test trigger Start first test Passing test Merge PR
#3258 2 8/8/2018 2:35 PM ET 8/8/2018 2:44 PM 8/8/2018 9:15 PM ET Not merged
#3260 4 8/8/2018 5:22 PM ET 8/8/2018 6:31 PM ET 8/10/2018 4:13 AM ET 8/10/2018 8:25 AM
#3213 3 7/31/2018 4:30 PM ET 7/31/2018 4:57 PM ET 8/1/2018 9:48 AM ET 8/1/2018 9:53 AM ET
#3098 4 7/12/2018 12:52 PM ET 7/12/2018 1:07 PM ET 7/13/2018 11:12 PM ET 7/14/2018 10:59 PM ET
#3369 6 8/29/2018 9:08 AM ET 8/29/2018 9:16 AM ET 8/31/2018 6:09 AM ET 8/31/2018 8:33 AM ET
@bartlettroscoe bartlettroscoe added the Framework tasks Framework tasks (used internally by Framework team) label Aug 10, 2018
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 10, 2018

Over the past few months, many of the ATDM Trilinos build script update PRs have experienced several cases of failed PR builds that had nothing to do with the changes in the PR branch. Below, I will list out the number of PR iterations before all of the PR builds for a given PR iteration passed, allowing the build. I will also list the start and end times between when the last commit was pushed, or the PR was created (which should tigger a PR build), the time when the final passing PR build was:

PR ID Num PR Builds to reach Passing Num (False) failed PR Builds First test trigger Start first test Passing test Merge PR
#3260 4 3 8/8/2018 5:22 PM ET 8/8/2018 6:31 PM ET 8/10/2018 4:13 AM ET 8/10/2018 8:25 AM
#3213 3 2 7/31/2018 4:30 PM ET 7/31/2018 4:57 PM ET 8/1/2018 9:48 AM ET 8/1/2018 9:53 AM ET
#3098 4 3 7/12/2018 12:52 PM ET 7/12/2018 1:07 PM ET 7/13/2018 11:12 PM ET 7/14/2018 10:59 PM ET

@bartlettroscoe
Copy link
Member Author

I think the ATDM Trilinos PRs tend to see more of these types of failures because the current PR testing system triggers the build and testing of every Trilinos package for any change to any file under the cmake/std/atdm/ directory. That will be resolved once #3133 is resolved (and PR #3258 is merged). But other PRs that trigger the enable of a lot of Trilinos packages would still be an issue perhaps.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 10, 2018

We are seeing similar problems going on over the last few days with PR #3258 with the PR builds shown here. The trend that you can see is that the PR builds that run Intel 17.0.1 build all pass just fine. The problem comes with the GCC 4.8.4 and GCC 4.9.3 builds. And all of the failures for these builds occurs only on the Jenkins node 'ascic142'. The GCC PR builds (including the GCC 4.8.4 and 4.9.3 builds) that occur on the other nodes 'ascic143', 'ascic157', and 'ascic158' all pass. That suggests that there is something different about the node 'ascic142' that is causing these builds to fail that is not occurring on the other nodes.

Something similar occurred with PR #3260 with PR build results shown here. In that case, 3 of the 4 failing PR builds were on 'ascic142', and that included build failures with empty build error messages. The other failing PR build was on 'ascic158' and that was two timing-out tests.

All of this suggests:

  1. There may be something wrong with 'ascic142' or at least something different from the other build nodes that may be causing more failures.
  2. The machines may be getting loaded too much and that is causing builds to crash and tests to timeout.

@trilinos/framework, can someone look into into these issues some? This problem as it impacts the ATDM work will mostly go away once PR #3258 is merged, but getting that merged requires the PR builds to pass, which it is having trouble doing.

Below is the data for the first round of failures being see in PR #3258.

PR ID Num PR Builds to reach Passing Num (False) failed PR Builds First test trigger Start first test Passing test Merge PR
#3258 2 1 8/8/2018 2:35 PM ET 8/8/2018 2:44 PM 8/8/2018 9:15 PM ET Not meged

And after a push of commits last night, the first PR testing iteration failed for PR #3258 as well so a new cycle has started.

@bartlettroscoe
Copy link
Member Author

And node 'ascic142' strikes again, this time killing the PR test iteration for the new PR #3278 shown here. The results for the failing build Trilinos_pullrequest_gcc_4.9.3 are not showing up on CDash here. The Jenkins output for this failing build shown here shows that it ran on 'ascic142' and showed the output:

[Trilinos_pullrequest_gcc_4.9.3] $ /usr/bin/env bash /tmp/jenkins4565491669098842090.sh
trilinos
/usr/bin/env
/bin/env
/bin/env
/usr/bin/env
changed-files.txt
gitchanges.txt
packageEnables.cmake
pull_request_test
TFW_single_configure_support_scripts
TFW_single_configure_support_scripts@tmp
TFW_testing_single_configure_prototype
TFW_testing_single_configure_prototype@tmp
TribitsDumpDepsXmlScript.log
Trilinos
TrilinosPackageDependencies.xml
git remote exists, removing it.
error: RPC failed; curl 56 Proxy CONNECT aborted
fatal: The remote end hung up unexpectedly
Source remote fetch failed. The error code was: 128
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Not sure what error: RPC failed; curl 56 Proxy CONNECT aborted means but it seems to have killed the git fetch.

@trilinos/framework, could we consider a retry and wait loop for these git communication operations?

@bartlettroscoe
Copy link
Member Author

FYI: The latest PR testing iteration for #3225 shown here failed due to bogus build failures on 'ascic142' as shown here.

@jwillenbring
Copy link
Member

@bartlettroscoe Thank you for this information. I started to investigate this further by putting more than half of the executors on ascic142 to sleep for 10 days (we can always kill the job if we need to). This means that only 1 PR testing job can run at a time on that node for the next 10 days. The small, 1 executor instance jobs can run along with one PR testing job for now. We could clamp that down too if the issues persist. If we do that (allow only 1 PR job to run on the node and nothing else) and the failures persist, I think we need to consider taking the node down or asking someone to look into the issues. 142 runs a lot of jobs for us. It is possible that 143 for example typically does not have 2 jobs running at the same time, but if it did, it would fail more often too. We'll see what happens anyway.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 16, 2018

@jwillenbring, thanks for taking these steps. It is hard to monitor the stability of the PR testing process just by looking at CDash since we expect failures in some PRs due to code changes and if results don't show up at all we don't see them.

Is it possible to log cases where PR builds don't successfully submit results to CDash or provide comments to GitHub for some reason? This might just be a global log file that the PR testing Jenkins jobs write to whenever a error is detected. For that matter, it would be good to also log every successful PR run (which just means nothing crashed and no communication failed). This type of data would be useful from a research perspective about the stability of the CI testing processes and it would provide a clear metric to see if changes to the PR process are improving stability or not.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 16, 2018

@trilinos/framework,

More stability problems. Even after the merge of #3258 and the completion of #3133 (still in review but complete otherwise) so that no packages should be built or tested for changes to ATDM build scripts, we still got a failed PR iteration as shown for the PR #3309 here. The new Jenkins output in that comment showed that the build Trilinos_pullrequest_gcc_4.8.4 failed due to a git fetch error:

 > git fetch --tags --progress git@gitlab-ex.sandia.gov:trilinos-project/TFW_single_configure_support_scripts.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --progress git@gitlab-ex.sandia.gov:trilinos-project/TFW_single_configure_support_scripts.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: ssh: connect to host gitlab-ex.sandia.gov port 22: Connection timed out
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

What is the timeout for that git fetch? None is listed. Can we set a timeout of 30 minutes or so?

Also, do all of the git fetches occur at the same time for all three PR builds? If so, you might avoid these problems by staggering the start of the PR builds by one minute each. That will add little time to the overall PR time but may make the git fetch more robust.

If Jenkins continues to be this fragile for git operations, you might want to do the git operations in your own driver scripts and put in a loop of tires. That is how CTest gets robust git fetches I think.

@bartlettroscoe
Copy link
Member Author

Another data-point ...

The PR #3312 iteration #3312 (comment) showed the GitHub fetch failure:

 > git fetch --tags --progress https://github.com/trilinos/Trilinos +refs/heads/*:refs/remotes/origin/* # timeout=20
ERROR: Timeout after 20 minutes
ERROR: Error fetching remote repo 'origin'

Jenkins needs to be made more robust with git fetches with loops of retires or someone needs to write scripts that will do this manually with loops (that is what ctest -S does to robustly submit to CDash). I can almost guarantee that is what TravisCI and other refined CI systems do to robustly run against github and other external sites.

@jwillenbring
Copy link
Member

@bartlettroscoe

Jenkins needs to be made more robust with git fetches with loops of retires or someone needs to write scripts that will do this manually with loops

@william76 is looking into this with the pipeline capability and @allevin with the autotester. I cannot figure out why Jenkins does so many clones instead of updates once it has a repo. Aaron said it is not like this for SST.

@jwillenbring
Copy link
Member

@bartlettroscoe

Another data-point ...

The PR #3312 iteration #3312 (comment) showed the GitHub fetch failure:

This communication failure happened on ascic157, so ascic142 is not the only node having issues with communication.

@jwillenbring
Copy link
Member

What is the timeout for that git fetch? None is listed. Can we set a timeout of 30 minutes or so?

By default 10. We upped it to 20 and there seemed to be little effect.

Also, do all of the git fetches occur at the same time for all three PR builds? If so, you might avoid these problems by staggering the start of the PR builds by one minute each. That will add little time to the overall PR time but may make the git fetch more robust.

The individual PR testing builds get put in the queue and get assigned to a node. Sometimes this happens almost instantly, sometimes it takes a while.

@bartlettroscoe
Copy link
Member Author

@jwillenbring, I think if you do a 'git fetch' and there is a network issue right then, then it will just return with an error. I think what we need is a loop with waits and retires.

Also, could you turn on the Jenkins project option "Add timestamps to the Console Output"? That might to see if these commands are timing out or are crashing before the timeout.

@bartlettroscoe
Copy link
Member Author

@jwillenbring said:

This communication failure happened on ascic157, so ascic142 is not the only node having issues with communication.

I think most of the failures on 'ascic142' that are reported above are build or test failures after the clones/updates are successful. The problem on 'ascic142' is not communication, it is overloading (or something related).

@bartlettroscoe
Copy link
Member Author

And another git fetch failures in #3316 (comment) showing:

 > git fetch --tags --progress git@gitlab-ex.sandia.gov:trilinos-project/TFW_single_configure_support_scripts.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'

Note that is not the Trilinos repo but the tiny little TFW_single_configure_support_scripts.git git repo. That can't be a timeout. Why is Jenkins so non-robust with git?

@bartlettroscoe
Copy link
Member Author

NOTE: The ctest -S script support built into CTest which does clones and updates runs on the exact same machines and networks as this PR testing system and I don't think we see even a small fraction of this number of git fetch failures. For example, just look at the 'Clean' builds that run on the same ascic-jenkins build farm machines over the last 5.5 months in this query. Out of those 475 builds we see 1 build with a git update failure (on 6/14/2018). That is far more robust than we are seeing from the PR tester. Therefore, one has to conclude that the problem is not the machines nor the network. The problem must be the software doing the updating (and not having robustness built into them).

@mhoemmen
Copy link
Contributor

In PR tests, the build is broken and tests are failing, that have nothing to do with the PR in question. See this example:

#3359 (comment)

Sometimes the PR tests fail because of lack of communication with some server, but now they are failing because of MueLu build errors and Tempus test failures. The latter may pass or fail intermittently, but it’s not clear to me how those build errors could have gotten through PR testing.

After discussion with Jim Willenbring, it looks like these build errors may come from test machines being overloaded. Ditto for Tempus failures perhaps, though I haven't investigated those in depth.

@bartlettroscoe
Copy link
Member Author

FYI: It looks like the problems with the PR tester are not just random crashes. It looks like can also retest a PR branch even after PR testing passed and there was no trigger to force a new set of PR testing builds. See #3356 (comment) for an example of this.

@bartlettroscoe
Copy link
Member Author

FYI: My last PR #3369 required 6 auto PR testing iteration attempts before it allowed the merge over two days after the start.

@bartlettroscoe
Copy link
Member Author

FYI: Stability problems continue #3455 (comment). That one PR iteration had a git clone error for one build and a CDash submit failure for another build.

@mhoemmen
Copy link
Contributor

See also
#3439 (comment)

@bartlettroscoe
Copy link
Member Author

@trilinos/framework,

FYI: More stability issues with the auto PR tester.

In #3546 (comment) you see the Trilinos_pullrequest_intel_17.0.1 failing due to a git pull failure:

fatal: unable to access 'https://github.com/hkthorn/Trilinos/': Proxy CONNECT aborted
Source remote fetch failed. The error code was: 128

and the Trilinos_pullrequest_gcc_4.8.4 build you see a problem submitting results to CDash showing:

Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/pull_request_test/Testing/20181002-1822/Configure.xml
Error message was: Failed to connect to testing-vm.sandia.gov port 80: Connection timed out
Problems when submitting via HTTP
configure submit error = -1
CMake Error at /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/TFW_testing_single_configure_prototype/simple_testing.cmake:172 (message):
Configure failed with error -1

And in another PR testing iteration shown in #3549 (comment) you see the Trilinos_pullrequest_gcc_4.8.4 build also failing due to inability to submit to CDash showing:

rror when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/pull_request_test/Testing/20181002-1915/Configure.xml
Error message was: Failed to connect to testing-vm.sandia.gov port 80: Connection timed out
Problems when submitting via HTTP
configure submit error = -1
CMake Error at /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/TFW_testing_single_configure_prototype/simple_testing.cmake:172 (message):
Configure failed with error -1

And actually, we are also seeing configure failures for these two unrelated PR iterations showing configure failures for the Trilinos_pullrequest_gcc_4.8.4 build showing:

Starting configure step.
Each . represents 1024 bytes of output
................. Size of output: 16K
Error(s) when configuring the project

It is impossible for those PR branches to trigger a configure failure because the PR builds don't currently use that ATDM configuration scripts (unless something has changed but I don't think so).

What is going on with the auto tester with the Trilinos_pullrequest_gcc_4.8.4 build?

@bartlettroscoe
Copy link
Member Author

CC: @trilinos/framework

And here is a new one: #3546 (comment)

This time the Trilinos_pullrequest_gcc_4.9.3 build crashed due to:

Checking out Revision bb78697920292c58562e47fbae13843a79c29e55 (refs/remotes/origin/develop)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f bb78697920292c58562e47fbae13843a79c29e55
hudson.plugins.git.GitException: Command "git checkout -f bb78697920292c58562e47fbae13843a79c29e55" returned status code 128:
stdout: 
stderr: fatal: Unable to create '/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3/Trilinos/.git/index.lock': File exists.
If no other git process is currently running, this probably means a
git process crashed in this repository earlier. Make sure no other git
process is running and remove the file manually to continue.

Does this mean that the local git repo is corrupted? Are two PR jobs running on top of each other?

@bartlettroscoe
Copy link
Member Author

please check for existing tickets and open new tickets reporting your questions and concerns there.

@e10harvey, we are not able to view existing Trilinos HelpDesk tickets, are we?

@e10harvey
Copy link
Contributor

please check for existing tickets and open new tickets reporting your questions and concerns there.

@e10harvey, we are not able to view existing Trilinos HelpDesk tickets, are we?

@wadeburgess, @ccober6: Is that correct?

@mayrmt
Copy link
Member

mayrmt commented Jun 13, 2022

As of today, framework will only be monitoring trilinos-help for these types of issues; please check for existing tickets and open new tickets reporting your questions and concerns there.

What about external contributors without access to these systems?

1 similar comment
@mayrmt
Copy link
Member

mayrmt commented Jun 13, 2022

As of today, framework will only be monitoring trilinos-help for these types of issues; please check for existing tickets and open new tickets reporting your questions and concerns there.

What about external contributors without access to these systems?

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 13, 2022

What about external contributors without access to these systems?

@mayrmt, external contributors can no longer see PR build and test failures either because these are now being reported to an internal SNL CDash site https://trilinos-cdash.sandia.gov.

This change was announced to internal SNL email lists on 6/7/2022 by @e10harvey, but I am not sure any external Trilinos developers or users are on those internal SNL email lists. I am not sure how this was announced to external SNL stakeholders. But if you just examine the updated PR testing comments like in #10614 (comment), the move of results to an internal SNL CDash site and the removal of the outer Jenkins driver STDOUT (where you can see infrastructure failures) from the GitHub PR comments are obvious (so I am not giving away any secrets here).

@jhux2
Copy link
Member

jhux2 commented Jun 13, 2022

please check for existing tickets and open new tickets reporting your questions and concerns there.

@e10harvey, we are not able to view existing Trilinos HelpDesk tickets, are we?

@bartlettroscoe That's correct, you can only see your own tickets, not anyone else's. That's why I tend to open a github issue and reference the HelpDesk ticket.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 13, 2022

FYI: It also seems that STK subpackages are being unconditionally enabled in the CUDA PR build on vortex. I reported this in TRILINOSHD-100. This is resulting in way more code and tests being built and run than is needed to test some PRs. (For example, all of the STK, TrillinosCouplings, and Panzer tests are getting run when just the Krino package is changed like in PR #10613. That is a waste of computing resources since changes to Krino can not possibly break STK and its dependent packages.) None of the other PRs are doing this.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 13, 2022

Also, note that Trilinos now uses Git submodules, and if you follow the wiki instructions to access the new GenConfig implementation, it puts your Trilinos repo in a modified state:

$ git status
On branch develop
Your branch is up to date with 'github/develop'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
        modified:   packages/framework/GenConfig (modified content)
        modified:   packages/framework/son-ini-files (new commits)
        modified:   packages/framework/srn-ini-files (new commits)

no changes added to commit (use "git add" and/or "git commit -a")

It is not clear how to deal with this modified state in my Trilinos Git repo.

I have reported this in TRILINOSHD-101. (The problem with the changes in the base Trilinos repo appears to be caused by the usage of the --remote option with the git submodule --inti --remote <remote-name> command being used in the packages/framework/get_dependencies.sh script.)

Part of the problem with the local modifications is that locally generated symlinked directories were put under Git version control. I reported this in TRILINOSHD-102.

UPDATE: I documented how to revert these local Submodule changes in the new section Revert local changes to the GenConfig submodules.

@jhux2
Copy link
Member

jhux2 commented Jun 13, 2022

Also, note that Trilinos now uses Git submodules, and if you follow the wiki instructions to access the new GenConfig implementation, it puts your Trilinos repo in a modified state:

@bartlettroscoe Putting those three files in the .gitignore file should ignore any changes to them. Would this work?

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe Putting those three files in the .gitignore file should ignore any changes to them. Would this work?

@jhux2, I don't think that will work reliably from reading:

It seems there is no easy way to blanket ignore changes to a Git repo's submodules. (The whole point of Git submodules is to track compatible sets of subrepos so making them easy to ignore would seem to defeat the purpose of submodules. But if that is not desired behavior, one might look for a different implementation than Git Submodules.)

@mayrmt
Copy link
Member

mayrmt commented Jun 15, 2022

@mayrmt, external contributors can no longer see PR build and test failures either because these are now being reported to an internal SNL CDash site https://trilinos-cdash.sandia.gov.

@bartlettroscoe @trilinos/framework In essence, this blocks external contributions, right? To me, it seems impossible to get a PR merged without being able to see the test results.

@rppawlo
Copy link
Contributor

rppawlo commented Jun 15, 2022

@mayrmt - we expect this to only be temporary. Long term we should be able to open the testing results back up to users. There will be an email announcement about this soon.

@bartlettroscoe
Copy link
Member Author

Just to log here, PR failures are occurring due to build errors present on 'develop' and also to one of the testing nodes running out of disk space.

@bartlettroscoe
Copy link
Member Author

Just to log here, PR testing seems to be blocking the merge of several PR branches and people are setting AT: RETEST many times.

For example, the PR #10751 which is only 22 day sold has had PR testing started 133 times. The label AT: RETEST has been manually set 122 times so far. How many PRs are being manipulated in this way?

@csiefer2
Copy link
Member

csiefer2 commented Aug 3, 2022

Sadly, setting AT: RETEST is the only tool developers have for dealing with random failures (especially those caused by non-code problems like machine overloading). You just spam AT: RETEST until you get lucky.

Please tell me you didn't count all 122 additions of AT: RETEST by hand...

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 3, 2022

Sadly, setting AT: RETEST is the only tool developers have for dealing with random failures (especially those caused by non-code problems like machine overloading). You just spam AT: RETEST until you get lucky.

But that just further jams up the system. There is no sense in beating a dead horse (i.e. when you know there is a systemic problem with the PR system that will need to be fixed before any PR iteration will be successful like has been happening over teh last two weeks). There has to be a better way for the Trilinos team to work together to get through these rough patches with PR testing.

Please tell me you didn't count all 122 additions of AT: RETEST by hand...

Of course not. If you search a page in Chrome it shows how many matches there are at the top.

@csiefer2
Copy link
Member

csiefer2 commented Aug 3, 2022

Sorry about accidentally editing your comment there. I thought I was editing me quoting your comment.

Anyway, if Framework knows that no PRs are getting through, then they should just disable the autotester for anyone but themselves and then tell people that. That way I can slap on AT: RETEST once and be ready to go when the AT comes back on line.

If they think some might get through, then what's wrong with trying to be at the front of the line? This PR has been stuck for weeks now and it is required by a specific deliverable.

@bartlettroscoe
Copy link
Member Author

Anyway, if Framework knows that no PRs are getting through, then they should just disable the autotester for anyone but themselves and then tell people that. That way I can slap on AT: RETEST once and be ready to go when the AT comes back on line.

That would likely be the best approach, IMHO.

If they think some might get through, then what's wrong with trying to be at the front of the line?

Because that just ensures that developers that do not constantly put on AT: RETEST get their PRs (to give the framework team time to address the problems) results in getting their PRs tested less frequently with even a smaller chance of getting their PRs merged. Expecting developers to add AT: RETEST dozens of times to a PR (and well over 100 times in some cases like in #10751) points to some serious issue, IMHO. There has to be a better way than this. A better approach would likely be to allow manual merges of critical PRs given sufficient objective evidence during periods where the PR tester is broken. Cleanly broken builds and tests are already getting on the 'develop' branch if you look at the "Master Merge" builds over the last couple of weeks here some of which mirror the PR builds (for example reported in #10823).

If there is any more discussion, we should likely move to a different issue as to not spam all of the participants on this log-running issue. That is my fault for now creating a separate issue for this.

@bartlettroscoe
Copy link
Member Author

FYI: As shown here just now, there are currently 13 PRs with the AT: RETEST label set on them. That is up from 9 a few hours ago. (I guess people got the word that PR testing might be working again.) That is quite a backlog to get through. So with single builds taking upwards of 24 hours (see here) and still possibly random failures occurring (see #10823), this is going to take a long time get caught up.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe
Copy link
Member Author

FYI: To see all of the issues that I know about currently impacting PR testing causing failures, see #10858.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Aug 16, 2023
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot.
Projects
None yet
Development

No branches or pull requests