-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trilinos auto PR tester stability issues #3276
Comments
Over the past few months, many of the ATDM Trilinos build script update PRs have experienced several cases of failed PR builds that had nothing to do with the changes in the PR branch. Below, I will list out the number of PR iterations before all of the PR builds for a given PR iteration passed, allowing the build. I will also list the start and end times between when the last commit was pushed, or the PR was created (which should tigger a PR build), the time when the final passing PR build was:
|
I think the ATDM Trilinos PRs tend to see more of these types of failures because the current PR testing system triggers the build and testing of every Trilinos package for any change to any file under the |
We are seeing similar problems going on over the last few days with PR #3258 with the PR builds shown here. The trend that you can see is that the PR builds that run Intel 17.0.1 build all pass just fine. The problem comes with the GCC 4.8.4 and GCC 4.9.3 builds. And all of the failures for these builds occurs only on the Jenkins node 'ascic142'. The GCC PR builds (including the GCC 4.8.4 and 4.9.3 builds) that occur on the other nodes 'ascic143', 'ascic157', and 'ascic158' all pass. That suggests that there is something different about the node 'ascic142' that is causing these builds to fail that is not occurring on the other nodes. Something similar occurred with PR #3260 with PR build results shown here. In that case, 3 of the 4 failing PR builds were on 'ascic142', and that included build failures with empty build error messages. The other failing PR build was on 'ascic158' and that was two timing-out tests. All of this suggests:
@trilinos/framework, can someone look into into these issues some? This problem as it impacts the ATDM work will mostly go away once PR #3258 is merged, but getting that merged requires the PR builds to pass, which it is having trouble doing. Below is the data for the first round of failures being see in PR #3258.
And after a push of commits last night, the first PR testing iteration failed for PR #3258 as well so a new cycle has started. |
And node 'ascic142' strikes again, this time killing the PR test iteration for the new PR #3278 shown here. The results for the failing build
Not sure what @trilinos/framework, could we consider a retry and wait loop for these git communication operations? |
@bartlettroscoe Thank you for this information. I started to investigate this further by putting more than half of the executors on ascic142 to sleep for 10 days (we can always kill the job if we need to). This means that only 1 PR testing job can run at a time on that node for the next 10 days. The small, 1 executor instance jobs can run along with one PR testing job for now. We could clamp that down too if the issues persist. If we do that (allow only 1 PR job to run on the node and nothing else) and the failures persist, I think we need to consider taking the node down or asking someone to look into the issues. 142 runs a lot of jobs for us. It is possible that 143 for example typically does not have 2 jobs running at the same time, but if it did, it would fail more often too. We'll see what happens anyway. |
@jwillenbring, thanks for taking these steps. It is hard to monitor the stability of the PR testing process just by looking at CDash since we expect failures in some PRs due to code changes and if results don't show up at all we don't see them. Is it possible to log cases where PR builds don't successfully submit results to CDash or provide comments to GitHub for some reason? This might just be a global log file that the PR testing Jenkins jobs write to whenever a error is detected. For that matter, it would be good to also log every successful PR run (which just means nothing crashed and no communication failed). This type of data would be useful from a research perspective about the stability of the CI testing processes and it would provide a clear metric to see if changes to the PR process are improving stability or not. |
@trilinos/framework, More stability problems. Even after the merge of #3258 and the completion of #3133 (still in review but complete otherwise) so that no packages should be built or tested for changes to ATDM build scripts, we still got a failed PR iteration as shown for the PR #3309 here. The new Jenkins output in that comment showed that the build
What is the timeout for that git fetch? None is listed. Can we set a timeout of 30 minutes or so? Also, do all of the git fetches occur at the same time for all three PR builds? If so, you might avoid these problems by staggering the start of the PR builds by one minute each. That will add little time to the overall PR time but may make the git fetch more robust. If Jenkins continues to be this fragile for git operations, you might want to do the git operations in your own driver scripts and put in a loop of tires. That is how CTest gets robust git fetches I think. |
Another data-point ... The PR #3312 iteration #3312 (comment) showed the GitHub fetch failure:
Jenkins needs to be made more robust with git fetches with loops of retires or someone needs to write scripts that will do this manually with loops (that is what ctest -S does to robustly submit to CDash). I can almost guarantee that is what TravisCI and other refined CI systems do to robustly run against github and other external sites. |
@william76 is looking into this with the pipeline capability and @allevin with the autotester. I cannot figure out why Jenkins does so many clones instead of updates once it has a repo. Aaron said it is not like this for SST. |
By default 10. We upped it to 20 and there seemed to be little effect.
The individual PR testing builds get put in the queue and get assigned to a node. Sometimes this happens almost instantly, sometimes it takes a while. |
@jwillenbring, I think if you do a 'git fetch' and there is a network issue right then, then it will just return with an error. I think what we need is a loop with waits and retires. Also, could you turn on the Jenkins project option "Add timestamps to the Console Output"? That might to see if these commands are timing out or are crashing before the timeout. |
@jwillenbring said:
I think most of the failures on 'ascic142' that are reported above are build or test failures after the clones/updates are successful. The problem on 'ascic142' is not communication, it is overloading (or something related). |
And another git fetch failures in #3316 (comment) showing:
Note that is not the Trilinos repo but the tiny little |
NOTE: The ctest -S script support built into CTest which does clones and updates runs on the exact same machines and networks as this PR testing system and I don't think we see even a small fraction of this number of git fetch failures. For example, just look at the 'Clean' builds that run on the same ascic-jenkins build farm machines over the last 5.5 months in this query. Out of those 475 builds we see 1 build with a git update failure (on 6/14/2018). That is far more robust than we are seeing from the PR tester. Therefore, one has to conclude that the problem is not the machines nor the network. The problem must be the software doing the updating (and not having robustness built into them). |
In PR tests, the build is broken and tests are failing, that have nothing to do with the PR in question. See this example: Sometimes the PR tests fail because of lack of communication with some server, but now they are failing because of MueLu build errors and Tempus test failures. The latter may pass or fail intermittently, but it’s not clear to me how those build errors could have gotten through PR testing. After discussion with Jim Willenbring, it looks like these build errors may come from test machines being overloaded. Ditto for Tempus failures perhaps, though I haven't investigated those in depth. |
FYI: It looks like the problems with the PR tester are not just random crashes. It looks like can also retest a PR branch even after PR testing passed and there was no trigger to force a new set of PR testing builds. See #3356 (comment) for an example of this. |
FYI: My last PR #3369 required 6 auto PR testing iteration attempts before it allowed the merge over two days after the start. |
FYI: Stability problems continue #3455 (comment). That one PR iteration had a git clone error for one build and a CDash submit failure for another build. |
See also |
@trilinos/framework, FYI: More stability issues with the auto PR tester. In #3546 (comment) you see the
and the
And in another PR testing iteration shown in #3549 (comment) you see the
And actually, we are also seeing configure failures for these two unrelated PR iterations showing configure failures for the
It is impossible for those PR branches to trigger a configure failure because the PR builds don't currently use that ATDM configuration scripts (unless something has changed but I don't think so). What is going on with the auto tester with the |
CC: @trilinos/framework And here is a new one: #3546 (comment) This time the
Does this mean that the local git repo is corrupted? Are two PR jobs running on top of each other? |
@e10harvey, we are not able to view existing Trilinos HelpDesk tickets, are we? |
@wadeburgess, @ccober6: Is that correct? |
What about external contributors without access to these systems? |
1 similar comment
What about external contributors without access to these systems? |
@mayrmt, external contributors can no longer see PR build and test failures either because these are now being reported to an internal SNL CDash site https://trilinos-cdash.sandia.gov. This change was announced to internal SNL email lists on 6/7/2022 by @e10harvey, but I am not sure any external Trilinos developers or users are on those internal SNL email lists. I am not sure how this was announced to external SNL stakeholders. But if you just examine the updated PR testing comments like in #10614 (comment), the move of results to an internal SNL CDash site and the removal of the outer Jenkins driver STDOUT (where you can see infrastructure failures) from the GitHub PR comments are obvious (so I am not giving away any secrets here). |
@bartlettroscoe That's correct, you can only see your own tickets, not anyone else's. That's why I tend to open a github issue and reference the HelpDesk ticket. |
FYI: It also seems that STK subpackages are being unconditionally enabled in the CUDA PR build on vortex. I reported this in TRILINOSHD-100. This is resulting in way more code and tests being built and run than is needed to test some PRs. (For example, all of the STK, TrillinosCouplings, and Panzer tests are getting run when just the Krino package is changed like in PR #10613. That is a waste of computing resources since changes to Krino can not possibly break STK and its dependent packages.) None of the other PRs are doing this. |
Also, note that Trilinos now uses Git submodules, and if you follow the wiki instructions to access the new GenConfig implementation, it puts your Trilinos repo in a modified state:
It is not clear how to deal with this modified state in my Trilinos Git repo. I have reported this in TRILINOSHD-101. (The problem with the changes in the base Trilinos repo appears to be caused by the usage of the Part of the problem with the local modifications is that locally generated symlinked directories were put under Git version control. I reported this in TRILINOSHD-102. UPDATE: I documented how to revert these local Submodule changes in the new section Revert local changes to the GenConfig submodules. |
@bartlettroscoe Putting those three files in the .gitignore file should ignore any changes to them. Would this work? |
@jhux2, I don't think that will work reliably from reading: It seems there is no easy way to blanket ignore changes to a Git repo's submodules. (The whole point of Git submodules is to track compatible sets of subrepos so making them easy to ignore would seem to defeat the purpose of submodules. But if that is not desired behavior, one might look for a different implementation than Git Submodules.) |
@bartlettroscoe @trilinos/framework In essence, this blocks external contributions, right? To me, it seems impossible to get a PR merged without being able to see the test results. |
@mayrmt - we expect this to only be temporary. Long term we should be able to open the testing results back up to users. There will be an email announcement about this soon. |
Just to log here, PR failures are occurring due to build errors present on 'develop' and also to one of the testing nodes running out of disk space. |
Just to log here, PR testing seems to be blocking the merge of several PR branches and people are setting For example, the PR #10751 which is only 22 day sold has had PR testing started 133 times. The label |
Sadly, setting AT: RETEST is the only tool developers have for dealing with random failures (especially those caused by non-code problems like machine overloading). You just spam AT: RETEST until you get lucky. Please tell me you didn't count all 122 additions of AT: RETEST by hand... |
But that just further jams up the system. There is no sense in beating a dead horse (i.e. when you know there is a systemic problem with the PR system that will need to be fixed before any PR iteration will be successful like has been happening over teh last two weeks). There has to be a better way for the Trilinos team to work together to get through these rough patches with PR testing.
Of course not. If you search a page in Chrome it shows how many matches there are at the top. |
Sorry about accidentally editing your comment there. I thought I was editing me quoting your comment. Anyway, if Framework knows that no PRs are getting through, then they should just disable the autotester for anyone but themselves and then tell people that. That way I can slap on AT: RETEST once and be ready to go when the AT comes back on line. If they think some might get through, then what's wrong with trying to be at the front of the line? This PR has been stuck for weeks now and it is required by a specific deliverable. |
That would likely be the best approach, IMHO.
Because that just ensures that developers that do not constantly put on If there is any more discussion, we should likely move to a different issue as to not spam all of the participants on this log-running issue. That is my fault for now creating a separate issue for this. |
FYI: As shown here just now, there are currently 13 PRs with the |
FYI: To see all of the issues that I know about currently impacting PR testing causing failures, see #10858. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
@trilinos/framework
Description
Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.
This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.
PR Builds Showing Random Failures
Below are a few examples of the stability problems (but are not all of the problems).
The text was updated successfully, but these errors were encountered: