-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework: Mysterious 6 failures in package Zoltan2Sphynx with no data on CDash going back to at least 2022-05-18 #10836
Comments
CC: @jhux2, @csiefer2, @e10harvey There is an important clue that I missed about these mysterious "6" build errors. If you look at all of the PR build that have "6"
that is built from the source file However, note that I can't seem to find any of these build errors in any of the recent "Master Merge" builds shown on CDash. Someone needs to get onto these machines where the PR builds are actually running and try to reproduce these errors in those build dirs. There is really not much more I can do without being given the access. |
Potentially linked to #10842. |
@bartlettroscoe #10813 has merged, which is a fix for #10842. PR #10775 started after that merge, and it still shows the same 6 failures in Zoltan2Sphinx that don't appear when you click on the link, as you can see here. |
I will see if we can't get Kitware's help on debugging this (it has been a challenge for Zack to see things on both the clients where the XML files are getting generated and on the CDash server where they are being consumed). |
@bartlettroscoe Sounds good. From the internal ticket, there was an indication that Zack saw real compile errors in the logs. |
@jhux2, if that is the case, we should be able to see those at: I think @e10harvey gave all Trilinos developers read-only access to that Jenkins site. I will pull the build log file (which is 89M) and see if I can see what is failing. |
@jhux2, so for whatever reason, the build log file for the PR #10775 for the build Trilinos_PR_gcc-7.2.0-debug is not shown at: But we see the build log file for other builds like, for example: Therefore, we can't see the build errors for that build for the PR #10775 that is reporting "6" build errors. Looking at the configure output for the build Trilinos_PR_gcc-7.2.0-debug with ID 850 for PR #10775 at: shows:
Well, that is the correct branch Looking at the console file at: we see:
So there is definingly build errors. We just can't see what they are. Have we tried to reproduce that build locally and see what happens? I will post a new Trilinos HelpDesk issue to see if we can get some help for why the build log is not being archived on Jenkins. |
@bartlettroscoe I tried a few times, but my guess is that we'd need the exact configure line. Is there a way to generate the cmake configure line from the file |
This screenshot shows the missing build log file at: |
@bartlettroscoe Results for PR #10802 look clean so far, and earlier iterations had the missing 6 tests. |
@jhux2, yup, they actually give you the exact configure command in the Jenkins job. For example, for this build it is shown at: which is:
And, actually, you can see that on CDash too with the uploaded build files under: which shows: When you click on |
FYI: I submitted TRILINOSHD-166 to see if we can get some help in archiving the build log file and the XML files (which Zack Galbreath will need to debug submit problems to CDash). |
#10808 has the missing 6, as well. How do I find the logs on jenkins? |
@bartlettroscoe I tried to reproduce the missing 6 error in #10775 by following the instructions in #10836 (comment), but I was unsuccessful, i.e., the build finished without error. There are few caveats: I built on ascicgpu031. I needed to disable a few TPLs and packages: HDF5, NetCDF, Scotch, and Moertel. I also removed Here are the loaded modules:
[EDIT] I loaded the environment with
|
@jhux2, how did you load the env? |
@bartlettroscoe See my edited #10836 (comment). |
@jhux2, the correct way to load the env is with the gen-config.sh script and use that to generate the CMake fragment file on the machine at the same time. I will create a wiki page that describes how to reproduce builds using GenConfig that should be easier to follow and should be able to correctly reproduce PR builds. |
And once again, the Jenkins job artifacts for the build with "6" build failures PR-10808-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-852 at: does NOT list the |
@jhux2, given that the same intel-17.0.1 build for the same PR #10829 showing the "6" build failures have different numbers of "Not Run" tests as shown here showing: makes me think this might be an out-of-memory issue was well. Also, given that there is no showing: it might make sense that an out-of-memory condition might trigger that file not even getting generated by the |
@zackgalbreath, could an out-of-memory state cause |
Below documents an attempt to reproduce build errors reported for the "6" build errors for PR #10808 for the build: I use a throw-away integration test branch to also test a few other small PRs at the same time (details below) and the reproduction process was simply:
This submitted to CDash at: and it showed all passing builds and tests. So I was not able to reproduce any failures (but I was able to test changes to the file (NOTE: I also got a rude reminder you can't use the new SEMS modules on the new 'hpws' machines which I forgot I reported in TRILINOSHD-59) Attempt to reproduce build errors associated with "6" build failures for build 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug' for PR #10808: (click to expand)Trying to reproduce build errors possibly associated with the "6" build errors reported for the build: Using a throw-away integration test branch 10807-kokkos-kernels-cublas-titb on 'hpws055' using:
That build was successful as shown by:
but it showed a lot of test failures:
This submitted to CDash at: and As shown in this query, all 2456 out of the 2459 test failures are showing "undefined symbol: ompi_common_verbs_usnic_register_fake_drivers" errors like:
The remaining 3 failing tests shown here for the tests:
don't given any clue why they are failing because they are calling a Python scrpt and showing ouptut. So you can't reproduce Trilinos PR builds for the Well, shoot, I already reported this problem way back on 2021-12-14 in TRILINOSHD-59 and it is still not fixed. (Can't believe I got bit by that.) Wow, so you can only reproduce these builds on actual ascic and ascicgpu machines? Logging on the machine 'ascicgpu17' with the same build directory in place, I run the dashboard again with a pre-configured bit of software with:
which posted to: and So I was not able to reproduce any build errors :-( |
As (not) shown in this query, there have not been any of these mysterious "6" build errors with no output since 2022-08-15 so I think it is safe to say this issue is resolved. Things that may have fixed this that were done:
Closing as fixed. |
BTW, we never did figure out how CDash got into this state that it shows "6" build errors with no error output. I just know the errors went away after they cleaned up disk space on the Jenkins clients. |
CC: @zackgalbreath, @e10harvey Sadly, this is not actually fixed. An experimental build the Framework team is doing for the C++17 transition (see TRILFRAME-411) driven by the Jenkins job: which submitted to CDash here is showing these "6" failures with no build details: The good news this time is that we have the Build.xml file archived in that Jenkins job to inspect. I will send the file to @zackgalbreath offline for him to inspect. |
CC: @e10harvey, @zackgalbreath, @csiefer2, @jhux2 So it turns out the that defect causing these these mysterious "6" failing and the single global-level error for the outer For evidence for this for the recent case described above, the
You can see this by downloading that
|
In summary, there are 3 independent defects that look to be working together to result in showing these mysterious "6" build errors:
If any one of those things did not happen, you would not see these mysterious "6" build errors. The first two look to be user errors on the SNL side. The third looks to be a possible CDash defect. |
Actually, the second one "ctest -S driver reporting the one single build error |
FYI: The fix for this is in CMake 3.24.3 (released 2022-11-01) . (See SNL Kitware #209). Next: Install CMake 3.24.3 everywhere and use with Trilinos PR builds ... |
This is now resolved since all of the PR builds are using CMake 3.24.3 (see #10823 (comment)). We should never see this again. Closing as complete. |
Bug Report
@trilinos/framework, @csiefer2, @rppawlo, @e10harvey
Next Action Status
This is due to a defect in CTest introduced in CMake 3.18. The fix for this is in CMake 3.24.3 (released 2022-11-01) . (See SNL Kitware #209). Next: Install CMake 3.24.3 everywhere and use with Trilinos PR builds ...
Internal Issues:
Description
As shown in this query going back to at least 2022-05-18, there have been many PR builds that failed showing "6" failures in the package
Zoltan2Sphynx
but when you click on "6", there are no errors shown. Also, there are zero "Not Run", "Fail" and "Pass" tests reported for these builds. That is very strange because generally if no test results are submitted, then empty rows are reported for "Not Run", "Fail" and "Pass" tests, not a zero.As shown in that query above, all of these failures are coming from either 'ascic' or 'ascicgpu' machines and they span a bunch of different builds including
intel-17.0.1
,intel-19.0.5',
cuda-11.4.2,
gnu-7.2.0, and
gnu-8.3.0` builds and has so far impacted 88 different builds (as of 8/5/2022).So far, this has impacted 28 PRs so far including #10537, #10552, #10606, #10614, #10628, #10644, #10653, #10662, #10675, #10677, #10682, #10697, #10706, #10720, #10749, #10751, #10767, #10775, #10777, #10779, #10783, #10784, #10796, #10799, #10801, #10802, #10817, and #10834
NOTE: For some reason (that we should investigate), when a global target not associated with a TriBITS package fails, it seems that CTest assigns this to the last subproject (sorted alphanumerically) which is
Zoltan2Sphynx
. We have seen this many times. Therefore, I don't believe this failure has anything to do with Zoltan2.Steps to Reproduce
Unknown. Seems to randomly occur in PR testing.
The text was updated successfully, but these errors were encountered: