Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mdserver crash when staring VisIt #18725

Closed
markcmiller86 opened this issue May 30, 2023 · 3 comments
Closed

mdserver crash when staring VisIt #18725

markcmiller86 opened this issue May 30, 2023 · 3 comments
Assignees
Labels
bug Something isn't working crash bug caused visit to crash impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) likelihood medium Neither low nor high likelihood

Comments

@markcmiller86
Copy link
Member

markcmiller86 commented May 30, 2023

Describe the bug

The mdserver does some work on the list of files in the cwd where VisIt is started from.
This makes the mdserver somewhat sensitive to the contents of the cwd on startup.

An LLNL user was getting a reproducible mdserver crash and gave us the contents of cwd to test. Both @cyrush and I have reproduced on LLNL RZ systems.

I believe the issue is that the mdserver is trying to decide if a particular set of files is one for which a virtual database entry in the file list should be created.

Another issue is that the mdserver may be trying to tease cycle numbers out of the file names. When the number in the filename has more digits than can fit in an int type, this could potentially also cause issues.

The names of files include things like...

foo_1655929385333_6522_gorfo.json 
foo_1677984469076_70111_gorfo.json
foo_1681771807053_67964_gorfo.json
.
.
.

Helpful additional information

  • Did VisIt crash: Yes
  • Did you get wrong results:

To Reproduce

Steps to reproduce the behavior. For example:

  1. Get the directory contents from @cyrush or @markcmiller86
  2. Start VisIt from a directory with those contents.
  3. observe mdserver crash with output like...
    /g/g11/miller86/visit/visit/3.3RC/build/exe/mdserver[0x420639]
    /g/g11/miller86/visit/visit/3.3RC/build/exe/mdserver[0x416be2]
    /g/g11/miller86/visit/visit/3.3RC/build/lib/libvisitcommon.so(_ZN7Subject6NotifyEv+0xa3)[0x2aaab0e2dbed]
    /g/g11/miller86/visit/visit/3.3RC/build/lib/libvisitcommon.so(_ZN16AttributeSubject6NotifyEv+0x1c)[0x2aaab0ced43c]
    /g/g11/miller86/visit/visit/3.3RC/build/lib/libvisitcommon.so(_ZN4Xfer7ProcessEv+0x220)[0x2aaab0e6b038]
    /g/g11/miller86/visit/visit/3.3RC/build/exe/mdserver[0x41c2b9]
    /g/g11/miller86/visit/visit/3.3RC/build/exe/mdserver[0x4184db]
    /g/g11/miller86/visit/visit/3.3RC/build/exe/mdserver[0x42dce5]
    /g/g11/miller86/visit/visit/3.3RC/build/exe/mdserver[0x42e0f4]
    

Running with -debug X where X is any integer winds up hiding the problem.

I was also able to run with gdb and get a backtrace...

#0  0x0000000000420639 in MDServerConnection::GetFilteredFileList (this=0x6d60a0, files=...)
    at /g/g11/miller86/visit/visit/3.3RC/src/mdserver/main/MDServerConnection.C:2064
#1  0x0000000000416be2 in GetFileListRPCExecutor::Update (this=0x69da50, s=0x6a19f8)
    at /g/g11/miller86/visit/visit/3.3RC/src/mdserver/main/GetFileListRPCExecutor.C:102
#2  0x00002aaab0e2dbed in Subject::Notify (this=0x6a19f8)
    at /g/g11/miller86/visit/visit/3.3RC/src/common/state/Subject.C:159
#3  0x00002aaab0ced43c in AttributeSubject::Notify (this=0x6a19c8)
    at /g/g11/miller86/visit/visit/3.3RC/src/common/state/AttributeSubject.C:65
#4  0x00002aaab0e6b038 in Xfer::Process (this=0x68cd70)
    at /g/g11/miller86/visit/visit/3.3RC/src/common/state/Xfer.C:382
#5  0x000000000041c2b9 in MDServerConnection::ProcessInput (this=0x6d60a0)
    at /g/g11/miller86/visit/visit/3.3RC/src/mdserver/main/MDServerConnection.C:478
#6  0x00000000004184db in MDServerApplication::Execute (this=0x65af70)
    at /g/g11/miller86/visit/visit/3.3RC/src/mdserver/main/MDServerApplication.C:252
#7  0x000000000042dce5 in MDServerMain (argc=3, argv=0x7fffffffc0d8)
    at /g/g11/miller86/visit/visit/3.3RC/src/mdserver/main/main.C:155
#8  0x000000000042e0f4 in main (argc=9, argv=0x7fffffffc0d8)
    at /g/g11/miller86/visit/visit/3.3RC/src/mdserver/main/main.C:273
(gdb) list
2059            {
2060                pos = newVirtualFiles.find(files.names[fileIndex]);
2061                if(pos == newVirtualFiles.end())
2062                    continue;
2063
2064                if(virtualFilesToCheck.types[vfIndex] != GetFileListRPC::REG)
2065                {
2066                    debug5 << "File " << files.names[fileIndex].c_str()
2067                           << " is not a file so we're adding its components back "
2068                              "to the files list." << endl;
(gdb) 

I think virtualFilesToCheck.types is actually empty here and vfIndex is zero, causing the crash.

@markcmiller86 markcmiller86 added bug Something isn't working crash bug caused visit to crash likelihood medium Neither low nor high likelihood impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) labels May 30, 2023
@markcmiller86
Copy link
Member Author

It is these three file names that are the culprit...

foo_1680216733095_39630_gorfo.json
foo_1680217640917_71770_gorfo.json
foo_1680217704002_1298_gorfo.json

@cyrush
Copy link
Member

cyrush commented May 30, 2023

-debug only hides if the .vlog files are created in the same dir as the problem files. That some how prevents the problem.

If you navigate to the folder only holding the problem files -- you can still see the crash.

Here is the stack I fished out:

std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_get_insert_unique_pos(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
---
std::ctype<char>::do_widen(char) const,
---
Subject::Notify()

@markcmiller86 markcmiller86 self-assigned this May 30, 2023
@markcmiller86 markcmiller86 mentioned this issue May 30, 2023
4 tasks
@markcmiller86
Copy link
Member Author

Resolved in #18726 on 3.3RC and in #18732 on develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crash bug caused visit to crash impact medium Productivity partially degraded (not easily mitigated bug) or improved (enhancement) likelihood medium Neither low nor high likelihood
Projects
None yet
Development

No branches or pull requests

2 participants