Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in TInterpreter::Calc with no output stack in seemingly random distributed rdf test execution #11515

Closed
vepadulano opened this issue Oct 7, 2022 · 10 comments · Fixed by #11516

Comments

@vepadulano
Copy link
Member

The problem

Sometimes the distributed RDataFrame test of the RunGraphs functionality fails, for example in this CI run. The visible error from the user side comes from Dask, which serializes a Python RuntimeError and produces some output log on stdout as follows:

2022-10-07 17:00:25,901 - distributed.worker - WARNING - Compute Failed
Key:       dask_mapper-d6baeae1-74a8-4a41-84e9-6ee2e8db064f
Function:  execute_task
args:      ((<function DaskBackend.ProcessAndMerge.<locals>.dask_mapper at 0x7f0cdae1a320>, (<function apply at 0x7f0cfa133910>, <class 'DistRDF.Ranges.TreeRangePerc'>, (), (<class 'dict'>, [['id', 1], ['treenames', ['myTree']], ['filenames', ['2clusters.root']], ['first_file_idx', 0], ['last_file_idx', 1], ['first_tree_start_perc', 0.5], ['last_tree_end_perc', 1], ['friendinfo', None]]))))
kwargs:    {}
Exception: "RuntimeError('C++ exception thrown:\\n\\truntime_error: \\nAn error occurred during just-in-time compilation in RLoopManager::Run. The lines above might indicate the cause of the crash\\nAll RDF objects that have not run their event loop yet should be considered in an invalid state.\\n')"

The Python RuntimeError in turn is just a wrapper of the C++ std::runtime_error which comes from this function in the RDF machinery.

That error suggests a failure in TInterpreter::Calc, which should be visible with some kind of output to stderr showing the compilation error. Unfortunately, stderr is completely empty, no error is shown by TInterpreter.

By adding a bunch of print statements here and there, we can get to see some more details of this issue (the patch will be attached).

Failure in cling::IncrementalExecutor::jitInitOrWrapper: PID=31643,funcname=_Z16__cling_Un1Qu340Pv

Failure at /home/vpadulan/programs/rootproject/rootsrc/interpreter/cling/lib/Interpreter/Interpreter.cpp:1438 with ExecutionResult=3

Failure in Interpreter::EvaluateInternal: PID=31643,evalres=1
The process was evaluating: ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f0cc988e7d0), new const char*[1]{"b1"}, 1, 1, reinterpret_cast<shared_ptr<TH1D>*>(0x7f0cd043c1f0), reinterpret_cast<std::weak_ptr<ROOT::Internal::RDF::RJittedAction>*>(0x7f0cc94b3830), reinterpret_cast<ROOT::Internal::RDF::RColumnRegister*>(0x7f0cc29a41b0));

Failure in gInterpreter->Calc: PID=31643,errorCode=1

Adding these two lines in cling::IncrementalExecutor::jitInitOrWrapper provides a way to stop the execution when the offending code is triggered, so that we can step in with gdb -p PID_OF_FAILING_DISTRDF_TASK:

        volatile int a = 0;
        while (a == 0) {}

It shows that there are many threads in flight (21!), even if the distributed task from the RDF point of view runs sequentially:

(gdb) info threads
  Id   Target Id                                  Frame 
  1    Thread 0x7f1370756740 (LWP 34502) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f131020075c) at futex-internal.c:57
  2    Thread 0x7f135fbff640 (LWP 34504) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f13622926e0 <thread_status+96>) at futex-internal.c:57
  3    Thread 0x7f135f3fe640 (LWP 34505) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292760 <thread_status+224>) at futex-internal.c:57
  4    Thread 0x7f135ebfd640 (LWP 34506) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f13622927e0 <thread_status+352>) at futex-internal.c:57
  5    Thread 0x7f135e3fc640 (LWP 34507) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292860 <thread_status+480>) at futex-internal.c:57
  6    Thread 0x7f1357bfb640 (LWP 34508) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f13622928e0 <thread_status+608>) at futex-internal.c:57
  7    Thread 0x7f13573fa640 (LWP 34509) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292960 <thread_status+736>) at futex-internal.c:57
  8    Thread 0x7f1356bf9640 (LWP 34510) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f13622929e0 <thread_status+864>) at futex-internal.c:57
  9    Thread 0x7f13503f8640 (LWP 34511) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292a60 <thread_status+992>) at futex-internal.c:57
  10   Thread 0x7f134dbf7640 (LWP 34512) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292ae0 <thread_status+1120>) at futex-internal.c:57
  11   Thread 0x7f134b3f6640 (LWP 34513) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292b60 <thread_status+1248>) at futex-internal.c:57
  12   Thread 0x7f134abf5640 (LWP 34514) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292be0 <thread_status+1376>) at futex-internal.c:57
  13   Thread 0x7f13483f4640 (LWP 34515) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292c60 <thread_status+1504>) at futex-internal.c:57
  14   Thread 0x7f1343bf3640 (LWP 34516) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292ce0 <thread_status+1632>) at futex-internal.c:57
  15   Thread 0x7f13433f2640 (LWP 34517) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
--Type <RET> for more, q to quit, c to continue without paging--
    futex_word=0x7f1362292d60 <thread_status+1760>) at futex-internal.c:57
  16   Thread 0x7f133ebf1640 (LWP 34518) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7f1362292de0 <thread_status+1888>) at futex-internal.c:57
  17   Thread 0x7f132aa21640 (LWP 34584) "python" __GI___libc_read (nbytes=4, buf=0x7f133827b950, fd=33) at ../sysdeps/unix/sysv/linux/read.c:26
  18   Thread 0x7f132a120640 (LWP 34585) "python" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f132a11f330, op=137, 
    expected=0, futex_word=0x7f1370d85acc <_PyRuntime+428>) at futex-internal.c:57
  19   Thread 0x7f132991f640 (LWP 34586) "python" __GI___libc_read (nbytes=4, buf=0x7f13382bf140, fd=31) at ../sysdeps/unix/sysv/linux/read.c:26
  20   Thread 0x7f132911e640 (LWP 34587) "python" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, 
    expected=0, futex_word=0x562911e622b0) at futex-internal.c:57
* 21   Thread 0x7f131ad44640 (LWP 34739) "python" 0x00007f1332cbafdc in cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const () from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so

The two most interesting ones are threads 1 and 21:
Thread1

#6  std::__condvar::wait (__m=..., this=<optimized out>)
    at /usr/src/debug/gcc-12.2.1-2.fc36.x86_64/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/std_mutex.h:155
#7  std::condition_variable::wait (this=<optimized out>, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:41
#8  0x00007f133726f0aa in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex> > (this=0x7f1310200730, __lock=...)
    at /usr/include/c++/12/condition_variable:312
#9  0x00007f133726df05 in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex>, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}>(std::unique_lock<std::mutex>&, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}) (this=0x7f1310200730, __lock=..., __p=...) at /usr/include/c++/12/condition_variable:321
#10 0x00007f1337269c3e in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x7f13102006f8)
    at /home/vpadulan/programs/rootproject/rootsrc/core/thread/src/TReentrantRWLock.cxx:190
#11 0x00007f1337277d74 in ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x7f13102006f0)
    at /home/vpadulan/programs/rootproject/rootsrc/core/thread/src/TRWMutexImp.cxx:42
#12 0x00007f1337276797 in ROOT::TVirtualRWMutex::Lock (this=0x7f13102006f0)
    at /home/vpadulan/programs/rootproject/rootsrc/core/base/inc/TVirtualRWMutex.h:79
#13 0x00007f1332b7a3c8 in TLockGuard::TLockGuard (this=0x7fffa6ac5868, mutex=0x7f13102006f0)
    at /home/vpadulan/programs/rootproject/rootsrc/core/base/inc/TVirtualMutex.h:80
#14 0x00007f1332a80aa6 in TCling::AutoParse (this=0x56290befde40, cls=0x7fffa6ac5d40 "_facade")
    at /home/vpadulan/programs/rootproject/rootsrc/core/metacling/src/TCling.cxx:6452
#15 0x00007f1332a6667d in TCling__AutoParseCallback (className=0x7fffa6ac5d40 "_facade")

Thread 21:

#0  0x00007f1332cbafdc in cling::IncrementalExecutor::executeWrapper(llvm::StringRef, cling::Value*) const ()
   from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#1  0x00007f1332c44ab1 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) ()
   from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#2  0x00007f1332c451a3 in cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) ()
   from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#3  0x00007f1332c45b1a in cling::Interpreter::evaluate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::Value&) () from /home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/lib/libCling.so
#4  0x00007f1332a74e9d in TCling::Calc (this=0x56290befde40, 
    line=0x7f12fa0f8290 "ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f130be1a880), new const char*[1]{\"b1\"}, 1, 1, "..., error=0x7f131ad40bdc)
    at /home/vpadulan/programs/rootproject/rootsrc/core/metacling/src/TCling.cxx:3556
#5  0x00007f132bf6da60 in operator() (__closure=0x7f131ad40bc0, 
    codeSlice="ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f130be1a880), new const char*[1]{\"b1\"}, 1, 1, "...)
    at /home/vpadulan/programs/rootproject/rootsrc/tree/dataframe/src/RDFUtils.cxx:340
#6  0x00007f132bf6def9 in ROOT::Internal::RDF::InterpreterCalc (
    code="ROOT::Internal::RDF::CallBuildAction<ROOT::Internal::RDF::ActionTags::Histo1D, Int_t>(reinterpret_cast<std::shared_ptr<ROOT::Detail::RDF::RNodeBase>*>(0x7f130be1a880), new const char*[1]{\"b1\"}, 1, 1, "..., context="RLoopManager::Run")

Which as a first instinct hints at some possible contention between the different things cling is doing in the two threads. For the moment I cannot come up with an easier reproducer, see next section for more details

How to reproduce

Here is the patch to print the statements that show the PID (as .txt so that I can attach it to this issue):
0001-Print-statements-for-DistRDF-Cling-failure.txt

The following is a Python script with the test

import os

import pyspark
from dask.distributed import Client, LocalCluster

import ROOT
RunGraphs = ROOT.RDF.Experimental.Distributed.RunGraphs
SparkDF = ROOT.RDF.Experimental.Distributed.Spark.RDataFrame
DaskDF = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
# verbosity = ROOT.Experimental.RLogScopedVerbosity(ROOT.Detail.RDF.RDFLogChannel(), ROOT.Experimental.ELogLevel.kDebug+10)

def connection():
    daskconn = Client(LocalCluster(n_workers=2, threads_per_worker=1, processes=True, memory_limit="2GiB"))

    conf = {"spark.master": "local[2]", "spark.driver.memory": "4g", "spark.app.name": "roottest-distrdf-common"}
    sparkconf = pyspark.SparkConf().setAll(conf.items())
    sparkconn = pyspark.SparkContext(conf=sparkconf)

    return daskconn, sparkconn


def test_rungraphs_sparkanddask_3histos(daskconn, sparkconn):
    """
    Submit three different RDF graphs concurrently to Spark and Dask
    """
    # Create a test file for processing
    treename = "myTree"
    filename = "2clusters.root"
    nentries = 10000
    opts = ROOT.RDF.RSnapshotOptions()
    opts.fAutoFlush = 5000
    ROOT.RDataFrame(nentries).Define("b1", "42")\
        .Define("b2", "42")\
        .Define("b3", "42")\
        .Snapshot(treename, filename, ["b1", "b2", "b3"], opts)

    histoproxies_spark = [
        SparkDF(treename, filename, sparkcontext=sparkconn, npartitions=2)
        .Histo1D((col, col, 1, 40, 45), col)
        for col in ["b1", "b2", "b3"]
    ]

    histoproxies_dask = [
        DaskDF(treename, filename, daskclient=daskconn, npartitions=2)
        .Histo1D((col, col, 1, 40, 45), col)
        for col in ["b1", "b2", "b3"]
    ]

    histoproxies = histoproxies_spark + histoproxies_dask

    # Before triggering the computation graphs values are None
    for proxy in histoproxies:
        assert proxy.proxied_node.value is None

    RunGraphs(histoproxies)

    # After RunGraphs all histograms are correctly assigned to the
    # node objects
    for proxy in histoproxies:
        histo = proxy.proxied_node.value
        assert isinstance(histo, ROOT.TH1D)
        assert histo.GetEntries() == nentries
        assert histo.GetMean() == 42

    os.remove(filename)


def main():
    test_rungraphs_sparkanddask_3histos(*connection())

if __name__ == "__main__":
    raise SystemExit(main())

And here is a bash script that runs the previous Python script in a loop until it fails

counter=0
while test $? -eq 0
do 
  counter=$(($counter+1))
  echo -e "\nRunning test $counter\n"
  CLING_DEBUG=1 python test.py
done

# Remove leftover file from the test when it fails
rm 2clusters.root

Setup

ROOT master@b13756d
Python 3.10.7
Java 19
pyspark 3.3.0
dask/distributed 2022.7.1

ROOT built with:

cmake -GNinja -Dccache=ON -Dminimal=ON -Dtesting=ON -Droottest=ON -Dpyroot=ON -Ddataframe=ON -Dxrootd=ON -Dtest_distrdf_pyspark=ON -Dtest_distrdf_dask=ON -DCMAKE_BUILD_TYPE=Debug -Dimt=ON

Additional context

I am tentatively assigning this also to @Axel-Naumann @vgvassilev @jalopezg-r00t who may have an easier time at understanding the output from gdb and help in creating a simpler reproducer

@vepadulano vepadulano changed the title Segfault with no output stack in seemingly random distributed rdf test execution Error in TInterpreter::Calc with no output stack in seemingly random distributed rdf test execution Oct 7, 2022
@ferdymercury
Copy link
Collaborator

Thread1

The trace of this thread looks quite similar to the one at #8365 (comment) but probably it's just an unrelated issue

@pcanal
Copy link
Member

pcanal commented Oct 7, 2022

For completeness, what are the complete stack trace for thread 1 and 21?

@vepadulano
Copy link
Member Author

Complete stacktrace of thread 1

stacktrace_thread1.txt

Complete stacktrace of thread 21

stacktrace_thread21.txt

@pcanal
Copy link
Member

pcanal commented Oct 7, 2022

Can you also instrument IncrementalExecutor::diagnoseUnresolvedSymbols to print its error to std::cout so we get more details?

@pcanal
Copy link
Member

pcanal commented Oct 7, 2022

Most likely the problem is that TClingClassInfo::IsEnum is missing a R__LOCKGUARD(gInterpreterMutex); before the creation of the stack TClingClassInfo.

@vepadulano
Copy link
Member Author

Can you also instrument IncrementalExecutor::diagnoseUnresolvedSymbols to print its error to std::cout so we get more details?

I converted instances of cling::errs to cling::outs, no output is shown from that function when the error appears.

Most likely the problem is that TClingClassInfo::IsEnum is missing a R__LOCKGUARD(gInterpreterMutex); before the creation of the stack TClingClassInfo

This did something actually, now I quite consistently always get a segfault, seemingly triggered by Python, with this stacktrace

Thread 1 (Thread 0x7f4d67fcc740 (LWP 53296) "python"):
#0  0x00007f4d68189bdf in wait4 () from /lib64/libc.so.6
#1  0x00007f4d680f8a1b in do_system () from /lib64/libc.so.6
#2  0x00007f4d2f521370 in TUnixSystem::Exec (this=0x55bd15241770, shellcmd=0x55bd2b060400 "/home/vpadulan/programs/rootproject/rootbuild/master-distrdf-debug/etc/gdb-backtrace.sh 53296 1>&2") at /home/vpadulan/programs/rootproject/rootsrc/core/unix/src/TUnixSystem.cxx:2104
#3  0x00007f4d2f521c11 in TUnixSystem::StackTrace (this=0x55bd15241770) at /home/vpadulan/programs/rootproject/rootsrc/core/unix/src/TUnixSystem.cxx:2395
#4  0x00007f4d2f8665b4 in (anonymous namespace)::do_trace (sig=5) at /home/vpadulan/programs/rootproject/rootsrc/bindings/pyroot/cppyy/cppyy-backend/clingwrapper/src/clingwrapper.cxx:182
#5  0x00007f4d2f866644 in (anonymous namespace)::TExceptionHandlerImp::HandleException (this=0x55bd15d7d000, sig=5) at /home/vpadulan/programs/rootproject/rootsrc/bindings/pyroot/cppyy/cppyy-backend/clingwrapper/src/clingwrapper.cxx:201
#6  0x00007f4d2f5255f9 in TUnixSystem::DispatchSignals (this=0x55bd15241770, sig=kSigAbort) at /home/vpadulan/programs/rootproject/rootsrc/core/unix/src/TUnixSystem.cxx:3610
#7  0x00007f4d2f51d544 in SigHandler (sig=kSigAbort) at /home/vpadulan/programs/rootproject/rootsrc/core/unix/src/TUnixSystem.cxx:395
#8  0x00007f4d2f52554f in sighandler (sig=6) at /home/vpadulan/programs/rootproject/rootsrc/core/unix/src/TUnixSystem.cxx:3586
#9  <signal handler called>
#10 0x00007f4d6813bc4c in __pthread_kill_implementation () from /lib64/libc.so.6
#11 0x00007f4d680eb9c6 in raise () from /lib64/libc.so.6
#12 0x00007f4d680d57f4 in abort () from /lib64/libc.so.6
#13 0x00007f4d683a3f3d in _Py_FatalErrorFormat () from /lib64/libpython3.10.so.1.0
#14 0x00007f4d683532d8 in _enter_buffered_busy.cold () from /lib64/libpython3.10.so.1.0
#15 0x00007f4d6846b95e in buffered_flush () from /lib64/libpython3.10.so.1.0
#16 0x00007f4d683d0c24 in method_vectorcall_NOARGS () from /lib64/libpython3.10.so.1.0
#17 0x00007f4d683dfdae in PyObject_VectorcallMethod () from /lib64/libpython3.10.so.1.0
#18 0x00007f4d6846b70e in _io_TextIOWrapper_flush () from /lib64/libpython3.10.so.1.0
#19 0x00007f4d683d0c24 in method_vectorcall_NOARGS () from /lib64/libpython3.10.so.1.0
#20 0x00007f4d683dfdae in PyObject_VectorcallMethod () from /lib64/libpython3.10.so.1.0
#21 0x00007f4d6846e605 in flush_std_files () from /lib64/libpython3.10.so.1.0
#22 0x00007f4d6837f800 in fatal_error.lto_priv () from /lib64/libpython3.10.so.1.0
#23 0x00007f4d683a3ffa in _Py_FatalErrorFormat () from /lib64/libpython3.10.so.1.0
#24 0x00007f4d683532d8 in _enter_buffered_busy.cold () from /lib64/libpython3.10.so.1.0
#25 0x00007f4d6846b95e in buffered_flush () from /lib64/libpython3.10.so.1.0
#26 0x00007f4d683d0c24 in method_vectorcall_NOARGS () from /lib64/libpython3.10.so.1.0
#27 0x00007f4d683dfdae in PyObject_VectorcallMethod () from /lib64/libpython3.10.so.1.0
#28 0x00007f4d6846b70e in _io_TextIOWrapper_flush () from /lib64/libpython3.10.so.1.0
#29 0x00007f4d683d0c24 in method_vectorcall_NOARGS () from /lib64/libpython3.10.so.1.0
#30 0x00007f4d683dfdae in PyObject_VectorcallMethod () from /lib64/libpython3.10.so.1.0
#31 0x00007f4d6846e605 in flush_std_files () from /lib64/libpython3.10.so.1.0
#32 0x00007f4d6846e1ed in Py_FinalizeEx () from /lib64/libpython3.10.so.1.0
#33 0x00007f4d6846e44c in Py_Exit () from /lib64/libpython3.10.so.1.0
#34 0x00007f4d6846bdeb in handle_system_exit () from /lib64/libpython3.10.so.1.0
#35 0x00007f4d6846bafc in PyErr_PrintEx () from /lib64/libpython3.10.so.1.0
#36 0x00007f4d6846393e in _PyRun_SimpleFileObject () from /lib64/libpython3.10.so.1.0
#37 0x00007f4d68463718 in _PyRun_AnyFileObject () from /lib64/libpython3.10.so.1.0
#38 0x00007f4d684608dc in Py_RunMain () from /lib64/libpython3.10.so.1.0
#39 0x00007f4d6842cd5b in Py_BytesMain () from /lib64/libpython3.10.so.1.0
#40 0x00007f4d680d6550 in __libc_start_call_main () from /lib64/libc.so.6
#41 0x00007f4d680d6609 in __libc_start_main_impl () from /lib64/libc.so.6
#42 0x000055bd13dec095 in _start ()
===========================================================

@pcanal
Copy link
Member

pcanal commented Oct 7, 2022

Any chance to run the failing process in valgrind?

@vepadulano
Copy link
Member Author

Any chance to run the failing process in valgrind?

I don't have any power on the process that fail itself, it's a task run inside of the dask worker and its scheduling is out of my control.

On other news, I made a few more tests focusing on the version of dask distributed module. In the description of the issue above I stated distributed==2022.7.1, this was my starting version for the tests I will describe in the following.

First, I noticed this type of error, happening between a test and another

2022-10-08 02:33:36,474 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f946cbb8340>>, <Task finished name='Task-245' coro=<SpecCluster._correct_state_internal() done, defined at /home/vpadulan/.local/lib/python3.10/site-packages/distributed/deploy/spec.py:330> exception=RuntimeError('cannot schedule new futures after shutdown')>)
Traceback (most recent call last):
  File "/home/vpadulan/.local/lib/python3.10/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/home/vpadulan/.local/lib/python3.10/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
RuntimeError: cannot schedule new futures after shutdown

Initially I thought these errors were just a by-product of the errors coming from TInterpreter that were leaving the dask worker in a bad state and thus breaking the scheduler too. But then, I updated to distributed==2022.8.1 and they were not present.

So I went on and I used git bisect starting from 2022.8.0 going onwards.

I was able to identify the following scenarios, depending on a combination of the status of ROOT and the version of dask distributed module:

  • ROOT current master, with distributed==2022.8.0: errors like shown shown above, both those coming from TInterepreter::Calc from RDF and those coming from dask about cannot schedule new futures after shutdown that I mentioned in this comment.
  • ROOT current master, with distributed==2022.8.1: No more errors about futures after shutdown. The errors from TInterpreter::Calc remain
  • ROOT master+fix, with distributed==2022.8.0: No more errors from TInterpreter::Calc, but still errors from dask about cannot schedule new futures after shutdown and also the segfaults with Python-only stacktrace that I wrote in a previous comment
  • ROOT master+fix, with distributed==2022.8.1: No errors at all!

So it seems that a mix of adding the fix suggested by Philippe (a R_LOCKGUARD in TClingClassInfo::IsEnum) plus updating to distributed==2022.8.1 fixes all our problems.

More details about the distributed versions

Thanks to the git bisect I could identify more closely the sources of errors on the dask side.

For the very weird Python stacktrace leading to a full segfault, I found that dask/distributed#6684 seems to fix it. In test runs where the segfault is triggered, before getting it (and the Python stacktrace), I can see these lines

RuntimeError: cannot schedule new futures after shutdown
Exception in thread AsyncProcess Dask Worker process (from Nanny) watch process join:
Traceback (most recent call last):
  File "/usr/lib64/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/vpadulan/.local/lib/python3.10/site-packages/distributed/process.py", line 238, in _watch_process
    assert exitcode is not None
AssertionError

The line assert exitcode is not None in process.py is exactly what is being changed by the linked PR above.

For the more generic error RuntimeError: cannot schedule new futures after shutdown, which practically happens every time the test ends, at shutdown time of the dask cluster object, it seems like the proper fix was given by dask/distributed#6847 .

This commit seems to fix (at least for the case of a dask LocalCluster) a long-standing issue dask/distributed#6846

vepadulano added a commit to vepadulano/root that referenced this issue Oct 8, 2022
Fixes root-project#11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
pcanal added a commit that referenced this issue Oct 11, 2022
Fixes #11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
@github-actions
Copy link

Hi @pcanal, @Axel-Naumann, @vgvassilev, @vepadulano, @jalopezg-r00t,

It appears this issue is closed, but wasn't yet added to a project. Please add upcoming versions that will include the fix, or 'not applicable' otherwise.

Sincerely,
🤖

@vepadulano
Copy link
Member Author

There's a second part to fixing this issue for good, i.e. also making sure we don't see weird crashes/segfaults due to dask. That involves bumping the minimum dask version to 2022.8.1, which in turn requires minimum Python version to 3.8

vepadulano added a commit to vepadulano/root that referenced this issue Oct 12, 2022
This is needed to avoid major bugs in dask, see
root-project#11515 for details.
vepadulano added a commit that referenced this issue Oct 12, 2022
This is needed to avoid major bugs in dask, see
#11515 for details.
vepadulano added a commit to vepadulano/root that referenced this issue Oct 18, 2022
Fixes root-project#11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
vepadulano added a commit to vepadulano/root that referenced this issue Oct 18, 2022
Fixes root-project#11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
vepadulano added a commit to vepadulano/root that referenced this issue Oct 18, 2022
Fixes root-project#11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
vepadulano added a commit that referenced this issue Oct 18, 2022
Fixes #11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
vepadulano added a commit that referenced this issue Oct 20, 2022
Fixes #11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
vepadulano added a commit that referenced this issue Oct 20, 2022
Fixes #11515

This method leads to contention in some specific scenarios (see linked
issue).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants