Skip to content
This repository has been archived by the owner on Apr 7, 2022. It is now read-only.

DS crashes when unsubscribing from events in __del__ #292

Closed
reszelaz opened this issue Jul 19, 2019 · 41 comments
Closed

DS crashes when unsubscribing from events in __del__ #292

reszelaz opened this issue Jul 19, 2019 · 41 comments

Comments

@reszelaz
Copy link
Contributor

Hi PyTango experts,

I would like to report an issue that we found in Sardana project after migrating it to Python 3.

First of all sorry for the complexity of the exaples (one uses pure PyTango and another one Taurus) to reproduce it, but I was not able to reduce them more. Also sorry if the problem is finally not in PyTango but in Tango. But I do not have knowledge on how to reproduce it with Tango C++.

I think all the necessary information and the steps to reproduce it are in https://github.com/reszelaz/test-tango-py3.

If you have any questions, don't hesitate to ask.

Cheers,
Zibi

@jkotan
Copy link
Contributor

jkotan commented Jul 19, 2019

Hi Zibi,

Have you tried to call content of the __del__ method (unsubscribe events) explicitly from the thread before __del__ is called? Do you get the same error?
It is not defined when gc calls __del__ so it is better not to place there any serious code.

reszelaz added a commit to reszelaz/sardana that referenced this issue Jul 20, 2019
Use DeviceProxy instead of taurus to avoid crashes in Py3
See: tango-controls/pytango#292

To be reverted when the above issue gets clarified
@ajoubertza
Copy link
Member

I've started looking at this. Don't know what the cause of the crash is yet, but I don't think unsubscribe_event would be executed in Python 2.

The PyTangoDevice and PyTangoAttribute create a circular reference. Prior to Python 3.4, unreachable objects with __del__ methods that are in a reference cycle are not released (they are "uncollectable"), so the __del__ method isn't called. Newer versions of Python will release those objects and call __del__.

Python 2.7
(__del__ not called, and weakref is still alive at the end):

$ python
Python 2.7.6 (default, Nov 13 2018, 12:45:42) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import weakref
>>> 
>>> class PyTangoDevice(object):
...     def __init__(self, name):
...         print("PyTangoDevice.__init__")
...         self._state = self.getAttribute("state")
...     def getAttribute(self, name):
...         return PyTangoAttribute(name, self)
... 
>>> class PyTangoAttribute(object):
...     def __init__(self, name, dev):
...         print("PyTangoAttribute.__init__")
...         self._dev = dev
...     def __del__(self):
...         print("PyTangoAttribute.__del__")
... 
>>> dev = PyTangoDevice('abc')
PyTangoDevice.__init__
PyTangoAttribute.__init__
>>> wr = weakref.ref(dev)
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> dev = None
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> gc.collect()
4
>>> gc.garbage
[<__main__.PyTangoAttribute object at 0x7f3842fcb910>]
>>> wr
<weakref at 0x7f3842fb4f18; to 'PyTangoDevice' at 0x7f3842fcb890>
>>> 

Python 3.4
(__del__ is called, and weakref is dead at the end):

$ python3
Python 3.4.3 (default, Nov 12 2018, 22:25:49) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gc
>>> import weakref
>>> 
>>> class PyTangoDevice(object):
...     def __init__(self, name):
...         print("PyTangoDevice.__init__")
...         self._state = self.getAttribute("state")
...     def getAttribute(self, name):
...         return PyTangoAttribute(name, self)
... 
>>> class PyTangoAttribute(object):
...     def __init__(self, name, dev):
...         print("PyTangoAttribute.__init__")
...         self._dev = dev
...     def __del__(self):
...         print("PyTangoAttribute.__del__")
... 
>>> dev = PyTangoDevice('abc')
PyTangoDevice.__init__
PyTangoAttribute.__init__
>>> wr = weakref.ref(dev)
>>> wr
<weakref at 0x7ffa87a6bc28; to 'PyTangoDevice' at 0x7ffa87a7c048>
>>> dev = None
>>> wr
<weakref at 0x7ffa87a6bc28; to 'PyTangoDevice' at 0x7ffa87a7c048>
>>> gc.collect()
PyTangoAttribute.__del__
4
>>> gc.garbage
[]
>>> wr
<weakref at 0x7ffa87a6bc28; dead>
>>> 

@bourtemb
Copy link
Member

bourtemb commented Sep 11, 2019

Hi @reszelaz,

Looking at the backtrace in https://github.com/reszelaz/test-tango-py3 README, it looks like there is an exception thrown when trying to acquire a Tango monitor in your case.
(Not able to acquire serialization (dev, class or process) monitor)?
Only zmq::error_t exceptions seem to be caught in this case in DelayEvent constructor.
We could probably try to catch also errors related to Tango subscription monitor acquisitions as well and it would probably not crash but one question is: why do you get this timeout when trying to acquire this subscription monitor in your case?

@reszelaz
Copy link
Contributor Author

First off all thanks @jkotan, @ajoubertza and @bourtemb for looking into this!
Also sorry for long silence from my side - this was due to holidays and also rush in finishing the migration of Sardana to Py3.

Have you tried to call content of the del method (unsubscribe events) explicitly from the thread before del is called? Do you get the same error?

I have done explicit unsubscribes and, from my tests, it looks like the problem disappears. You can see the corresponding code in the tango-test-py3/explicit_unsub. This avoids the unsubscription in the __del__ method.

It is not defined when gc calls __del__ so it is better not to place there any serious code.

Taurus attributes subscribe to configuration events in __init__ and unsubscribe in __del__. These are singleton objects kept in the factory until someone uses them. Whenever these are not needed should be disposable by the GC. Maybe Python context management could be an alternative for that? but I imagine this to be a bigger refactoring. Anyway, I pass this suggestion to @cpascual so he evaluates it. Thanks!

why do you get this timeout when trying to acquire this subscription monitor in your case?

We observe this problem when iterating over attributes and continuously subscribing and unsubscribing from events. The thing is that the unsubscription is triggered by the garbage collector. I suspect these unsubscriptions somehow collides with the subscriptions of the other attributes.

@reszelaz reszelaz changed the title DS crashes when unsubscribing from events in __dell__ DS crashes when unsubscribing from events in __del__ Sep 16, 2019
@mliszcz
Copy link

mliszcz commented Oct 10, 2019

Thanks @bourtemb and @reszelaz for working on this today.

As promised, here is a summary of our findings:

  1. Log with extra debugs:
TAURUS #3 subscribe_event, attr state, thread: <JobThread(Thread-3, started daemon 140624837715712)>
DBG Locking DelayEvent #2: thread: 0
DBG:3795 get_monitor try
DBG:3797 get_monitor ok
TAURUS #4 unsubscribe_event, id 542, thread: <JobThread(Thread-3, started daemon 140624837715712)>
DBG Locking DelayEvent #3: thread: 0x7fe5a8045350
DBG:3795 get_monitor try
DBG: DelayEvent: Not able to acquire serialization (dev, class or process) monitor
Tango exception
Severity = ERROR
Error reason = API_CommandTimedOut
Desc : Not able to acquire serialization (dev, class or process) monitor
Origin : TangoMonitor::get_monitor

DBG Locking DelayEvent #3: thread: 0x7fe5a8045350

subscribe_event is not finished yet but unsubscribe_event is called from Taurus. unsubscribe_event fails to acquire monitor (in DelayEvent). Both subscribe and unsubscribe are called from the same python Thread object. During subscribe, no omni_thread is configured.

Hypothesis 1:
Unsubscribe_event is called at some point due to GC run in a dedicated thread (but incorrect thread name is printed). Maybe the GC thread is starved until monitor timeout expires

Hypothesis 2:
There is a deadlock in the monitor code since it is assuming that the threads are omni-orb threads. Two different non-omni-orb threads are considered by monitor to be the same thread.

Anyway, calling subscribe/unsubscribe_event from non-omni-orb threads (like python threads), is not supported by cppTango monitor (probably we should open an issue for fixing it/printing a warning/documenting).

With below patch I am unable to reproduce the issue. It's not complete, you still should delete the ensure_self object when the thread terminates. Proposed solution is to expose similar API in PyTango.

diff --git a/taurus/PyDsExpClient.py b/taurus/PyDsExpClient.py
index a88a924..7c2c59c 100644
--- a/taurus/PyDsExpClient.py
+++ b/taurus/PyDsExpClient.py
@@ -7,6 +7,20 @@ from taurus.core.tango import TangoDevice

 import PyTango

+import cffi
+ffibuilder = cffi.FFI()
+ffibuilder.cdef("void* omni_thread_ensure_self();")
+lib = ffibuilder.verify(r"""
+    #include <omnithread.h>
+    #include <cstdio>
+    void* omni_thread_ensure_self()
+    {
+        return new omni_thread::ensure_self;
+    }
+    """,
+    source_extension='.cpp',
+    libraries=["omnithread"]
+)

 DEV_NAME_PATTERN = "test/pydsexp/{}"

@@ -30,6 +44,7 @@ class JobThread(threading.Thread):
         self.dev = dev

     def run(self):
+        es = lib.omni_thread_ensure_self()
         for i in range(100):
             if self.dev and self.dev._stop_flag:
                 break

@reszelaz
Copy link
Contributor Author

Thanks to you two for investigating this!

I just tried to run the JobThread.run method directly in the Start command (without starting a new thread) and the problem disappears. So, if subscribe/unsubscribe are called from the omniorb thread then everything works. I think it also discards the Hypothesis 2, cause if the finalizer __del__ would be called from another thread it would need to crash. Do you agree?

Then a question to the PyTango maintainers would be if we could have a convenient API for calling the omni_thread_ensure_self?

@mliszcz
Copy link

mliszcz commented Oct 10, 2019

@bourtemb
Yesterday we talked about adding omni_orb::ensure_self es{} in cppTango at the top of subscribe_event() and unsubscribe_event(). This will result in different omni thread ID assigned for the same physical thread during each call to one of those methods. The problem is that e.g. TangoMonitor relies on omni thread ID to detect re-entrant locking, so we may trigger some unexpected side effects.

How about:

  1. creating thread-local omni_orb::ensure_self object at global scope in devapi_base.cpp or event.cpp or as a static member of DeviceProxy:
    namespace Tango
    {
        thread_local omni_thread::ensure_self global_omni_thread_self{};
        // additionally may be declared 'extern' in some header file
        // ...
  2. access this object at the very beginning of each DeviceProxy method to ensure that it gets created.
    void EventConsumer::unsubscribe_event(int event_id)
    {
        global_omni_thread_self; // new, in each method
        // continue as usual

I think this should solve this issue (without any changes in PyTango) as well as any issues you experience with C++11 threads.

@mliszcz
Copy link

mliszcz commented Oct 10, 2019

@bourtemb another option is to gradually remove dependency on omni_thread API, with C++11 threads or some Tango-specific class which will wrap the low-level threading interface. For instance replace omni_thread::self()->id() with:

#ifdef LINUX
    pthread_t id = pthread_self();
#else
    DWORD id = GetCurrentThreadId();
#endif

@bourtemb
Copy link
Member

@bourtemb
Yesterday we talked about adding omni_orb::ensure_self es{} in cppTango at the top of subscribe_event() and unsubscribe_event(). This will result in different omni thread ID assigned for the same physical thread during each call to one of those methods. The problem is that e.g. TangoMonitor relies on omni thread ID to detect re-entrant locking, so we may trigger some unexpected side effects.

I agree with you. I think we should not do that.

How about:

1. creating thread-local `omni_orb::ensure_self` object at global scope in `devapi_base.cpp` or `event.cpp` or as a static member of DeviceProxy:
   ```c++
   namespace Tango
   {
       thread_local omni_thread::ensure_self global_omni_thread_self{};
       // additionally may be declared 'extern' in some header file
       // ...
   ```

This thread_local feature might be very useful indeed in our use case.

2. access this object at the very beginning of each DeviceProxy method to ensure that it gets created.
   ```c++
   void EventConsumer::unsubscribe_event(int event_id)
   {
       global_omni_thread_self; // new, in each method
       // continue as usual
   ```

I think this should solve this issue (without any changes in PyTango) as well as any issues you experience with C++11 threads.

I think it's a good idea. It might actually be enough to declare it at global scope and to initialize it there. It's maybe not needed to add this
global_omni_thread_self;
line at the beginning of each method as long as the ensure_self object is created for the life time of the current thread, I think it's fine and if you declare and initialize it at global scope, it might be ok...
Tests are needed to verify the behaviour.

@bourtemb
Copy link
Member

@bourtemb another option is to gradually remove dependency on omni_thread API, with C++11 threads or some Tango-specific class which will wrap the low-level threading interface. For instance replace omni_thread::self()->id() with:

#ifdef LINUX
    pthread_t id = pthread_self();

This works when using pthreads, but would this work with standard C++11 threads?

@mliszcz
Copy link

mliszcz commented Oct 15, 2019

I think it's a good idea. It might actually be enough to declare it at global scope and to initialize it there. It's maybe not needed to add this
global_omni_thread_self;
line at the beginning of each method as long as the ensure_self object is created for the life time of the current thread, I think it's fine and if you declare and initialize it at global scope, it might be ok...
Tests are needed to verify the behaviour.

According to this SO question:
https://stackoverflow.com/questions/24253584/when-is-a-thread-local-global-variable-initialized
You must explicitly access the globlal variable from each thread to ensure that it gets created, i.e. standard only requires that it must be created before first use.

#ifdef LINUX
   pthread_t id = pthread_self();

This works when using pthreads, but would this work with standard C++11 threads?

C++11 threads are just convenient wrappers around basic threading libraries. On linux systems it is probably always pthreads (not sure about BSD, Solaris and others), e.g. on Ubuntu 18.04:

/usr/include/c++/7/thread
74: typedef __gthread_t native_handle_type;

/usr/include/x86_64-linux-gnu/c++/7/bits/gthr-posix.h
47:typedef pthread_t __gthread_t;

See e.g. an example how to use pthread API to change priority of C++11 thread: https://en.cppreference.com/w/cpp/thread/thread/native_handle

The idea is to have a simple class like TangoThreadId, which will store native thread id during construction, and overload operator== for comparison. But I'm not sure if without creating "ensure_self" you can access omni_mutex or omni_contition.

@mliszcz
Copy link

mliszcz commented Oct 15, 2019

Let me answer myself:

But I'm not sure if without creating "ensure_self" you can access omni_mutex or omni_contition.

class _OMNITHREAD_NTDLL_ omni_mutex {
public:
  omni_mutex();
  ~omni_mutex();

  inline void lock()    { OMNI_MUTEX_LOCK_IMPLEMENTATION   }
  inline void unlock()  { OMNI_MUTEX_UNLOCK_IMPLEMENTATION }

...

#define OMNI_MUTEX_LOCK_IMPLEMENTATION                  \
    pthread_mutex_lock(&posix_mutex);

#define OMNI_MUTEX_TRYLOCK_IMPLEMENTATION               \
    return !pthread_mutex_trylock(&posix_mutex);

Similar for NT threads.

So we need omni-thread friendly threads (ensure_self) probably only for ID access and thread local storage.

@reszelaz
Copy link
Contributor Author

reszelaz commented Dec 4, 2019

Hi Michal,

I finally I could try your patch proposed in #292 (comment). I tried it in the Sardana code (this is where we originally found the issue) and not in the simplified example from the initial comment of this issue.

I tried it in two different places:

  1. Placing this line:
es = lib.omni_thread_ensure_self()

in taurus.core.util.threadpool.Worker. Sardana just uses this kind of worker threads and actually macros are executed by them.

  1. Placing the same line in the sardana.macroserver.msmacromanager.MacroExecutor.__runXML - this is the job that will be queued when executing a macro.

Note that the second scenario may cause that this line will be executed more than once by the same thread. The first one does not have this risk.

Then of course I have reverted our workaround commit so we again use disposable taurus.Attribute objects. The workaround was basically that we were using directly disposable PyTango proxies (this avoids the subscription to the configuration events done by Taurus).

Unfortunatelly our testsuite hangs using this workaround - I run it 5 times always with the same result. The same testsuite without doing anything works correctly (we run it for months in CI and I also repeated it now on my testing platform). Also if I just apply your code (es = lib.omni_thread_ensure_self()), but do not revert our workaround commit the testsuite works correctly.

reszelaz added a commit to reszelaz/sardana that referenced this issue Dec 5, 2019
@reszelaz
Copy link
Contributor Author

reszelaz commented Dec 5, 2019

Just to complement the previous comment here is the branch on which I tried it. The last commit applies the workaround as explaind in point 2. The previous commit is the rever of our workaround.

@mliszcz
Copy link

mliszcz commented Dec 13, 2019

Hi @reszelaz thanks for detailed information!

I'm trying to reproduce the issue using Sardana test suite but I do not see any crash or deadlock. I have: cppTango (tango-9-lts), pytango (v9.3.1 from git), itango (latest from pypi), taurus (latest from pypi), sardana (branch workaround_pytango292 from your fork), all running on Python 3. I have started Pool (demo1), MacroServer (demo1), ran sar_demo from spock and then ran sardanatestsuite script. Some tests are failing, it's probably misconfiguration in my environment, but other than that no process crashed or hung.

Ran 307 tests in 513.127s
FAILED (errors=41, skipped=13, expected failures=2)

Is this the correct way to reproduce the issue? Which test case from Sardana test suite shows the problem?

@reszelaz
Copy link
Contributor Author

I will answer with more details on Monday (I'm out of office now). Similar problems happens with #318. I investigated a little bit and I suspect that the events stop to work at some point. I was even able to reproduce it without running the whole test suite. We may be hitting another bug..
But don't worry, I will keep you posted with something easier to reproduce (hopefully next week). And many thanks for trying it yourself!

@reszelaz
Copy link
Contributor Author

reszelaz commented Feb 5, 2020

Hi again,
Sorry for the long silence! I just checked the new PR #327 proposed by @ajoubertza and the behavior is exactly the same as with the initial workaround proposed by @mliszcz and #318 (now rejected in favor of #327). As I said previously I was able to reproduce it with just a subset of tests - the whole sardanatestsuite is not necessary. So, the said tests just hangs:

zreszela@pc255:~/workspace/sardana (workaround_pytango292)> python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup
MainThread     WARNING  2020-02-05 11:38:57,014 TaurusRootLogger: <frozen importlib._bootstrap>:222: DeprecationWarning: taurus.external.unittest is deprecated since 4.3.2. Use unittest instead

MainThread     INFO     2020-02-05 11:38:57,319 Starter: Starting server Pool/unittest1_1...
/home/zreszela/workspace/taurus/lib/taurus/core/tango/starter.py:107: ResourceWarning: unclosed file <_io.BufferedWriter name='/dev/null'>
  self.start()
MainThread     INFO     2020-02-05 11:39:00,352 Starter: Server Pool/unittest1_1 has been started
MainThread     WARNING  2020-02-05 11:39:02,097 TaurusRootLogger: /home/zreszela/workspace/sardana/src/sardana/taurus/core/tango/sardana/pool.py:220: DeprecationWarning: _get_value is deprecated since 4.0. Use .rvalue instead
  v = evt_value.value

MainThread     INFO     2020-02-05 11:39:02,106 pc255.cells.es:10000.b95a5064-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread     WARNING  2020-02-05 11:39:02,107 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     INFO     2020-02-05 11:39:02,485 Starter: Deleted device pool/demo1/2
MainThread     INFO     2020-02-05 11:39:02,485 Starter: Hard killing server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:02,496 Starter: Server Pool/unittest1_1 has been stopped
MainThread     INFO     2020-02-05 11:39:02,499 Starter: Deleted Server Pool/unittest1_1
.MainThread     INFO     2020-02-05 11:39:02,509 Starter: Starting server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:05,522 Starter: Server Pool/unittest1_1 has been started
MainThread     INFO     2020-02-05 11:39:06,877 pc255.cells.es:10000.bc3c3770-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread     WARNING  2020-02-05 11:39:06,879 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     WARNING  2020-02-05 11:39:06,897 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     INFO     2020-02-05 11:39:07,234 Starter: Deleted device pool/demo1/2
MainThread     INFO     2020-02-05 11:39:07,235 Starter: Hard killing server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:07,246 Starter: Server Pool/unittest1_1 has been stopped
MainThread     INFO     2020-02-05 11:39:07,249 Starter: Deleted Server Pool/unittest1_1
.MainThread     INFO     2020-02-05 11:39:07,258 Starter: Starting server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:09,268 Starter: Server Pool/unittest1_1 has been started
MainThread     INFO     2020-02-05 11:39:10,640 pc255.cells.es:10000.be73242c-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread     WARNING  2020-02-05 11:39:10,642 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     WARNING  2020-02-05 11:39:10,662 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     INFO     2020-02-05 11:39:11,030 Starter: Deleted device pool/demo1/2
MainThread     INFO     2020-02-05 11:39:11,030 Starter: Hard killing server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:11,039 Starter: Server Pool/unittest1_1 has been stopped
MainThread     INFO     2020-02-05 11:39:11,041 Starter: Deleted Server Pool/unittest1_1
.MainThread     INFO     2020-02-05 11:39:11,049 Starter: Starting server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:14,062 Starter: Server Pool/unittest1_1 has been started
MainThread     INFO     2020-02-05 11:39:15,449 pc255.cells.es:10000.c158a87e-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread     WARNING  2020-02-05 11:39:15,451 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     WARNING  2020-02-05 11:39:15,466 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     INFO     2020-02-05 11:39:15,898 Starter: Deleted device pool/demo1/2
MainThread     INFO     2020-02-05 11:39:15,898 Starter: Hard killing server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:15,910 Starter: Server Pool/unittest1_1 has been stopped
MainThread     INFO     2020-02-05 11:39:15,913 Starter: Deleted Server Pool/unittest1_1
.MainThread     INFO     2020-02-05 11:39:15,921 Starter: Starting server Pool/unittest1_1...
MainThread     INFO     2020-02-05 11:39:18,936 Starter: Server Pool/unittest1_1 has been started
MainThread     INFO     2020-02-05 11:39:20,318 pc255.cells.es:10000.c43f79a0-4803-11ea-bc3a-f8b156a48f79: Configuration changed
MainThread     WARNING  2020-02-05 11:39:20,320 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
MainThread     WARNING  2020-02-05 11:39:20,336 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
Dummy-1        INFO     2020-02-05 11:39:34,079 pc255.cells.es:10000.c43f79a0-4803-11ea-bc3a-f8b156a48f79.configuration: Activating polling. Reason: API_EventTimeout

As we can see, at the very end, the client (python -m unittest ....) receives "API_EventTimeout".

Before continuing, let me explain a little bit what this TestMeasurementGroup do. There is just one test definitiion and we execute it 6 times with different parameters (differnt experimental channels in the MeasurementGroup). Every test execution creates a Pool device server (until there is a problem with one of the tests it will always use the same instance name: unittest1_1) and populates it with test devices. Then it creates a MeasurementGroup with the experimental channels specified as parameters and makes an acquisition. And all this is repeated 6 times. But it got hangs before finishing...

When the tests are hung aparently the Pool device server does not hang. Its devices respond to the state queries (also from the MeasurementGroup device).

So, now I use:

  • PyTango branch issue-307-context-handler
  • Taurus from the develop branch with the tango.EnsureOmniThread() applied in the Worker.run:
    def run(self):
        with tango.EnsureOmniThread():
            get = self.pool.jobs.get
            while True:
                cmd, args, kw, callback, th_id, stack = get()
                if cmd:
                    self.busy = True
                    self.cmd = cmd.__name__
                    try:
                        if callback:
                            callback(cmd(*args, **kw))
                        else:
                            cmd(*args, **kw)
                    except:
                        orig_stack = "".join(format_list(stack))
                        self.error("Uncaught exception running job '%s' called "
                                   "from thread %s:\n%s",
                                   self.cmd, th_id, orig_stack, exc_info=1)
                    finally:
                        self.busy = False
                        self.cmd = ''
                else:
                    self.pool.workers.remove(self)
                    return
  • Sardana from the above mentioned branch workaround_pytango292 but without the last commit 847785b (which is basically the original workaround proposed by @mliszcz - now it is applied on Taurus).

We also tried this workaround at one of the beamlines with much more complicated macros than these tests. The result was that the Sardana servers were hanging after prior reports of serialization monitor errors.

@ajoubertza
Copy link
Member

Hi @reszelaz. This is quite a difficult problem to solve!

The Taurus thread pool implementation looks correct. I searched the Sardana code for "thread" and I do see a few other places where standard threading.Thread objects are created. E.g., PoolMonitor and DummyEventSource. Could any other threads (not using Taurus thread pool) be involved in the code being running in this test? Maybe there are others that need EnsureOmniThread.

reszelaz added a commit to reszelaz/pytango that referenced this issue Feb 7, 2020
@reszelaz
Copy link
Contributor Author

reszelaz commented Feb 7, 2020

Thanks Anton for looking into it again! I have advanced a little bit with the debugging, let's see if the following information is helpful to you...

Actually the tests (client side) that I use to reproduce this issue don't use threads. The server (Pool) which is used in these tests uses threads but I suppose it does not matter. The PoolMonitor thread is started in the Pool but then it is immediatelly suspended. There are threads launched in the acquisition process. DummyEventSource is a helper class but it is not used in these tests. So, IMO, the tango.EnsureOmniThread() in the Taurus thread pool is not necessary. I have performed tests with and without it and the results are the same.

I have put some debug prints in PyTango: __DeviceProxy__subscribe_event_attrib and __DeviceProxy__unsubscribe_event. Curiously the API_EventTimeout happens after __del__ (and unsubcribes) are called by garbage collector in the middle of event subscritpion (in this example while subscribing to NbStarts attribute configuration event the unsubscribe is called on configuration and change events of state attributes of two other devices):

test_count_5 (sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup)
count with Tango attribute ... 
MainThread     INFO     2020-02-07 15:27:37,592 Starter: Starting server Pool/unittest1_1...
MainThread     INFO     2020-02-07 15:27:39,602 Starter: Server Pool/unittest1_1 has been started
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 63
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 64
__DeviceProxy.subscribe_event_attrib(configuration): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(configuration): after self.__subscribe_event; id = 65
__DeviceProxy.subscribe_event_attrib(configuration): before self.__subscribe_event
MainThread     INFO     2020-02-07 15:27:40,931 pc255.cells.es:10000.ff4670cc-49b5-11ea-aa43-f8b156a48f79: Configuration changed
__DeviceProxy.subscribe_event_attrib(configuration): after self.__subscribe_event; id = 66
MainThread     WARNING  2020-02-07 15:27:40,933 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 67
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 68
MainThread     WARNING  2020-02-07 15:27:40,948 TaurusRootLogger: TangoAttributeNameValidator.getParams() is deprecated. Use getUriGroups() instead.
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 69
__DeviceProxy.subscribe_event_attrib(state): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(state): after self.__subscribe_event; id = 70
__DeviceProxy.subscribe_event_attrib(IntegrationTime): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(IntegrationTime): after self.__subscribe_event; id = 71
__DeviceProxy.subscribe_event_attrib(IntegrationTime): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(IntegrationTime): after self.__subscribe_event; id = 72
__DeviceProxy.subscribe_event_attrib(Moveable): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(Moveable): after self.__subscribe_event; id = 73
__DeviceProxy.subscribe_event_attrib(Moveable): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(Moveable): after self.__subscribe_event; id = 74
__DeviceProxy.subscribe_event_attrib(NbStarts): before self.__subscribe_event
__DeviceProxy__unsubscribe_event(67): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(67): after self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(68): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(68): after self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(69): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(69): after self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(70): before self.__unsubscribe_event
__DeviceProxy__unsubscribe_event(70): after self.__unsubscribe_event
__DeviceProxy.subscribe_event_attrib(NbStarts): after self.__subscribe_event; id = 75
__DeviceProxy.subscribe_event_attrib(NbStarts): before self.__subscribe_event
__DeviceProxy.subscribe_event_attrib(NbStarts): after self.__subscribe_event; id = 76
Dummy-1        INFO     2020-02-07 15:27:56,918 pc255.cells.es:10000.ff4670cc-49b5-11ea-aa43-f8b156a48f79.configuration: Activating polling. Reason: API_EventTimeout

From the all tests output you can see that in the case of the previous tests the garbage collection does not happen while we subsribe. I also attach backtrace of all threads from the hung tests process.

You should be able to reproduce it very easily. It happens almost always when running:

python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup -v

However sometimes it luckily finishes even after throwing some API_EventTimeouts, but I would say that in 80 % of the times it hangs the tests process.

It is enough to use:

  • Taurus from develop branch
  • Sardana from pytango292 branch (where our workaround to use DeviceProxies instead of taurus.Device objects is already reverted, so basically this branch as it is)
  • (optional) PyTango from reszelaz/issue-307-context-handler where I added debug prints on top of the issue-307-context-handler branch.

@ajoubertza
Copy link
Member

I was also thinking about which thread the __del__ methods will be called in. I wanted to modify your test script to include the thread name in the subscription and unsubscription print messages, but I couldn't get a Docker environment set up that ran tests at all.

Instead, I made a simple test script (no Tango code involved) that creates objects with reference cycles from a number of threads, with random sleeps in between. The __del__ methods get called from random threads, in batches. Only sometimes is the deletion done in the same thread that created the object. Python is free to call the GC after any byte code operation.

There's an interesting PEP related to this that hasn't been implemented yet (deferred state): https://www.python.org/dev/peps/pep-0556/ They talk about reentrancy being a problem. Even better is this blog article: https://codewithoutrules.com/2017/08/16/concurrency-python/ which warns of doing much in __del__.

have you tried to put

with tango.EnsureOmniThread():

into the __del__ method.

This might be dangerous, since the dummy omniORB ID will only last as long as the __del__ method. If the same thread calls __del__ again later it would get a new ID, which could affect the locks in cppTango that are based on the omniORB thread ID.

You can also use the new tango.is_omni_thread() if you just want to check what kind of thread it is.

@jkotan
Copy link
Contributor

jkotan commented Feb 8, 2020

@ajoubertza, thanks for nice readings. It looks like the second article describes what is going on with Taurus/Sardana in a simple way. In py2 because of cyclic dependences the content of __del__ is not called. Contrary to py2 the py3 with its gc can remove objects with cyclic dependences but it is done asynchronously (in a "random" thread). It would be nice to find and remove those cyclic dependences (if they exists) but it does not guarantee to solve the problem. Thus one could

  • move the __del__ content to other method in order not to call unsubscribe by gc (my solution from July'19 but it requires refactorization of taurus/sardana code)
  • play a game with tango.EnsureOmniThread() and  tango.is_omni_thread() (may be "dangerous")
  • improve tango.EnsureOmniThread() to work always with any thread (is complicated , e.g. it may require static bookkeeping metainfo about omni threads)

Anyone knows any other option?

@ajoubertza
Copy link
Member

Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since __del__ wasn't called. Won't the device eventually give up sending the events if the device proxy that subscribed is no longer connected?

@reszelaz
Copy link
Contributor Author

Hi all,

Thanks Jan and Anton for investigating this issue. I see that you advanced a lot!
I will try to answer in the chronological order:

I have tried to use with tango.EnsureOmniThread() in __del__ but the result is exactly the same. Anyway Anton already discouraged its usage. Also I think it should not be necessary cause the event consumer thread should be already omni thread.

I was also thinking about which thread the del methods will be called in. I wanted to modify your test script to include the thread name in the subscription and unsubscription print messages,

I've done it last Friday, based on this I took my vaugh interpretation of what is going on. There are just two threads: MainThread and Dummy-1. The Dummy-1 is the one which executes event callbacks. Then if two conditions are met:

  • while we are subscribing in MainThread then an event arrives
  • while processing this event callback in Dummy-1 thread the GC calls unsubscribes

it hangs the tests.
See logs with thread names from the tests execution

but I couldn't get a Docker environment set up that ran tests at all.

You can try using the reszelaz/sardana-test docker image and follow the instructions from the "How to develop sardana using this image" section. I just updated the image so it uses the latest taurus from develop.
So, basically you will need to do something like this:

docker pull reszelaz/sardana-test
docker run -d --name=sardana-test -h sardana-test -v /home/zreszela/workspace/sardana:/sardana reszelaz/sardana-test
docker exec sardana-test bash -c "cd /sardana && python3 setup.py develop"
docker exec sardana-test bash -c "python3 -m unittest sardana.taurus.core.tango.sardana.test.test_pool.TestMeasurementGroup -v"

Note: /home/zreszela/workspace/sardana is the path where I cloned sardana, you will need to change it. Rembember to use the branch I said above.

In my case, to reproduce this issue on docker I was forced to add more tests to the TestMeasurementGroup class. Basically I doubled them copying and pasting the already defined tests. And with this I see API_EventTimeout almost at every execution.

Whenever you reproduce it, to repeat it again, you will need to kill the hung processes (server & tests) and clean the DB:

docker exec sardana-test bash -c "killall -9 /usr/bin/python3"
docker exec sardana-test bash -c "python3 -c 'import tango; db = tango.Database(); [db.delete_server(server) for server in db.get_server_list(\"Pool/unittest*\")]'"

I just had time to read the above blog article. Based on it I wonder what is the lock that we deadlock on? It must come from Tango, right? I have made some investigation and I would rather discard this PyTango lock:

self.__dict__['_subscribed_events_lock'] = threading.Lock()

Please correct if I'm wrong, but if we are just using MainThread and Dummy-1 thread (Tango event consumer) and it still hangs (or produces API_EventTimeouts), then maybe we are hitting some other problem and not the one that can be fixed by the tango.EnsureOmniThread context manager?

@bourtemb
Copy link
Member

bourtemb commented Feb 11, 2020

Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since __del__ wasn't called. Won't the device eventually give up sending the events if the device proxy that subscribed is no longer connected?

In cppTango, when a DeviceProxy object is deleted, it unsubscribes to the events already subscribed via this DeviceProxy object.
This was added in 2016 by @taurel , probably for a very good reason(?!).
So do you suggest to modify cppTango in order to remove this unsubscription in DeviceProxy destructor?
~DeviceProxy is calling unsubscribe_all_events() (https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/devapi_base.cpp#L2641)
which is invoking several methods which are taking a lock which is also taken during the event subscription:
https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/devapi_base.cpp#L2678
and
https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/devapi_base.cpp#L2641

Here are the lines where the lock is taken:
In ZmqEventConsumer::get_subscribed_event_ids():
https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/zmqeventconsumer.cpp#L3431
In EventConsumer::unsubscribe_event(int):
https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/event.cpp#L1870
and
https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/event.cpp#L1908

In EventConsumer::subscribe_event:
https://github.com/tango-controls/cppTango/blob/9.3.3/cppapi/client/event.cpp#L1246

I think what's happening in your use case is very similar to what is described in https://codewithoutrules.com/2017/08/16/concurrency-python/ in the "Reentrancy!" section.

As far as I understand, the main problem comes from the fact that we enter into a situation where the garbage collector is invoked in the middle of an operation which has taken the lock (subscribe_event) and has not yet released it.
The garbage collector is trying to delete a DeviceProxy object, so trying to unsubscribe to all the events which were subscribed via this DeviceProxy. This unsubscribe operation is also trying to acquire the lock but it cannot get it because the lock could not be released by the code which was interrupted by the garbage collector and this code is waiting for the garbage collector cycle to end (it will never end). So we are indeed in a deadlock.

This is quite an annoying problem because the problem should not be present in C++.
It looks like we cannot use any lock in the destructors.

The warning in the Python3 documentation is also clear on the topic:
__del__() can be invoked when arbitrary code is being executed, including from any arbitrary thread. If __del__() needs to take a lock or invoke any other blocking resource, it may deadlock as the resource may already be taken by the code that gets interrupted to execute __del__().

@schooft
Copy link

schooft commented Feb 12, 2020

This blog post seems to discuss a very similar problem in the MongoDB python driver:
https://emptysqua.re/blog/pypy-garbage-collection-and-a-deadlock/
Their solution is to use a background thread that periodically checks a list for cleanup tasks. The __del__ method only appends tasks to that list. Here is a toy example:
https://github.com/ajdavis/what-to-expect-when-youre-expiring/blob/master/example_123_lock.py
That solution is used since 2015, so it seems to be stable:
https://github.com/mongodb/mongo-python-driver/blob/master/pymongo/mongo_client.py#L707

Another solution might be to start a one shot timer in the __del__ method:

import threading

lock = threading.Lock()

class C:
    def __del__(self):
        def target():
            print('getting lock')
            with lock:
                print('releasing lock')
        # target()  # this deadlocks
        t = threading.Timer(0, target)
        t.start()

c = C()
with lock:
    del c

It seems to work in this simple test case, but I don't know if it is safe in general.

@reszelaz
Copy link
Contributor Author

Thanks Reynald for the Tango C++ details! And Tim for joining this dicussion! Things are getting clearer in my mind now.

Another option is not to unsubscribe at all when your client objects are finalised - that is what was happening under Python 2, since del wasn't called. Won't the device eventually give up sending the events if the device proxy that subscribed is no longer connected?

In cppTango, when a DeviceProxy object is deleted, it unsubscribes to the events already subscribed via this DeviceProxy object.
This was added in 2016 by @taurel , probably for a very good reason(?!).
So do you suggest to modify cppTango in order to remove this unsubscription in DeviceProxy destructor?

I think that Anton here referred to remove the unsubscribes from taurus.Attribute.__del__ and not from ~DeviceProxy?

Also about:

that is what was happening under Python 2, since del wasn’t called

Actually I'm not sure if in Python2 these are were not called. I have a vauge memories that these were called but maybe not so often, and maybe much more later in time. But I think it is not worth investigating this. If you agree let's focus on Python 3.

So, what I tried is what Anton said (or at least what I understood:) - to not call unsubscribes in taurus.Attribute.__del__. And it was still hanging... Here I think we hit what Reynald explains in #292 (comment) about ~DeviceProxy - it unsubscribes all events when we are destructing it. I have not confirmed it by placing prints in Tango C++, but it makes sense to me. So, I would discard this as a solution for Taurus.

Raynald (@bourtemb), since I'm not so fluent in C++, could you clarify me if in either of these occassions:

  • we subscribed and unsubscribed from all events on the DeviceProxy
  • we never subscribed to any events on the DeviceProxy

it will still try to get the lock when destructing the DeviceProxy?

I'm almost sure, but please also confirm, the lock is global per client, and not per DeviceProxy?

Now, a question more to Python experts. When a GC calls __del__ methods in one of the threads and one of this __del__ methods waits on a lock, Python will not switch to another thread until the GC finishes? and there is no change that the lock will be released? This would explain still unclear to me observation.

The comments below the Tim's first link somehow confirms this:

GC doesn't automatically run on a separate thread - it hijacks whichever thread is active at the moment GC is triggered. But if the thread that happens to be running GC needs a lock that's held by a different thread, then there's a deadlock.

But since there is a contradictory comments below, and also they mention PyPy implementation, I'm not 100% sure yet.

@schooft
Copy link

schooft commented Feb 12, 2020

I think that comment is wrong. It should deadlock when the __del__ method is executed on the same thread that holds the lock. Here is a test for that:

import threading
import gc
from time import sleep

lock = threading.Lock()

def target():
    print('target')
    with lock:
        print("thread has lock")
        sleep(1)
        gc.collect()  # this deadlocks
    print("thread released lock")

t = threading.Thread(target=target)
t.start()

class C(object):
    def __del__(self):
        print('gc getting lock')
        with lock:
            print('gc releasing lock')
            pass

c = C()
c.self = c  # create cyclic reference
del c
# gc.collect()  # this doesn't deadlock
t.join()

Depending on the thread where gc.collect is called, this script deadlocks or not.

Edit:
The methods subscribe_event and unsubscribe_event of PyTango.DeviceProxy seem to acquire a threading.Lock. Thus, when the __del__ method calling the unsubscribe_event method is executed by the GC on a thread that is just executing the part of these methods that holds the lock, it deadlocks.

@bourtemb
Copy link
Member

Raynald (@bourtemb), since I'm not so fluent in C++, could you clarify me if in either of these occassions:

* we subscribed and unsubscribed from all events on the DeviceProxy

* we never subscribed to any events on the DeviceProxy

it will still try to get the lock when destructing the DeviceProxy?

~DeviceProxy() calls unsubscribe_all_events()

unsubscribe_all_events() calls ZmqEventConsumer::get_subscribed_event_ids()

ZmqEventConsumer::get_subscribed_event_ids() takes ZmqEventConsumer map_modification_lock, as ReaderLock
This should not be blocking unless another thread currently holds a writer lock on this lock.
Another thread could hold map_modification_lock as Writer lock if the client is currently trying to subscribe to an event or trying to unsubscribe to an already subscribed event.
Please note that in some cases, the Tango keep alive thread might try periodically to subscribe to events which are not yet connected. This can happen when the client tries to subscribe to some events with the stateless flag set to true.

I'm almost sure, but please also confirm, the lock is global per client, and not per DeviceProxy?

I confirm, this ZmqEventConsumer::map_modification_lock lock is global. It is a lock associated to the ZmqEventConsumer object instance, which is unique per client.

@schooft
Copy link

schooft commented Feb 13, 2020

To clarify, PyTango.DeviceProxy has an additional threading.Lock called _subscribed_events_lock:
https://github.com/tango-controls/pytango/blob/develop/tango/device_proxy.py#L1067
which is acquired during subscribe_event and unsubscribe_event, e.g.,:
https://github.com/tango-controls/pytango/blob/develop/tango/device_proxy.py#L1291

Calling one of these methods in a __del__ method of an object that has cyclic references leads to the observed deadlocks, because the garbage collector can potentially execute the __del__ method on any python thread after any instruction* as illustrated by above toy example where the GC is triggered manually in different threads.

*Actually, the GC of CPython only runs after instructions that allocate, so avoiding allocations in code paths that hold the lock is a possible but difficult workaround.

There are at least the following solutions or workarounds:

  1. Remove the _subscribed_events_lock in PyTango
  2. Avoid triggering the GC in PyTango while holding the lock
  3. Change the __del__ methods that call unsubscribe_event
    a. Do not unsubscribe, as this seems to be no problem for Python < 3.4 where cyclic references were never garbage collected and thus the __del__ methods were not executed
    b. Use a background thread like MongoDB
    c. Try out the one shot timer I proposed above
  4. Remove cyclic references, so that the __del__ method is executed deterministically on CPython

@ajoubertza
Copy link
Member

ajoubertza commented Feb 13, 2020

I think that Anton here referred to remove the unsubscribes from taurus.Attribute.del and not from ~DeviceProxy?

Yes, @reszelaz I meant remove the Python calls to unsubscribe in taurus.Attribute.__del__.

Thanks for the great analysis, everyone!

Considering @schooft's suggestions:

  1. I don't think this lock is the only one giving problems. Even when the Python call to unsubscribe was removed, we still got a deadlock:

    So, what I tried is what Anton said (or at least what I understood:) - to not call unsubscribes in taurus.Attribute.del. And it was still hanging... Here I think we hit what Reynald explains in DS crashes when unsubscribing from events in __del__ #292 (comment) about ~DeviceProxy

  2. Yes, this could work, although we would have to disable the GC, and it might be hard to find a good time to periodically call collect. When this issue was first posted, I added a call to gc.collect at the end of the outer loop in JobThread.run from the original example and it didn't crash or hang anymore.
  3. a) As mentioned in 1, this was tried and did not solve the problem due to C++ lock (I think).
    b) This can work.
    c) Not sure how often these objects get deleted - if there are hundreds per second, the one-shot timer approach might create a huge number of threads.
  4. Could probably be done in taurus code, but a developer could still accidentally use/modify the objects causing a cycle...

@schooft
Copy link

schooft commented Feb 13, 2020

  1. Yes, this could work, although we would have to disable the GC, and it might be hard to find a good time to periodically call collect.

I didn't mean to disable the GC but to not allocate any objects in the problematic code paths as done for MongoDB version 2. As far as I understand, the GC is only triggered when new python objects are allocated except for some special occasions like shutdown.

When this issue was first posted, I added a call to gc.collect at the end of the outer loop in JobThread.run from the original example and it didn't crash or hang anymore.

This makes sense. The GC only actually runs when allocations exceed certain thresholds. By running it regularly, the times when the allocations exceed the thresholds are changed. If there are only few allocations, the GC might not run at all outside the manual invocations.

  1. c) Not sure how often these objects get deleted - if there are hundreds per second, the one-shot timer approach might create a huge number of threads.

Good point.

I agree, that there is likely another problem. The backtrace attached in #292 (comment) does not contain any calls from the garbage collector. At least in the toy example below, I can see the following frames:

#17 0x000055d91f9c3f4f in slot_tp_finalize () at ../Objects/typeobject.c:6343
#18 0x000055d91f966abf in finalize_garbage (collectable=0x7fc3d493a5d0) at ../Modules/gcmodule.c:808
#19 collect () at ../Modules/gcmodule.c:1003
#20 0x000055d91f968089 in collect_with_callback (generation=0) at ../Modules/gcmodule.c:1119
#21 collect_generations () at ../Modules/gcmodule.c:1142
#22 _PyObject_GC_Alloc (basicsize=<optimized out>, use_calloc=0) at ../Modules/gcmodule.c:1708
#23 _PyObject_GC_Malloc () at ../Modules/gcmodule.c:1718
#24 _PyObject_GC_New () at ../Modules/gcmodule.c:1730
#25 0x000055d91fa0be3c in PyList_New () at ../Objects/listobject.c:159

This is the toy example that triggers the GC on a random thread by allocating many objects:

import threading

lock = threading.Lock()

class C(object):
    def __del__(self):
        print('gc getting lock on', threading.current_thread().getName())
        with lock:
            print('gc releasing lock')

def target():
    print('target')
    c2 = C()
    c2.self = c2
    c2 = None
    l2 = []
    with lock:
        print("thread has lock", threading.current_thread().getName())
        for i in range(10000):
            l2.append([])  # create allocations to trigger gc
    print("thread released lock")

t = threading.Thread(target=target)
t.start()

c = C()
c.self = c
c = None
l = []
for i in range(1000):
    l.append([])  # create allocations to trigger gc

t.join()

On my PC this deadlocks roughly every second time. It is noteworthy that at least one garbage collectable object (not necessarily the problematic one) needs to be created on the thread that holds the lock, otherwise I didn't manage to trigger the GC on that thread.

@reszelaz Could you also create a backtrace with thread apply all py-bt to see in which python code the threads are stuck? (You might need to install the python3-dbg package on Debian. Alternatively, you could try py-spy dump --native --pid $pid, see https://github.com/benfred/py-spy/)

Edit: The backtrace above is now the correct one for the toy example.

@schooft
Copy link

schooft commented Feb 14, 2020

I got a backtrace myself for the PyTango test case (I am not sure if my system is setup correctly as I have a lot of boost error handling stuff in the full backtrace):

Thread 11 (Thread 0x7f0296ffd700 (LWP 6955)):
Traceback (most recent call first):
  Waiting for the GIL
  File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1305, in __DeviceProxy__unsubscribe_event
    self.__unsubscribe_event(event_id)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 109, in run
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 195, in greener
    return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
  File "PyDsExpClient.py", line 141, in __del__
  Garbage-collecting
  <built-in method write of _io.TextIOWrapper object at remote 0x7f02baa5d630>
  <built-in method print of module object at remote 0x7f02baab15e8>
  File "PyDsExpClient.py", line 145, in push_event
  File "PyDsExpClient.py", line 106, in __call__
  File "/usr/lib/python3/dist-packages/tango/green.py", line 92, in submit
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 210, in greener
    return executor.submit(fn, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1252, in __DeviceProxy__subscribe_event_attrib
    attr_name, event_type, cb, filters, stateless, extract_as)
  File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1188, in __DeviceProxy__subscribe_event
    return __DeviceProxy__subscribe_event_attrib(self, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 109, in run
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 195, in greener
    return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
  File "PyDsExpClient.py", line 131, in __init__
  File "PyDsExpClient.py", line 118, in getAttribute
  File "PyDsExpClient.py", line 163, in run
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
    self._bootstrap_inner()

So while being in the print function in PyTangoAttribute.push_event the garbage collector is triggered and calls the __del__ method of a PyTangoAttribute (the garbage collector only collects attributes with name "state", the "attr1" attributes have no cyclic references and are deleted immediately after initializing a new one).

The relevant backtrace that leads to the call of PyTangoAttribute.push_event:

#110 0x00007f02b8b93d22 in PyCallBackPushEvent::push_event(Tango::AttrConfEventData*) () from /usr/lib/python3/dist-packages/tango/_tango.cpython-35m-x86_64-linux-gnu.so
#111 0x00007f02b8100c2f in Tango::EventConsumer::get_fire_sync_event(Tango::DeviceProxy*, Tango::CallBack*, Tango::EventQueue*, Tango::EventType, std::__cxx11::basic_string<char, std::char_traits<char>, std::all---Type <return> to continue, or q <return> to quit---
ocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::event_callback&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)
    () from /usr/lib/x86_64-linux-gnu/libtango.so.9
#112 0x00007f02b8102492 in Tango::EventConsumer::connect_event(Tango::DeviceProxy*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, Tango::EventQueue*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, int) () from /usr/lib/x86_64-linux-gnu/libtango.so.9
#113 0x00007f02b8103de4 in Tango::EventConsumer::subscribe_event(Tango::DeviceProxy*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, Tango::EventQueue*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) () from /usr/lib/x86_64-linux-gnu/libtango.so.9
#114 0x00007f02b8104350 in Tango::EventConsumer::subscribe_event(Tango::DeviceProxy*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9
#115 0x00007f02b8017312 in Tango::DeviceProxy::subscribe_event(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Tango::EventType, Tango::CallBack*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool) ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9

The relevant backtrace from within the __del__ method:

#2  0x00007f02b834ddf9 in Tango::TangoMonitor::wait(long) ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9
#3  0x00007f02b8107ffc in Tango::TangoMonitor::get_monitor() ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9
#4  0x00007f02b81269b0 in Tango::DelayEvent::DelayEvent(Tango::EventConsumer*) ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9
#5  0x00007f02b80fd9ce in Tango::EventConsumer::unsubscribe_event(int) ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9
#6  0x00007f02b801e577 in Tango::DeviceProxy::unsubscribe_event(int) ()
   from /usr/lib/x86_64-linux-gnu/libtango.so.9

The full backtrace backtrace.txt

This was one of the runs. During other runs, I didn't get a deadlock/crash but instead this exception:

PyTangoAttribute.__del__: before unsubscribe state
Exception ignored in: <bound method PyTangoAttribute.__del__ of <__main__.PyTangoAttribute object at 0x7f263243a5c0>>
Traceback (most recent call last):
  File "PyDsExpClient.py", line 141, in __del__
    self._dev.dp.unsubscribe_event(id_)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 195, in greener
    return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
  File "/usr/lib/python3/dist-packages/tango/green.py", line 109, in run
    return fn(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/tango/device_proxy.py", line 1305, in __DeviceProxy__unsubscribe_event
    self.__unsubscribe_event(event_id)
PyTango.EventSystemFailed: (DevError(desc = 'Failed to unsubscribe event, the event id specified does not correspond with any known one', origin = 'EventConsumer::get_event_system_for_event_id()', reason = 'API_EventNotFound', severity = tango._tango.ErrSeverity.ERR),)

This was the __del__ method of a rather old attribute object which should have been deleted many iterations earlier. Hope this helps.

@schooft
Copy link

schooft commented Feb 14, 2020

To trigger the above error reliably, just call gc.collect in PyTangoAttribute.push_event.

@reszelaz
Copy link
Contributor Author

Hi all,

Tim, sorry that I did not sent you the backtrace before but this morning I rebooted my PC and afterwards the Sardana tests that I used to run started to behave differently. At the beginning I was getting crazy but now I think it helped me to advance a little bit...
So, after the reboot, the earlier mentioned test module was reproducibly hanging at the very first test. It was very aligned with this observation #292 (comment) but this time, the event consumer thread was not even involved. Then, I thought that I was hitting the potential deadlock of the GC waiting on _subscribed_events_lock as you explained. So, I added some more logs around the lock. But, I think I was not hitting this one. Then, simplifying the test I came into this example. Could it be the root cause of this issue? It would be nice if someone could test it and confirm if it is reproducible outside of my environment.

Now, regarding your tests from #292 (comment). I'm not sure, but I think that for this one it is needed to use the tango.EnsureOmniThread() context manager since there we are using threads which are not omni threads. But, better if someone else confirms that.

Also, regarding the potential deadlock on _subscribed_events_lock lock. This lock was already under revision in #185 (comment), maybe we could now review again if it is needed? If it is finally needed, Tim's idea to change this particular PyTango code to ensure that the GC won't be called sounds good to me. GC can not enter on this thread while we are already executing Tango C++ calls, right? Also I discard changing the lock to RLock, cause we may mess up with the map, right?

At some time came to my mind that instead of calling unsubscribes in __del__ call them in weakref callbacks or finalizers (taurus.Factory stores weakrefs to taurus.Attribute). But I imagine these must behave similarly to the __del__ - will be called by GC process (at random moment and at random thread), right?

Thanks again for all your help in trying to fix this issue!

@ajoubertza
Copy link
Member

It would be nice if someone could test it and confirm if it is reproducible outside of my environment.

I tried the example, and get almost identical result (only differences in the API_EventTimeout text). My test environment is Docker + Conda + Python 3.7.5 + cppTango 9.3.2 + your pytango branch.

(env-py3.7-tango9.3.2) root@d2d458b3bc7d:/opt/pytango/examples/issue-292/test-tango-unsub# python3 client1.py
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): before self.__subscribe_event
2020-02-15 13:47:27.278990 TEST/DEVICE1/1 ATTR1 ATTR_CONF label='attr1'; unit=''
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): after self.__subscribe_event; id = 1
MainThread:__DeviceProxy.__subscribe_event(1): before event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(1): after event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(1): after event_map_lock release
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): before self.__subscribe_event
2020-02-15 13:47:28.293413 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
MainThread:__DeviceProxy.__subscribe_event_attrib(attr1): after self.__subscribe_event; id = 2
MainThread:__DeviceProxy.__subscribe_event(2): before event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(2): after event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(2): after event_map_lock release
MainThread:__DeviceProxy.__subscribe_event_attrib(attr2): before self.__subscribe_event
MainThread:__DeviceProxy__unsubscribe_event(1): before event_map_lock acquire
MainThread:__DeviceProxy__unsubscribe_event(1): after event_map_lock acquire
MainThread:__DeviceProxy__unsubscribe_event(1): after event_map_lock release
MainThread:__DeviceProxy__unsubscribe_event(1): before self.__unsubscribe_event
MainThread:__DeviceProxy__unsubscribe_event(1): after self.__unsubscribe_event
MainThread:__DeviceProxy.__subscribe_event_attrib(attr2): after self.__subscribe_event; id = 3
MainThread:__DeviceProxy.__subscribe_event(3): before event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(3): after event_map_lock acquire
MainThread:__DeviceProxy.__subscribe_event(3): after event_map_lock release
2020-02-15 13:47:39.264995 TEST/DEVICE2/1 ATTR1 CHANGE [API_EventTimeout] Event channel is not responding anymore, maybe the server or event system is down
2020-02-15 13:47:39.266084 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
ERROR: API_EventTimeout
2020-02-15 13:47:49.280982 TEST/DEVICE2/1 ATTR1 CHANGE [API_EventTimeout] Event channel is not responding anymore, maybe the server or event system is down
2020-02-15 13:47:49.282081 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
ERROR: API_EventTimeout
2020-02-15 13:47:59.292498 TEST/DEVICE2/1 ATTR1 CHANGE [API_EventTimeout] Event channel is not responding anymore, maybe the server or event system is down
2020-02-15 13:47:59.293321 TEST/DEVICE2/1 ATTR1 CHANGE [ATTR_VALID] 0.0
ERROR: API_EventTimeout

@schooft
Copy link

schooft commented Feb 15, 2020

@reszelaz Thank you for this great example case!

On my machine I get the same error. I also implemented the same client in C++ and the error occurs there as well.

See reszelaz/test-tango-unsub#1

@reszelaz
Copy link
Contributor Author

Thanks for the tests! WIth Tim's example in C++ I already reported it tango-controls/cppTango#686.

@reszelaz
Copy link
Contributor Author

Hi all, I was just reviewing old issues and I think that this one can be closed already. We are happy with the fix with cppTango.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants