Skip to content
This repository was archived by the owner on Jul 8, 2022. It is now read-only.

Yet another attempt to improve event subscription with minimal changes to the existing code base#404

Closed
Ingvord wants to merge 21 commits intomasterfrom
improve-event-subscription_v2
Closed

Yet another attempt to improve event subscription with minimal changes to the existing code base#404
Ingvord wants to merge 21 commits intomasterfrom
improve-event-subscription_v2

Conversation

@Ingvord
Copy link
Copy Markdown
Member

@Ingvord Ingvord commented Oct 25, 2017

see #315, #385

@Ingvord Ingvord requested a review from bourtemb October 30, 2017 11:12
@Ingvord Ingvord mentioned this pull request Nov 2, 2017
@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Nov 3, 2017

Please find attached an OpenDocument spreadsheet listing all the cases I could identify (there might be more) where the events do not work as expected in Tango 9 and what is wrong for each specific case.
EventsTests.zip

I tested this PR with only a few cases listed on this spreadsheet and I saw the same wrong behaviours as before.
@Ingvord, please try to find a way to test all of that in your environment. It would be great if you could use docker to test some cases for instance so we could add some tests in our tests suite. You could also set up a basic network with only 2 computers or raspberry Pi and create a network alias for one of the computers. That should be enough to test it.

One important point is that when a user subscribes to an event with a specific hostname in his TANGO_HOST or FQDN, I think he should always get the same hostname in the EventData object from the callback as the one used during the subscription. @taurel, please do not hesitate to contradict me if the initial idea was different or if you think this does not make sense.
This name received in the callback is currently not consistent in many cases.
There are inconsistencies between the event name received in the callback after the synchronous call (at subscription time) and the event name received afterwards via ZMQ.

As an example, here is one case which does not work, even with a single computer:

  • export TANGO_HOST=localhost:10000 on the C++ server side
  • a C++ client subscribes to an attribute able to generate events using the FQDN during the subscription (e.g.: tango://localhost:10000/sys/tg_test/1).

In this case, some events are received but the name (hostname part) in the events does not correspond to the name used at the time of the subscription (except for the synchronous call which is OK in this case).
The heartbeat mechanism also does not work in this specific case and some API_EventTimeout error events are triggered every 10 seconds.
One could see some errors like "No entry in channel map for heartbeat tango:://<real_hostname>:10000/dserver/tangotest/test" too.

I didn't test it with the code from this PR, but #350 should be easy to verify with a simple setup.

@Ingvord please try to setup your computers in order to be able to reproduce the problems, else it will be very painful to debug and fix all these problems.

Do you need some C++ client event subscriber examples?

Hoping this helps,
Reynald

@t-b
Copy link
Copy Markdown
Collaborator

t-b commented Nov 3, 2017

@Ingvord @bourtemb Sorry for just asking questions from the sidelines. Does that bug here exist in tango8 as well? I'm seeing some wierd behaviour in qtango with tango9 which could be related to these issues here.

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Nov 3, 2017

@t-b I think we can consider there are several bugs here...
This issue is actually listing several of them but they are linked.
The bugs are probably present in Tango 8 too, except #350
In general, you will not get much trouble if you use the same TANGO_HOST string on the client side and server side, except with the localhost case when you use the FQDN (I may have forgotten some cases, please refer to the spreadsheet from my previous comment).

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Nov 29, 2017

@bourtemb could you please attach your C++ client code as well?

@bourtemb
Copy link
Copy Markdown
Member

Here it is:
event.zip

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Nov 29, 2017

@bourtemb , super! Thanks!!!

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 5, 2017

@bourtemb , Please find attached an updated version of your spread sheet: EventsTests.ods.zip. I have tested current codebase against "case 17" - all except netalias. It seems to work apart from the attribute name substitution.

Here is the way how I did it:

  1. start-tango with exported 10000 port (see 60db0a0):
    start_tango
  2. configure DevTest/Event_change_tst in Jive (polling = 1000, abs = 1)
  3. In CLion Run->Edit configuration-> set TANGO_HOST env var for DevTest:
    set_tango_host
  4. define program argument for events:
    set_client_arg
  5. Run both
  6. execute DevTest/IOIncValue command via Jive (emits an event)
  7. Repeat with different TANGO_HOST/argument...

In this setup Tango host is running in a docker, but exports port 10000 to localhost. Both client and server are running on the same machine but use different TANGO_HOST/FQDN to subscribe to an event.

Could you please test the code in your environment (with netaliases)?

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 6, 2017

Hi, I did some simple tests with TangoTest device server on a single machine.
The heartbeat problem seems to be solved in the case F-22 from the spreadsheet for instance. :-) Good!
There is something strange though... The first event received after the synchronous event from the subscription seems to be received twice by the client. I tried different combinations of TANGO_HOST, using localhost, the IP address, 127.0.0.1, ...
This phenomenon doesn't seem to appear when there is a reconnection.

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 6, 2017

It looks like the duplicate event after the synchronous call does not happen all the time... I will need deeper investigations to try to reproduce it systematically.

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 7, 2017

It looks like I can reproduce the duplicated event issue systematically when I do the following:

  1. Start TangoTest server with float_scalar attribute polled
  2. Start client: event tango://localhost:10000/my/tg_test/device float_scalar
  3. Set float_scalar attribute to 1000 using tg_devtest (test device from jive)
  4. Restart the server
  5. Wait for the reconnection on the client side
  6. Restart the client
  7. Set float_scalar attribute to 1000 using tg_devtest (test device from jive)

At that point you should see the same event received twice.
I tried with other TANGO_HOST definitions and I got the same problem.
I tried with tango 9.2.5a using TANGO_HOST=hostname:port and I did not see this problem.

From that point on, if you restart any client, it will be systematic on the client side. And if you start several clients, they will all see this duplicated event.

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 7, 2017

@bourtemb , thanks a lot! I will have a look into it...

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 8, 2017

Hi, Here is an updated version of the spreadsheet with the results of the tests using an host having a network alias defined: EventsTests-reynald-tests-08Dec2017.zip.
I tested only the cases on lines 17 to 24.
Your patch is clearly improving the situation in case of network aliases. :-)
There are still some issues to solve with the event name in the received events or synchronous calls and the duplicated event in the case I mentioned before.
There are also some reconnection problems when the server is using TANGO_HOST=127.0.0.1:10000 and a remote client from another host is subscribing to events using a TANGO_HOST with the hostname, network alias name or public IP address of the device server host.

Ingvord added a commit that referenced this pull request Dec 12, 2017
Ingvord added a commit that referenced this pull request Dec 12, 2017
@Ingvord Ingvord force-pushed the improve-event-subscription_v2 branch from bb5d79f to 9f73d5d Compare December 12, 2017 15:10
@tango-controls-bot
Copy link
Copy Markdown
Collaborator

About the problem with the duplicated event, this is what I understood after talking with @taurel:
Once upon a time, there was a bug in ZMQ reported by Emmanuel.
As a work-around to this bug, the Tango library on the server side (or ZMQ, not clear to me when I see the ZMQ fix: https://github.com/zeromq/libzmq/pull/291/files) is sending twice the same event the first time after the connection of a new client.
The same event is sent twice but there is a counter field in the event which is normally incremented for each new event. In this case, the same event is sent twice, with the same counter value.
The client side can then detect whether the counter got incremented compared to the previous received event. If it was not incremented, the callback should not be called a second time.
The bug in ZMQ should be fixed now so, on the server side, we should remove the code sending twice the same event with the same counter value (if needed). On the client side, we should keep the code as before and ignore the events received twice with the same counter value in order to ensure the callback is not called several times when communicating with an old device server.

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 12, 2017

@bourtemb , Was it you under "tango-tickets-migrator"? :) Thanks for the info, good to know! In fact I have seen this in the code, there is a comment at zmqeventconsumer.cpp#2024:

//
// Miss some events?
// Due to LIBZMQ Bug 283, the first event after a process startup is sent two times
// with the same ctr value. Do not call the user callback for the second times.
//

But this behavior observed (two times event after reconnection) was due to this:

//eventkeepalive.cpp#134
subscriber_info.push_back("0");//client lib version

Note hardcoded 0. This was passed to the server during the reconnection and server guessed client_lib to be 5 instead of 6 effectively subscribing the same client twice due to different client_lib (see eventcmds.cpp#833). Then in polling loop it iterates over client_libs vector:

//eventsupplier.cpp#401
for (ite = client_libs.begin();ite != client_libs.end();++ite)

And obviously (after a few hours of debugging) it sent the same event twice to the same client, but due to different client_lib.

Changing subscriber_info.push_back("0") in eventkeepalive.cpp with subscriber_info.push_back(to_string(DevVersion)); (see f49cb02) seems to fix the problem.

I have tested cases B18-24 (except netalias... I tried to setup docker network with aliases, but failed miserably) with reconnection and now they all green (I have not highlighted though - awaiting your approval).

EventsTests-ingvord-tests-12Dec2017.ods.zip

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 13, 2017

Hi Igor,
Indeed I used the wrong account by mistake for my previous comment... :)
I tested your latest version (patched with https://github.com/tango-controls/cppTango/pull/420/files to be able to compile TangoTest) with TANGO_HOST=localhost:10000 on the server side and invoking the client with:
event tango://localhost:10000/sys/tg_test/rb1 float_scalar
This attribute float_scalar is polled at 10 seconds in my case and I set float_scalar attribute to 1000 after the client has subscribed to trigger change events.
In these conditions I get duplicated events after the first subscription synchronous call and the same at the reconnection.
Here are the messages displayed by the client (the device server is stopped and restarted at some point):

ebssimul:~/Events % bin/debian8/event tango://localhost:10000/sys/tg_test/rb1 float_scalar
I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:23:53 2017 (1513157033,33607 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 0
Element number [1] = 0

Event subscription done
ev_id1 = 1
push_zmq_event from zmq::message_t
Lib: Received event for tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in event_callback_map = tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in channel_map = tango://ebssimul.esrf.fr:10000/dserver/tangotest/reynald
ds_ctr1
evt_cb.ctr0
I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:12 2017 (1513157052,548733 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 288
Element number [1] = 1000

push_zmq_event from zmq::message_t
Lib: Received event for tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in event_callback_map = tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in channel_map = tango://ebssimul.esrf.fr:10000/dserver/tangotest/reynald
ds_ctr2
evt_cb.ctr1
I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:12 2017 (1513157052,548733 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 288
Element number [1] = 1000

push_zmq_event from zmq::message_t
Lib: Received event for tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in event_callback_map = tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in channel_map = tango://ebssimul.esrf.fr:10000/dserver/tangotest/reynald
ds_ctr2
evt_cb.ctr2
push_zmq_event from zmq::message_t
Lib: Received event for tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in event_callback_map = tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in channel_map = tango://ebssimul.esrf.fr:10000/dserver/tangotest/reynald
ds_ctr3
evt_cb.ctr2
I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:22 2017 (1513157062,548769 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 53
Element number [1] = 1000

I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:36 2017: The event is an error
Tango error stack
Severity = ERROR 
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()

I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:46 2017: The event is an error
Tango error stack
Severity = ERROR 
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()

I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:38 2017 (1513157078,149561 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 0
Element number [1] = 0

push_zmq_event from zmq::message_t
Lib: Received event for tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in event_callback_map = tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in channel_map = tango://ebssimul.esrf.fr:10000/dserver/tangotest/reynald
ds_ctr1
evt_cb.ctr3
I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:57 2017 (1513157097,663958 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 8
Element number [1] = 1000

push_zmq_event from zmq::message_t
Lib: Received event for tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in event_callback_map = tango://ebssimul.esrf.fr:10000/sys/tg_test/rb1/float_scalar.idl5_change
Key in channel_map = tango://ebssimul.esrf.fr:10000/dserver/tangotest/reynald
ds_ctr2
evt_cb.ctr1
I have received an event for attribute tango://localhost:10000/sys/tg_test/rb1/float_scalar
Wed Dec 13 10:24:57 2017 (1513157097,663958 sec) : float_scalar (dim_x = 1, dim_y = 0, w_dim_x = 1, w_dim_y = 0, Data quality factor = VALID, Data format = SCALAR, Data type = DevFloat)
Element number [0] = 8
Element number [1] = 1000

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 13, 2017

@bourtemb , well, this is wiered... I can not reproduce your case localy. Here is my output:

/home/ingvord/Projects/org.tango/git/cppTango/cmake-build-debug/events tango://localhost:10000/test/debian8/10 Event_change_tst
I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Quality = 0
Wed Dec 13 10:42:53 2017 (1513158173,896622 sec) : Event_change_tst (dim_x = 4, dim_y = 0, w_dim_x = 0, w_dim_y = 0, Data quality factor = VALID, Data format = SPECTRUM, Data type = DevLong)
Element number [0] = 10
Element number [1] = 20
Element number [2] = 30
Element number [3] = 40

Event subscription done
ev_id1 = 1
push_zmq_event from zmq::message_t
Lib: Received event for tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in event_callback_map = tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in channel_map = tango://hzgc103k.desy.de:10000/dserver/devtest/test
ds_ctr1
evt_cb.ctr0
I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Quality = 0
Wed Dec 13 10:43:53 2017 (1513158233,896781 sec) : event_change_tst (dim_x = 4, dim_y = 0, w_dim_x = 0, w_dim_y = 0, Data quality factor = VALID, Data format = SPECTRUM, Data type = DevLong)
Element number [0] = 10
Element number [1] = 20
Element number [2] = 32
Element number [3] = 40

push_zmq_event from zmq::message_t
Lib: Received event for tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in event_callback_map = tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in channel_map = tango://hzgc103k.desy.de:10000/dserver/devtest/test
ds_ctr1
evt_cb.ctr1
push_zmq_event from zmq::message_t
Lib: Received event for tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in event_callback_map = tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in channel_map = tango://hzgc103k.desy.de:10000/dserver/devtest/test
ds_ctr2
evt_cb.ctr1
I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Quality = 0
Wed Dec 13 10:44:33 2017 (1513158273,897345 sec) : event_change_tst (dim_x = 4, dim_y = 0, w_dim_x = 0, w_dim_y = 0, Data quality factor = VALID, Data format = SPECTRUM, Data type = DevLong)
Element number [0] = 10
Element number [1] = 20
Element number [2] = 33
Element number [3] = 40

I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Wed Dec 13 10:44:46 2017: The event is an error
Tango error stack
Severity = ERROR 
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()

I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Quality = 0
Wed Dec 13 10:44:46 2017 (1513158286,955893 sec) : Event_change_tst (dim_x = 4, dim_y = 0, w_dim_x = 0, w_dim_y = 0, Data quality factor = VALID, Data format = SPECTRUM, Data type = DevLong)
Element number [0] = 10
Element number [1] = 20
Element number [2] = 30
Element number [3] = 40

push_zmq_event from zmq::message_t
Lib: Received event for tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in event_callback_map = tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in channel_map = tango://hzgc103k.desy.de:10000/dserver/devtest/test
ds_ctr1
evt_cb.ctr2
I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Quality = 0
Wed Dec 13 10:45:03 2017 (1513158303,655678 sec) : event_change_tst (dim_x = 4, dim_y = 0, w_dim_x = 0, w_dim_y = 0, Data quality factor = VALID, Data format = SPECTRUM, Data type = DevLong)
Element number [0] = 10
Element number [1] = 20
Element number [2] = 31
Element number [3] = 40

push_zmq_event from zmq::message_t
Lib: Received event for tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in event_callback_map = tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in channel_map = tango://hzgc103k.desy.de:10000/dserver/devtest/test
ds_ctr1
evt_cb.ctr1
push_zmq_event from zmq::message_t
Lib: Received event for tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in event_callback_map = tango://hzgc103k.desy.de:10000/test/debian8/10/event_change_tst.idl5_change
Key in channel_map = tango://hzgc103k.desy.de:10000/dserver/devtest/test
ds_ctr2
evt_cb.ctr1
I have received an event for attribute tango://localhost:10000/test/debian8/10/event_change_tst
Quality = 0
Wed Dec 13 10:45:33 2017 (1513158333,655981 sec) : event_change_tst (dim_x = 4, dim_y = 0, w_dim_x = 0, w_dim_y = 0, Data quality factor = VALID, Data format = SPECTRUM, Data type = DevLong)
Element number [0] = 10
Element number [1] = 20
Element number [2] = 32
Element number [3] = 40

Could you please attach your server's out put (-v5)?

@bourtemb
Copy link
Copy Markdown
Member

You are not in the same conditions... You are using another device server, which is pushing events by code.

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 13, 2017

@bourtemb , nope. Here is IOIncValue which I use to increment value:

CORBA::Any *IOIncValue::execute(Tango::DeviceImpl *device, TANGO_UNUSED(const CORBA::Any &in_any))
{
	((static_cast<DevTest *>(device))->attr_event[2])++;
	((static_cast<DevTest *>(device))->attr_event64[0])++;

	return insert();
}

It simply increases the value of the attribute which is then detected by the polling thread.

@bourtemb
Copy link
Copy Markdown
Member

You are using a spectrum attribute, right? It's a different type. I don't see the problem with float_spectrum_ro attribute from TangoTest for instance.

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 13, 2017

OK, it makes sense now. Thanks!

@bourtemb
Copy link
Copy Markdown
Member

Here are the logs from the server started with -v5:
TangoTest.log.zip

@bourtemb
Copy link
Copy Markdown
Member

I got a core dumped on the client side when trying to subscribe to a tango 9 version of TangoTest device server:
ebssimul:~/Events % bin/debian8/event tango://localhost:10000/sys/tg_test/rb1 float_scalar
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr: __pos (which is 8) > this->size() (which is 0)
Aborted (core dumped)

Here is the backtrace:

Core was generated by `bin/debian8/event tango://localhost:10000/sys/tg_test/rb1 float_scalar'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fbcd8406067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007fbcd8406067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fbcd8407448 in __GI_abort () at abort.c:89
#2  0x00007fbcd8cf3b3d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fbcd8cf1bb6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fbcd8cf1c01 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fbcd8cf1e19 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fbcd8d47cdf in std::__throw_out_of_range_fmt(char const*, ...) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007fbcd8d53752 in std::string::substr(unsigned long, unsigned long) const () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fbcdadac467 in Tango::EventConsumer::connect_event (this=0x1076c10, device=0x7ffcf75a9b90, obj_name="float_scalar", 
    event=Tango::CHANGE_EVENT, callback=0x10679b0, ev_queue=0x0, filters=std::vector of length 0, capacity 0, event_name="idl5_change", 
    event_id=0) at /users/bourtemb/GIT/tango-controls/cppTango-improve-event-subscription_v2/src/client/event_consumer.cpp:2107
#9  0x00007fbcdadaa1b5 in Tango::EventConsumer::subscribe_event (this=0x1076c10, device=0x7ffcf75a9b90, attribute="float_scalar", 
    event=Tango::CHANGE_EVENT, callback=0x10679b0, ev_queue=0x0, filters=std::vector of length 0, capacity 0, stateless=false)
    at /users/bourtemb/GIT/tango-controls/cppTango-improve-event-subscription_v2/src/client/event_consumer.cpp:1564
#10 0x00007fbcdada9ebd in Tango::EventConsumer::subscribe_event (this=0x1076c10, device=0x7ffcf75a9b90, attribute="float_scalar", 
    event=Tango::CHANGE_EVENT, callback=0x10679b0, filters=std::vector of length 0, capacity 0, stateless=false)
    at /users/bourtemb/GIT/tango-controls/cppTango-improve-event-subscription_v2/src/client/event_consumer.cpp:1437
#11 0x00007fbcdacca635 in Tango::DeviceProxy::subscribe_event (this=0x7ffcf75a9b90, attr_name="float_scalar", event=Tango::CHANGE_EVENT, 
    callback=0x10679b0, filters=std::vector of length 0, capacity 0, stateless=false)
    at /users/bourtemb/GIT/tango-controls/cppTango-improve-event-subscription_v2/src/client/devapi_base.cpp:7736
#12 0x00000000004034d2 in main (argc=3, argv=0x7ffcf75a9f68) at event.cpp:185
(gdb) f 8
#8  0x00007fbcdadac467 in Tango::EventConsumer::connect_event (this=0x1076c10, device=0x7ffcf75a9b90, obj_name="float_scalar", 
    event=Tango::CHANGE_EVENT, callback=0x10679b0, ev_queue=0x0, filters=std::vector of length 0, capacity 0, event_name="idl5_change", 
    event_id=0) at /users/bourtemb/GIT/tango-controls/cppTango-improve-event-subscription_v2/src/client/event_consumer.cpp:2107
2107	    string tg_host = local_callback_key.substr(8, pos - 8);
(gdb) p local_callback_key
$1 = ""
(gdb) p pos
$2 = 18446744073709551615

In this case, it seems the local_callback_key variable is reset to "" at event_consumer.cpp:1899:
initialize_recieved_from_admin(dvlsa, std::move(local_callback_key), std::move(adm_name));
And more precisely at event_consumer.cpp:3403:
recieved_from_admin.event_name = std::move(local_callback_key);
tango_lib_ver == 925 in my case.

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 13, 2017

@bourtemb , I have been able to track the source of the problem you have observed (this one). This is again due to client_lib being wrong on the server side:

1513159003 [140681033012992] DEBUG dserver/TangoTest/Reynald EventConfirmSubscriptionCmd: confirm subscription for device sys/tg_test/rb1 attribute/pipe float_scalar event idl5_change
1513159003 [140681033012992] DEBUG dserver/TangoTest/Reynald DServer::event_subscription(): update change subscription

1513159003 [140681033012992] DEBUG dserver/TangoTest/Reynald Attribute::set_client_lib(5,change)

Note 5 in the last line. This is due to eventcmds.cpp#1077:

string::size_type pos = event.find(EVENT_COMPAT);
if (pos != string::npos)
{
	client_lib_str = event.substr(pos + 3,1);
	stringstream ss;
	ss << client_lib_str;//extracted 5
	ss >> client_lib;
	event.erase(0,EVENT_COMPAT_IDL5_SIZE);
}

Should be fixed in 11f826f

@bourtemb
Copy link
Copy Markdown
Member

It looks like 11f826f is not fixing the duplicated event I was reporting and is also introducing other incompatibility issues with old clients/servers.
Running a V10 client subscribing to a V9 server no longer crashes (good news 👍) but the events are no longer working properly because we get API_EventTimeout errors and a reconnection attempt every 10 seconds with a synchronous call every 10s.
Running a V9 client subscribing to a V10 server is even worse because the callback is invoked only during the synchronous calls. There is no API_EventTimeout errors in this case but from the client perspective, it looks like no event is received after the synchronous call from the subscription.

@bourtemb
Copy link
Copy Markdown
Member

It looks very strange to me to have a variable named EVENT_COMPAT_IDL5 and to assign it a value like "idl6_".
I wouldn't play with this variable for the moment. I don't think the problem is due to that and I think we can still use this "idl5_" string for Tango V10. It just means that the events are from a version >= IDL 5.
The events didn't change on Tango V10 on the wire as far as I know. I wouldn't change that for the moment.

@bourtemb
Copy link
Copy Markdown
Member

Working on tango-9-lts may simplify your life too.

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 13, 2017

About the network alias issues, there is one case which works now (except compatibility and duplicated events issues), when the client subscribes with FQAN = tango://<network_alias>.<domain_name>:<port_number>/my/device/name/attr... 👍
The following one does not work (network alias without domain name) in my config: FQAN = tango://<network_alias>:<port_number>/my/device/name/attr... :( In this case, the events are received, but the name in the events (even the synchronous calls) corresponds to the real hostname extended with the domain name.

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 13, 2017

It looks very strange to me to have a variable named EVENT_COMPAT_IDL5 and to assign it a value like "idl6_".
I wouldn't play with this variable for the moment.I don't think the problem is due to that

I would wish to think so too... but. There is a clear evidence:

  1. client executes EventConfirmSubscriptionCmd in eventkeepalive.cpp#694 and passes XXX.idl5_change as an event name
  2. DServer extracts 5 from this and puts it into client_libs:
1513159003 [140681033012992] DEBUG dserver/TangoTest/Reynald EventConfirmSubscriptionCmd: confirm subscription for device sys/tg_test/rb1 attribute/pipe float_scalar event idl5_change
1513159003 [140681033012992] DEBUG dserver/TangoTest/Reynald DServer::event_subscription(): update change subscription

1513159003 [140681033012992] DEBUG dserver/TangoTest/Reynald Attribute::set_client_lib(5,change)

The last line is in attribute.cpp#5945:

if (count(client_lib[i].begin(), client_lib[i].end(), _l) == 0)
        client_lib[i].push_back(_l);
  1. In polling thread eventsupllier.cpp#396:
vector<int> &client_libs = attr.get_client_lib(CHANGE_EVENT);//returns client_lib[i] from 2)
vector<int>::iterator ite;
string ev_name = EventName[CHANGE_EVENT];
bool inc_ctr = true;

cout3 << "EventSupplier::detect_and_push_change_event(): iterating over client libs" << endl;
for (ite = client_libs.begin();ite != client_libs.end();++ite)
{
    // Now it has two ite values - 5&6 -> push_event is called twice
    //...
    cout3 << "push_event" << endl;
			push_event(device_impl,
				   ev_name,
				   filterable_names,
				   filterable_data,
				   filterable_names_lg,
				   filterable_data_lg,
				   sent_value,
				   attr_name,
				   except,
				   inc_ctr);

The only way to fix this (as I see it) is to tell the server the exact version of the client (hence f49cb02)

But, unlike ZmqEventSubscriptionChange, EventConfirmSubscription accepts DevVarStringArray for multiple events and checks the input argument like this:

if ((in_data->length() == 0) || (in_data->length() % 3) != 0)
	{
		TangoSys_OMemStream o;
		o << "Wrong number of input arguments: 3 needed per event: device name, attribute/pipe name and event name"
		  << endl;

		Except::throw_exception((const char *) API_WrongNumberOfArgs, o.str(),
								(const char *) "DServer::event_confirm_subscription");
	}

So basically I can not add another argument that specifies client version:

//THIS WON'T WORK
if ((in_data->length() == 0) || (in_data->length() % 4) != 0)

Because for instance both 12 % 3 and 12 % 4 result in != 0. But in the first case this could be old client asking for 4 attributes and in the later a new client with 3 attributes. Needless to say that the following code will fail:

//eventcmds.cpp#1027
for (size_t i = 0; i < nb_event; i++)
	{
		string dev_name, obj_name, event, obj_name_lower;
		int base = i * 3;
		dev_name = (*argin)[base];
		obj_name = (*argin)[base + 1];
		event = (*argin)[base + 2];

This way I ended up with 11f826f

Working on tango-9-lts may simplify your life too.

At the moment this does not have much difference. In fact I am not sure this PR is even back portable :(

@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented Dec 13, 2017

@bourtemb

Working on tango-9-lts may simplify your life too.

I have just got an idea: did you mean - there is no client_lib==6 in Tango 9, so this problem might not occur?

@bourtemb
Copy link
Copy Markdown
Member

bourtemb commented Dec 14, 2017

I have just got an idea: did you mean - there is no client_lib==6 in Tango 9, so this problem might not occur?

Yes, that's what I meant.

@Ingvord Ingvord removed this from the Q4'2017 milestone Dec 14, 2017
@Ingvord
Copy link
Copy Markdown
Member Author

Ingvord commented May 11, 2018

Closed due to #423

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants