DEALER-REP hangs forever #29

bbkr · 2016-02-12T11:51:49Z

Hi

I have DEALER to REP flow. Client has throttling (no more than 4 async requests at the same time) and monitors socket for asynchronous replies. Server is synchronous.

Client script gets stuck at random point. For example client produced 27 requests, server processed 27 requests but only 23 were received by client and throttling kicks in locking script forever. I've debugged this and IO callback on socket is not called anymore after receiving 23rd message.

Environment:

Perl 5.14.4, 64bit, no threads
ZMQ::FFI - 1.11
AnyEvent - 7.08 (no EV or other backends installed)
FFI::Platypus - 0.40

Server:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_REP);

my $context = ZMQ::FFI->new();
my $responder = $context->socket(ZMQ_REP);
$responder->bind('tcp://*:5555');

while (1) {
    my $msg = $responder->recv();
    print "processing $msg\n";
    $responder->send($msg);
}

Client

#!/usr/bin/env perl

use strict;
use warnings;

use AnyEvent;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_DEALER);

my $context = ZMQ::FFI->new();
my $requestor = $context->socket(ZMQ_DEALER);
$requestor->connect('tcp://localhost:5555');
# ...more servers can be added

my $tasks = 100_000;
my ($req, $rep) = (0, 0);
my ($in, $out);

$out = AnyEvent->idle(
    'cb' => sub {

        # throttle requests, only 4 at a time
        return if $req - $rep >= 4;

        # send request, emulate REQ frame
        $requestor->send_multipart(['', ++$req]);
        print "sent $req\n";

        # do this amount of tasks
        $out = undef if $req == $tasks;
    }
);

$in = AnyEvent->io(
      'fh' => $requestor->get_fd,
      'poll' => 'r',
      'cb'   => sub {

          # expect multiple answers
          while ($requestor->has_pollin) {

             # receive response
             my @msg = $requestor->recv_multipart();
             $rep++;
             print "received $rep\n";

             # finish if every task was processed
             exit if $req == $tasks and $rep == $tasks;
         }
      }
   );

AnyEvent::Loop->run();

I'm new to 0MQ so please forgive me if this is not a bug in ZMQ::FFI but bad logic in my code.

The text was updated successfully, but these errors were encountered:

bbkr · 2016-02-12T12:03:11Z

I've noticed that if I add

my $t = AnyEvent->timer(
    'after' => 10,
    'cb' => sub { print $requestor->recv_multipart() }
);

Then I get next message from socket after script gets locked.
So it looks like has_pollin is not detecting all messages possible, leaves some data on the socket and no IO callback will be generated in the future because not all messages were consumed.

Or my theory can be completely wrong :)

calid · 2016-02-12T21:08:22Z

use DEALER DEALER. REQ/REP sockets are basic types generally discouraged for real applications. My guess is the REP socket is getting out of sync, and since it HAS to follow recv, send, recv, send order it gets stuck at that point. Using DEALER DEALER worked without issue.

bbkr · 2016-02-12T23:36:37Z

REP socket on server is following recv, send, recv states correctly. If lock happens I can see that it produced 4 more messages that are received in client. It's client that never gets IO callback despite those messages being available on socket later.

Besides - DEALER-REP is valid pattern, should never get "out of sync" on REP side (and in my case should not exceed HWM either because of throttling). I can not imagine how is it possible to achieve such desync because REP has its own buffer.

I'll try DEALER-DEALER, however I do not want event loop in server - that means I'll receive task in AE callback and that exponentially complicates task execution flow because I cannot use $condvar->recv() to synchronize async steps required to do task.

bbkr · 2016-02-12T23:38:56Z

Ah, also in DEALER-DEALER there can be only one peer, that's another reason why I chose DEALER-REP.

bbkr · 2016-02-13T01:26:30Z

OK, here is where stuff gets interesting...

I disabled throttling return if $req - $rep >= 4 line and got all messages.
I set throttling to 1 and also got all messages.
I set throttling to 2 and got random lockups.
I set throttling to 3 and got random lockups.
I set throttling to 500 and got rare random lockups.

So now I'm really confused. Why the code works when there is one or bunch messages published on socket at the same time but locks when there are few? I've tested it on 1_000_000 messages.

calid · 2016-02-14T05:50:51Z

@bbkr a bit slammed with work at the moment, but I'll take a deeper look at this just as soon as I have a chance. I've certainly run into weird behavior using event loops + zeromq's virtual fd in the past. Usually this is down to not handling zeromq's edge triggered semantics in exactly the right way. Is it possible this is the issue?

If you aren't familiar with edge triggered vs level triggered behavior this article seems like a nice overview of the issues:
http://funcptr.net/2012/09/10/zeromq---edge-triggered-notification/

bbkr · 2016-02-15T21:04:01Z

So basically in edge triggered model I must consume all messages to get next "IO is readable" callback.
That means if something arrives after $requestor->has_pollin() returns false but before exiting callback then it will lock. And to complicate things readable socket info can be false positive.

So the scenario that leads to lock:

got IO readable callback
enter callback body
ask for pollin
get reply "false"
_meantime message arrives on socket_
leave callback body

(it may also happen after consuming few messages while in callback)

That means many examples linked in ZMQ guide and even this library PUSH/PULL synopsis are prone to this error.

I have no idea how to fix it in a code that needs AnyEvent loop. The obvious hack is to give up on IO monitors and use timers, but that is very inefficient.

calid · 2016-03-28T14:16:56Z

@bbkr I haven't forgotten about this, just been extremely busy... I should be able to look at this in the next few weeks though.

calid · 2019-02-26T11:23:15Z

I was unable to reproduce the hang using the example client/server in your initial comment, either this was an issue with older versions or it is specific to your local system. I let it run for several minutes, sending/receiving tens of thousands of messages without issue.

Perl - v5.16.3 x86_64-linux-thread-multi
ZMQ::FFI - 1.12
AnyEvent - 7.15 
FFI::Platypus - 0.84

calid closed this as completed Feb 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEALER-REP hangs forever #29

DEALER-REP hangs forever #29

bbkr commented Feb 12, 2016

bbkr commented Feb 12, 2016

calid commented Feb 12, 2016

bbkr commented Feb 12, 2016

bbkr commented Feb 12, 2016

bbkr commented Feb 13, 2016

calid commented Feb 14, 2016

bbkr commented Feb 15, 2016

calid commented Mar 28, 2016

calid commented Feb 26, 2019

DEALER-REP hangs forever #29

DEALER-REP hangs forever #29

Comments

bbkr commented Feb 12, 2016

bbkr commented Feb 12, 2016

calid commented Feb 12, 2016

bbkr commented Feb 12, 2016

bbkr commented Feb 12, 2016

bbkr commented Feb 13, 2016

calid commented Feb 14, 2016

bbkr commented Feb 15, 2016

calid commented Mar 28, 2016

calid commented Feb 26, 2019