Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEALER-REP hangs forever #29

Closed
bbkr opened this issue Feb 12, 2016 · 9 comments
Closed

DEALER-REP hangs forever #29

bbkr opened this issue Feb 12, 2016 · 9 comments

Comments

@bbkr
Copy link

bbkr commented Feb 12, 2016

Hi

I have DEALER to REP flow. Client has throttling (no more than 4 async requests at the same time) and monitors socket for asynchronous replies. Server is synchronous.

Client script gets stuck at random point. For example client produced 27 requests, server processed 27 requests but only 23 were received by client and throttling kicks in locking script forever. I've debugged this and IO callback on socket is not called anymore after receiving 23rd message.

Environment:

Perl 5.14.4, 64bit, no threads
ZMQ::FFI - 1.11
AnyEvent - 7.08 (no EV or other backends installed)
FFI::Platypus - 0.40

Server:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_REP);

my $context = ZMQ::FFI->new();
my $responder = $context->socket(ZMQ_REP);
$responder->bind('tcp://*:5555');

while (1) {
    my $msg = $responder->recv();
    print "processing $msg\n";
    $responder->send($msg);
}

Client

#!/usr/bin/env perl

use strict;
use warnings;

use AnyEvent;
use ZMQ::FFI;
use ZMQ::FFI::Constants qw(ZMQ_DEALER);

my $context = ZMQ::FFI->new();
my $requestor = $context->socket(ZMQ_DEALER);
$requestor->connect('tcp://localhost:5555');
# ...more servers can be added

my $tasks = 100_000;
my ($req, $rep) = (0, 0);
my ($in, $out);

$out = AnyEvent->idle(
    'cb' => sub {

        # throttle requests, only 4 at a time
        return if $req - $rep >= 4;

        # send request, emulate REQ frame
        $requestor->send_multipart(['', ++$req]);
        print "sent $req\n";

        # do this amount of tasks
        $out = undef if $req == $tasks;
    }
);

$in = AnyEvent->io(
      'fh' => $requestor->get_fd,
      'poll' => 'r',
      'cb'   => sub {

          # expect multiple answers
          while ($requestor->has_pollin) {

             # receive response
             my @msg = $requestor->recv_multipart();
             $rep++;
             print "received $rep\n";

             # finish if every task was processed
             exit if $req == $tasks and $rep == $tasks;
         }
      }
   );

AnyEvent::Loop->run();

I'm new to 0MQ so please forgive me if this is not a bug in ZMQ::FFI but bad logic in my code.

@bbkr
Copy link
Author

bbkr commented Feb 12, 2016

I've noticed that if I add

my $t = AnyEvent->timer(
    'after' => 10,
    'cb' => sub { print $requestor->recv_multipart() }
);

Then I get next message from socket after script gets locked.
So it looks like has_pollin is not detecting all messages possible, leaves some data on the socket and no IO callback will be generated in the future because not all messages were consumed.

Or my theory can be completely wrong :)

@calid
Copy link
Member

calid commented Feb 12, 2016

use DEALER DEALER. REQ/REP sockets are basic types generally discouraged for real applications. My guess is the REP socket is getting out of sync, and since it HAS to follow recv, send, recv, send order it gets stuck at that point. Using DEALER DEALER worked without issue.

@bbkr
Copy link
Author

bbkr commented Feb 12, 2016

REP socket on server is following recv, send, recv states correctly. If lock happens I can see that it produced 4 more messages that are received in client. It's client that never gets IO callback despite those messages being available on socket later.

Besides - DEALER-REP is valid pattern, should never get "out of sync" on REP side (and in my case should not exceed HWM either because of throttling). I can not imagine how is it possible to achieve such desync because REP has its own buffer.

I'll try DEALER-DEALER, however I do not want event loop in server - that means I'll receive task in AE callback and that exponentially complicates task execution flow because I cannot use $condvar->recv() to synchronize async steps required to do task.

@bbkr
Copy link
Author

bbkr commented Feb 12, 2016

Ah, also in DEALER-DEALER there can be only one peer, that's another reason why I chose DEALER-REP.

@bbkr
Copy link
Author

bbkr commented Feb 13, 2016

OK, here is where stuff gets interesting...

I disabled throttling return if $req - $rep >= 4 line and got all messages.
I set throttling to 1 and also got all messages.
I set throttling to 2 and got random lockups.
I set throttling to 3 and got random lockups.
I set throttling to 500 and got rare random lockups.

So now I'm really confused. Why the code works when there is one or bunch messages published on socket at the same time but locks when there are few? I've tested it on 1_000_000 messages.

@calid
Copy link
Member

calid commented Feb 14, 2016

@bbkr a bit slammed with work at the moment, but I'll take a deeper look at this just as soon as I have a chance. I've certainly run into weird behavior using event loops + zeromq's virtual fd in the past. Usually this is down to not handling zeromq's edge triggered semantics in exactly the right way. Is it possible this is the issue?

If you aren't familiar with edge triggered vs level triggered behavior this article seems like a nice overview of the issues:
http://funcptr.net/2012/09/10/zeromq---edge-triggered-notification/

@bbkr
Copy link
Author

bbkr commented Feb 15, 2016

So basically in edge triggered model I must consume all messages to get next "IO is readable" callback.
That means if something arrives after $requestor->has_pollin() returns false but before exiting callback then it will lock. And to complicate things readable socket info can be false positive.

So the scenario that leads to lock:

  • got IO readable callback
  • enter callback body
  • ask for pollin
  • get reply "false"
  • _meantime message arrives on socket_
  • leave callback body

(it may also happen after consuming few messages while in callback)

That means many examples linked in ZMQ guide and even this library PUSH/PULL synopsis are prone to this error.

I have no idea how to fix it in a code that needs AnyEvent loop. The obvious hack is to give up on IO monitors and use timers, but that is very inefficient.

@calid
Copy link
Member

calid commented Mar 28, 2016

@bbkr I haven't forgotten about this, just been extremely busy... I should be able to look at this in the next few weeks though.

@calid
Copy link
Member

calid commented Feb 26, 2019

I was unable to reproduce the hang using the example client/server in your initial comment, either this was an issue with older versions or it is specific to your local system. I let it run for several minutes, sending/receiving tens of thousands of messages without issue.

Perl - v5.16.3 x86_64-linux-thread-multi
ZMQ::FFI - 1.12
AnyEvent - 7.15 
FFI::Platypus - 0.84

@calid calid closed this as completed Feb 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants