Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker freezes with 'Muting DataChannel' #3118

Closed
slfritchie opened this issue Mar 10, 2020 · 0 comments · Fixed by #3121
Closed

Worker freezes with 'Muting DataChannel' #3118

slfritchie opened this issue Mar 10, 2020 · 0 comments · Fixed by #3121

Comments

@slfritchie
Copy link
Contributor

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

In a 2-worker cluster, a modest workload sent to initializer triggers an intermittent behavior in worker1 where worker1 freezes after printing a series of 15 or more Muting DataChannel messages. Work continues on initializer until the point when the data channel(s) from initializer -> worker1 apply back pressure to initializer.

What is the expected behavior?

No freeze

What OS and version of Wallaroo are you using?

Ubuntu Bionic/18.04 LTS + Wallaroo @ commit 35d2038

Steps to reproduce?

  1. Use the instructions in Intermittent race in forwarding state during autoscale #3117 to set up a 1 or 2 CPU virtual machine, 4GB RAM minimum.
    • You may want to use env PONYCFLAGS="--verbose=1 --debug -Dresilience -Dtrace -Dcheckpoint_trace -Didentify_routing_ids" make when building Machida3.
  2. Use the CSV file at https://gist.githubusercontent.com/slfritchie/065bb9325d1844c581067e90b9dae542/raw/3a33c348527d0c449bf7f6c449bd9ce4969a77ce/3118.csv as the input to the recipe below.
vagrant@ubuntu-bionic:/build2$ reset.sh
WARNING: all useful state files are deleted by this script!

vagrant@ubuntu-bionic:/build2$ start-cluster.sh 2
WARNING: all useful state files are deleted by this script!
Worker initializer: port = 7107
Worker worker1: port = 7117
Success

vagrant@ubuntu-bionic:/build2$ for i in `seq 1 3600`; do /bin/echo -n . ; cat /path/to/3118.csv  | ./frame-text-lines.py | nc -w 1 localhost 7100; done

The bug may take up to an hour before manifesting. See full logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1583818189.tar.gz. From /tmp/wallaroo.1:

1583816848.074484,_CheckpointEventLogPhase: check_completion() with 40 checkpointed and 51 total
1583816848.077920,Muting DataChannel
1583816848.077938,Muting DataChannel
1583816848.077948,Muting DataChannel
1583816848.077956,Muting DataChannel
1583816848.077964,Muting DataChannel
1583816848.077971,Muting DataChannel
1583816848.077979,Muting DataChannel
1583816848.077987,Muting DataChannel
1583816848.077995,Muting DataChannel
1583816848.078003,Muting DataChannel
1583816848.078010,Muting DataChannel
1583816848.078018,Muting DataChannel
1583816848.078025,Muting DataChannel
1583816848.078033,Muting DataChannel
1583816848.078041,Muting DataChannel
1583816848.078049,Muting DataChannel
1583816848.078057,Muting DataChannel
1583816848.078065,Muting DataChannel
1583816848.078072,Muting DataChannel
1583816848.078080,Muting DataChannel
1583816848.078088,Muting DataChannel
@slfritchie slfritchie self-assigned this Mar 10, 2020
slfritchie added a commit that referenced this issue Mar 12, 2020
* Move increment of _ack_counter into _maybe_ack()
* Add call to _maybe_ack() to forward_barrier() to fix #3118.
  Under very low application message rates and resilience=on,
  the accumulation barrier messages on the other end of the
  boundary was caused by not acking here.  The _ack_counter
  could become an odd number, and _maybe_ack()'s modulo
  arithmetic + comparison could never be true, which meant
  never sending an ack.

Fixes #3118
jtfmumm pushed a commit that referenced this issue Mar 12, 2020
* Move increment of _ack_counter into _maybe_ack()
* Add call to _maybe_ack() to forward_barrier() to fix #3118.
  Under very low application message rates and resilience=on,
  the accumulation barrier messages on the other end of the
  boundary was caused by not acking here.  The _ack_counter
  could become an odd number, and _maybe_ack()'s modulo
  arithmetic + comparison could never be true, which meant
  never sending an ack.

Fixes #3118
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant