messages blocking a queue and causing runaway message_sender #14862

pohutukawa · 2020-05-04T23:11:26Z

We've had recently the issue arising that the Zulip chat server (on a DigitalOcean instance with 2 vCPUs and 4 GB of RAM) is starting to behave strangely and does not deliver its services reliably any more. It is a fully patched Ubuntu 18.04 instance with an upgraded Zulip 2.1.4 install on the basis of the DigitalOcean droplet (originally with Zulip 2.1.2).

In particular one or more of the message_sender processors/workers are going ballistic and start to guzzle up nearly 100% CPU load each. If that happens, some messages are not delivered any more, and on restarts, messages are wildly delivered out of order. The situation does not get better with service restarts (even repeatedly) or entire server (host) restarts.

Things I have done over the last week:

Add a crontab entry to restart the server every night
Upgraded Zulip to 2.1.4
Re-installed RabbitMQ
Purged the message_sender queue on the RabbitMQ broker.

Note: On inital install, we have migrated previous public Slack content to the Zulip server to have better continuity.

Here the runaway message_sender worker is visible (worker_num=1):

root@zulip:~# ps awux | grep message_sender
root      5827  0.0  0.0  14856  1008 pts/0    S+   22:54   0:00 grep --color=auto message_sender
zulip    19028  0.0  2.6 488668 107436 ?       S    04:25   0:04 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=0
zulip    19029 11.4  2.7 490628 109740 ?       Sl   04:25 126:29 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=1
zulip    19030  0.0  2.6 488584 107648 ?       S    04:25   0:04 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=2
zulip    19031  0.0  2.6 488640 107972 ?       S    04:25   0:04 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=3

When this happens, apparently the message_sender queue runs fuller and fuller through use:

root@zulip:~# rabbitmqctl list_queues
Listing queues
tornado_return  0
embedded_bots   0
missedmessage_email_senders     0
missedmessage_mobile_notifications      0
user_activity   12
invites 0
embed_links     0
email_senders   0
missedmessage_emails    0
error_reports   0
digest_emails   0
slow_queries    0
notify_tornado  0
outgoing_webhooks       0
message_sender  54
user_presence   0
deferred_work   0
email_mirror    0
user_activity_interval  0
signups 0

The queue doesn't get emptier, but retains an increasing number of messages that don't seem to get pulled off the queue. In the DigitalOcean load graph, it is visible that at 4:20 pm I had purged the message_sender queue, "normalising" things again. Then at about 8:45 am again the problem has re-occurred:

How can this problem be solved? We really like Zulip (and don't want to go back to Slack or go to something like MS Teams), but at the moment the system is not really usable due to unreliability. How can I fix things?

The text was updated successfully, but these errors were encountered:

timabbott · 2020-05-05T00:02:39Z

@pohutukawa if you're up for it, I'd recommend upgrading to master with upgrade-zulip-from-git master. We've removed the entire message_sender queue system from Zulip in work towards the Zulip 2.2 release, which eliminates this category of failure mode (which we haven't seen before).

https://zulip.readthedocs.io/en/latest/production/upgrade-or-modify.html#upgrading-from-a-git-repository

Another option is to stop the Zulip server and then use manage.py purge_queue message_sender to delete all the messages in that queue; that should also resolve your problem.

zulipbot · 2020-05-05T00:02:56Z

Hello @zulip/server-production members, this issue was labeled with the "area: production" label, so you may want to check it out!

timabbott · 2020-05-05T00:05:53Z

The message_sender queue only processes messages sent via the Zulip web UI, so it's likely that someone in your organization sent a message containing a link or other content that is making the markdown processor infinite loop. This should be hard, because we run the markdown processor under a 5s timer that kills the job if it is running forever. I'd consider asking if anyone in your organization is sending messages designed to DoS a parser; if that's indeed the issue, please report the problematic message content to security@zulipchat.com so we can investigate.

pohutukawa · 2020-05-05T00:08:59Z

Thanks @timabbott, I'll try the upgrade to Git master for now. As I've purged messages from the message_sender queue already, and got the same problem again (twice) within a day, that doesn't seem to be a viable solution. Also, I could imagine that there may be message loss when purging. Not so good.

As for the Markdown DoS scenario: I'd doubt that any of our guys would be doing that, at least intentionally. Having said that, though, is there a way to selectively access the content of a/the problematic message to get to the bottom of the problem?

pohutukawa · 2020-05-05T00:28:57Z

Upgrade from Git to master done. Let's see what happens now.

Note: rabbitmqctl list_queues does not show any more queues at all. ("Error: unable to connect to node zulip@localhost: nodedown")

timabbott · 2020-05-05T23:42:08Z

Note: rabbitmqctl list_queues does not show any more queues at all. ("Error: unable to connect to node zulip@localhost: nodedown")

That seems wrong; is your server actually working? But in any case, the message_sender queue should be gone.

As for the Markdown DoS scenario: I'd doubt that any of our guys would be doing that, at least intentionally. Having said that, though, is there a way to selectively access the content of a/the problematic message to get to the bottom of the problem?

No, because the message is failing to complete sending, so it's not stored on the server at all.

pohutukawa · 2020-05-06T00:29:27Z

Note: rabbitmqctl list_queues does not show any more queues at all. ("Error: unable to connect to node zulip@localhost: nodedown")

That seems wrong; is your server actually working? But in any case, the message_sender queue should be gone.

The server is working, but the list_queues command does not show any queues. I have rebooted since, and this is the output:

# rabbitmqctl list_queues
Error: unable to connect to node zulip@localhost: nodedown

But the RabbitMQ broker is running:

# service rabbitmq-server status
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-05-01 03:29:13 UTC; 4 days ago
  Process: 925 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SU
 Main PID: 924 (rabbitmq-server)
    Tasks: 88 (limit: 4702)
   CGroup: /system.slice/rabbitmq-server.service
           ├─ 924 /bin/sh /usr/sbin/rabbitmq-server
           ├─ 990 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
           ├─1244 /usr/lib/erlang/erts-9.2/bin/epmd -daemon
           ├─1401 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db
           ├─1515 erl_child_setup 65536
           ├─1632 inet_gethost 4
           └─1633 inet_gethost 4

May 04 15:11:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-56
May 05 05:19:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-27
May 05 05:23:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-73
May 05 05:29:01 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-19
May 05 05:34:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-86
May 05 13:23:45 zulip epmd[1244]: epmd: got partial packet only on file descriptor 6 (0)
May 05 15:32:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-19
May 05 15:43:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-20
May 05 22:27:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-87
May 05 23:10:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-88

As for the Markdown DoS scenario: I'd doubt that any of our guys would be doing that, at least intentionally. Having said that, though, is there a way to selectively access the content of a/the problematic message to get to the bottom of the problem?

No, because the message is failing to complete sending, so it's not stored on the server at all.

Yes, but I guess it must be persisted in the queue.

Anyway, the server "feels" normal since the upgrade to the Git master version yesterday, even though the inspection via RabbitMQ tools on the queues won't work. Very odd ...

timabbott · 2020-05-12T19:41:29Z

Error: unable to connect to node zulip@localhost: nodedown

I would expect this to totally break Zulip, so that's very odd. maybe you somehow have two instances of rabbitmq with different nodenames, only one of which is down? Check /var/lib/rabbitmq/mnesia/ for hostnames there.

pohutukawa · 2020-06-15T01:53:33Z

Just referenced this ticket from elsewhere, and noticed I had not given final feedback.

Now (I don't know why) the rabbitmqctl list_queues does report a list of all the queues properly with their current queue sizes. I don't know why, but this problem seems resolved, and the upgrade to the Git master (as of early May) seems to have done the trick to resolve the problem.

timabbott added area: production question labels May 5, 2020

pohutukawa closed this as completed Jun 15, 2020

pohutukawa mentioned this issue Jun 15, 2020

Message list loads forever when some messages have reactions zulip/zulip-mobile#4156

Closed

alexmv mentioned this issue Oct 13, 2020

TornadoQueueClient couldn't connect to RabbitMQ #16441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

messages blocking a queue and causing runaway message_sender #14862

messages blocking a queue and causing runaway message_sender #14862

pohutukawa commented May 4, 2020

timabbott commented May 5, 2020

zulipbot commented May 5, 2020

timabbott commented May 5, 2020

pohutukawa commented May 5, 2020

pohutukawa commented May 5, 2020

timabbott commented May 5, 2020

pohutukawa commented May 6, 2020

timabbott commented May 12, 2020

pohutukawa commented Jun 15, 2020

messages blocking a queue and causing runaway message_sender #14862

messages blocking a queue and causing runaway message_sender #14862

Comments

pohutukawa commented May 4, 2020

timabbott commented May 5, 2020

zulipbot commented May 5, 2020

timabbott commented May 5, 2020

pohutukawa commented May 5, 2020

pohutukawa commented May 5, 2020

timabbott commented May 5, 2020

pohutukawa commented May 6, 2020

timabbott commented May 12, 2020

pohutukawa commented Jun 15, 2020