Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

messages blocking a queue and causing runaway message_sender #14862

Closed
pohutukawa opened this issue May 4, 2020 · 9 comments
Closed

messages blocking a queue and causing runaway message_sender #14862

pohutukawa opened this issue May 4, 2020 · 9 comments

Comments

@pohutukawa
Copy link

We've had recently the issue arising that the Zulip chat server (on a DigitalOcean instance with 2 vCPUs and 4 GB of RAM) is starting to behave strangely and does not deliver its services reliably any more. It is a fully patched Ubuntu 18.04 instance with an upgraded Zulip 2.1.4 install on the basis of the DigitalOcean droplet (originally with Zulip 2.1.2).

In particular one or more of the message_sender processors/workers are going ballistic and start to guzzle up nearly 100% CPU load each. If that happens, some messages are not delivered any more, and on restarts, messages are wildly delivered out of order. The situation does not get better with service restarts (even repeatedly) or entire server (host) restarts.

Things I have done over the last week:

  • Add a crontab entry to restart the server every night
  • Upgraded Zulip to 2.1.4
  • Re-installed RabbitMQ
  • Purged the message_sender queue on the RabbitMQ broker.

Note: On inital install, we have migrated previous public Slack content to the Zulip server to have better continuity.

Here the runaway message_sender worker is visible (worker_num=1):

root@zulip:~# ps awux | grep message_sender
root      5827  0.0  0.0  14856  1008 pts/0    S+   22:54   0:00 grep --color=auto message_sender
zulip    19028  0.0  2.6 488668 107436 ?       S    04:25   0:04 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=0
zulip    19029 11.4  2.7 490628 109740 ?       Sl   04:25 126:29 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=1
zulip    19030  0.0  2.6 488584 107648 ?       S    04:25   0:04 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=2
zulip    19031  0.0  2.6 488640 107972 ?       S    04:25   0:04 python3 /home/zulip/deployments/current/manage.py process_queue --queue_name=message_sender --worker_num=3

When this happens, apparently the message_sender queue runs fuller and fuller through use:

root@zulip:~# rabbitmqctl list_queues
Listing queues
tornado_return  0
embedded_bots   0
missedmessage_email_senders     0
missedmessage_mobile_notifications      0
user_activity   12
invites 0
embed_links     0
email_senders   0
missedmessage_emails    0
error_reports   0
digest_emails   0
slow_queries    0
notify_tornado  0
outgoing_webhooks       0
message_sender  54
user_presence   0
deferred_work   0
email_mirror    0
user_activity_interval  0
signups 0

The queue doesn't get emptier, but retains an increasing number of messages that don't seem to get pulled off the queue. In the DigitalOcean load graph, it is visible that at 4:20 pm I had purged the message_sender queue, "normalising" things again. Then at about 8:45 am again the problem has re-occurred:

image

How can this problem be solved? We really like Zulip (and don't want to go back to Slack or go to something like MS Teams), but at the moment the system is not really usable due to unreliability. How can I fix things?

@timabbott
Copy link
Sponsor Member

@pohutukawa if you're up for it, I'd recommend upgrading to master with upgrade-zulip-from-git master. We've removed the entire message_sender queue system from Zulip in work towards the Zulip 2.2 release, which eliminates this category of failure mode (which we haven't seen before).

https://zulip.readthedocs.io/en/latest/production/upgrade-or-modify.html#upgrading-from-a-git-repository

Another option is to stop the Zulip server and then use manage.py purge_queue message_sender to delete all the messages in that queue; that should also resolve your problem.

@zulipbot
Copy link
Member

zulipbot commented May 5, 2020

Hello @zulip/server-production members, this issue was labeled with the "area: production" label, so you may want to check it out!

@timabbott
Copy link
Sponsor Member

The message_sender queue only processes messages sent via the Zulip web UI, so it's likely that someone in your organization sent a message containing a link or other content that is making the markdown processor infinite loop. This should be hard, because we run the markdown processor under a 5s timer that kills the job if it is running forever. I'd consider asking if anyone in your organization is sending messages designed to DoS a parser; if that's indeed the issue, please report the problematic message content to security@zulipchat.com so we can investigate.

@pohutukawa
Copy link
Author

Thanks @timabbott, I'll try the upgrade to Git master for now. As I've purged messages from the message_sender queue already, and got the same problem again (twice) within a day, that doesn't seem to be a viable solution. Also, I could imagine that there may be message loss when purging. Not so good.

As for the Markdown DoS scenario: I'd doubt that any of our guys would be doing that, at least intentionally. Having said that, though, is there a way to selectively access the content of a/the problematic message to get to the bottom of the problem?

@pohutukawa
Copy link
Author

Upgrade from Git to master done. Let's see what happens now.

Note: rabbitmqctl list_queues does not show any more queues at all. ("Error: unable to connect to node zulip@localhost: nodedown")

@timabbott
Copy link
Sponsor Member

Note: rabbitmqctl list_queues does not show any more queues at all. ("Error: unable to connect to node zulip@localhost: nodedown")

That seems wrong; is your server actually working? But in any case, the message_sender queue should be gone.

As for the Markdown DoS scenario: I'd doubt that any of our guys would be doing that, at least intentionally. Having said that, though, is there a way to selectively access the content of a/the problematic message to get to the bottom of the problem?

No, because the message is failing to complete sending, so it's not stored on the server at all.

@pohutukawa
Copy link
Author

Note: rabbitmqctl list_queues does not show any more queues at all. ("Error: unable to connect to node zulip@localhost: nodedown")

That seems wrong; is your server actually working? But in any case, the message_sender queue should be gone.

The server is working, but the list_queues command does not show any queues. I have rebooted since, and this is the output:

# rabbitmqctl list_queues
Error: unable to connect to node zulip@localhost: nodedown

But the RabbitMQ broker is running:

# service rabbitmq-server status
● rabbitmq-server.service - RabbitMQ Messaging Server
   Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-05-01 03:29:13 UTC; 4 days ago
  Process: 925 ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait (code=exited, status=0/SU
 Main PID: 924 (rabbitmq-server)
    Tasks: 88 (limit: 4702)
   CGroup: /system.slice/rabbitmq-server.service
           ├─ 924 /bin/sh /usr/sbin/rabbitmq-server
           ├─ 990 /bin/sh /usr/lib/rabbitmq/bin/rabbitmq-server
           ├─1244 /usr/lib/erlang/erts-9.2/bin/epmd -daemon
           ├─1401 /usr/lib/erlang/erts-9.2/bin/beam.smp -W w -A 64 -P 1048576 -t 5000000 -stbt db
           ├─1515 erl_child_setup 65536
           ├─1632 inet_gethost 4
           └─1633 inet_gethost 4

May 04 15:11:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-56
May 05 05:19:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-27
May 05 05:23:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-73
May 05 05:29:01 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-19
May 05 05:34:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-86
May 05 13:23:45 zulip epmd[1244]: epmd: got partial packet only on file descriptor 6 (0)
May 05 15:32:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-19
May 05 15:43:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-20
May 05 22:27:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-87
May 05 23:10:02 zulip epmd[1244]: epmd: node name already occupied rabbitmq-cli-88

As for the Markdown DoS scenario: I'd doubt that any of our guys would be doing that, at least intentionally. Having said that, though, is there a way to selectively access the content of a/the problematic message to get to the bottom of the problem?

No, because the message is failing to complete sending, so it's not stored on the server at all.

Yes, but I guess it must be persisted in the queue.

Anyway, the server "feels" normal since the upgrade to the Git master version yesterday, even though the inspection via RabbitMQ tools on the queues won't work. Very odd ...

@timabbott
Copy link
Sponsor Member

Error: unable to connect to node zulip@localhost: nodedown

I would expect this to totally break Zulip, so that's very odd. maybe you somehow have two instances of rabbitmq with different nodenames, only one of which is down? Check /var/lib/rabbitmq/mnesia/ for hostnames there.

@pohutukawa
Copy link
Author

Just referenced this ticket from elsewhere, and noticed I had not given final feedback.

Now (I don't know why) the rabbitmqctl list_queues does report a list of all the queues properly with their current queue sizes. I don't know why, but this problem seems resolved, and the upgrade to the Git master (as of early May) seems to have done the trick to resolve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants