Right now, the tornado server is a single point of failure and a bottleneck in scaling out Zulip. I took a quick look at the code and thought it shouldn't be terribly hard to support sharding if we modify TornadoQueueClient to converse with RabbitMQ via a fanout exchange (pub/sub).
Messages on the "notify_tornado" queue can be broadcast to all tornado shards, and the rest of the code should work as is. Future upgrades can perhaps optimize the routing to a subset of tornado shards.
Messages on the "tornado_return" queue can match binding keys with routing keys to make sure they return to the correct tornado shard.
This will likely require session affinity in the nginx config, although with #395, perhaps this is not required if we have a safe way of exposing the handler_id to clients?
Yep, that sort of broadcast to all the shards with a future upgrade to do a subset is my current plan for how to handle multiple queue server shards.
With #395, I think we do what we want without nginx session affinity if we namespace queue IDs to include the ID of the queue server in them. Since to clients, the queue ID is just an opaque string, that upgrade could be backwards-compatible.
If you're interested in accelerating progress on scalibility, #395 is currently stuck waiting for someone to code review it; and I think it should be possible for someone new to the codebase to help since the whole Tornado service is only a few thousand lines of code (I'm planning to do a refactor after #395 lands to make that much more clear from the code organization...).
Working on this.