Support channel.close in the middle of a publish #11210

carlhoerberg · 2024-05-10T10:18:45Z

Proposed Changes

RabbitMQ isn't respecting section 4.2.6 in the spec: https://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf

Note that any non-content frame explicitly marks the end of the content. Although the size of the content is well-known from the content header (and thus also the number of content frames), this allows for a sender to abort the sending of content without the need to close the channel.

Instead any non Body frame after a Header frame closes the whole connection.

This PR allows channel.close frames to cancel an ongoing publish.

Types of Changes

Bug fix (non-breaking change which fixes issue Don't close connection if channel without finished publish is closed #11208)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

In AMQProxy we pool channels from many different downstream client on a upstream single connection, but if any of those downstream client disconnect while not have finished a full publish the proxy tries to close the upstream channel but then RabbitMQ closes the whole connection and thus all other channels from all other downstream clients.

Fixes rabbitmq#11208

carlhoerberg · 2024-05-10T12:05:54Z

Tests are coming. Will maybe generalise to all method frames as per the spec. Just channel.close was the minimal change to solve our case.

The incomplete method is thrown away, and the following non-content method (eg channel.close) is processed normally.

michaelklishin · 2024-05-10T14:42:02Z

This is a highly hypothetical scenario that hasn't come up since 2007 (that I recall).

I'd like to see evidence that it does not introduce any performance regressions.

michaelklishin · 2024-05-10T14:51:28Z

The only scenario where a client can run into something like this is:

In one thread, start publishing
In another, concurrently close the channel

and sharing channels across threads like that is explicitly not supported in any of the clients. So it must be only proxies where N client connections are multiplexed to 1 actual upstream connection.

gomoripeti · 2024-05-10T16:32:05Z

We will execute performance tests to ensure there is no performance impact. I am certain the code can be made like that, to only modify the code path, when it would hit the frame error anyway, and not the happy path.

In a followup commit I modified the code to not special case the channel.close method but act similarly for any non-content method according to the spec. However that might lead to behaviour which some would consider unexpected or confusing, eg prematurely terminating a publish with a queue.declare, the publish might seem to be lost while the channel is still up. Therefore I can accept if the behaviour is restricted to channel.close when the affected channel won't stay open.

I wasn't sure where to place the tests so I just put them in a new suite. For the record the suite fails without this patch the following way:

amqp_frames_SUITE > normal_publish (00:00.193)

amqp_frames_SUITE > publish_close_after_method_frame
    #1. {error,
            {{shutdown,
                 {connection_closing,
                     {server_initiated_close,505,
                         <<"UNEXPECTED_FRAME - expected content header for class 60, got non content header frame instead">>}}},
             {gen_server,call,
                 [<0.246.0>,{close,200,<<"Goodbye">>},infinity]}}}

amqp_frames_SUITE > publish_close_after_header_frame
    #1. {error,
            {{shutdown,
                 {connection_closing,
                     {server_initiated_close,505,
                         <<"UNEXPECTED_FRAME - expected content body, got non content body frame instead">>}}},
             {gen_server,call,
                 [<0.267.0>,{close,200,<<"Goodbye">>},infinity]}}}

amqp_frames_SUITE > publish_close_after_first_body_frame
    #1. {error,
            {{shutdown,
                 {connection_closing,
                     {server_initiated_close,505,
                         <<"UNEXPECTED_FRAME - expected content body, got non content body frame instead">>}}},
             {gen_server,call,
                 [<0.288.0>,{close,200,<<"Goodbye">>},infinity]}}}

michaelklishin · 2024-05-10T17:05:49Z

@gomoripeti I expect that this version, with try/catch, is a lot more likely to introduce regressions. The original version made more sense to me.

Like I said earlier, I don't see how a realistic client can inject a queue.declare between a basic.publish and the subsequent content header and body frames (without violating explicitly document concurrent use rules).

If the scope of this continues to creep, we will close this and ask you to solve this in your proxy, which is the only affected project if you think of it.

michaelklishin · 2024-05-10T17:13:21Z

As far as channel parser states go, this can be implemented as

A new transition for a state where a basic.publish frame was received
As a "last in the chain" function head (after the content header and body frames)

However, the reasoning in section 4.2.6 as quoted above carries no practical sense. In my 14 years as a RabbitMQ contributor I have never seen a user try to "terminate content" like that.
Most users do not know that this is possible and client libraries do not make it possible in practice without violating concurrent use rules.

Sorry but it's just too hard to justify a change like that that will potentially affect every single user.

If an AMQP 0-9-1 proxy developer chooses to adhere by section 4.2.6, they can. Otherwise this is a clear example of overspecialization for one specific user, and we have previously rejected such PRs, even if they would potentially benefit more than the users of a single proxy (e.g. #10293).

michaelklishin · 2024-05-10T17:15:01Z

Sorry, the more I think about how this can be done, the more I find section 4.2.6 to be ridiculous and this change as inevitably risky. We have deviated from the AMQP 0-9-1 spec in the past for practical implementation reasons and questionable or ambiguous decisions in that fronze-in-time spec, and I have no problem with that.

michaelklishin · 2024-05-10T17:24:18Z

What a proxy could do to avoid an exception is something like this:

Knowing the size of the expected content (zero or more body frames), assume that the any frame arriving in the middle of a publish should be delayed
After the publish is done, publish all those pending frames

This will not allow clients to "terminate a publish" early but it will be a safe thing to do.

To handle potentially missing body frames, introduce a "flushing timeout" where all pending frames are sent. This would not be a safe thing to do, though, so such timeout would have to match the heartbeat or TCP keepalive timeout used.

It's not the most straighforward algorithm to implement but I honestly don't see why cloudamqp/amqproxy#162 should be solved in RabbitMQ if the user very clearly states that directly connected clients are not affected.

carlhoerberg · 2024-05-10T21:49:21Z

Sorry, the PR was made more generic than intended. We only want to support channel.close to not generate connection.close. I agree that introducing "non-content frame explicitly marks the end of the content" now can cause unintended consequences, but channel.close cleary shows the intention of the user.

It's not practical in a proxy or pool to buffer full messages, they can be tens of MB large (this is more common that you would think), exploding memory usage when you have potentially hundreds of thousands of downstream clients.

You're wrong to assume that cloudamqp/amqproxy#162 is related, this is a different and a very generic issue for anyone implementing a channel pool.

The branch has been updated with 137315e but it doesn't show up in the PR as it's already closed.

gomoripeti · 2024-05-11T07:37:22Z

The generalisation was just an early experiment from my side, and by the time pushing the tests I realised it is a bad idea to interleave frames of multiple methods on the same channel. We don't have any use case for that. Let's forget about that and forget about the spec.

But as Carl wrote it's useful to keep connections long lived, avoid reconnect (which is expensive on TCP, TLS and AMQP levels) and creating connection churn.

As an alternative (only theoretical) the broker could support a channel.close(ChannelId) method on channel 0, so that a client side "channel manager" could terminate individual channels without keeping much state about them. But I think the proposed change in this branch has much smaller diff and less impact.

Support channel.close in the middle of a publish

3d8f76b

Fixes rabbitmq#11208

Any non-content frame marks the end of the content

6a4a613

The incomplete method is thrown away, and the following non-content method (eg channel.close) is processed normally.

Test connection not closed if non-content frame after incomplete publish

0581bba

michaelklishin closed this May 10, 2024

michaelklishin mentioned this pull request May 10, 2024

Don't close connection if channel without finished publish is closed #11208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support channel.close in the middle of a publish #11210

Support channel.close in the middle of a publish #11210

carlhoerberg commented May 10, 2024 •

edited

Loading

carlhoerberg commented May 10, 2024

michaelklishin commented May 10, 2024

michaelklishin commented May 10, 2024

gomoripeti commented May 10, 2024

michaelklishin commented May 10, 2024 •

edited

Loading

michaelklishin commented May 10, 2024

michaelklishin commented May 10, 2024 •

edited

Loading

michaelklishin commented May 10, 2024 •

edited

Loading

carlhoerberg commented May 10, 2024

gomoripeti commented May 11, 2024

Support channel.close in the middle of a publish #11210

Support channel.close in the middle of a publish #11210

Conversation

carlhoerberg commented May 10, 2024 • edited Loading

Proposed Changes

Types of Changes

Checklist

Further Comments

carlhoerberg commented May 10, 2024

michaelklishin commented May 10, 2024

michaelklishin commented May 10, 2024

gomoripeti commented May 10, 2024

michaelklishin commented May 10, 2024 • edited Loading

michaelklishin commented May 10, 2024

michaelklishin commented May 10, 2024 • edited Loading

michaelklishin commented May 10, 2024 • edited Loading

carlhoerberg commented May 10, 2024

gomoripeti commented May 11, 2024

carlhoerberg commented May 10, 2024 •

edited

Loading

michaelklishin commented May 10, 2024 •

edited

Loading

michaelklishin commented May 10, 2024 •

edited

Loading

michaelklishin commented May 10, 2024 •

edited

Loading