Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support channel.close in the middle of a publish #11210

Closed

Conversation

carlhoerberg
Copy link
Contributor

@carlhoerberg carlhoerberg commented May 10, 2024

Proposed Changes

RabbitMQ isn't respecting section 4.2.6 in the spec: https://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf

Note that any non-content frame explicitly marks the end of the content. Although the size of the content is well-known from the content header (and thus also the number of content frames), this allows for a sender to abort the sending of content without the need to close the channel.

Instead any non Body frame after a Header frame closes the whole connection.

This PR allows channel.close frames to cancel an ongoing publish.

Types of Changes

  • Bug fix (non-breaking change which fixes issue Don't close connection if channel without finished publish is closed #11208)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

In AMQProxy we pool channels from many different downstream client on a upstream single connection, but if any of those downstream client disconnect while not have finished a full publish the proxy tries to close the upstream channel but then RabbitMQ closes the whole connection and thus all other channels from all other downstream clients.

@carlhoerberg
Copy link
Contributor Author

Tests are coming. Will maybe generalise to all method frames as per the spec. Just channel.close was the minimal change to solve our case.

The incomplete method is thrown away, and the following non-content
method (eg channel.close) is processed normally.
@michaelklishin
Copy link
Member

This is a highly hypothetical scenario that hasn't come up since 2007 (that I recall).

I'd like to see evidence that it does not introduce any performance regressions.

@michaelklishin
Copy link
Member

The only scenario where a client can run into something like this is:

  • In one thread, start publishing
  • In another, concurrently close the channel

and sharing channels across threads like that is explicitly not supported in any of the clients. So it must be only proxies where N client connections are multiplexed to 1 actual upstream connection.

@gomoripeti
Copy link
Contributor

We will execute performance tests to ensure there is no performance impact. I am certain the code can be made like that, to only modify the code path, when it would hit the frame error anyway, and not the happy path.

In a followup commit I modified the code to not special case the channel.close method but act similarly for any non-content method according to the spec. However that might lead to behaviour which some would consider unexpected or confusing, eg prematurely terminating a publish with a queue.declare, the publish might seem to be lost while the channel is still up. Therefore I can accept if the behaviour is restricted to channel.close when the affected channel won't stay open.

I wasn't sure where to place the tests so I just put them in a new suite. For the record the suite fails without this patch the following way:

amqp_frames_SUITE > normal_publish (00:00.193)

amqp_frames_SUITE > publish_close_after_method_frame
    #1. {error,
            {{shutdown,
                 {connection_closing,
                     {server_initiated_close,505,
                         <<"UNEXPECTED_FRAME - expected content header for class 60, got non content header frame instead">>}}},
             {gen_server,call,
                 [<0.246.0>,{close,200,<<"Goodbye">>},infinity]}}}

amqp_frames_SUITE > publish_close_after_header_frame
    #1. {error,
            {{shutdown,
                 {connection_closing,
                     {server_initiated_close,505,
                         <<"UNEXPECTED_FRAME - expected content body, got non content body frame instead">>}}},
             {gen_server,call,
                 [<0.267.0>,{close,200,<<"Goodbye">>},infinity]}}}

amqp_frames_SUITE > publish_close_after_first_body_frame
    #1. {error,
            {{shutdown,
                 {connection_closing,
                     {server_initiated_close,505,
                         <<"UNEXPECTED_FRAME - expected content body, got non content body frame instead">>}}},
             {gen_server,call,
                 [<0.288.0>,{close,200,<<"Goodbye">>},infinity]}}}

@michaelklishin
Copy link
Member

michaelklishin commented May 10, 2024

@gomoripeti I expect that this version, with try/catch, is a lot more likely to introduce regressions. The original version made more sense to me.

Like I said earlier, I don't see how a realistic client can inject a queue.declare between a basic.publish and the subsequent content header and body frames (without violating explicitly document concurrent use rules).

If the scope of this continues to creep, we will close this and ask you to solve this in your proxy, which is the only affected project if you think of it.

@michaelklishin
Copy link
Member

As far as channel parser states go, this can be implemented as

  • A new transition for a state where a basic.publish frame was received
  • As a "last in the chain" function head (after the content header and body frames)

However, the reasoning in section 4.2.6 as quoted above carries no practical sense. In my 14 years as a RabbitMQ contributor I have never seen a user try to "terminate content" like that.
Most users do not know that this is possible and client libraries do not make it possible in practice without violating concurrent use rules.

Sorry but it's just too hard to justify a change like that that will potentially affect every single user.

If an AMQP 0-9-1 proxy developer chooses to adhere by section 4.2.6, they can. Otherwise this is a clear example of overspecialization for one specific user, and we have previously rejected such PRs, even if they would potentially benefit more than the users of a single proxy (e.g. #10293).

@michaelklishin
Copy link
Member

michaelklishin commented May 10, 2024

Sorry, the more I think about how this can be done, the more I find section 4.2.6 to be ridiculous and this change as inevitably risky. We have deviated from the AMQP 0-9-1 spec in the past for practical implementation reasons and questionable or ambiguous decisions in that fronze-in-time spec, and I have no problem with that.

@michaelklishin
Copy link
Member

michaelklishin commented May 10, 2024

What a proxy could do to avoid an exception is something like this:

  • Knowing the size of the expected content (zero or more body frames), assume that the any frame arriving in the middle of a publish should be delayed
  • After the publish is done, publish all those pending frames

This will not allow clients to "terminate a publish" early but it will be a safe thing to do.

To handle potentially missing body frames, introduce a "flushing timeout" where all pending frames are sent. This would not be a safe thing to do, though, so such timeout would have to match the heartbeat or TCP keepalive timeout used.

It's not the most straighforward algorithm to implement but I honestly don't see why cloudamqp/amqproxy#162 should be solved in RabbitMQ if the user very clearly states that directly connected clients are not affected.

@carlhoerberg
Copy link
Contributor Author

Sorry, the PR was made more generic than intended. We only want to support channel.close to not generate connection.close. I agree that introducing "non-content frame explicitly marks the end of the content" now can cause unintended consequences, but channel.close cleary shows the intention of the user.

It's not practical in a proxy or pool to buffer full messages, they can be tens of MB large (this is more common that you would think), exploding memory usage when you have potentially hundreds of thousands of downstream clients.

You're wrong to assume that cloudamqp/amqproxy#162 is related, this is a different and a very generic issue for anyone implementing a channel pool.

The branch has been updated with 137315e but it doesn't show up in the PR as it's already closed.

@gomoripeti
Copy link
Contributor

The generalisation was just an early experiment from my side, and by the time pushing the tests I realised it is a bad idea to interleave frames of multiple methods on the same channel. We don't have any use case for that. Let's forget about that and forget about the spec.

But as Carl wrote it's useful to keep connections long lived, avoid reconnect (which is expensive on TCP, TLS and AMQP levels) and creating connection churn.

As an alternative (only theoretical) the broker could support a channel.close(ChannelId) method on channel 0, so that a client side "channel manager" could terminate individual channels without keeping much state about them. But I think the proposed change in this branch has much smaller diff and less impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants