Skip to content

Conversation

@ansd
Copy link
Member

@ansd ansd commented Jun 2, 2025

Test case tcp_back_pressure_rabbitmq_internal_flow_quorum_queue succeeds
consistently locally on macOS and fails consistently in CI since 30 May
2025.

CI also shows a test failure instance of tcp_back_pressure_rabbitmq_internal_flow_classic_queue, albeit much rearer.

This test case succeeds in CI when using ubuntu-22.04 but fails with ubuntu-24.04.
Even before 30 May 2025, ubuntu-24.04 was used. However the GitHub runner
version was updated from Version: 20250511.1.0 to Version: 20250527.1.0
which presumably started to cause this test to fail.
This hypothesis cannot be validated because the GitHub actions
definitions YAML file doesn't provide a means to configure this version.

File images/ubuntu/Ubuntu2404-Readme.md in actions/runner-images@ubuntu24/20250511.1...ubuntu24/20250527.1 shows the diff.
The most notable changes are probably the kernel version change from Kernel Version: 6.11.0-1013-azure to Kernel Version: 6.11.0-1015-azure and some changes to file images/ubuntu/scripts/build/configure-environment.sh

There seem to be no RabbitMQ related changes causing this test to fail
because this test also fails with an older RabbitMQ version with the new runner
Version: 20250527.1.0.

Neither meck nor inet:setopts(Socket, [{active, once}]) cause the
test failure because the test also fails with the former
erlang:suspend_process/1 and erlang:resume_process/1.

The test fails due to the following timeout in the writer proc on the
server:

** Last message in was {'$gen_cast',
                           {send_command,<0.760.0>,0,
                               {'v1_0.transfer',
                                   {uint,3},
                                   {uint,2211},
                                   {binary,<<0,0,8,162>>},
                                   {uint,0},
                                   true,undefined,undefined,undefined,
                                   undefined,undefined,undefined},
                               <<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>}}
** When Server state == #{pending => 3510,socket => #Port<0.49>,
                          reader => <0.755.0>,
                          monitored_sessions => [<0.760.0>],
                          pending_size => 3510}
** Reason for termination ==
** {{writer,send_failed,timeout},
    [{rabbit_amqp_writer,flush,1,
                         [{file,"src/rabbit_amqp_writer.erl"},{line,250}]},
     {rabbit_amqp_writer,handle_cast,2,
                         [{file,"src/rabbit_amqp_writer.erl"},{line,106}]},
     {gen_server,try_handle_cast,3,[{file,"gen_server.erl"},{line,2371}]},
     {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,2433}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}

Even after the CT test case resumes consumption,
the server still times out writing to the socket. The Wireshark capture validates that there is no logic error on the server: The TCP window of the receiving client is full for 30 seconds, the client doesn't receive fast enough, hence the server times out writing to the socket and closes the connection.

The most important test expectation that is kept in place is that the
server won't send all the messages if the client can't receive fast
enough.

@ansd ansd force-pushed the backpressure-flake branch from 39f7d26 to d09523c Compare June 2, 2025 17:34
Test case `tcp_back_pressure_rabbitmq_internal_flow_quorum_queue` succeeds
consistently locally on macOS and fails consistently in CI since 30 May
2025.

CI also shows a test failure instance of `tcp_back_pressure_rabbitmq_internal_flow_classic_queue`, albeit much rearer.

This test case succeeds in CI when using ubuntu-22.04 but fails with ubuntu-24.04.
Even before 30 May 2025, ubuntu-24.04 was used. However the GitHub runner
version was updated from Version: 20250511.1.0 to Version: 20250527.1.0
which presumably started to cause this test to fail.
This hypothesis cannot be validated because the GitHub actions
definitions YAML file doesn't provide a means to configure this version.

File `images/ubuntu/Ubuntu2404-Readme.md` in actions/runner-images@ubuntu24/20250511.1...ubuntu24/20250527.1 shows the diff.
The most notable changes are probably the kernel version change from Kernel Version: 6.11.0-1013-azure to Kernel Version: 6.11.0-1015-azure and some changes to file `images/ubuntu/scripts/build/configure-environment.sh`

There seem to be no RabbitMQ related changes causing this test to fail
because this test also fails with an older RabbitMQ version with the new runner
Version: 20250527.1.0.

Neither `meck` nor `inet:setopts(Socket, [{active, once}])` cause the
test failure because the test also fails with the former
`erlang:suspend_process/1` and `erlang:resume_process/1`.

The test fails due to the following timeout in the writer proc on the
server:
```
** Last message in was {'$gen_cast',
                           {send_command,<0.760.0>,0,
                               {'v1_0.transfer',
                                   {uint,3},
                                   {uint,2211},
                                   {binary,<<0,0,8,162>>},
                                   {uint,0},
                                   true,undefined,undefined,undefined,
                                   undefined,undefined,undefined},
                               <<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>}}
** When Server state == #{pending => 3510,socket => #Port<0.49>,
                          reader => <0.755.0>,
                          monitored_sessions => [<0.760.0>],
                          pending_size => 3510}
** Reason for termination ==
** {{writer,send_failed,timeout},
    [{rabbit_amqp_writer,flush,1,
                         [{file,"src/rabbit_amqp_writer.erl"},{line,250}]},
     {rabbit_amqp_writer,handle_cast,2,
                         [{file,"src/rabbit_amqp_writer.erl"},{line,106}]},
     {gen_server,try_handle_cast,3,[{file,"gen_server.erl"},{line,2371}]},
     {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,2433}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}
```

For unknown reasons, even after the CT test case resumes consumption,
the server still times out writing to the socket.

The most important test expectation that is kept in place is that the
server won't send all the messages if the client can't receive fast
enough.
@ansd ansd force-pushed the backpressure-flake branch from d09523c to 0c391a5 Compare June 2, 2025 17:35
@ansd ansd marked this pull request as ready for review June 2, 2025 18:30
@michaelklishin michaelklishin added this to the 4.2.0 milestone Jun 2, 2025
@michaelklishin michaelklishin merged commit 797e543 into main Jun 2, 2025
557 of 560 checks passed
@michaelklishin michaelklishin deleted the backpressure-flake branch June 2, 2025 19:00
michaelklishin added a commit that referenced this pull request Jun 2, 2025
Remove AMQP backpressure test expectation (backport #14007)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants