[🐛 BUG]: rabbitmq_redial error causes redial/retry logic to prematurely exit #1857

haydenm315 · 2024-02-01T22:17:54Z

No duplicates 🥲.

I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

I'm using roadrunner with rabbitmq amqp and jobs to run a pipeline. If there's an interruption in the rabbitmq broker such as restarting it, roadrunner fails to re-establish the connecftion to the rabbitmq broker queue.

I see one redialing attempt when the broker is restarted, however it's never reconnected.

Version (rr --version)

rr version 2023.1.5 (build time: 2023-06-08T14:45:05+0000, go1.20.4), OS: linux, arch: amd64

How to reproduce the issue?

amqp:
  addr: amqp://${RABBIT_USERNAME}:${RABBIT_PASSWORD}@${RABBIT_ADDRESS}:5672

rpc:
  listen: tcp://127.0.0.11:6001

jobs:
  num_pollers: ${ROADRUNNER_NUM_POLLERS}
  timeout: 15
  pipeline_size: 100000
  consume: ["widgets"]
  pool:
    num_workers: ${ROADRUNNER_NUM_WORKERS}
    allocate_timeout: 30s
    destroy_timeout: 30s

  pipelines:
    snapshot:
      driver: amqp
      config: # NEW in 2.7
        priority: 10
        prefetch: ${ROADRUNNER_PREFETCH}
        consume_all: true
        queue: "default"
        reserve_timeout: 5s
        delete_queue_on_stop: true
        redial_timeout: 99999999
        queue_headers:
          x-max-priority: 10

server:
  command: "php somejob.php"
  relay: pipes

logs:
  mode: ${ROADRUNNER_LOGGING}
  encoding: json
  level: ${ROADRUNNER_LOGGING_LEVEL}

To reproduce, start up a rabbitmq broker and then the roadrunner.
Verify the roadrunner is connected to the broker and is ready for work
./rr workers

Output will include the following at the bottom.

Jobs of [jobs]:
+--------+----------+--------+---------+--------+---------+----------+
| STATUS | PIPELINE | DRIVER |  QUEUE  | ACTIVE | DELAYED | RESERVED |
+--------+----------+--------+---------+--------+---------+----------+
| READY  | snapshot | amqp   | default | 0      | 0       | 0        |
+--------+----------+--------+---------+--------+---------+----------+

restart your rabbitmq broker
roadrunner will output some errors as pasted below in the log output section.. After these errors are output, there won't be any activity.

Looking at the workers with ./rr workers won't have any job status, even though the rabbitmq broker is available.
Only after restarting the roadrunner process will the amqp connect restore.

Relevant log output

roadrunner errors when disconnected from broker.

{"level":"error","ts":1706825285055192507,"logger":"amqp","msg":"publish channel close","error":"Exception (504) Reason: \"channel/connection is not open\""}
{"level":"error","ts":1706825285055457503,"logger":"amqp","msg":"state channel close","error":"Exception (504) Reason: \"channel/connection is not open\""}
{"level":"error","ts":1706825285055487649,"logger":"amqp","msg":"consume channel close","error":"Exception (504) Reason: \"channel/connection is not open\""}
{"level":"error","ts":1706825285055505906,"logger":"amqp","msg":"amqp connection closed","error":"Exception (504) Reason: \"channel/connection is not open\""}
{"level":"error","ts":1706825285055555699,"logger":"amqp","msg":"pipeline connection was closed, redialing","error":"Exception (320) Reason: \"CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'\"","pipeline":"snapshot","driver":"amqp","start":1706825285055551243}

The text was updated successfully, but these errors were encountered:

rustatian · 2024-02-01T22:24:32Z

Hey @haydenm315 👋 Thanks for the report 👍
I forgot to ask you about the RR version. Could you please try the latest one (2023.3.9)?

haydenm315 · 2024-02-07T18:22:17Z

Just tried with 2023.3.10 and there's messages which indicate an attempt to recover, however I still don't see the queue listed when doing ./rr workers, after bouncing the rabbitmq service to simulate an upgrade or service interruption. Looks like there's an attempt to re-establish the connection now.

Upon making rabbitmq unavailable, roadrunner logs the following.

roadrunner-1  | {"level":"DEBUG","ts":"2024-02-07T18:14:21+0000","logger":"amqp        ","msg":"delivery channel was closed, leaving the rabbit listener"}
roadrunner-1  | {"level":"DEBUG","ts":"2024-02-07T18:14:21+0000","logger":"amqp        ","msg":"number of listeners","listeners":0}
roadrunner-1  | {"level":"DEBUG","ts":"2024-02-07T18:14:21+0000","logger":"amqp        ","msg":"exited from redialer"}
roadrunner-1  | {"level":"ERROR","ts":"2024-02-07T18:14:21+0000","logger":"amqp        ","msg":"amqp connection closed","error":"Exception (504) Reason: \"channel/connection is not open\""}
roadrunner-1  | {"level":"ERROR","ts":"2024-02-07T18:14:21+0000","logger":"amqp        ","msg":"pipeline connection was closed, redialing","error":"Exception (320) Reason: \"CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'\"","pipeline":"snapshot","driver":"amqp","start":"2024-02-07T18:14:21+0000"}

After making rabbitmq available again, roadrunner logs the following

roadrunner-1  | {"level":"INFO","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"rabbitmq dial was succeed. trying to redeclare queues and subscribers"}
roadrunner-1  | {"level":"INFO","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"queues and subscribers was redeclared successfully"}
roadrunner-1  | {"level":"INFO","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"connection was successfully restored","pipeline":"snapshot","driver":"amqp","start":"2024-02-07T18:14:21+0000","elapsed":"27.519989644s"}
roadrunner-1  | {"level":"INFO","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"redialer restarted"}
roadrunner-1  | {"level":"DEBUG","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"exited from redialer"}
roadrunner-1  | {"level":"ERROR","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"pipeline connection was closed, redialing","error":"Exception (320) Reason: \"CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'\"","pipeline":"snapshot","driver":"amqp","start":"2024-02-07T18:14:49+0000"}
roadrunner-1  | {"level":"INFO","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"rabbitmq dial was succeed. trying to redeclare queues and subscribers"}
roadrunner-1  | {"level":"INFO","ts":"2024-02-07T18:14:49+0000","logger":"amqp        ","msg":"rabbitmq dial was succeed. trying to redeclare queues and subscribers"}

rustatian · 2024-02-07T19:20:51Z

Yeah, so, that means that RR successfully redialed.

@haydenm315 Is that what you wanted to see, am I right?

haydenm315 · 2024-02-07T19:56:49Z

Seems closer than the older release we were using. Still requires a restart of roadrunner to get it to accept jobs and have a ready status. Unless I restart roadrunner after rabbitmq comes back, I don't see jobs like below in the rr workers output.

Jobs of [jobs]:
+--------+----------+--------+---------+--------+---------+----------+
| STATUS | PIPELINE | DRIVER |  QUEUE  | ACTIVE | DELAYED | RESERVED |
+--------+----------+--------+---------+--------+---------+----------+
| READY  | snapshot | amqp   | default | 0      | 0       | 0        |
+--------+----------+--------+---------+--------+---------+----------+

Maybe something's not quite right with "trying to redeclare queues and subscribers".

rustatian · 2024-02-07T19:58:56Z

I'm not sure, because I have a test especially for the case when rabbitmq is down. Have you tried to push the job after redial?

rustatian · 2024-02-07T20:01:13Z

@haydenm315 In your configuration I also see, that you're consuming pipeline with the name widgets, but you don't have such a pipeline in your pipelines configuration.

haydenm315 · 2024-02-07T20:35:24Z

I renamed the queue in the previous example. It's not a queue naming issue. Jobs are getting pushed because when I restart the roadrunner server the number of Execs goes up.

rustatian · 2024-02-07T20:49:15Z

But, it's not a queue, it's a pipeline. Could you please share the latest configuration you have?

haydenm315 · 2024-02-07T22:22:30Z

Here the .rr.yaml file These are the env var values in the cfg file

AMQP_REDIAL_TIMEOUT=99999999
ROADRUNNER_NUM_POLLERS=200
ROADRUNNER_NUM_WORKERS=50
ROADRUNNER_PREFETCH=200

rr.yml.txt

rustatian · 2024-02-08T07:42:19Z

Thanks 👍
Few notes regarding your configuration:

The number of pollers should not be significantly bigger than the number of workers. Poller is a thread inside RR which tries to obtain a job from the internal priority queue. More pollers -> more memory will be used for the thread stack, etc. Try 60 pollers and compare with the current value (200).
There is no reload plugin anymore since 2023.1.0 version.

haydenm315 · 2024-02-08T22:36:25Z

Thanks for the recommendations

rustatian · 2024-02-13T08:13:12Z

Hey @haydenm315 👋

I double-checked that behavior. Let me summarize what I did:

Started RR, with jobs configuration and 1 pipeline.
Shut down RabbitMQ docker. RR started the redial process with exponential backoff.
I used a huge redial timeout. Waited for a minute.
Then I started RabbitMQ docker.
RoadRunner successfully reconnected to it.

Keep in mind, that RR uses an exponential backoff mechanism to redial. It won't reconnect immediately, but after some exponential step. For example: first redial will happen after 1sec after loosing a connection, next - 2s -> 5s -> 15s -> 25s ...

rustatian · 2024-02-13T08:15:43Z

I also noticed, that your amqp configuration contains some options from the other drivers, e.g.: reserve_timeout is from the beanstalk driver.

rustatian · 2024-02-23T13:53:59Z

Closing as answered. You are welcome to comment here if you still have questions 😃

haydenm315 added B-bug Bug: bug, exception F-need-verification labels Feb 1, 2024

haydenm315 assigned rustatian Feb 1, 2024

rustatian closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 BUG]: rabbitmq_redial error causes redial/retry logic to prematurely exit #1857

[🐛 BUG]: rabbitmq_redial error causes redial/retry logic to prematurely exit #1857

haydenm315 commented Feb 1, 2024 •

edited by rustatian

rustatian commented Feb 1, 2024 •

edited

haydenm315 commented Feb 7, 2024 •

edited by rustatian

rustatian commented Feb 7, 2024 •

edited

haydenm315 commented Feb 7, 2024 •

edited by rustatian

rustatian commented Feb 7, 2024

rustatian commented Feb 7, 2024

haydenm315 commented Feb 7, 2024

rustatian commented Feb 7, 2024

haydenm315 commented Feb 7, 2024 •

edited

rustatian commented Feb 8, 2024 •

edited

haydenm315 commented Feb 8, 2024

rustatian commented Feb 13, 2024

rustatian commented Feb 13, 2024

rustatian commented Feb 23, 2024

[🐛 BUG]: rabbitmq_redial error causes redial/retry logic to prematurely exit #1857

[🐛 BUG]: rabbitmq_redial error causes redial/retry logic to prematurely exit #1857

Comments

haydenm315 commented Feb 1, 2024 • edited by rustatian

No duplicates 🥲.

What happened?

Version (rr --version)

How to reproduce the issue?

Relevant log output

rustatian commented Feb 1, 2024 • edited

haydenm315 commented Feb 7, 2024 • edited by rustatian

rustatian commented Feb 7, 2024 • edited

haydenm315 commented Feb 7, 2024 • edited by rustatian

rustatian commented Feb 7, 2024

rustatian commented Feb 7, 2024

haydenm315 commented Feb 7, 2024

rustatian commented Feb 7, 2024

haydenm315 commented Feb 7, 2024 • edited

rustatian commented Feb 8, 2024 • edited

haydenm315 commented Feb 8, 2024

rustatian commented Feb 13, 2024

rustatian commented Feb 13, 2024

rustatian commented Feb 23, 2024

haydenm315 commented Feb 1, 2024 •

edited by rustatian

rustatian commented Feb 1, 2024 •

edited

haydenm315 commented Feb 7, 2024 •

edited by rustatian

rustatian commented Feb 7, 2024 •

edited

haydenm315 commented Feb 7, 2024 •

edited by rustatian

haydenm315 commented Feb 7, 2024 •

edited

rustatian commented Feb 8, 2024 •

edited