osiris_replica_reader: Stop with `normal` if the leader is already gone during `init/1` #162

dumbbell · 2024-05-22T15:52:55Z

Why

In the context of RabbitMQ, if a stream queue is deleted right after being declared, there is a chance that some Osiris processes might not be ready yet at the time the queue is deleted.

In particular, the osiris_replica_reader process monitors the given leader (an osiris_writer process in the context of a RabbitMQ stream queue) during its init/1 and that process might be stopped already.

When this happens, here is the crash that is logged:

[error] <0.1548.0> ** Generic server <0.1548.0> terminating
[error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process,
[error] <0.1548.0>                                <0.1535.0>,noproc}
[error] <0.1548.0> ** When Server state == {state,
[error] <0.1548.0>                          {osiris_log,
[error] <0.1548.0>                           {cfg,
[error] <0.1548.0>                            ".../__delete_queue_1716383944197847531",
[error] <0.1548.0>                            <<"__delete_queue_1716383944197847531">>,500000000,
[error] <0.1548.0>                            256000,#{},[],
[error] <0.1548.0>                            {write_concurrency,
[error] <0.1548.0>                             #Ref<0.1118981177.1282015234.97903>},
[error] <0.1548.0>                            {osiris_replica_reader,
[error] <0.1548.0>                             {resource,<<"/">>,queue,<<"delete_queue">>},
[error] <0.1548.0>                             {127,0,0,1},
[error] <0.1548.0>                             6489},
[error] <0.1548.0>                            #Fun<osiris_writer.0.78287785>,
[error] <0.1548.0>                            #Ref<0.1118981177.1282015234.97826>,16},
[error] <0.1548.0>                           {read,data,0,tcp,all,8,undefined},
[error] <0.1548.0>                           undefined,undefined,
[error] <0.1548.0>                           {file_descriptor,prim_file,
[error] <0.1548.0>                            #{handle => #Ref<0.1118981177.1282015238.91045>,
[error] <0.1548.0>                              owner => <0.1548.0>,
[error] <0.1548.0>                              r_buffer => #Ref<0.1118981177.1282015234.97902>,
[error] <0.1548.0>                              r_ahead_size => 0}}},
[error] <0.1548.0>                          <<"__delete_queue_1716383944197847531">>,tcp,
[error] <0.1548.0>                          #Port<0.84>,<33363.1916.0>,<0.1535.0>,
[error] <0.1548.0>                          #Ref<0.1118981177.1281884162.97904>,
[error] <0.1548.0>                          {write_concurrency,
[error] <0.1548.0>                           #Ref<0.1118981177.1282015234.97903>},
[error] <0.1548.0>                          {osiris_replica_reader,
[error] <0.1548.0>                           {resource,<<"/">>,queue,<<"delete_queue">>},
[error] <0.1548.0>                           {127,0,0,1},
[error] <0.1548.0>                           6489},
[error] <0.1548.0>                          -1,0}
[error] <0.1548.0> ** Reason for termination ==
[error] <0.1548.0> ** noproc

That is because the osiris_replica_reader process receives the DOWN message from the leader monitoring with the noproc reason. It reuses the reason for its own exit reason. Because this is an abnormal reason, a crash is being logged.

How

There is no reason to log such a crash when the process tree is being shut down concurrently. osiris_replica_reader can terminate with a normal reason.

That is what this patch does: if the leader exit reason is noproc, it terminates with the normal reason instead.

src/osiris_replica_reader.erl

kjnilsson

formatting

…ne during `init/1` [Why] In the context of RabbitMQ, if a stream queue is deleted right after being declared, there is a chance that some Osiris processes might not be ready yet at the time the queue is deleted. In particular, the `osiris_replica_reader` process monitors the given leader (an `osiris_writer` process in the context of a RabbitMQ stream queue) during its `init/1` and that process might be stopped already. When this happens, here is the crash that is logged: [error] <0.1548.0> ** Generic server <0.1548.0> terminating [error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process, [error] <0.1548.0> <0.1535.0>,noproc} [error] <0.1548.0> ** When Server state == {state, [error] <0.1548.0> {osiris_log, [error] <0.1548.0> {cfg, [error] <0.1548.0> ".../__delete_queue_1716383944197847531", [error] <0.1548.0> <<"__delete_queue_1716383944197847531">>,500000000, [error] <0.1548.0> 256000,#{},[], [error] <0.1548.0> {write_concurrency, [error] <0.1548.0> #Ref<0.1118981177.1282015234.97903>}, [error] <0.1548.0> {osiris_replica_reader, [error] <0.1548.0> {resource,<<"/">>,queue,<<"delete_queue">>}, [error] <0.1548.0> {127,0,0,1}, [error] <0.1548.0> 6489}, [error] <0.1548.0> #Fun<osiris_writer.0.78287785>, [error] <0.1548.0> #Ref<0.1118981177.1282015234.97826>,16}, [error] <0.1548.0> {read,data,0,tcp,all,8,undefined}, [error] <0.1548.0> undefined,undefined, [error] <0.1548.0> {file_descriptor,prim_file, [error] <0.1548.0> #{handle => #Ref<0.1118981177.1282015238.91045>, [error] <0.1548.0> owner => <0.1548.0>, [error] <0.1548.0> r_buffer => #Ref<0.1118981177.1282015234.97902>, [error] <0.1548.0> r_ahead_size => 0}}}, [error] <0.1548.0> <<"__delete_queue_1716383944197847531">>,tcp, [error] <0.1548.0> #Port<0.84>,<33363.1916.0>,<0.1535.0>, [error] <0.1548.0> #Ref<0.1118981177.1281884162.97904>, [error] <0.1548.0> {write_concurrency, [error] <0.1548.0> #Ref<0.1118981177.1282015234.97903>}, [error] <0.1548.0> {osiris_replica_reader, [error] <0.1548.0> {resource,<<"/">>,queue,<<"delete_queue">>}, [error] <0.1548.0> {127,0,0,1}, [error] <0.1548.0> 6489}, [error] <0.1548.0> -1,0} [error] <0.1548.0> ** Reason for termination == [error] <0.1548.0> ** noproc That is because the `osiris_replica_reader` process receives the `DOWN` message from the leader monitoring with the `noproc` reason. It reuses the reason for its own exit reason. Because this is an abnormal reason, a crash is being logged. [How] There is no reason to log such a crash when the process tree is being shut down concurrently. `osiris_replica_reader` can terminate with a `normal` reason. That is what this patch does: if the leader exit reason is `noproc`, it terminates with the `normal` reason instead.

dumbbell added the enhancement New feature or request label May 22, 2024

dumbbell requested a review from kjnilsson May 22, 2024 15:52

dumbbell self-assigned this May 22, 2024

dumbbell marked this pull request as ready for review May 22, 2024 16:27

michaelklishin approved these changes May 22, 2024

View reviewed changes

dumbbell force-pushed the handle-premature-leader-exit-in-osiris_replica_reader branch from 4f0489e to 5c3c10d Compare May 23, 2024 07:53

kjnilsson reviewed Jun 24, 2024

View reviewed changes

src/osiris_replica_reader.erl Outdated Show resolved Hide resolved

kjnilsson requested changes Jun 24, 2024

View reviewed changes

dumbbell force-pushed the handle-premature-leader-exit-in-osiris_replica_reader branch from 5c3c10d to 8a4ec95 Compare June 24, 2024 14:27

kjnilsson merged commit 3cc00e8 into main Jun 24, 2024
8 checks passed

kjnilsson deleted the handle-premature-leader-exit-in-osiris_replica_reader branch June 24, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osiris_replica_reader: Stop with `normal` if the leader is already gone during `init/1` #162

osiris_replica_reader: Stop with `normal` if the leader is already gone during `init/1` #162

dumbbell commented May 22, 2024 •

edited

Loading

kjnilsson left a comment

osiris_replica_reader: Stop with normal if the leader is already gone during init/1 #162

osiris_replica_reader: Stop with normal if the leader is already gone during init/1 #162

Conversation

dumbbell commented May 22, 2024 • edited Loading

Why

How

kjnilsson left a comment

Choose a reason for hiding this comment

osiris_replica_reader: Stop with `normal` if the leader is already gone during `init/1` #162

osiris_replica_reader: Stop with `normal` if the leader is already gone during `init/1` #162

dumbbell commented May 22, 2024 •

edited

Loading