Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osiris_replica_reader: Stop with normal if the leader is already gone during init/1 #162

Merged

Conversation

dumbbell
Copy link
Member

@dumbbell dumbbell commented May 22, 2024

Why

In the context of RabbitMQ, if a stream queue is deleted right after being declared, there is a chance that some Osiris processes might not be ready yet at the time the queue is deleted.

In particular, the osiris_replica_reader process monitors the given leader (an osiris_writer process in the context of a RabbitMQ stream queue) during its init/1 and that process might be stopped already.

When this happens, here is the crash that is logged:

[error] <0.1548.0> ** Generic server <0.1548.0> terminating
[error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process,
[error] <0.1548.0>                                <0.1535.0>,noproc}
[error] <0.1548.0> ** When Server state == {state,
[error] <0.1548.0>                          {osiris_log,
[error] <0.1548.0>                           {cfg,
[error] <0.1548.0>                            ".../__delete_queue_1716383944197847531",
[error] <0.1548.0>                            <<"__delete_queue_1716383944197847531">>,500000000,
[error] <0.1548.0>                            256000,#{},[],
[error] <0.1548.0>                            {write_concurrency,
[error] <0.1548.0>                             #Ref<0.1118981177.1282015234.97903>},
[error] <0.1548.0>                            {osiris_replica_reader,
[error] <0.1548.0>                             {resource,<<"/">>,queue,<<"delete_queue">>},
[error] <0.1548.0>                             {127,0,0,1},
[error] <0.1548.0>                             6489},
[error] <0.1548.0>                            #Fun<osiris_writer.0.78287785>,
[error] <0.1548.0>                            #Ref<0.1118981177.1282015234.97826>,16},
[error] <0.1548.0>                           {read,data,0,tcp,all,8,undefined},
[error] <0.1548.0>                           undefined,undefined,
[error] <0.1548.0>                           {file_descriptor,prim_file,
[error] <0.1548.0>                            #{handle => #Ref<0.1118981177.1282015238.91045>,
[error] <0.1548.0>                              owner => <0.1548.0>,
[error] <0.1548.0>                              r_buffer => #Ref<0.1118981177.1282015234.97902>,
[error] <0.1548.0>                              r_ahead_size => 0}}},
[error] <0.1548.0>                          <<"__delete_queue_1716383944197847531">>,tcp,
[error] <0.1548.0>                          #Port<0.84>,<33363.1916.0>,<0.1535.0>,
[error] <0.1548.0>                          #Ref<0.1118981177.1281884162.97904>,
[error] <0.1548.0>                          {write_concurrency,
[error] <0.1548.0>                           #Ref<0.1118981177.1282015234.97903>},
[error] <0.1548.0>                          {osiris_replica_reader,
[error] <0.1548.0>                           {resource,<<"/">>,queue,<<"delete_queue">>},
[error] <0.1548.0>                           {127,0,0,1},
[error] <0.1548.0>                           6489},
[error] <0.1548.0>                          -1,0}
[error] <0.1548.0> ** Reason for termination ==
[error] <0.1548.0> ** noproc

That is because the osiris_replica_reader process receives the DOWN message from the leader monitoring with the noproc reason. It reuses the reason for its own exit reason. Because this is an abnormal reason, a crash is being logged.

How

There is no reason to log such a crash when the process tree is being shut down concurrently. osiris_replica_reader can terminate with a normal reason.

That is what this patch does: if the leader exit reason is noproc, it terminates with the normal reason instead.

@dumbbell dumbbell added the enhancement New feature or request label May 22, 2024
@dumbbell dumbbell requested a review from kjnilsson May 22, 2024 15:52
@dumbbell dumbbell self-assigned this May 22, 2024
@dumbbell dumbbell marked this pull request as ready for review May 22, 2024 16:27
@dumbbell dumbbell force-pushed the handle-premature-leader-exit-in-osiris_replica_reader branch from 4f0489e to 5c3c10d Compare May 23, 2024 07:53
Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

…ne during `init/1`

[Why]
In the context of RabbitMQ, if a stream queue is deleted right after
being declared, there is a chance that some Osiris processes might not
be ready yet at the time the queue is deleted.

In particular, the `osiris_replica_reader` process monitors the given
leader (an `osiris_writer` process in the context of a RabbitMQ stream
queue) during its `init/1` and that process might be stopped already.

When this happens, here is the crash that is logged:

    [error] <0.1548.0> ** Generic server <0.1548.0> terminating
    [error] <0.1548.0> ** Last message in was {'DOWN',#Ref<0.1118981177.1281884162.97904>,process,
    [error] <0.1548.0>                                <0.1535.0>,noproc}
    [error] <0.1548.0> ** When Server state == {state,
    [error] <0.1548.0>                          {osiris_log,
    [error] <0.1548.0>                           {cfg,
    [error] <0.1548.0>                            ".../__delete_queue_1716383944197847531",
    [error] <0.1548.0>                            <<"__delete_queue_1716383944197847531">>,500000000,
    [error] <0.1548.0>                            256000,#{},[],
    [error] <0.1548.0>                            {write_concurrency,
    [error] <0.1548.0>                             #Ref<0.1118981177.1282015234.97903>},
    [error] <0.1548.0>                            {osiris_replica_reader,
    [error] <0.1548.0>                             {resource,<<"/">>,queue,<<"delete_queue">>},
    [error] <0.1548.0>                             {127,0,0,1},
    [error] <0.1548.0>                             6489},
    [error] <0.1548.0>                            #Fun<osiris_writer.0.78287785>,
    [error] <0.1548.0>                            #Ref<0.1118981177.1282015234.97826>,16},
    [error] <0.1548.0>                           {read,data,0,tcp,all,8,undefined},
    [error] <0.1548.0>                           undefined,undefined,
    [error] <0.1548.0>                           {file_descriptor,prim_file,
    [error] <0.1548.0>                            #{handle => #Ref<0.1118981177.1282015238.91045>,
    [error] <0.1548.0>                              owner => <0.1548.0>,
    [error] <0.1548.0>                              r_buffer => #Ref<0.1118981177.1282015234.97902>,
    [error] <0.1548.0>                              r_ahead_size => 0}}},
    [error] <0.1548.0>                          <<"__delete_queue_1716383944197847531">>,tcp,
    [error] <0.1548.0>                          #Port<0.84>,<33363.1916.0>,<0.1535.0>,
    [error] <0.1548.0>                          #Ref<0.1118981177.1281884162.97904>,
    [error] <0.1548.0>                          {write_concurrency,
    [error] <0.1548.0>                           #Ref<0.1118981177.1282015234.97903>},
    [error] <0.1548.0>                          {osiris_replica_reader,
    [error] <0.1548.0>                           {resource,<<"/">>,queue,<<"delete_queue">>},
    [error] <0.1548.0>                           {127,0,0,1},
    [error] <0.1548.0>                           6489},
    [error] <0.1548.0>                          -1,0}
    [error] <0.1548.0> ** Reason for termination ==
    [error] <0.1548.0> ** noproc

That is because the `osiris_replica_reader` process receives the `DOWN`
message from the leader monitoring with the `noproc` reason. It reuses
the reason for its own exit reason. Because this is an abnormal reason,
a crash is being logged.

[How]
There is no reason to log such a crash when the process tree is being
shut down concurrently. `osiris_replica_reader` can terminate with a
`normal` reason.

That is what this patch does: if the leader exit reason is `noproc`, it
terminates with the `normal` reason instead.
@dumbbell dumbbell force-pushed the handle-premature-leader-exit-in-osiris_replica_reader branch from 5c3c10d to 8a4ec95 Compare June 24, 2024 14:27
@kjnilsson kjnilsson merged commit 3cc00e8 into main Jun 24, 2024
8 checks passed
@kjnilsson kjnilsson deleted the handle-premature-leader-exit-in-osiris_replica_reader branch June 24, 2024 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants