Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix QQ crash recovery bug #2256

Merged
merged 1 commit into from
Feb 25, 2020
Merged

Conversation

kjnilsson
Copy link
Contributor

When using dead letter handlers the state machine would crash when a
prefix_msg was being dead-lettered on recovery. This handles this case
and also fixes an issue where the incorrect initial release cursor
interval would have been set when overriding the release cursor default.

[#171463230]

When using dead letter handlers the state machine would crash when a
prefix_msg was being dead-lettered on recovery. This handles this case
and also fixes an issue where the incorrect initial release cursor
interval would have been set when overriding the release cursor default.

[#171463230]
@michaelklishin michaelklishin merged commit 1cc662c into master Feb 25, 2020
@michaelklishin michaelklishin deleted the rabbit-fifo-dead-letter-bug branch February 25, 2020 15:51
@michaelklishin michaelklishin added this to the 3.8.3 milestone Feb 26, 2020
@acogoluegnes
Copy link
Contributor

To reproduce (from the umbrella):

  • change the value here to 1 (to increase the likelihood of taking a snapshot.
  • run the broker make run-broker PLUGINS='rabbitmq_management'
  • from the management UI, import a JSON resource file with the following content:
{
   "rabbit_version":"3.8.2",
   "rabbitmq_version":"3.8.2",
   "users":[
      {
         "name":"guest",
         "password_hash":"iUrc4rS1mlxSbZUDW8aHkwZNy8JOhhZWe6R98tzhAtfVHnaI",
         "hashing_algorithm":"rabbit_password_hashing_sha256",
         "tags":"administrator"
      }
   ],
   "vhosts":[
      {
         "name":"/"
      }
   ],
   "permissions":[
      {
         "user":"guest",
         "vhost":"/",
         "configure":".*",
         "write":".*",
         "read":".*"
      }
   ],
   "topic_permissions":[

   ],
   "parameters":[

   ],
   "global_parameters":[
      {
         "name":"cluster_name",
         "value":"rabbit@acogoluegnes-inspiron"
      }
   ],
   "policies":[

   ],
   "queues":[
      {
         "name":"qq",
         "vhost":"/",
         "durable":true,
         "auto_delete":false,
         "arguments":{
            "x-dead-letter-exchange":"dle",
            "x-dead-letter-routing-key":"dle",
            "x-queue-type":"quorum"
         }
      },
      {
         "name":"dle",
         "vhost":"/",
         "durable":true,
         "auto_delete":false,
         "arguments":{
            "x-queue-type":"classic"
         }
      }
   ],
   "exchanges":[
      {
         "name":"dle",
         "vhost":"/",
         "type":"direct",
         "durable":true,
         "auto_delete":false,
         "internal":false,
         "arguments":{

         }
      }
   ],
   "bindings":[
      {
         "source":"dle",
         "vhost":"/",
         "destination":"dle",
         "destination_type":"queue",
         "routing_key":"dle",
         "arguments":{

         }
      }
   ]
}
  • launch the following Java program (or equivalent):
import com.rabbitmq.client.*;

import java.io.IOException;
import java.util.concurrent.atomic.AtomicInteger;

public class QqRecoveryCrash {

    public static void main(String[] args) throws Exception {
        ConnectionFactory cf = new ConnectionFactory();
        Connection c = cf.newConnection();
        Channel ch = c.createChannel();

        for (int i = 0; i < 200_000; i++) {
            ch.basicPublish("", "qq", null, "hello".getBytes());
        }

        ch.basicQos(2);
        AtomicInteger count = new AtomicInteger(0);
        ch.basicConsume("qq", false, new DefaultConsumer(ch) {
            @Override
            public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
                if (count.get() % 2 == 0) {
                    ch.basicReject(envelope.getDeliveryTag(), false);
                } else {
                    ch.basicAck(envelope.getDeliveryTag(), false);
                }
                count.incrementAndGet();
            }
        });

        Thread.sleep(1000000000L);
    }

}
  • wait until a snapshot is taken (in my case, the snapshot lies in /tmp/rabbitmq-test-instances/rabbit/mnesia/rabbit/quorum/rabbit@acogoluegnes-inspiron/2F_QQ22JQAMBA5G1D/snapshots)
  • Ctrl + C the broker, stop the Java program
  • restart the broker. The queue should not recover with the following error:
2020-02-26 08:41:45.823 [info] <0.867.0> queue 'qq' in vhost '/': terminating with {case_clause,{0,{'$prefix_msg',5},[]}} in state recover
2020-02-26 08:41:45.823 [debug] <0.867.0> queue 'qq' in vhost '/': terminating with reason '{case_clause,{0,{'$prefix_msg',5},[]}}'
2020-02-26 08:41:45.826 [error] <0.867.0> ** State machine '%2F_qq' terminating
** Last event = {cast,go}
** When server state  = [{id,{'%2F_qq','rabbit@acogoluegnes-inspiron'}},{opt,terminate},{raft_state,recover},{leader_last_seen,undefined},{num_pending_commands,0},{num_delay
ed_commands,0},{election_timeout_set,false},{ra_server_state,#{aux => {'%2F_qq',{inactive,-576460706242386,1,1.0}},cluster => #{{'%2F_qq','rabbit@acogoluegnes-inspiron'} => 
#{commit_index_sent => 0,match_index => 0,next_index => 1,query_index => 0}},commit_index => 212813,current_term => 1,effective_machine_version => 0,last_applied => 12405,lo
g => #{cache_size => 0,first_index => 12406,last_index => 212818,last_written_index_term => {212818,1},num_segments => 7,open_segments => 1,snapshot_index => 12405,type => r
a_log},machine => #{checkout_message_bytes => 0,enqueue_message_bytes => 62020,num_checked_out => ...,...},...}}]
** Reason for termination = error:{case_clause,{0,{'$prefix_msg',5},[]}}
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{rabbit_fifo,'-dead_letter_effects/4-anonymous-0-',4,[{file,"src/rabbit_fifo.erl"},{line,1160}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_fifo,dead_letter
_effects,4,[{file,"src/rabbit_fifo.erl"},{line,1160}]},{rabbit_fifo,apply,3,[{file,"src/rabbit_fifo.erl"},{line,192}]},{ra_server,apply_with,2,[{file,"src/ra_server.erl"},{l
ine,2024}]},{ra_server,'-recover/1-fun-0-',2,[{file,"src/ra_server.erl"},{line,305}]},{ra_server,'-apply_to/5-lists^foldl/2-0-',3,[{file,"src/ra_server.erl"},{line,1982}]},{
ra_server,apply_to,5,[{file,"src/ra_server.erl"},{line,1982}]}]
2020-02-26 08:41:45.827 [error] <0.867.0> CRASH REPORT Process '%2F_qq' with 0 neighbours crashed with reason: no case clause matching {0,{'$prefix_msg',5},[]} in rabbit_fif
o:'-dead_letter_effects/4-anonymous-0-'/4 line 1160
2020-02-26 08:41:45.827 [error] <0.546.0> Supervisor {<0.546.0>,ra_server_sup} had child '%2F_qq' started with ra_server_proc:start_link(#{await_condition_timeout => 30000,b
roadcast_time => 100,cluster_name => '%2F_qq',friendly_name => ...,...}) at <0.867.0> exit with reason no case clause matching {0,{'$prefix_msg',5},[]} in rabbit_fifo:'-dead
_letter_effects/4-anonymous-0-'/4 line 1160 in context child_terminated
2020-02-26 08:41:46.074 [debug] <0.913.0> queue 'qq' in vhost '/': ra_log:init recovered last_index_term {212818,1} first index 12406
2020-02-26 08:41:46.088 [debug] <0.913.0> queue 'qq' in vhost '/': recover -> recover in term: 1
2020-02-26 08:41:46.088 [debug] <0.913.0> queue 'qq' in vhost '/': recovering state machine version 0:0 from index 12405 to 212813
  • stop the broker

To check the fix, apply the patch, restart the broker (without wiping the data directory), and the queue should recover.

Copy link
Contributor

@acogoluegnes acogoluegnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@acogoluegnes
Copy link
Contributor

Backported to and tested againt v3.8.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants