rolling restart #124

adyach · 2018-12-20T08:15:34Z

Description

Rolling restart of the cluster:

User asks for restart usig cli: bubuku-cli rolling-restart --image-tag 123 --instnce-type t2.nano
Bubuku creates restart assignment and triggers an action for rolling restart for the first broker in assignment at the same time checking for cluster state using kafka jmx metrics
Restart: broker is stopped, volume is detached, instance terminated, instance launched and broker is started
Once broker is being restarted is up, broker is making restart triggers rolling restart action with restart assignment without the broker which was restarted
Restart assignment is empty then rolling restart is finished

rcillo · 2018-12-20T11:08:27Z

bubuku/aws/ec2_node_launcher.py

+        #
+        _LOG.info('Overriding ephemeral volumes to be able to set up AWS auto recovery alarm ')
+        block_devices = []
+        for bd in ami.block_device_mappings:


@adyach could you please explain this to me later. I don't understand it.

rcillo · 2018-12-20T11:20:14Z

bubuku/aws/ec2_node_launcher.py

+            UserData=taupage_user_data,
+            InstanceType=self.cluster_config.get_instance_type(),
+            SubnetId=subnet['SubnetId'],
+            PrivateIpAddress=ip,


@adyach why not let AWS choose an available IP address for you?

rcillo · 2018-12-20T11:29:19Z

bubuku/cli.py

 logging.basicConfig(level=getattr(logging, 'INFO', None))


 @click.group()
 def cli():
+    logo = """


.__ ____ ____ ____ | | _/ ___\/ _ \ / _ \| | \ \__( <_> | <_> ) |__ \___ >____/ \____/|____/ \/

rcillo · 2018-12-20T11:50:42Z

This is really cool 👍

Some corner cases that could happen but I'm not sure we need to address them right now:

Kafka leader imbalance could make this pause permanently between restarts. We've seen that happening. Let's see how it goes.
The action is consumed and posted back on each step. The problem with that is that some problems during the execution of a step could cause an unrecoverable failure. I mean, if at step 3 out of 5 something bad happens and an unrecoverable exception is launched, wouldn't that cause the complete process to stop and we would have to somehow identify where it stopped? Please, correct me if I understood it wrong.

adyach · 2018-12-20T12:02:11Z

Thank you for review! 1. It is intentionally blocking restart if there is prefered replica imbalance, we observed in the past version of Kafka, it could imfluence behavior. I agree, we need to observe it for sometime. 2. The restart can stop if bubuku performing restart was stopped, in that case the restart should be started again from scratch. This is intentionally done like that, because it is quite bad situation, when broker is down while it is restarting other one. The state is not saved, this is not perfect, we can improve it if required. If one of the steps fails with exception, it will be triggered again.

On Thu 20. Dec 2018 at 12:50, Ricardo de Cillo ***@***.***> wrote: This is really cool 👍 Some corner cases that could happen but I'm not sure we need to address them right now: 1. Kafka leader imbalance could make this pause permanently between restarts. We've seen that happening. Let's see how it goes. 2. The action is consumed and posted back on each step. The problem with that is that some problems during the execution of a step could cause an unrecoverable failure. I mean, if at step 3 out of 5 something bad happens and an unrecoverable exception is launched, wouldn't that cause the complete process to stop and we would have to somehow identify where it stopped? Please, correct me if I understood it wrong. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#124 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEo3QmS5BsvbPbJDZ8S-VBU1fcGkOjGQks5u63mVgaJpZM4ZbtFn> .

-- With great enthusiasm, Andrey

rcillo · 2018-12-20T13:18:01Z

Ok, now I remember that bubuku has a "catch all" exception and it would retry the action, right? So the only chance of having a partial rolling restart would be to kill bubuku process with -9, for example. If that's the case, then I agree that it's quite dramatic event and maybe we would like to start all over.

rcillo · 2018-12-20T13:19:05Z

👍

rcillo · 2018-12-21T09:42:50Z

I checked the lastest change. Was it necessary because it would retry too often?

adyach · 2019-01-15T10:41:14Z

👍

antban · 2019-01-18T15:26:58Z

👍

adyach added 2 commits December 20, 2018 09:02

rolling restart

cd8f94d

create alert for the instance

fdc3be7

rcillo reviewed Dec 20, 2018

View reviewed changes

push back timeout of 20 seconds

68b606d

adyach and others added 4 commits December 21, 2018 13:06

updated version

cf11e71

increased restart cooldown timeout to 120 sec

90a80f2

wait until instance is terminated

5b389ca

Merge branch 'master' into ARUHA-2081-1

1c396ae

antban approved these changes Jan 18, 2019

View reviewed changes

adyach merged commit 30086c0 into master Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling restart #124

rolling restart #124

adyach commented Dec 20, 2018

rcillo Dec 20, 2018

rcillo Dec 20, 2018

rcillo Dec 20, 2018

rcillo commented Dec 20, 2018

adyach commented Dec 20, 2018 via email

rcillo commented Dec 20, 2018

rcillo commented Dec 20, 2018

rcillo commented Dec 21, 2018

adyach commented Jan 15, 2019

antban commented Jan 18, 2019

rolling restart #124

rolling restart #124

Conversation

adyach commented Dec 20, 2018

Description

rcillo Dec 20, 2018

Choose a reason for hiding this comment

rcillo Dec 20, 2018

Choose a reason for hiding this comment

rcillo Dec 20, 2018

Choose a reason for hiding this comment

rcillo commented Dec 20, 2018

adyach commented Dec 20, 2018 via email

rcillo commented Dec 20, 2018

rcillo commented Dec 20, 2018

rcillo commented Dec 21, 2018

adyach commented Jan 15, 2019

antban commented Jan 18, 2019