Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling restart #124

Merged
merged 7 commits into from
Jan 18, 2019
Merged

rolling restart #124

merged 7 commits into from
Jan 18, 2019

Conversation

adyach
Copy link
Contributor

@adyach adyach commented Dec 20, 2018

Description

Rolling restart of the cluster:

  1. User asks for restart usig cli: bubuku-cli rolling-restart --image-tag 123 --instnce-type t2.nano
  2. Bubuku creates restart assignment and triggers an action for rolling restart for the first broker in assignment at the same time checking for cluster state using kafka jmx metrics
  3. Restart: broker is stopped, volume is detached, instance terminated, instance launched and broker is started
  4. Once broker is being restarted is up, broker is making restart triggers rolling restart action with restart assignment without the broker which was restarted
  5. Restart assignment is empty then rolling restart is finished

#
_LOG.info('Overriding ephemeral volumes to be able to set up AWS auto recovery alarm ')
block_devices = []
for bd in ami.block_device_mappings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adyach could you please explain this to me later. I don't understand it.

UserData=taupage_user_data,
InstanceType=self.cluster_config.get_instance_type(),
SubnetId=subnet['SubnetId'],
PrivateIpAddress=ip,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adyach why not let AWS choose an available IP address for you?

logging.basicConfig(level=getattr(logging, 'INFO', None))


@click.group()
def cli():
logo = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                    .__   
  ____  ____   ____ |  |  
_/ ___\/  _ \ /  _ \|  |  
\  \__(  <_> |  <_> )  |__
 \___  >____/ \____/|____/
     \/                   

@rcillo
Copy link
Contributor

rcillo commented Dec 20, 2018

This is really cool 👍

Some corner cases that could happen but I'm not sure we need to address them right now:

  1. Kafka leader imbalance could make this pause permanently between restarts. We've seen that happening. Let's see how it goes.
  2. The action is consumed and posted back on each step. The problem with that is that some problems during the execution of a step could cause an unrecoverable failure. I mean, if at step 3 out of 5 something bad happens and an unrecoverable exception is launched, wouldn't that cause the complete process to stop and we would have to somehow identify where it stopped? Please, correct me if I understood it wrong.

@adyach
Copy link
Contributor Author

adyach commented Dec 20, 2018 via email

@rcillo
Copy link
Contributor

rcillo commented Dec 20, 2018

Ok, now I remember that bubuku has a "catch all" exception and it would retry the action, right? So the only chance of having a partial rolling restart would be to kill bubuku process with -9, for example. If that's the case, then I agree that it's quite dramatic event and maybe we would like to start all over.

@rcillo
Copy link
Contributor

rcillo commented Dec 20, 2018

👍

@rcillo
Copy link
Contributor

rcillo commented Dec 21, 2018

I checked the lastest change. Was it necessary because it would retry too often?

@adyach
Copy link
Contributor Author

adyach commented Jan 15, 2019

👍

1 similar comment
@antban
Copy link
Contributor

antban commented Jan 18, 2019

👍

@adyach adyach merged commit 30086c0 into master Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants