Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement req: enable support to control Kafka server restarts #58

Closed
bijugs opened this Issue Oct 7, 2014 · 7 comments

Comments

Projects
None yet
3 participants
@bijugs
Copy link

bijugs commented Oct 7, 2014

One of the scenarios which Kafka users will come across often is the restart of Kafka servers in a cluster to deploy new changes with minimal down time. When chef-client is run in batch on all the nodes (scheduled as daemons), there is a possibility that many kafka servers are brought down at the same time due to restart for config changes resulting in availability/performance issue. The worst case scenario would be all but one Kafka server is brought down.
Another way to alleviate this scenario will be to control the number of servers which can be restarted at a time which can vary by users based on how many replication they set for the topic partitions. Can we provide some feature to be able to control the restart of the particular Kafka node. One way would be to

  • conditionally include a new recipe file instead of the "service" resource in " _configure.rb
  • The default "new recipe" file will have a dummy recipe which will get executed on any condition that currently triggers the Kafka restart.
  • When the dummy recipe gets executed, it notifies the restart "kafka service" resource to perform the restart.

This will help users to override the default "new" recipe file with their own recipe file and logic to control whether the Kafka server can be restarted or not. Let me know if you have any questions/thoughts.

@cbaenziger

This comment has been minimized.

Copy link
Contributor

cbaenziger commented Oct 7, 2014

I think we can achieve this via a more simple method of extending (or removing) the service action for Kafka.

Previously, in Chef-BCPC we extended the service action to notify our cookbook's kafka verification step for example.

For others' reference, in our cookbooks we are trying to implement a rolling configuration restart for the services on our clusters. Right now, we are looking at having a LWRP wrap our service actions so that the :restart action can be caught and synchronized per required running instance counts to avoid availability outages.

Thanks to this community for the awesome Kafka cookbook!

@mthssdrbrg

This comment has been minimized.

Copy link
Collaborator

mthssdrbrg commented Oct 19, 2014

I see what you're getting at, and I don't have any particular preference regarding this (more than it should be configurable) as we're not really using Chef to do coordinated restarts of services (currently a manual process).

I'm all for any possible solutions that aren't too crazy. Not sure I'm completely following your suggestion, could you provide some sample code for exactly how this would work?

@cbaenziger: I made an attempt to look through the "rolling configuration restart" link and from what I could make of it, it looked like the restarts would be coordinated using ZooKeeper, no?

@bijugs

This comment has been minimized.

Copy link
Author

bijugs commented Oct 19, 2014

Hi @mthssdrbrg , thank you for looking at the request. The following will be the outline of the code changes for the proposed solution.

  • Add a new attribute to the cookbook. For e.g.
default.kafka.start_coordination.recipe = `kafka::_coordinate`
  • Add a new recipe say _coordinate.rb with a dummy ruby block
    ruby_block "coordinate-kafka-start" do
       block do
          Chef::Log.debug ("Default recipe to coordinate Kafka start is used")
        end
        action :create
        notifies :restart, 'service[kafka]', :delayed
     end
     service 'kafka' do
        provider kafka_init_opts[:provider]
        supports start: true, stop: true, restart: true, status: true
        action kafka_service_actions
     end
  • Change the current notifications to service[kafka] resource to ruby_block[coordinate-kafka-start] in the cookbook recipes (e.g. _configure.rb)
  • Remove service 'kafka' resource in _configure.rb and in its an include_recipe statement to include the new recipe
  include_recipe node.kafka.start_coordination.recipe

These changes should leave the function of the cookbook as it is today.

When someone would like to use the cookbook and include logic to control the restart of Kafka process say by using ZooKeeper as we are doing, in the wrapper cookbook they can

  • Overwrite the new attribute with their custom recipe name. For e.g default.kafka.start_coordination.recipe = "wrapper_cookbook::recipe_name"
  • In the wrapper_cookbook::recipe_name recipe, for restart coordination users need to have the same dummy ruby_block as in the _coordinate.rb but can notify the logic they would like to incorporate. Once the logic is executed the service 'kafka' resource can be notified. For e.g the following is a sample of the custom recipe users can create
    ruby_block "coordinate-kafka-start" do
       block do
          Chef::Log.info (" Custom recipe to coordinate Kafka start is used")
        end
        action :create
        notifies :create, 'ruby_block[restart_coordination]', :delayed
     end
     ruby_block "restart_coordination" do
       block do
          Chef::Log.info ("Need to implement the process to coordinate the restart process like using ZK")
       end
       action :nothing
       notifies :restart, 'service[kafka]', :delayed
     end
     service 'kafka' do
        provider kafka_init_opts[:provider]
        supports start: true, stop: true, restart: true, status: true
        action kafka_service_actions
     end
     ruby_block "restart_coordination_cleanup" do
       block do
          Chef::Log.info ("Implement any cleanup logic required after restart like releasing locks")
       end
       action :nothing
     end
  • The clean up resource is optional based on the implementation.

The write-up is long but the changes are trivial. Please let me know if you have any questions. Also I may have overlooked details in the cookbook and if so let me know.

@mthssdrbrg

This comment has been minimized.

Copy link
Collaborator

mthssdrbrg commented Nov 1, 2014

Thanks for the very detailed explanation, that's pretty much what I thought as well, just wanted to make sure that we were on the same page :). As long as it's backwards compatible I have no problem with including functionality as such, and wouldn't mind reviewing a pull request.

@mthssdrbrg

This comment has been minimized.

Copy link
Collaborator

mthssdrbrg commented Nov 15, 2014

Closed by #63. Perhaps the README should be updated with a description of this though,

@mthssdrbrg

This comment has been minimized.

Copy link
Collaborator

mthssdrbrg commented Dec 9, 2014

@bijugs I just realized that the service definition is in the _coordinate recipe, though I'm thinking that this isn't strictly necessary? Can you think of why this should be necessary? Otherwise I'd prefer to move it to some other recipe, like _install or _configure, as it's now required to be specified in any recipe that would want to override the _coordinate one.

@bijugs

This comment has been minimized.

Copy link
Author

bijugs commented Dec 9, 2014

@mthssdrbrg, the reason I had included the service definition in _coordinate is to allow users to include any logic/action need after the restart of Kafka brokers. This is a sample custom implementation of coordinate recipe which clears a lock if the restart was successful in a node. If the restart was not successful the lock stays in place so that restarts doesn't happen in other nodes in cluster until the reason for the restart failure is investigated. This is to prevent unavailability if the problem is cluster wide like incorrect broker property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.