Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server recover helper #77

Closed
maxadamo opened this issue Apr 12, 2023 · 5 comments · Fixed by #82
Closed

Nomad server recover helper #77

maxadamo opened this issue Apr 12, 2023 · 5 comments · Fixed by #82

Comments

@maxadamo
Copy link
Sponsor Contributor

maxadamo commented Apr 12, 2023

Affected Puppet, Ruby, OS and module versions/distributions

n/a

How to reproduce

bring down the nomad daemon on all your nomad servers

What are you seeing

you won't be able to restart the daemon

What behaviour did you expect instead

have a procedure, a script, or a Bolt task

Output log

n/a

Proposed solution

Recovering from outage, is a time consuming operation, but it can be partially automated.
The manifest below creates a script which can be run from the servers to recovery the cluster.
If you have PuppetDB you can use nomad_server_regex otherwise you need to pre-fill a hash and use nomad_server_hash.

# This class is used to generate a peers.json and a recovery script file for Nomad servers.
# It is used to recover from a Nomad server outage.
#
# @example using PuppetDB
#  class { 'geant_nomad::server::peer_json':
#    nomad_server_regex => 'nomad-server0',
#    iface              => 'eth0',
#  }
#
# @example using a Hash
#  class { 'geant_nomad::server::peer_json':
#    nomad_server_hash => {
#      '192.168.1.10' => 'a1b2c3d4-1234-5678-9012-3456789abcde',
#      '192.168.1.10' => 'a1b2c3d4-1234-5678-9012-3456789abcde',
#    },
#    iface              => 'eth0',
#  }
#
# @param nomad_server_regex
#  Regex to match Nomad server hostnames within the same puppet environment
# @param nomad_server_regex
#  If you don't have the PuppetDB you can supply a Hash with server IPs and corresponding node-ids
# @param iface
#  NIC where Nomad server IP is configured
# @param port
#  Nomad server port
#
class nomad::servier_recovery (
  String $iface,
  Optional[String] $nomad_server_regex = undef,
  Optional[Hash] $nomad_server_hash    = undef,
  Stdlib::Port $port                   = 4647,
) {
  if ($nomad_server_regex) and ($nomad_server_hash) {
    fail('You can only use one of the parameters: nomad_server_regex or nomad_server_hash')
  }
  elsif !($nomad_server_regex) and !($nomad_server_hash) {
    fail('You must use one of the parameters: nomad_server_regex or nomad_server_hash')
  }
  if ($facts['nomad_node_id']) {
    if ($nomad_server_regex) {
      $nomad_server_inventory = puppetdb_query(
        "inventory[facts.networking.hostname, facts.networking.interfaces.${iface}.ip, facts.nomad_node_id] {
          facts.networking.hostname ~ '${nomad_server_regex}' and facts.agent_specified_environment = '${facts['agent_specified_environment']}'
        }"
      )
      $nomad_server_pretty_inventory = $nomad_server_inventory.map |$item| {
        {
          'id' => $item['facts.nomad_node_id'],
          'address' => "${item["facts.networking.interfaces.${iface}.ip"]}:${port}",
          'non_voter' => false
        }
      }
    } else {
      if $nomad_server_hash.keys() !~ Stdlib::IP::Address::Nosubnet {
        fail('The keys of the nomad_server_hash parameter must be valid IP addresses')
      }
      $nomad_server_pretty_inventory = $nomad_server_hash.map |$key, $value| {
        {
          'id' => $value,
          'address' => "${key}:${port}",
          'non_voter' => false
        }
      }
    }

    file {
      default:
        owner => 'root',
        group => 'root';
      '/tmp/peers.json':
        mode    => '0640',
        content => to_json_pretty($nomad_server_pretty_inventory);
      '/usr/local/bin/nomad-server-outage-recover.sh':
        mode    => '0750',
        content => "#!/bin/bash
PATH=/bin:/usr/bin:/sbin:/usr/sbin
systemctl stop nomad.service
install -o root -g root -m 644 /tmp/peers.json /var/lib/nomad/server/raft/peers.json
systemctl start nomad.service\n";
    }
  }
}
@maxadamo
Copy link
Sponsor Contributor Author

ideally we could create a Bolt task to trigger the execution of the script

@bastelfreak
Copy link
Member

@sebastianrakel @attachmentgenie do you have some thoughts here? maybe we sould have a bolt plan for this?

@attachmentgenie
Copy link
Member

I also feel a bolt task would be more appropriate, in a meltdown situation i dont see anyone changing and pushing hiera changes in an emergency.

@maxadamo
Copy link
Sponsor Contributor Author

maxadamo commented Apr 13, 2023

@attachmentgenie the idea is to create the file peers.json, pulling the data from PuppetDB (and fall-back to hiera only if you miss PuppetDB), and not only when you need it. The file will always be there, ready to be used.
It's gonna be the same with Bolt, but if you don't have the puppetDB it's even worse with Bolt, because you'll need to input all the data when you are in a meltdown situation, and it's gonna be easier to create the peers.json manually.

IMO the Bolt plan is eventually an addition to the puppet manifests.
And if you don't have the PuppetDB I would recommend to fill in the data in advance.

@maxadamo
Copy link
Sponsor Contributor Author

maxadamo commented Apr 22, 2023

@attachmentgenie are you also good with the change, and is it clear how it works?
If all your servers are down, you just to run: /usr/local/bin/nomad-server-outage-recover.sh on your nomad servers (not the agents). EOS
I can merge it straight away, and I am asking because I already started working on the next PR to fix #84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment