Nomad server recover helper #77

maxadamo · 2023-04-12T14:10:36Z

Affected Puppet, Ruby, OS and module versions/distributions

n/a

How to reproduce

bring down the nomad daemon on all your nomad servers

What are you seeing

you won't be able to restart the daemon

What behaviour did you expect instead

have a procedure, a script, or a Bolt task

Output log

n/a

Proposed solution

Recovering from outage, is a time consuming operation, but it can be partially automated.
The manifest below creates a script which can be run from the servers to recovery the cluster.
If you have PuppetDB you can use nomad_server_regex otherwise you need to pre-fill a hash and use nomad_server_hash.

# This class is used to generate a peers.json and a recovery script file for Nomad servers.
# It is used to recover from a Nomad server outage.
#
# @example using PuppetDB
#  class { 'geant_nomad::server::peer_json':
#    nomad_server_regex => 'nomad-server0',
#    iface              => 'eth0',
#  }
#
# @example using a Hash
#  class { 'geant_nomad::server::peer_json':
#    nomad_server_hash => {
#      '192.168.1.10' => 'a1b2c3d4-1234-5678-9012-3456789abcde',
#      '192.168.1.10' => 'a1b2c3d4-1234-5678-9012-3456789abcde',
#    },
#    iface              => 'eth0',
#  }
#
# @param nomad_server_regex
#  Regex to match Nomad server hostnames within the same puppet environment
# @param nomad_server_regex
#  If you don't have the PuppetDB you can supply a Hash with server IPs and corresponding node-ids
# @param iface
#  NIC where Nomad server IP is configured
# @param port
#  Nomad server port
#
class nomad::servier_recovery (
  String $iface,
  Optional[String] $nomad_server_regex = undef,
  Optional[Hash] $nomad_server_hash    = undef,
  Stdlib::Port $port                   = 4647,
) {
  if ($nomad_server_regex) and ($nomad_server_hash) {
    fail('You can only use one of the parameters: nomad_server_regex or nomad_server_hash')
  }
  elsif !($nomad_server_regex) and !($nomad_server_hash) {
    fail('You must use one of the parameters: nomad_server_regex or nomad_server_hash')
  }
  if ($facts['nomad_node_id']) {
    if ($nomad_server_regex) {
      $nomad_server_inventory = puppetdb_query(
        "inventory[facts.networking.hostname, facts.networking.interfaces.${iface}.ip, facts.nomad_node_id] {
          facts.networking.hostname ~ '${nomad_server_regex}' and facts.agent_specified_environment = '${facts['agent_specified_environment']}'
        }"
      )
      $nomad_server_pretty_inventory = $nomad_server_inventory.map |$item| {
        {
          'id' => $item['facts.nomad_node_id'],
          'address' => "${item["facts.networking.interfaces.${iface}.ip"]}:${port}",
          'non_voter' => false
        }
      }
    } else {
      if $nomad_server_hash.keys() !~ Stdlib::IP::Address::Nosubnet {
        fail('The keys of the nomad_server_hash parameter must be valid IP addresses')
      }
      $nomad_server_pretty_inventory = $nomad_server_hash.map |$key, $value| {
        {
          'id' => $value,
          'address' => "${key}:${port}",
          'non_voter' => false
        }
      }
    }

    file {
      default:
        owner => 'root',
        group => 'root';
      '/tmp/peers.json':
        mode    => '0640',
        content => to_json_pretty($nomad_server_pretty_inventory);
      '/usr/local/bin/nomad-server-outage-recover.sh':
        mode    => '0750',
        content => "#!/bin/bash
PATH=/bin:/usr/bin:/sbin:/usr/sbin
systemctl stop nomad.service
install -o root -g root -m 644 /tmp/peers.json /var/lib/nomad/server/raft/peers.json
systemctl start nomad.service\n";
    }
  }
}

The text was updated successfully, but these errors were encountered:

maxadamo · 2023-04-12T14:14:59Z

ideally we could create a Bolt task to trigger the execution of the script

bastelfreak · 2023-04-12T15:28:11Z

@sebastianrakel @attachmentgenie do you have some thoughts here? maybe we sould have a bolt plan for this?

attachmentgenie · 2023-04-12T20:32:27Z

I also feel a bolt task would be more appropriate, in a meltdown situation i dont see anyone changing and pushing hiera changes in an emergency.

maxadamo · 2023-04-13T08:10:51Z

@attachmentgenie the idea is to create the file peers.json, pulling the data from PuppetDB (and fall-back to hiera only if you miss PuppetDB), and not only when you need it. The file will always be there, ready to be used.
It's gonna be the same with Bolt, but if you don't have the puppetDB it's even worse with Bolt, because you'll need to input all the data when you are in a meltdown situation, and it's gonna be easier to create the peers.json manually.

IMO the Bolt plan is eventually an addition to the puppet manifests.
And if you don't have the PuppetDB I would recommend to fill in the data in advance.

maxadamo · 2023-04-22T13:24:08Z

@attachmentgenie are you also good with the change, and is it clear how it works?
If all your servers are down, you just to run: /usr/local/bin/nomad-server-outage-recover.sh on your nomad servers (not the agents). EOS
I can merge it straight away, and I am asking because I already started working on the next PR to fix #84

This was referenced Apr 12, 2023

add peers.json and script to recover from outage #78

Closed

add peers.json and script to recover an outage #79

Closed

add peers.json and script to recover an outage #80

Closed

This was referenced Apr 13, 2023

add peers.json and script to recover from outage #81

Closed

add peers.json and script to recover from outage #82

Merged

maxadamo closed this as completed in #82 Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad server recover helper #77

Nomad server recover helper #77

maxadamo commented Apr 12, 2023 •

edited

maxadamo commented Apr 12, 2023

bastelfreak commented Apr 12, 2023

attachmentgenie commented Apr 12, 2023

maxadamo commented Apr 13, 2023 •

edited

maxadamo commented Apr 22, 2023 •

edited

Nomad server recover helper #77

Nomad server recover helper #77

Comments

maxadamo commented Apr 12, 2023 • edited

Affected Puppet, Ruby, OS and module versions/distributions

How to reproduce

What are you seeing

What behaviour did you expect instead

Output log

Proposed solution

maxadamo commented Apr 12, 2023

bastelfreak commented Apr 12, 2023

attachmentgenie commented Apr 12, 2023

maxadamo commented Apr 13, 2023 • edited

maxadamo commented Apr 22, 2023 • edited

maxadamo commented Apr 12, 2023 •

edited

maxadamo commented Apr 13, 2023 •

edited

maxadamo commented Apr 22, 2023 •

edited