Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6218361
Initial prototype for slurm stats
jovial Nov 12, 2020
e901e7b
Merge branch 'main' into feature/slurm-stats
jovial Nov 13, 2020
b329201
Clean up unnecessary changes
jovial Nov 13, 2020
c2566f9
Revert cluster name
jovial Nov 13, 2020
c2c984c
Spit slurm deployment from elastic configuration
jovial Nov 13, 2020
cdabccd
Use hostnames and not ip addresses for prometheus targets
jovial Nov 19, 2020
c9651f9
Add pre-compute rules that are required for openhpc dashboard
jovial Nov 19, 2020
1ac74bc
Add metrics needed for openhpc dashboard
jovial Nov 19, 2020
bef9465
Revert "Add metrics needed for openhpc dashboard"
jovial Nov 19, 2020
c8b54b8
Whitespace fix
jovial Nov 23, 2020
23f5447
Workaround for missing options in grafana role
jovial Nov 23, 2020
07646c6
Align with the naming scheme from README
jovial Nov 24, 2020
1409acf
Move slurm-stats to collection
jovial Nov 24, 2020
46f93d4
Make inventory a directory rather than a file
jovial Nov 24, 2020
523af19
Add playbook to generate passwords
jovial Nov 24, 2020
06b4f9e
WIP: Add support for using generated passwords
jovial Dec 8, 2020
d421a8c
Prefix passwords with secrets_
jovial Dec 9, 2020
3e7eaf0
Clean up
jovial Dec 10, 2020
413afed
Add password generation details to README
jovial Dec 10, 2020
d0034f0
Revert changes to monitoring.yml
jovial Dec 10, 2020
13db757
Add security group rules for compute
jovial Dec 10, 2020
7f0aa41
Merge remote-tracking branch 'origin/main' into feature/slurm-stats
jovial Dec 10, 2020
d5652cd
Merge branch 'feature/slurm-stats' of github.com:stackhpc/openhpc-dem…
jovial Dec 10, 2020
3a14706
Up retries
jovial Dec 10, 2020
7607eee
Fix replacements
jovial Dec 10, 2020
e5df74e
Collect extra metrics
jovial Dec 10, 2020
57b83b0
Use default pod
jovial Dec 10, 2020
00a003e
Correction
jovial Dec 10, 2020
9537e09
Merge remote-tracking branch 'ssh/feature/slurm-stats' into HEAD
jovial Dec 10, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ terraform.tfstate*
inventory
config-drive.iso
venv
collections
collections
28 changes: 28 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,34 @@ NB: Working DNS is a requirement.
yum install terraform
terraform init

# Passwords

Prior to running any other playbooks, you need to define a set of passwords. You can
use the `generate-passwords.yml` playbook to automate this process:

```
ansible-playbook generate-passwords.yml
```

This will output a set of passwords <`repository root>/inventory/group_vars/all/passwords.yml`.
Placing them in the inventory means that they will be defined for all playbooks.

It is recommended to encrypt the contents of this file prior to commiting to git:

```
ansible-vault encrypt inventory/group_vars/all/passwords.yml
```

You will then need to provide a password when running the playbooks e.g:

```
ansible-playbook monitoring-db.yml --tags grafana --ask-vault-password
```

See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details.

# Usage

NB: For development of roles/collections you may want to use this alternative to `ansible-galaxy ...`:

mkdir roles
Expand Down
3 changes: 2 additions & 1 deletion ansible.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ stderr_callback = debug
gathering = smart
forks = 30
host_key_checking = False
inventory = inventory

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=240s -o PreferredAuthentications=publickey -o UserKnownHostsFile=/dev/null
pipelining = True
pipelining = True
23 changes: 23 additions & 0 deletions config/elastic/internal_users.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
# See: https://aws.amazon.com/blogs/opensource/change-passwords-open-distro-for-elasticsearch/

# This is the internal user database
# The hash value is a bcrypt hash and can be generated with plugin/tools/hash.sh

_meta:
type: "internalusers"
config_version: 2

# Define your internal users here

admin:
hash: "{{ secrets_openhpc_elasticsearch_admin_password | password_hash('bcrypt') }}"
reserved: true
backend_roles:
- "admin"
description: "Admin user"

kibanaserver:
hash: "{{ secrets_openhpc_elasticsearch_kibana_password | password_hash('bcrypt') }}"
reserved: true
description: "Used by kibana to connect to elastic"
56 changes: 56 additions & 0 deletions config/filebeat/filebeat.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
filebeat.config:
modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false

setup.ilm:
# Failed to connect to backoff(elasticsearch(https://localhost:9200)): Connection marked as failed because
# the onConnect callback failed: request checking for ILM availability failed: 500 Internal Server Error:
# {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/get"}],
# "type":"security_exception","reason":"Unexpected exception indices:admin/get"},"status":500}
enabled: false
rollover_alias: "filebeat"
pattern: "{now/d}-000001"

filebeat.inputs:
- type: log
json.add_error_key: true
paths:
- '/logs/slurm-stats/*.json'
fields:
event.kind: event
fields_under_root: true

processors:
- timestamp:
field: json.End
layouts:
- '2006-01-02T15:04:05'
test:
- '2020-06-17T10:17:48'
- timestamp:
target_field: 'event.end'
field: json.End
layouts:
- '2006-01-02T15:04:05'
test:
- '2020-06-17T10:17:48'
- timestamp:
target_field: 'event.start'
field: json.Start
layouts:
- '2006-01-02T15:04:05'
test:
- '2020-06-17T10:17:48'
- convert:
fields:
- {from: "json.NNodes", type: "integer"}
- {from: "json.NCPUS", type: "integer"}
- {from: "json.ElapsedRaw", type: "integer"}

output.elasticsearch:
hosts: ["{{ groups.elastic.0 }}:9200"]
protocol: "https"
ssl.verification_mode: none
username: "admin"
password: "{{ secrets_openhpc_elasticsearch_admin_password }}"
20 changes: 20 additions & 0 deletions config/prometheus/rules/precompute.rules
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Required for openhpc dashboard

groups:
- name: opehnpc
interval: 60s
rules:
- record: node_cpu_system_seconds:record
expr: (100 * sum by(instance)(increase(node_cpu_seconds_total{mode="system",job="node"}[60s]))) / (sum by(instance)(increase(node_cpu_seconds_total{job="node"}[60s])))
- record: node_cpu_user_seconds:record
expr: (100 * sum by(instance)(increase(node_cpu_seconds_total{mode="user",job="node"}[60s]))) / (sum by(instance)(increase(node_cpu_seconds_total{job="node"}[60s])))
- record: node_cpu_iowait_seconds:record
expr: (100 * sum by(instance)(increase(node_cpu_seconds_total{mode="iowait",job="node"}[60s]))) / (sum by(instance)(increase(node_cpu_seconds_total{job="node"}[60s])))
- record: node_cpu_other_seconds:record
expr: (100 * sum by(instance)(increase(node_cpu_seconds_total{mode!="idle",mode!="user",mode!="system",mode!="iowait",job="node"}[60s]))) / (sum by(instance)(increase(node_cpu_seconds_total{job="node"}[60s])))
- record: node_cpu_scaling_frequency_hertz_avg:record
expr: avg by (instance) (node_cpu_scaling_frequency_hertz)
- record: node_cpu_scaling_frequency_hertz_min:record
expr: min by (instance) (node_cpu_scaling_frequency_hertz)
- record: node_cpu_scaling_frequency_hertz_max:record
expr: max by (instance) (node_cpu_scaling_frequency_hertz)
25 changes: 25 additions & 0 deletions filter_plugins/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/python

# Copyright: (c) 2020, StackHPC
# Apache 2 License

from ansible.errors import AnsibleError, AnsibleFilterError
from collections import defaultdict
import jinja2
from ansible.module_utils.six import string_types
import os.path

def readfile(fpath):
if not os.path.isfile(fpath):
return ""
with open(fpath) as f:
return f.read()

class FilterModule(object):
''' Ansible core jinja2 filters '''

def filters(self):
return {
# jinja2 overrides
'readfile': readfile
}
9 changes: 9 additions & 0 deletions generate-passwords.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---

- name: Generate passwords.yml
hosts: localhost
gather_facts: false
tasks:
- name: Include password generation role
include_role:
name: passwords
15 changes: 15 additions & 0 deletions inventory.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,18 @@ ${cluster_name}_login

[cluster_compute:children]
${cluster_name}_compute

[elastic:children]
${cluster_name}_login

[kibana:children]
${cluster_name}_login

[slurm_stats:children]
${cluster_name}_login

[podman:children]
elastic
kibana
slurm_stats

Empty file added inventory/.gitkeep
Empty file.
Empty file.
51 changes: 50 additions & 1 deletion main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,53 @@ variable "compute_image" {
default = "CentOS-8-GenericCloud-8.2.2004-20200611.2.x86_64"
}

resource "openstack_networking_secgroup_v2" "secgroup_slurm_login" {
name = "secgroup_slurm_login"
description = "Rules for the slurm login node"
# Fully manage with terraform
delete_default_rules = true
}

resource "openstack_networking_secgroup_v2" "secgroup_slurm_compute" {
name = "secgroup_slurm_compute"
description = "Rules for the slurm compute node"
# Fully manage with terraform
delete_default_rules = true
}

resource "openstack_networking_secgroup_rule_v2" "secgroup_slurm_login_rule_egress_v4" {
direction = "egress"
ethertype = "IPv4"
security_group_id = openstack_networking_secgroup_v2.secgroup_slurm_login.id
}

resource "openstack_networking_secgroup_rule_v2" "secgroup_slurm_login_rule_ingress_tcp_v4" {
direction = "ingress"
ethertype = "IPv4"
# NOTE: You will want to lock down the ports in a production environment. This will require
# setting of static ports for the NFS server see:
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/storage_administration_guide/s2-nfs-nfs-firewall-config
port_range_min = 1
protocol = "tcp"
port_range_max = 65535
security_group_id = openstack_networking_secgroup_v2.secgroup_slurm_login.id
}

resource "openstack_networking_secgroup_rule_v2" "secgroup_slurm_compute_rule_egress_v4" {
direction = "egress"
ethertype = "IPv4"
security_group_id = openstack_networking_secgroup_v2.secgroup_slurm_compute.id
}

resource "openstack_networking_secgroup_rule_v2" "secgroup_slurm_compute_rule_ingress_tcp_v4" {
direction = "ingress"
ethertype = "IPv4"
port_range_min = 1
protocol = "tcp"
port_range_max = 65535
security_group_id = openstack_networking_secgroup_v2.secgroup_slurm_compute.id
}

resource "openstack_compute_instance_v2" "login" {

name = "${var.cluster_name}-login-0"
Expand All @@ -39,6 +86,7 @@ resource "openstack_compute_instance_v2" "login" {
network {
name = "ilab"
}
security_groups = [openstack_networking_secgroup_v2.secgroup_slurm_login.id]
}


Expand All @@ -54,6 +102,7 @@ resource "openstack_compute_instance_v2" "compute" {
network {
name = "ilab"
}
security_groups = [openstack_networking_secgroup_v2.secgroup_slurm_compute.id]
}

# TODO: needs fixing for case where creation partially fails resulting in "compute.network is empty list of object"
Expand All @@ -65,5 +114,5 @@ resource "local_file" "hosts" {
"computes": openstack_compute_instance_v2.compute,
},
)
filename = "${path.module}/inventory"
filename = "${path.module}/inventory/hosts"
}
Loading