Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 12 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,30 @@

# stackhpc.openhpc

This Ansible role is used to install the necessary packages to have a fully functional OpenHPC cluster.
This Ansible role installs packages and performs configuration to provide a fully functional OpenHPC cluster. It can also be used to drain and resume nodes.

As a role it must be used from a playbook, for which a simple example is given below. This approach means it is totally modular with no assumptions about available networks or any cluster features except for some hostname conventions. Any desired cluster fileystem or other required functionality may be freely integrated using additional Ansible roles or other approaches.

Role Variables
--------------

`openhpc_slurm_service_enabled`: checks whether `openhpc_slurm_service` is enabled

`openhpc_slurm_service`: name of the slurm service e.g. `slurmd`
`openhpc_slurm_service_enabled`: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld)

`openhpc_slurm_control_host`: ansible host name of the controller e.g `"{{ groups['cluster_control'] | first }}"`

`openhpc_slurm_partitions`: list of one or more slurm partitions. Each partition may contain the following values:
* `groups`: If there are multiple node groups that make up the partition, a list of group objects can be defined here.
Otherwise, `groups` can be omitted and the following attributes can be defined in the partition object.
Otherwise, `groups` can be omitted and the following attributes can be defined in the partition object:
* `name`: The name of the nodes within this group.
* `cluster_name`: Optional. An override for the top-level definition `openhpc_cluster_name`.
* `num_nodes`: Nodes within the group are assumed to number `0:num_nodes-1`.
* `ram_mb`: Optional. The physical RAM available in each server of this group.
Compute node hostnames are assumed to take the form: `cluster_name-group_name-{0..num_nodes-1}`
* `default`: Optional. A boolean flag for whether this partion. Valid settings are `YES` and `NO`.
* `maxtime`: Optional. A partition-specific time limit in hours, minutes and seconds. The default value is
`openhpc_job_maxtime`, which defaults to `24:00:00`.
* `ram_mb`: Optional. The physical RAM available in each server of this group ([slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `RealMemory`).

For each group (if used) or partition there must be an ansible inventory group `cluster_name-group_name`. The compute nodes in this group must have hostnames in the form `cluster_name-group_name-{0..num_nodes-1}`.

* `default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
* `maxtime`: Optional. A partition-specific time limit in hours, minutes and seconds ([slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`). The default value is
given by `openhpc_job_maxtime`.

`openhpc_job_maxtime`: A maximum time job limit in hours, minutes and seconds. The default is `24:00:00`.

Expand Down Expand Up @@ -80,10 +82,7 @@ To deploy, create a playbook which looks like this:
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
openhpc_slurm_partitions:
- name: "compute"
flavor: "compute-A"
image: "CentOS7.5-OpenHPC"
num_nodes: 8
user: "centos"
openhpc_cluster_name: openhpc
openhpc_packages: []
...
Expand Down