# Working with clusters

This notebook shows how to work with clusters in CLAP. We will use a cluster `npb-cluster` defined in `examples/cli/1. Creating a cluster.ipynb` notebook (pre-requesite). 

This notebook covers:
* How to start and setup a cluster
* How to grow a cluster (adding more nodes to it) and how to shrink
* How to get cluster nodes

In [4]:
import sys
sys.path.append('../..')

In [5]:
import yaml
import time
import glob
from dataclasses import asdict
from app.cli.modules.node import get_config_db, get_node_manager
from app.cli.modules.role import get_role_manager
from app.cli.modules.cluster import get_cluster_config_db, get_cluster_manager
from clap.utils import float_time_to_string, path_extend
from clap.executor import SSHCommandExecutor, AnsiblePlaybookExecutor


In [13]:
configuration_db = get_config_db()
cluster_config_db = get_cluster_config_db()
node_manager = get_node_manager()
role_manager = get_role_manager()
cluster_manager = get_cluster_manager()
# Private's path (usually ~/.clap/private/) will be used for other methods
private_path = node_manager.private_path

Redefinition of setup setup-initial. Skipping
Redefinition of setup setup-packages. Skipping
Redefinition of setup setup-commands. Skipping
Redefinition of setup setup-env. Skipping
Redefinition of setup setup-git. Skipping
Redefinition of setup run-training. Skipping
Redefinition of cluster example-cluster. Skipping
Redefinition of setup setup-initial. Skipping
Redefinition of cluster example-cluster. Skipping
Redefinition of setup setup-initial. Skipping
Redefinition of setup setup-install-gcc. Skipping


`cluster_config_db` will load all cluster configs at `~/.clap/configs/clusters/` and will store all in the `clusters` member. `clusters` is a dictionary, where the keys are the name of cluster configuration the values are dataclasses of type `ClusterConfig`.

Let's list all cluster configurations and get the configuration named `npb-cluster`.

In [14]:
print(list(cluster_config_db.clusters.keys()))

['my-cluster', 'example-cluster']


In [15]:
npb_cluster_config = cluster_config_db.clusters['my-cluster']
print(npb_cluster_config)

ClusterConfig(cluster_config_id='my-cluster', options=ClusterOptions(ssh_to='jobmanager'), before_all=[], before=[], after_all=[], after=[SetupConfig(roles=[], actions=[RoleActionType(role='gan', action='install-packages', extra={'packages': 'python3-pip, build-essential, cmake, openmpi-bin, openmpi-common, openmpi-doc, libopenmpi-dev'})]), SetupConfig(roles=[], actions=[RoleActionType(role='gan', action='run-command', extra={'cmd': 'sudo apt-get -y install python-is-python3'})]), SetupConfig(roles=[], actions=[RoleActionType(role='gan', action='run-command', extra={'cmd': 'pip install mxnet gluonnlp sacremoses'})]), SetupConfig(roles=[], actions=[RoleActionType(role='gan', action='run-command', extra={'cmd': 'pip install horovod --no-cache-dir'})]), SetupConfig(roles=[], actions=[RoleActionType(role='gan', action='run-command', extra={'cmd': 'git clone https://github.com/robertopossidente/optimizer-clap-app.git'})]), SetupConfig(roles=[], actions=[RoleActionType(role='gan', action='ru

The configuration is a dataclass, so it can be ful converted to a dict, with `asdict` function.

In [16]:
npb_cluster_config_dict = asdict(npb_cluster_config)
print(yaml.dump(npb_cluster_config_dict, indent=4))

after:
-   actions:
    -   action: install-packages
        extra:
            packages: python3-pip, build-essential, cmake, openmpi-bin, openmpi-common,
                openmpi-doc, libopenmpi-dev
        role: gan
    roles: []
-   actions:
    -   action: run-command
        extra:
            cmd: sudo apt-get -y install python-is-python3
        role: gan
    roles: []
-   actions:
    -   action: run-command
        extra:
            cmd: pip install mxnet gluonnlp sacremoses
        role: gan
    roles: []
-   actions:
    -   action: run-command
        extra:
            cmd: pip install horovod --no-cache-dir
        role: gan
    roles: []
-   actions:
    -   action: run-command
        extra:
            cmd: git clone https://github.com/robertopossidente/optimizer-clap-app.git
        role: gan
    roles: []
-   actions:
    -   action: run-command
        extra:
            cmd: sudo touch /etc/ansible/facts.d/times.fact && sudo mkdir -p /etc/ansible/facts.d/
        

We can start a cluster, based on a cluster configuration, using the `start_cluster`  method from `ClusterManager` class The function will return a cluster id that will be used for other methods.

In [17]:
cluster_id = cluster_manager.start_cluster(npb_cluster_config)
print(cluster_id)

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Starting 1 type-a instances (timeout 600 seconds)] ***********************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Tagging instances] *******************************************************
[0;33mchanged: [localhost] => (item={'id': 'i-02b42dc03292107df', 'name': 'JohnOlney-ee47192d'})[0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    

We can get a full cluster information using `get_cluster_by_id` method from `ClusterManager` class. It will return a dataclass of type `ClusterDescriptor` that has all the information of a custer. TO get all clusters in the repository, `get_all_clusters` function returns a list of `ClusterDescriptor`. 

Let's print the `ClusterDescriptor` from cluster the recently created cluster `cluster-da580f1038254cfa98b203ca109ecb53` in YAML format.

In [9]:
cluster = cluster_manager.get_cluster_by_id(cluster_id)
cluster_dict = asdict(cluster)
print(yaml.dump(cluster_dict, indent=4))

cluster_config:
    after:
    -   actions:
        -   action: install-packages
            extra:
                packages: python3-pip, build-essential, cmake, openmpi-bin, openmpi-common,
                    openmpi-doc, libopenmpi-dev
            role: gan
        roles: []
    -   actions:
        -   action: run-command
            extra:
                cmd: sudo apt-get -y install python-is-python3
            role: gan
        roles: []
    -   actions:
        -   action: run-command
            extra:
                cmd: pip install mxnet gluonnlp sacremoses
            role: gan
        roles: []
    -   actions:
        -   action: run-command
            extra:
                cmd: pip install horovod --no-cache-dir
            role: gan
        roles: []
    -   actions:
        -   action: run-command
            extra:
                cmd: git clone --branch train https://github.com/robertopossidente/AMLC19-GluonNLP.git
            role: gan
        roles: []
    -  

Given a cluster id, we can get all CLAP nodes that belongs to this cluster, using `get_all_cluster_nodes` method from `ClusterManager` class. It wil return a list of node ids, which can be used with several CLAP modules, such as `NodeManager` and `RoleManager` classes..

In [10]:
cluster_nodes = cluster_manager.get_all_cluster_nodes(cluster_id)
print(cluster_nodes)

['d4289a6df8f4462c9de952b2c95dc817', '0a6d71d1846f4085a3e7b433854d8385', '2cf810aaf68f4b1fa9d8ac2cb48dd002']


Using the `get_cluster_nodes_types` method from `ClusterManager` class will result in a dictionary where the key are the cluster node types (e.g., `npb-type-b`) and the values are a list of node ids of nodes from that type.

In [11]:
cluster_nodes_with_type = cluster_manager.get_cluster_nodes_types(cluster_id)
print(cluster_nodes_with_type)

{'jobmanager': ['d4289a6df8f4462c9de952b2c95dc817'], 'taskmanager': ['0a6d71d1846f4085a3e7b433854d8385', '2cf810aaf68f4b1fa9d8ac2cb48dd002']}


In [18]:
cluster_nodes_with_type = cluster_manager.get_cluster_nodes_types(cluster_id)
print(yaml.dump(cluster_nodes_with_type))

npb-type-b:
- 0e9db9afd8d649638349dec77d9eb066
- 43d6d3880d034c9a8aa7c4929bd8b3fc
- a8c82747b7184f81bf72a676fa9baa56
- b616ea6a27ed450eb4996e1fd3b0f710



In [None]:
command_to_execute = """
mpirun -np 1 -H localhost:1 -bind-to none -map-by slot python /home/ubuntu/optimizer-clap-app/machine-translation/my-train.py 2>&1 > log.txt 
echo Launch Machine Translation by ssh
"""
executor = SSHCommandExecutor(command_to_execute, cluster_nodes, private_path)
result = executor.run()

for node_id, res in result.items():
    print(f"Node id {node_id}, executed the command: {res.ok}, ret code: {res.ret_code}")
    # resut is a dataclass, we can convert to a dictionary
    res_dict = asdict(res)
    print('-----')
    # Dump dictionary in YAML format
    print(yaml.dump(res_dict, indent=4, sort_keys=True))

In [24]:
stopped_nodes = node_manager.stop_nodes(cluster_nodes[0:3])
print(stopped_nodes)

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Stopping nodes CarolArchey, JaniceSilkenson, NancyHackwell] **************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

['a8c82747b7184f81bf72a676fa9baa56', '0e9db9afd8d649638349dec77d9eb066', 'b616ea6a27ed450eb4996e1fd3b0f710']


And running the application again...

Fetching some results...

And parsing them...

## Stopping cluster

Finally we can stop the cluster (and stop all nodes) using the `stop_cluster` command. This will also remove the cluster from cluster repository.

Other similar functions are:
* `resume_cluster`: That will resume all paused nodes of a cluster  
* `pause_cluster`: That will pause all nodes of a cluster
* `is_alive`: That will check if all cluster nodes are alive

In [38]:
cluster_manager.stop_cluster(cluster_id)

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Stopping nodes BritneyGalvan] ********************************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   



['43d6d3880d034c9a8aa7c4929bd8b3fc']

In [18]:
clusters = cluster_manager.get_all_clusters()
for cluster in clusters:
    cluster_manager.stop_cluster(cluster)
