# Node Managing with CLAP

This notebook introduces CLAP's features to create, manage and destroy CLAP's nodes. It will walk through the NodeManager class, used to manager computing nodes, and ConfigurationManager class, used to get CLAP's instances configuration from CLAP's configurations files (default at: ~/.clap/configs).

Make sure, you are executing this notebook inside CLAP's environment (clap-env).

As this notebook is inside CLAP's example/api directory, let's add `../..` to python system's paths.

In [1]:
import sys
sys.path.append('../..')

In [2]:
# Let's print all packages installed
!pip list

Package                       Version
----------------------------- -----------
alabaster                     0.7.12
ansible                       3.4.0
ansible-base                  2.10.9
ansible-runner                1.4.7
anyio                         3.1.0
argcomplete                   1.12.3
argon2-cffi                   20.1.0
astroid                       2.5.6
async-generator               1.10
attrs                         21.2.0
Babel                         2.9.1
backcall                      0.2.0
bcrypt                        3.2.0
bleach                        3.3.0
boto                          2.49.0
boto3                         1.17.73
botocore                      1.20.73
certifi                       2020.12.5
cffi                          1.14.5
chardet                       4.0.0
click                         8.0.0
coloredlogs                   15.0
contextlib2                   0.6.0.post1
cryptography                  3.4.7
dacite                        1.6.0
d

Let's perform some imports. In order to facilitate the creation of the NodeManager and ConfigurationDatabase classes, let's use the defaults defined in app.cli.modules.node which will search for configurations at `~/.clap/configs` and use `~/.clap/storage/nodes.db` as default node repository. 

In [3]:
import yaml
from dataclasses import asdict
from app.cli.modules.node import get_config_db, get_node_manager
from clap.utils import float_time_to_string

In [4]:
# Creating configuration database and node manager objects
configuration_db = get_config_db()
node_manager = get_node_manager()

configuration_db will load all instance configs at `~/.clap/configs/instances.yaml` and will store all in the instance_descriptors member. instances_descriptor is a dictionary, where the keys are the name of instance configuration at instances file and the values are dataclasses of type InstanceInfo.

Let's check the contents of `~/.clap/configs/instances.yaml`

In [5]:
!cat ~/.clap/configs/instances.yaml

type-a:
    provider: aws-config-us-east-1
    login: login-ubuntu
    flavor: t2.micro
    image_id: ami-07d0cf3af28718ef8
    security_group: otavio-sg

type-b:
    provider: aws-config-us-east-1
    login: login-ubuntu
    flavor: t2.medium
    image_id: ami-07d0cf3af28718ef8
    boot_disk_size: 16
    security_group: otavio-sg


In [6]:
all_instances_ids = list(configuration_db.instance_descriptors.keys())
print(f"All instance ids presented in my system: {', '.join(all_instances_ids)}")

All instance ids presented in my system: type-a, type-b


In [7]:
# Lets pick the type-a instance info and verify it
type_a_instance_info = configuration_db.instance_descriptors['type-a']
print(f"Instance config: {type_a_instance_info}")
# Instances info are dataclasses, you can access members using access python's member access syntax (via '.'). For instance:
flavor = type_a_instance_info.instance.flavor
print(f"Instance flavor: {flavor}")

Instance config: InstanceInfo(provider=ProviderConfigAWS(provider_config_id='aws-config-us-east-1', region='us-east-1', access_keyfile='ec2_access_key.pub', secret_access_keyfile='ec2_access_key.pem', vpc=None, url='https://ec2.us-east-1.amazonaws.com', provider='aws'), login=LoginConfig(login_config_id='login-ubuntu', user='ubuntu', keypair_name='otavio_key_us_east_1', keypair_public_file='otavio_key_us_east_1.pub', keypair_private_file='otavio_key_us_east_1.pem', ssh_port=22, sudo=True, sudo_user='root'), instance=InstanceConfigAWS(instance_config_id='type-a', provider='aws-config-us-east-1', login='login-ubuntu', flavor='t2.micro', image_id='ami-07d0cf3af28718ef8', security_group='otavio-sg', boot_disk_size=None, boot_disk_device=None, boot_disk_type=None, boot_disk_iops=None, boot_disk_snapshot=None, placement_group=None, price=None, timeout=None, network_ids=[]))
Instance flavor: t2.micro


In [8]:
# Dataclasses can be easily be converted to dict using asdict function
type_a_instance_info_dict = asdict(type_a_instance_info)
# Lets print dict in yaml syntax
print(yaml.dump(type_a_instance_info_dict, indent=4))

instance:
    boot_disk_device: null
    boot_disk_iops: null
    boot_disk_size: null
    boot_disk_snapshot: null
    boot_disk_type: null
    flavor: t2.micro
    image_id: ami-07d0cf3af28718ef8
    instance_config_id: type-a
    login: login-ubuntu
    network_ids: []
    placement_group: null
    price: null
    provider: aws-config-us-east-1
    security_group: otavio-sg
    timeout: null
login:
    keypair_name: otavio_key_us_east_1
    keypair_private_file: otavio_key_us_east_1.pem
    keypair_public_file: otavio_key_us_east_1.pub
    login_config_id: login-ubuntu
    ssh_port: 22
    sudo: true
    sudo_user: root
    user: ubuntu
provider:
    access_keyfile: ec2_access_key.pub
    provider: aws
    provider_config_id: aws-config-us-east-1
    region: us-east-1
    secret_access_keyfile: ec2_access_key.pem
    url: https://ec2.us-east-1.amazonaws.com
    vpc: null



Starting nodes can done using start_node function. This function will start N CLAP nodes with a given InstanceInfo.
Use start_nodes for starting different number of nodes with different InstanceInfos.

Let's start 2 nodes of type-a. The function will return the node IDs for nodes that is sucessfully started.

In [9]:
started_node_ids = node_manager.start_node(type_a_instance_info, count=2)

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Starting 2 type-a instances (timeout 600 seconds)] ***********************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Tagging instances] *******************************************************
[0;33mchanged: [localhost] => (item={'id': 'i-0102a005efac4d1a8', 'name': 'SaraAmick-8708490f'})[0m
[0;33mchanged: [localhost] => (item={'id': 'i-081327a6346e0a5a9', 'name': 'ScottCable-b3953956'})[0m
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreacha

In [10]:
print(f"{len(started_node_ids)} nodes started: {started_node_ids}")

2 nodes started: ['8708490fd29941898e27cd330af94c42', 'b3953956fc7048e886dc466fb9e746e5']


Functions in CLAP's node manager usually use the IDs of nodes to operate. The functions get_\*_nodes (e.g. get_all_nodes, get_nodes_by_id) will return a list of NodeDescriptor objects. NodeDescriptor is a dataclass that describe full node information. As NodeDesriptor is a dataclass, it can be easily be transformed to a dict, using asdict function.

Let's pick all nodes in CLAP and print it in YAML format. 

In [11]:
for node in node_manager.get_all_nodes():
    # Can be accessed with '.' operator
    node_id = node.node_id
    print('---------')
    print(f"Node Id: {node_id}, created at {float_time_to_string(node.creation_time)}; Status: {node.status}")
    print('---------')
    # Or can be converted to a dict
    node_dict = asdict(node)
    # Printing dict in YAML format
    print(yaml.dump(node_dict, indent=4))
    print('**********')

---------
Node Id: 8708490fd29941898e27cd330af94c42, created at 29-05-21 15:34:28; Status: started
---------
cloud_instance_id: i-0102a005efac4d1a8
cloud_lifecycle: normal
configuration:
    instance:
        boot_disk_device: null
        boot_disk_iops: null
        boot_disk_size: null
        boot_disk_snapshot: null
        boot_disk_type: null
        flavor: t2.micro
        image_id: ami-07d0cf3af28718ef8
        instance_config_id: type-a
        login: login-ubuntu
        network_ids: []
        placement_group: null
        price: null
        provider: aws-config-us-east-1
        security_group: otavio-sg
        timeout: null
    login:
        keypair_name: otavio_key_us_east_1
        keypair_private_file: otavio_key_us_east_1.pem
        keypair_public_file: otavio_key_us_east_1.pub
        login_config_id: login-ubuntu
        ssh_port: 22
        sudo: true
        sudo_user: root
        user: ubuntu
    provider:
        access_keyfile: ec2_access_key.pub
        

Nodes with status == 'started' are nodes that were started but no SSH login was performed yet. Once a successfuly login is performed, the node changes it status to 'reachable'. If the SSH fails, the status become 'unreachable'.

The function is_alive check if the node is alive and update its information. This function also updates several other information, such as IP, status etc.. The functions takes then id of nodes as input and returns a dict, where keys is the node ids and values are booleans denoting if the node is alive (successfuly SSH performed) or not.

Note:
* This function may output "Unable to connect to port 22 on XXX.XXX.XXX.XXX" when a unsucessfull login is performed. So, the function will wait a 'wait_timeout' seconds and try again for 'retries' times

In [12]:
alive_nodes = node_manager.is_alive(started_node_ids)
for node_id, alive_flag in alive_nodes.items():
    alive_str = 'alive' if alive_flag == True else 'not alive'
    print(f"{node_id} --> {alive_str}.")

Error executing command in node b3953956: [Errno None] Unable to connect to port 22 on 52.90.37.115
Error executing command in b3953956: [Errno None] Unable to connect to port 22 on 52.90.37.115.


8708490fd29941898e27cd330af94c42 --> alive.
b3953956fc7048e886dc466fb9e746e5 --> alive.


The node status should change to a reachable state.

CLAP won't check node status periodically. Consider using is_alive function ocasionally to update node information. 

The get_node_by_id function returns full node information (NodeDescriptor objects) that matches the specified node ids.

In [13]:
nodes = node_manager.get_nodes_by_id(started_node_ids)
for node in nodes:
    print (f"{node.node_id}: Status: {node.status}; IP: {node.ip}")

8708490fd29941898e27cd330af94c42: Status: reachable; IP: 100.25.28.126
b3953956fc7048e886dc466fb9e746e5: Status: reachable; IP: 52.90.37.115


## Tags

Tags can be added to nodes to easilly get NodeDescriptors with matching tags. Tags are a dict with key and values as strings and can be added or removed from nodes using add_tags and remove tags function. These functions take a list of node ids as input and a list of node ids listing the nodes in that tags where added or removed.

Lets add some fictitious tags for the two created nodes.

In [14]:
# Lets add these tags to started_node_ids[0] only
tags = {
    'example-notebook': 'cool',
    'uuid': 'worker-0'
}
node_ids_with_tags_added = node_manager.add_tags([started_node_ids[0]], tags)
print(f'Tags {tags} were added to nodes: {node_ids_with_tags_added}')

# Lets add these tags to started_node_ids[1] only
tags = {
    'example-notebook': 'cool',
    'uuid': 'worker-1'
}
node_ids_with_tags_added = node_manager.add_tags([started_node_ids[1]], tags)
print(f'Tags {tags} were added to nodes: {node_ids_with_tags_added}')

Tags {'example-notebook': 'cool', 'uuid': 'worker-0'} were added to nodes: ['8708490fd29941898e27cd330af94c42']
Tags {'example-notebook': 'cool', 'uuid': 'worker-1'} were added to nodes: ['b3953956fc7048e886dc466fb9e746e5']


We can get full node information easiy based on node tags, using get_nodes_with_tag or get_nodes_with_tag_value. The primer will get all nodes that contains a the key specified and the last, the nodes that has the tag and a speficied value.

In [15]:
# Get nodes that has tag 'example-notebook'
node_descriptors = node_manager.get_nodes_with_tag('example-notebook')
node_ids = [node.node_id for node in node_descriptors]
print(f"Get {len(node_ids)} nodes: {node_ids}")

# Get nodes that has tag 'uuid' and value 'worker-0'
node_descriptors = node_manager.get_nodes_with_tag_value('uuid', 'worker-0')
node_ids = [node.node_id for node in node_descriptors]
print(f"Get {len(node_ids)} nodes: {node_ids}")

Get 2 nodes: ['8708490fd29941898e27cd330af94c42', 'b3953956fc7048e886dc466fb9e746e5']
Get 1 nodes: ['8708490fd29941898e27cd330af94c42']


# Pausing and resuming nodes

CLAP support pausing and resuming nodes. Pausing and resuming will not terminate nodes or reboot nodes, it will only suspend nodes. Once a node is paused, node status is changed to 'paused' and its IP to None. Once the node is resumed, it status is changed to 'started'. The resume_nodes function will also try to login to the node again, if it is sucessfull login is performed status is changed to 'reachable', else 'unreachable'.

pause_nodes and resume_nodes methods will pause and resume nodes taking their ids as input, respectively and return a list of node ids for nodes that successfuly performed the operations.

Notes:
* Pausing nodes that already have been paused or resuming nodes already running will result in nothing.
* If a paused node is resumed outside CLAP, you can use the is_alive function to update node status and all other information.

In [17]:
# Pausing nodes will pause nodes based on their node ids. It will return a list of successfuly paused node ids
paused_node_ids = node_manager.pause_nodes(started_node_ids)
print(f"Paused {len(paused_node_ids)} nodes: {paused_node_ids}")

# Printing node status
nodes = node_manager.get_nodes_by_id(paused_node_ids)
for node in nodes:
    print(f'Node {node.node_id} (nickname: {node.nickname}), status: {node.status}, IP: {node.ip}')

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Pausing nodes `SaraAmick, ScottCable`] ***********************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Paused 2 nodes: ['8708490fd29941898e27cd330af94c42', 'b3953956fc7048e886dc466fb9e746e5']
Node 8708490fd29941898e27cd330af94c42 (nickname: SaraAmick), status: paused, IP: None
Node b3953956fc7048e886dc466fb9e746e5 (nickname: ScottCable), status: paused, IP: None


In [18]:
# Resuming nodes will resume nodes based on their node ids. It will return a list of successfuly paused node ids
resumed_node_ids = node_manager.resume_nodes(paused_node_ids)
print(f"Resumed {len(resumed_node_ids)} nodes: {resumed_node_ids}")

# Printing node status
nodes = node_manager.get_nodes_by_id(resumed_node_ids)
for node in nodes:
    print(f'Node {node.node_id} (nickname: {node.nickname}), status: {node.status}, IP: {node.ip}')

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Resuming nodes `SaraAmick, ScottCable`] **********************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   



Error executing command in node 8708490f: [Errno None] Unable to connect to port 22 on 100.25.171.110
Error executing command in node b3953956: [Errno None] Unable to connect to port 22 on 18.206.202.159
Error executing command in 8708490f: [Errno None] Unable to connect to port 22 on 100.25.171.110.
Error executing command in b3953956: [Errno None] Unable to connect to port 22 on 18.206.202.159.


Resumed 2 nodes: ['8708490fd29941898e27cd330af94c42', 'b3953956fc7048e886dc466fb9e746e5']
Node 8708490fd29941898e27cd330af94c42 (nickname: SaraAmick), status: reachable, IP: 100.25.171.110
Node b3953956fc7048e886dc466fb9e746e5 (nickname: ScottCable), status: reachable, IP: 18.206.202.159


# Stopping nodes

Finally, nodes can be stopped (terminated at cloud provider) using stop_nodes method. The method will take the node ids as input and returns a list of node ids successfuly stopped. Stopped nodes will automatically be removed from node repository unless the 'remove_nodes' parameter is passed to the method as False.

In [19]:
stopped_node_ids = node_manager.stop_nodes(resumed_node_ids)
print(f"Stopped {len(stopped_node_ids)} nodes: {stopped_node_ids}")

[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Stopping nodes SaraAmick, ScottCable] ************************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Stopped 2 nodes: ['8708490fd29941898e27cd330af94c42', 'b3953956fc7048e886dc466fb9e746e5']


In [20]:
# No more nodes in repository...
nodes = node_manager.get_all_nodes()
print(f"Got {len(nodes)} nodes: {nodes}")

Got 0 nodes: []
