# SWARM: Job Selection via Consensus

In this notebook, we configure a topology that includes three Swarm Agent nodes and two Redis database nodes.

### Agent Nodes
These nodes host the Swarm Agents. Depending on the configuration, multiple agents can be launched on a single node. All agents collectively engage in a consensus process for job selection.

### Database Nodes
Redis instances run on these nodes and are responsible for maintaining the neighbor map, which is periodically updated by the agents.

### Job Pool
Jobs are generated as JSON files using the `python task_generator.py` script. When multiple agent nodes are used, the generated job file is distributed to all nodes to ensure a consistent job pool across all agents.


## Import the libraries

In [None]:
from ipaddress import ip_address, IPv4Address, IPv6Address, IPv4Network, IPv6Network
import ipaddress

from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager

fablib = fablib_manager()
                     
fablib.show_config();

## Define variables

In [None]:
slice_name = 'MySlice-swarm-1'
swarm_node_name_prefix = "agent"
swarm_node_count = 3

database_node_name = "database"

# Node profile parameters
cores = 8
ram = 32
disk = 100
image = "default_ubuntu_22"

## Configuration Parameters

**Simplified Setup for Quick Testing:**

- **`slice_name`**: Unique identifier for this FABRIC slice
- **`swarm_node_count`**: 3 agent nodes (minimal setup for testing)
- **`database_node_name`**: Name of Redis database host
- **`cores/ram/disk`**: Higher resources (32GB RAM) for running multiple agents per node
- **`image`**: Ubuntu 22 base image

**Use Case:** This configuration is ideal for:
- Quick functionality testing
- Debugging agent behavior  
- Development and prototyping
- Learning the SWARM+ system

**Contrast with Multi-Site Notebook:**
- Multi-site: 110 nodes across 10 sites → Production-scale WAN evaluation
- Simple: 3-4 nodes at few sites → Development and testing

## Determine sites

In [None]:
#sites = fablib.get_random_sites(count=swarm_node_count + 1, avoid=["NEWY", "CIEN"])
sites = ["MAX", "FIU", "UCSD", "SALT"]
print(f'Preparing to create slice "{slice_name}" in site {sites}')

## Slice Creation

- **Database Node**
  - Allocate a node to host the Redis database. Ensure this node is connected to the L3 FabNetV4 network to enable communication with the agent nodes.

- **Agent Cluster**
  - Provision the number of nodes specified by `swarm_node_count` for deploying Swarm agents, ideally distributing them across multiple sites.
  - Each agent node should also be connected to the L3 FabNetV4 network to facilitate inter-node communication.

In [None]:
# Create Slice
slice = fablib.new_slice(name=slice_name)

database = slice.add_node(name="database", site=sites[0], image=image, disk=disk, cores=cores, ram=ram)
database.add_fabnet()
database.add_post_boot_execute('sudo ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa')
database.add_post_boot_execute("sudo git clone https://github.com/swarm-workflows/SwarmAgents.git /root/SwarmAgents")
database.add_post_boot_execute('sudo bash -c "cd /root/SwarmAgents && ./install_ubuntu.sh"')
database.add_post_boot_execute("sudo /root/SwarmAgents/install_docker_ubuntu.sh") 
database.add_post_boot_execute('sudo bash -c "cd /root/SwarmAgents && docker compose up -d redis"')

# Add nodes for Agents and connect them to the kafka cluster
for idx in range(swarm_node_count):
    agent = slice.add_node(name=f"{swarm_node_name_prefix}-{idx+1}", site=sites[idx + 1], image=image, disk=disk, cores=cores, ram=ram)
    agent.add_fabnet()
    agent.add_post_boot_execute('sudo ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa')
    agent.add_post_boot_execute("sudo git clone -b agent-topology --single-branch https://github.com/swarm-workflows/SwarmAgents.git /root/SwarmAgents")
    agent.add_post_boot_execute("sudo /root/SwarmAgents/install_ubuntu.sh") 
    agent.add_post_boot_execute("sudo /root/SwarmAgents/install_docker_ubuntu.sh") 

# Submit Slice Request
slice.submit()

### Automated Post-Boot Configuration

Each node is configured automatically via `add_post_boot_execute()` commands:

**Database Node Setup:**
1. Generate SSH keys for root access
2. Clone SwarmAgents repository from GitHub
3. Run Ubuntu installation script (`install_ubuntu.sh`)
4. Install Docker
5. Launch Redis container via Docker Compose

**Agent Node Setup:**
1. Generate SSH keys for root access
2. Clone SwarmAgents repository (specific branch: `agent-topology`)
3. Run Ubuntu installation script  
4. Install Docker

**Advantages of Post-Boot Automation:**
- Nodes ready to use immediately after provisioning
- Consistent environment across all nodes
- No manual SSH configuration required
- Repeatable deployments

**Wait Time:** Expect ~5-10 minutes for all post-boot scripts to complete after slice becomes active

In [None]:
slice = fablib.get_slice(slice_name)
slice.list_nodes();

## Configure Hostnames

On each agent node, add an entry to the `/etc/hosts` file mapping the database node’s IP address to its hostname. This ensures agents can resolve and connect to the database node correctly.

In [None]:
database = slice.get_node(database_node_name)
database_addr = database.get_interface(network_name=f"FABNET_IPv4_{database.get_site()}").get_ip_addr()

for n in slice.get_nodes():
    if n.get_name() == database_node_name:
        continue
    n.execute(f'sudo sh -c \'echo "{database_addr} database" >> /etc/hosts\'')


## Running SWARM-MULTI Consensus Setup

### Simplified Topology Overview - Single Node

- The **database** is launched on a dedicated `database` node.
- **All agents** are deployed on a single node (e.g., `agent-1`).

- A job pool is generated using a JSON file named `tasks.json`.

- The system supports two modes of agent communication:
  - **Ring**: Agents are grouped into rings of 5; one agent from each ring connects to form higher-level rings until a single ring remains.
  - **Mesh**: Every agent communicates with all other agents.

- The communication mode is configurable via the `topology` field in `config_swarm_multi.yml`:
  - For **Mesh**, set `topology.peer_agents` to `"all"`.
  - For **Ring**, set `topology.peer_agents` to a comma-separated list of peer agent IDs.

- Use the provided script to launch the agents.

## Manual Experiment Workflow

Unlike the multi-site notebook which uses automated batch testing, this notebook demonstrates **manual agent launch** for learning and debugging.

**Workflow Overview:**

1. **Install Dependencies** → Run pip install on agent nodes
2. **Generate Job Pool** → Create tasks.json with synthetic jobs
3. **Launch Agents** → SSH into node and run launch script
4. **Monitor Progress** → Check job completion via Redis queries
5. **Stop Agents** → Graceful shutdown
6. **Collect Results** → Download logs and plots from `swarm-multi/` directory

**Why Manual Mode?**
- Better visibility into agent startup process
- Easier debugging of configuration issues
- Educational: understand each step of system operation
- Fine-grained control over agent launch timing

**Topology Options:**
The system supports two communication modes (configured in `config_swarm_multi.yml`):

- **Mesh**: Every agent connects to all others (`topology.peer_agents = "all"`)
  - Best for: Small-scale testing (≤30 agents)
  - Characteristics: O(n²) connections, fast consensus, high message overhead
  
- **Ring**: Agents organized in hierarchical rings of 5
  - Best for: Larger deployments (30+ agents)
  - Characteristics: O(√n) message routing, lower overhead, slightly higher latency

In [None]:
agent1 = slice.get_node("agent-1")

stdout, stderr = agent1.execute(f'sudo bash -c "cd /root/SwarmAgents && pip3.11 install -r requirements.txt"', quiet=True)

In [None]:
task_count = 100

In [None]:
agent1 = slice.get_node("agent-1")

In [None]:
stdout, stderr = agent1.execute(f'sudo bash -c "cd /root/SwarmAgents && python3.11 task_generator.py {task_count}"')

### GRPC Hack
NOTE: There is some issue with gRPC version mismatch, the pip install commands are a hack to resolve it and should be executed only once

```bash
sudo su -
cd SwarmAgents
pip3.11 install protobuf==3.20.3
pip3.11 install -r requirements.txt
```

### Launching the Agents

- SSH into `agent1` using the command mentioned earlier, then run the following commands to start the agents:

NOTE: There is some issue with gRPC version mismatch, the pip install commands are a hack to resolve it and should be executed only once

```bash
sudo su -
cd SwarmAgents
./swarm-multi-start.sh 10 ring database
```

### Verifying Completed Jobs

- To check the number of completed jobs, run:
```bash
python3.11 dump_tasks.py --host database --key job --count
```

- If the output matches the total number of expected jobs (e.g., 100), you’re ready to stop the agents.

### Stopping the Agents

- Use the following command to gracefully stop all running agents:
```bash
./swarm-multi-kill.sh
```

After shutdown, all logs and generated plots will be available in the `SwarmAgents/swarm-multi` directory.

### Multi Node Topology

TBD

### Delete the Slice

## Collecting Results

After running the manual experiment, collect your data from the agent nodes.

### Results Location
All logs, metrics, and plots are stored in:
```
/root/SwarmAgents/swarm-multi/
```

### Download Results via SSH
```bash
# From your local machine
scp -r ubuntu@<agent-1-ip>:/root/SwarmAgents/swarm-multi ./local-results
```

### What You'll Find
```
swarm-multi/
├── agent-<id>.log              # Per-agent execution logs
├── agent-<id>.csv              # Job assignments per agent  
├── all_jobs.csv                # Consolidated job data
├── metrics.json                # Performance statistics
├── topology.png                # Network topology diagram (if generated)
└── latency_cdf.png             # Latency distribution plot (if generated)
```

### Quick Analysis Commands

**Check job completion:**
```bash
python3.11 dump_tasks.py --host database --key job --count
```

**Verify all agents completed:**
```bash
cat swarm-multi/metrics.json | jq '.completed_jobs, .total_jobs'
```

**View latency stats:**
```bash
cat swarm-multi/metrics.json | jq '.avg_latency, .p95_latency, .p99_latency'
```

### Troubleshooting
- **No output files?** Check agent logs for errors: `tail -100 swarm-multi/agent-*.log`
- **Jobs not completing?** Verify Redis connectivity: `redis-cli -h database ping`
- **Consensus issues?** Check network connectivity between agents

In [None]:
slice.delete()