# Sky Manager

Sky Manager's goal is to intelligently schedule jobs and deployment across an organization's clusters. It aims to eliminate the boundraries between clusters and create the notion of "one gigantic cluster".

Sky Manager consists of an API server and a controller manager. Organizations can easily add their clusters (Kubernetes and Slurm (TODO)) to Sky Manager.

The types of objects Sky Manager supports is:
- Clusters
- Jobs (Federated across clusters)
- Deployments (Federated across clusters)
- Namespaces (Federated across clusters)
- FilterPolicies (Governance for existing jobs/deployments.)

## API Server

The API server supports CRUD operations over namespace and global objects. These operations include:
- Create
- Get (Read)
- List
- Update
- Watch (asynchronously watches objects and tracks for updates)

In [1]:
# Launch API server.
import subprocess
import signal
import os

# Get a list of all running processes.
ps = subprocess.Popen(['ps', 'aux'], stdout=subprocess.PIPE).communicate()[0]
processes = ps.splitlines()
# Iterate over each running process
for process in processes:
    # Find processes with 'api_server' in the command
    if 'api_server' in process.decode('utf-8') or 'launch_sky_manager' in process.decode('utf-8'):
        # Extract the process ID (PID).
        pid = int(process.split()[1])
        # Kill api_server process.
        os.kill(pid, signal.SIGKILL)  # or signal.SIGTERM for a softer kill

In [2]:

os.system('python ../api_server/api_server.py &')

0

 * Serving Flask app 'Sky-Manager-API-Server'
 * Debug mode: on


 * Running on http://localhost:50051
werkzeug - 2023-11-10 23:48:00,055 - INFO - [33mPress CTRL+C to quit[0m
werkzeug - 2023-11-10 23:48:00,055 - INFO -  * Restarting with stat
werkzeug - 2023-11-10 23:48:00,590 - INFO -  * Debugger PIN: 107-294-845


Below we show simple examples with the API server:

In [3]:
# List clusters
from sky_manager.utils.utils import load_manager_config

api_server_ip, api_server_port = load_manager_config()

print('Listing all clusters:')
os.system(f'curl -X GET http://{api_server_ip}:{api_server_port}/clusters')

Listing all clusters:
{
  "items": [
    {
      "kind": "Cluster",
      "metadata": {
        "annotations": {},
        "labels": {},
        "name": "cluster-0"
      },
      "spec": {
        "manager": "kubernetes"
      },
      "status": {
        "allocatable": {},
        "capacity": {},
        "conditions": [
          {
            "createTime": "1699660080.5569887",
            "status": "INIT",
            "updateTime": "1699660080.5569887"
          }
        ],
        "status": "INIT"
      }
    },
    {
      "kind": "Cluster",
      "metadata": {
        "annotations": {},
        "labels": {},
        "name": "cluster-1"
      },
      "spec": {
        "manager": "kubernetes"
      },
      "status": {
        "allocatable": {},
        "capacity": {},
        "conditions": [
          {
            "createTime": "1699660080.568068",
            "status": "INIT",
            "updateTime": "1699660080.568068"
          }
        ],
        "status": "INIT"
      

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0werkzeug - 2023-11-10 23:48:09,133 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:09] "GET /clusters HTTP/1.1" 200 -
100  1504  100  1504    0     0   489k      0 --:--:-- --:--:-- --:--:--  489k


0

In [4]:
print('Get cluster cluster-0:')
os.system(f'curl -X GET http://{api_server_ip}:{api_server_port}/clusters/cluster-0')

Get cluster cluster-0:
{
  "kind": "Cluster",
  "metadata": {
    "annotations": {},
    "labels": {},
    "name": "cluster-0"
  },
  "spec": {
    "manager": "kubernetes"
  },
  "status": {
    "allocatable": {},
    "capacity": {},
    "conditions": [
      {
        "createTime": "1699660080.5569887",
        "status": "INIT",
        "updateTime": "1699660080.5569887"
      }
    ],
    "status": "INIT"
  }
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0werkzeug - 2023-11-10 23:48:27,302 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:27] "GET /clusters/cluster-0 HTTP/1.1" 200 -
100   394  100   394    0     0   128k      0 --:--:-- --:--:-- --:--:--  128k


0

In [5]:
print('DELETE cluster cluster-0:')
os.system(f'curl -X DELETE http://{api_server_ip}:{api_server_port}/clusters/cluster-0')

print("Cluster-0 should be gone.")
os.system(f'curl -X GET http://{api_server_ip}:{api_server_port}/clusters/cluster-0')

DELETE cluster cluster-0:
{
  "message": "Deleted 'clusters/cluster-0'."
}
Cluster-0 should be gone.
{
  "error": "Object 'clusters/cluster-0' not found."
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0werkzeug - 2023-11-10 23:48:37,905 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:37] "DELETE /clusters/cluster-0 HTTP/1.1" 200 -
100    49  100    49    0     0   1225      0 --:--:-- --:--:-- --:--:--  1225
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0werkzeug - 2023-11-10 23:48:37,919 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:37] "[33mGET /clusters/cluster-0 HTTP/1.1[0m" 404 -
100    56  100    56    0     0  28000      0 --:--:-- --:--:-- --:--:-- 28000


0

## Programatic API and CLI

Thanks to the API server, Sky Manager layers a programmtic API and CLI that uses the API server's rest API.

In [6]:
from sky_manager.api_client import ClusterAPI

cluster_api = ClusterAPI()

print('API - List clusters.')
print(cluster_api.list())

print('API - Get cluster-1')
print(cluster_api.get('cluster-1'))

print('API - Delete cluster-1')
print(cluster_api.delete('cluster-1'))

API - List clusters.
{'items': [{'kind': 'Cluster', 'metadata': {'annotations': {}, 'labels': {}, 'name': 'cluster-1'}, 'spec': {'manager': 'kubernetes'}, 'status': {'allocatable': {}, 'capacity': {}, 'conditions': [{'createTime': '1699660080.568068', 'status': 'INIT', 'updateTime': '1699660080.568068'}], 'status': 'INIT'}}, {'kind': 'Cluster', 'metadata': {'annotations': {}, 'labels': {}, 'name': 'cluster-2'}, 'spec': {'manager': 'kubernetes'}, 'status': {'allocatable': {}, 'capacity': {}, 'conditions': [{'createTime': '1699660080.5764384', 'status': 'INIT', 'updateTime': '1699660080.5764384'}], 'status': 'INIT'}}], 'kind': 'ClusterList'}
API - Get cluster-1
{'kind': 'Cluster', 'metadata': {'annotations': {}, 'labels': {}, 'name': 'cluster-1'}, 'spec': {'manager': 'kubernetes'}, 'status': {'allocatable': {}, 'capacity': {}, 'conditions': [{'createTime': '1699660080.568068', 'status': 'INIT', 'updateTime': '1699660080.568068'}], 'status': 'INIT'}}
API - Delete cluster-1
{'message': "De

werkzeug - 2023-11-10 23:48:59,101 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:59] "GET /clusters HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:48:59,106 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:59] "GET /clusters/cluster-1 HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:48:59,121 - INFO - 127.0.0.1 - - [10/Nov/2023 23:48:59] "DELETE /clusters/cluster-1 HTTP/1.1" 200 -


In [7]:
print('CLI - List clusters.')
os.system('skym get clusters')

print('CLI - Get cluster-2.')
os.system('skym get cluster cluster-2')

print('CLI - Create cluster.')
os.system('skym create cluster skycluster --manager k8')

print('CLI - List clusters.')
os.system('skym get clusters')



CLI - List clusters.


werkzeug - 2023-11-10 23:49:03,084 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:03] "GET /clusters HTTP/1.1" 200 -


Name       Manager     Resources    Status
cluster-2  kubernetes  {}           INIT
CLI - Get cluster-2.


werkzeug - 2023-11-10 23:49:03,761 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:03] "GET /clusters/cluster-2 HTTP/1.1" 200 -


Name       Manager     Resources    Status
cluster-2  kubernetes  {}           INIT
CLI - Create cluster.


werkzeug - 2023-11-10 23:49:04,376 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:04] "POST /clusters HTTP/1.1" 200 -


Created cluster skycluster.
CLI - List clusters.
Name        Manager     Resources    Status
cluster-2   kubernetes  {}           INIT
skycluster  k8          {}           INIT


werkzeug - 2023-11-10 23:49:04,971 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:04] "GET /clusters HTTP/1.1" 200 -


0

In [8]:
from sky_manager.api_client import JobAPI

job_api = JobAPI(namespace='default')
print('List jobs.')
print(job_api.list())

print('Get job-0')
print(job_api.get('job-0'))

print('Delete job-0')
print(job_api.delete('job-0'))

print('Create job hello')
job_dict = {
    "kind": "Job",
    "metadata": {
      "name": "hello",
      "labels": {
        "testing": "hello"
      }
    },
    "spec": {
      "image": "gcr.io/sky-burst/skyburst:latest",
      "resources": {
        "cpu": 1,
      },
      "run": "echo Sky!"
    } 
}
print(job_api.create(job_dict))

List jobs.
{'items': [{'kind': 'Job', 'metadata': {'annotations': {}, 'labels': {'testing': 'hello'}, 'name': 'job-0', 'namespace': 'default'}, 'spec': {'image': 'gcr.io/sky-burst/skyburst:latest', 'replicas': 1, 'resources': {'cpu': 1, 'gpu': 0}, 'run': 'echo hi; sleep 100; echo bye'}, 'status': {'clusters': None, 'conditions': [{'createTime': '1699660169.212788', 'status': 'INIT', 'updateTime': '1699660169.212788'}], 'status': 'INIT'}}, {'kind': 'Job', 'metadata': {'annotations': {}, 'labels': {'testing': 'hello'}, 'name': 'job-1', 'namespace': 'default'}, 'spec': {'image': 'gcr.io/sky-burst/skyburst:latest', 'replicas': 1, 'resources': {'cpu': 1, 'gpu': 0}, 'run': 'echo hi; sleep 100; echo bye'}, 'status': {'clusters': None, 'conditions': [{'createTime': '1699660169.2128294', 'status': 'INIT', 'updateTime': '1699660169.2128294'}], 'status': 'INIT'}}, {'kind': 'Job', 'metadata': {'annotations': {}, 'labels': {'testing': 'hello'}, 'name': 'job-2', 'namespace': 'default'}, 'spec': {'im

werkzeug - 2023-11-10 23:49:29,214 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:29] "GET /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:29,220 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:29] "GET /default/jobs/job-0 HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:29,263 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:29] "DELETE /default/jobs/job-0 HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:29,272 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:29] "POST /default/jobs HTTP/1.1" 200 -


In [9]:
print('CLI - List jobs.')
os.system('skym get jobs')

print('CLI - Get jobs hello.')
os.system('skym get job hello')



CLI - List jobs.


werkzeug - 2023-11-10 02:24:21,660 - INFO - 127.0.0.1 - - [10/Nov/2023 02:24:21] "GET /default/jobs HTTP/1.1" 200 -


Name    Cluster      Replicas  Resources    Namespace    Status
hello                       0  cpu: 1       default      INIT
job-1                       0  cpu: 1       default      INIT
                               gpu: 0
job-2                       0  cpu: 1       default      INIT
                               gpu: 0
CLI - Get jobs hello.
Name    Cluster      Replicas  Resources    Namespace    Status
hello                       0  cpu: 1       default      INIT


werkzeug - 2023-11-10 02:24:22,256 - INFO - 127.0.0.1 - - [10/Nov/2023 02:24:22] "GET /default/jobs/hello HTTP/1.1" 200 -


0

## Controller Manager

Under the hood, the controller manager manages 
- Scheduler Controller, which coordinates which job goes to which clusters (aka spread replicas across clusters).
- Skylet Controller, which spawns a "Skylet" process for each cluster.

Diving deeper the Skylet controller manages:
- Cluster Controller, similar to Kubelet, which monitors a cluster's healthy and state.
- Job Controller, which monitors the state of a job's replicas submitted to cluster.
- Flow Controller, which controls the flow of jobs in and out of the cluster. (i.e. evict job is it is waiting too long).


In [9]:
# Launch controller manager
os.system('python ../launch_sky_manager.py &')

0

Launching Skylet Controller Manager.


werkzeug - 2023-11-10 23:49:55,601 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:55] "GET /clusters HTTP/1.1" 200 -
[Skylet Controller] - 2023-11-10 23:49:55,603 - INFO - Executing Skylet controller - Manages launching and terminating Skylets for clusters.
werkzeug - 2023-11-10 23:49:55,624 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:55] "GET /jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:55,633 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:55] "GET /clusters HTTP/1.1" 200 -
[Scheduler Controller] - 2023-11-10 23:49:55,635 - INFO - Running Scheduler controller - Manages workload submission over multiple clusters.


['mluo-cloud', 'mluo-onprem']


werkzeug - 2023-11-10 23:49:56,338 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "[33mGET /clusters/mluo-cloud HTTP/1.1[0m" 404 -
werkzeug - 2023-11-10 23:49:56,373 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "GET /clusters?watch=true HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:56,374 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "GET /clusters?watch=true HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:56,374 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "POST /clusters HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:56,387 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:49:56,387 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "[33mGET /clusters/mluo-onprem HTTP/1.1[0m" 404 -
werkzeug - 2023-11-10 23:49:56,397 - INFO - 127.0.0.1 - - [10/Nov/2023 23:49:56] "POST /clusters HTTP/1.1" 200 -
[Skylet Controller] - 2023-11-10 23:49:56,403 - INFO - Launched Skylet for cluster: mluo-cloud.
werkzeug - 2023-11-10 23:49:56,409 - INFO - 

In [13]:
# Sky manager automatically detects all clusters in your Kubeconfig file. Skylet controller will spawn Skylet subprocesses for each valid K8 cluster.
!skym get clusters

Name         Manager     Resources                            Status
cluster-2    kubernetes  {}                                   INIT
mluo-cloud   k8          cpu: 5.79/6.0                        READY
                         gpu: 0/0
                         memory: 18107.0390625/23864.0390625
mluo-onprem  k8          cpu: 3.86/4.0                        READY
                         gpu: 0/0
                         memory: 12071.34375/15909.34375
skycluster   k8          {}                                   INIT


werkzeug - 2023-11-10 23:52:16,180 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:16] "GET /clusters HTTP/1.1" 200 -


werkzeug - 2023-11-10 23:52:16,805 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:16] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:16,814 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:16] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:52:16,815 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:16,870 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:16] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:16,878 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:16] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:52:16,879 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:21,820 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:21] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:21,831 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:21] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:52:21,832 - INFO - Updated cluster state.
werkzeug - 2023-1

## Job Submission Demo
This part of the DEMO will consist of three parts:
- Submitting a simple job. Sky Manager will automatically choose the cluster to execute the job.
- FilterPolicy (if user has governance constraints) - Sky Manager will filter for the right set of clusters to execute the job.
- Multi-node jobs (aka multiple replicas) - Sky Manager will automatically spread the job across clusters.

### Demo 1: Simple Job Submission

In [14]:
# Submit a 1 CPU job to Sky Manager
import time
import uuid

job_uuid = uuid.uuid4().hex[:8] # Get only the first 8 characters for a short version


os.system(f'skym create job sky-{job_uuid} --resources cpu 1 --run "echo Sky!; sleep 10"')

Created job sky-165c0f23.


werkzeug - 2023-11-10 23:52:40,299 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:40] "POST /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:40,306 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:40] "GET /default/filterpolicies HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:40,317 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:40] "PUT /default/jobs HTTP/1.1" 200 -
[Scheduler Controller] - 2023-11-10 23:52:40,319 - INFO - Sending job sky-165c0f23 to clusters {'mluo-cloud': 1}.
[mluo-cloud - Flow Controller] - 2023-11-10 23:52:40,320 - INFO - Submitting job 'sky-165c0f23' to cluster 'mluo-cloud'.
werkzeug - 2023-11-10 23:52:40,382 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:40] "PUT /default/jobs HTTP/1.1" 200 -


0

werkzeug - 2023-11-10 23:52:41,841 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:41] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:41,850 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:41] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:52:41,852 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:41,879 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:41] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:41,943 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:41] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:41,952 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:41] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:52:41,953 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:44,771 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:44] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:46,844 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:46] "GET /clusters/mluo-cloud HTTP/

In [15]:
for _ in range(10):
    os.system(f'skym get job sky-{job_uuid}')
    time.sleep(0.5)

werkzeug - 2023-11-10 23:52:51,822 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:51] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:51,832 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:51] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:52:51,834 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:51,856 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:51] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:51,909 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:51] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:51,918 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:51] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:52:51,920 - INFO - Updated cluster state.


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:52,951 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:52] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:53,782 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:53] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:54,041 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:54] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:55,138 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:55] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:56,209 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:56] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:56,884 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:56] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:56,906 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:56] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:56,916 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:56] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:52:56,917 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:56,934 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:56] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:52:56,944 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:56] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:52:56,945 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:52:57,323 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:57] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:58,393 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:58] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:59,480 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:59] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:52:59,783 - INFO - 127.0.0.1 - - [10/Nov/2023 23:52:59] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:00,577 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:00] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:53:01,749 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:01] "GET /default/jobs/sky-165c0f23 HTTP/1.1" 200 -


Name          Cluster       Replicas  Resources    Namespace    Status
sky-165c0f23  mluo-cloud           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:53:02,126 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:02] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:02,149 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:02] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:53:02,150 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:53:02,175 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:02] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:02,191 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:02] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:53:02,193 - INFO - Updated cluster state.


werkzeug - 2023-11-10 23:53:02,810 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:02] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:05,791 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:05] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:06,852 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:06] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:06,863 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:06] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:53:06,863 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:53:06,896 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:06] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:53:06,905 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:06] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:53:06,906 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:53:08,797 - INFO - 127.0.0.1 - - [10/Nov/2023 23:53:08] "PUT /default/jobs HTTP/1.1" 20

### Demo 2: Filter Policies

Filter policies constrain where users can submit their cluster.

In [16]:
# TODO: Filters on cluster labels (not just cluster name)
filter_policy = {
        'kind': 'FilterPolicy',
        'metadata': {
            'name': 'remove-mluo-cloud',
            'namespace': 'default',
        },
        'spec': {
            'clusterFilter': {
                'include': ['mluo-onprem', 'mluo-cloud', 'cloud-2'],
                'exclude': ['mluo-cloud'],
            },
            'labelsSelector': {
                'my_app': 'testing',
            }
        }
}

from sky_manager.api_client import FilterPolicyAPI

FilterPolicyAPI(namespace='default').create(filter_policy)

werkzeug - 2023-11-10 23:54:04,731 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:04] "POST /default/filterpolicies HTTP/1.1" 200 -


{'kind': 'FilterPolicy',
 'metadata': {'annotations': {},
  'labels': {},
  'name': 'remove-mluo-cloud',
  'namespace': 'default'},
 'spec': {'clusterFilter': {'exclude': ['mluo-cloud'],
   'include': ['mluo-onprem', 'mluo-cloud', 'cloud-2']},
  'labelsSelector': {'my_app': 'testing'}},
 'status': {'conditions': [{'createTime': '1699660444.7262862',
    'status': 'READY',
    'updateTime': '1699660444.7262862'}],
  'status': 'READY'}}

werkzeug - 2023-11-10 23:54:05,874 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:05] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:06,887 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:06] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:06,899 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:06] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:54:06,900 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:54:06,916 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:06] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:06,924 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:06] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:54:06,925 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:54:08,881 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:08] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:11,984 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:11] "GET /clusters/mluo-cloud HTTP/

In [17]:
from sky_manager.api_client import JobAPI
job_api = JobAPI(namespace='default')


job_uuid = 'sky-' + str(uuid.uuid4().hex[:8]) # Get only the first 8 characters for a short version


job_dict = {
    "kind": "Job",
    "metadata": {
      "name": job_uuid,
      "labels": {
        "my_app": "testing"
      }
    },
    "spec": {
      "image": "gcr.io/sky-burst/skyburst:latest",
      "resources": {
        "cpu": 1,
      },
      "run": "sleep 30"
    } 
}
print(job_api.create(job_dict))

{'kind': 'Job', 'metadata': {'annotations': {}, 'labels': {'my_app': 'testing'}, 'name': 'sky-bfd441fa', 'namespace': 'default'}, 'spec': {'image': 'gcr.io/sky-burst/skyburst:latest', 'replicas': 1, 'resources': {'cpu': 1}, 'run': 'sleep 30'}, 'status': {'clusters': None, 'conditions': [{'createTime': '1699660480.6871684', 'status': 'INIT', 'updateTime': '1699660480.6871684'}], 'status': 'INIT'}}


werkzeug - 2023-11-10 23:54:40,692 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:40] "POST /default/jobs HTTP/1.1" 200 -


werkzeug - 2023-11-10 23:54:40,702 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:40] "GET /default/filterpolicies HTTP/1.1" 200 -
[mluo-onprem - Flow Controller] - 2023-11-10 23:54:40,713 - INFO - Submitting job 'sky-bfd441fa' to cluster 'mluo-onprem'.
werkzeug - 2023-11-10 23:54:40,717 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:40] "PUT /default/jobs HTTP/1.1" 200 -
[Scheduler Controller] - 2023-11-10 23:54:40,718 - INFO - Sending job sky-bfd441fa to clusters {'mluo-onprem': 1}.
werkzeug - 2023-11-10 23:54:40,774 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:40] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:41,990 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:41] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:42,026 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:42] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:54:42,036 - INFO - 127.0.0.1 - - [10/Nov/2023 23:54:42] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:5

In [18]:
for _ in range(10):
    os.system(f'skym get job {job_uuid}')
    time.sleep(0.5)

werkzeug - 2023-11-10 23:55:04,318 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:04] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:05,405 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:05] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:05,911 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:05] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:05,944 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:05] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:06,481 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:06] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:06,975 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:06] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:06,985 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:06] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:55:06,986 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:55:06,988 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:06] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:06,995 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:06] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:55:06,996 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:55:07,566 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:07] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:08,638 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:08] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:08,917 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:08] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:08,947 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:08] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:09,733 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:09] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:10,839 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:10] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:11,947 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:11] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:12,001 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:12] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:12,057 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:12] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:12,062 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:12] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:12,071 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:12] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:55:12,073 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:55:12,081 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:12] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:12,126 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:12] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:55:12,127 - INFO - Update

Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:13,065 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:13] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:14,141 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:14] "GET /default/jobs/sky-bfd441fa HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-bfd441fa  mluo-onprem           1  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:55:14,925 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:14] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:14,955 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:14] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:16,953 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:16] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:16,960 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:16] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:55:16,965 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:16] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:55:16,967 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:55:16,974 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:16] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:55:16,975 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:55:17,926 - INFO - 127.0.0.1 - - [10/Nov/2023 23:55:17] "PUT /default/jobs HTTP/1.1" 20

### Demo 3: Spreading a job's/deployment's replicas across clusters.

In [21]:
from sky_manager.api_client import JobAPI
job_api = JobAPI(namespace='default')


job_uuid = 'sky-' + str(uuid.uuid4().hex[:8]) # Get only the first 8 characters for a short version
num_replicas = 4

job_dict = {
    "kind": "Job",
    "metadata": {
      "name": job_uuid,
    },
    "spec": {
      "replicas": num_replicas,
      "image": "gcr.io/sky-burst/skyburst:latest",
      "resources": {
        "cpu": 1,
      },
      "run": "sleep 30"
    } 
}
print(job_api.create(job_dict))

{'kind': 'Job', 'metadata': {'annotations': {}, 'labels': {}, 'name': 'sky-e76f7a2c', 'namespace': 'default'}, 'spec': {'image': 'gcr.io/sky-burst/skyburst:latest', 'replicas': 4, 'resources': {'cpu': 1}, 'run': 'sleep 30'}, 'status': {'clusters': None, 'conditions': [{'createTime': '1699583487.677352', 'status': 'INIT', 'updateTime': '1699583487.677352'}], 'status': 'INIT'}}


werkzeug - 2023-11-10 02:31:27,683 - INFO - 127.0.0.1 - - [10/Nov/2023 02:31:27] "POST /default/jobs HTTP/1.1" 200 -


werkzeug - 2023-11-10 02:31:27,691 - INFO - 127.0.0.1 - - [10/Nov/2023 02:31:27] "GET /default/filterpolicies HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:31:27,699 - INFO - 127.0.0.1 - - [10/Nov/2023 02:31:27] "PUT /default/jobs HTTP/1.1" 200 -
[Scheduler Controller] - 2023-11-10 02:31:27,700 - INFO - Sending job sky-e76f7a2c to clusters {'mluo-cloud': 3, 'mluo-onprem': 1}.
[mluo-cloud - Flow Controller] - 2023-11-10 02:31:27,701 - INFO - Submitting job 'sky-e76f7a2c' to cluster 'mluo-cloud'.
[mluo-onprem - Flow Controller] - 2023-11-10 02:31:27,706 - INFO - Submitting job 'sky-e76f7a2c' to cluster 'mluo-onprem'.
werkzeug - 2023-11-10 02:31:27,764 - INFO - 127.0.0.1 - - [10/Nov/2023 02:31:27] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:31:27,776 - INFO - 127.0.0.1 - - [10/Nov/2023 02:31:27] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:31:28,799 - INFO - 127.0.0.1 - - [10/Nov/2023 02:31:28] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:31:28,888

{'kind': 'Job', 'metadata': {'annotations': {}, 'labels': {}, 'name': 'sky-93788a95', 'namespace': 'default'}, 'spec': {'image': 'gcr.io/sky-burst/skyburst:latest', 'replicas': 4, 'resources': {'cpu': 1}, 'run': 'sleep 30'}, 'status': {'clusters': None, 'conditions': [{'createTime': '1699660617.2664027', 'status': 'INIT', 'updateTime': '1699660617.2664027'}], 'status': 'INIT'}}


werkzeug - 2023-11-10 23:56:57,272 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:57] "POST /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:57,280 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:57] "GET /default/filterpolicies HTTP/1.1" 200 -


werkzeug - 2023-11-10 23:56:57,289 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:57] "PUT /default/jobs HTTP/1.1" 200 -
[mluo-onprem - Flow Controller] - 2023-11-10 23:56:57,291 - INFO - Submitting job 'sky-93788a95' to cluster 'mluo-onprem'.
[mluo-cloud - Flow Controller] - 2023-11-10 23:56:57,291 - INFO - Submitting job 'sky-93788a95' to cluster 'mluo-cloud'.
[Scheduler Controller] - 2023-11-10 23:56:57,290 - INFO - Sending job sky-93788a95 to clusters {'mluo-cloud': 2, 'mluo-onprem': 2}.
werkzeug - 2023-11-10 23:56:57,349 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:57] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:57,355 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:57] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:00,076 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:00] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:00,081 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:00] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:00,092 - INFO - 

In [20]:
!kubectl get pods --context mluo-cloud

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME                 READY   STATUS              RESTARTS   AGE
sky-165c0f23-fj9hs   0/1     ContainerCreating   0          4m2s


werkzeug - 2023-11-10 23:56:42,102 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:42] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:42,110 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:42] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:42,131 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:42] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:56:42,134 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:56:42,212 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:42] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:42,232 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:42] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:56:42,233 - INFO - Updated cluster state.


werkzeug - 2023-11-10 23:56:45,062 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:45] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:47,032 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:47] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:47,041 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:47] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:56:47,042 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:56:47,044 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:47] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:47,053 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:47] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:56:47,054 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:56:48,060 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:48] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:56:51,064 - INFO - 127.0.0.1 - - [10/Nov/2023 23:56:51] "PUT /default/jobs HTTP/1.1" 20

In [22]:
for _ in range(10):
    os.system(f'skym get job {job_uuid}')
    time.sleep(0.5)

werkzeug - 2023-11-10 23:57:14,319 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:14] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:15,110 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:15] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:15,114 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:15] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:15,122 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:15] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:15,429 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:15] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:16,539 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:16] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:17,066 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:17] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:17,075 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:17] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:57:17,076 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:17,107 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:17] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:17,116 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:17] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:57:17,117 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:17,676 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:17] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:18,096 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:18] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:18,104 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:18] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:18,114 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:18] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:18,769 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:18] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:19,855 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:19] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:20,930 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:20] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:21,099 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:21] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:21,103 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:21] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:21,113 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:21] "PUT /default/jobs HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:22,012 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:22] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:22,160 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:22] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:22,170 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:22] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:57:22,171 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:22,178 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:22] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:22,187 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:22] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:57:22,187 - INFO - Updated cluster state.


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:23,101 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:23] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:24,109 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:24] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:24,118 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:24] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:24,128 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:24] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:24,194 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:24] "GET /default/jobs/sky-93788a95 HTTP/1.1" 200 -


Name          Cluster        Replicas  Resources    Namespace    Status
sky-93788a95  mluo-cloud            2  cpu: 1       default      RUNNING
sky-93788a95  mluo-onprem           2  cpu: 1       default      RUNNING


werkzeug - 2023-11-10 23:57:27,068 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:27,089 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:57:27,097 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:27,103 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:27,134 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:27,146 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:57:27,147 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:27,147 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:27,157 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:27] "PUT /default/jobs HTTP/1.1" 20

In [23]:
!kubectl get pods --context mluo-cloud

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME                 READY   STATUS              RESTARTS   AGE
sky-165c0f23-fj9hs   0/1     Completed           0          5m3s
sky-93788a95-769cm   0/1     ContainerCreating   0          46s
sky-93788a95-dd5mv   0/1     ContainerCreating   0          46s


werkzeug - 2023-11-10 23:57:45,123 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:45] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:45,148 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:45] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:47,094 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:47] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:47,102 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:47] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:57:47,103 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:47,123 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:47] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:47,132 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:47] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:57:47,133 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:48,125 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:48] "PUT /default/jobs HTTP/1.1" 20

In [24]:
!kubectl get pods --context mluo-onprem

werkzeug - 2023-11-10 23:57:57,119 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:57] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:57,131 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:57] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:57:57,144 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:57:57,171 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:57] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:57,303 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:57] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:57,310 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:57] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:57:57,321 - INFO - 127.0.0.1 - - [10/Nov/2023 23:57:57] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:57:57,322 - INFO - Updated cluster state.


To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME                                         READY   STATUS      RESTARTS   AGE
frontend-67c6b84d49-dmskz                    1/1     Running     0          6d19h
skupper-prometheus-5df4949b9c-pprxc          1/1     Running     0          6d19h
skupper-router-7c54d5bcf-q6x2k               2/2     Running     0          6d19h
skupper-service-controller-c9f6bfb98-k2ln9   2/2     Running     0          6d19h
sky-05453620-4f5kv                           0/1     Completed   0          6m33s
sky-2b448d64-gkm7n                           0/1     Completed   0          25h
sky-4f49eacc-t8c6n                           0/1     Completed   0          24h
sky-93788a95-fgfcg                           0/1     Completed   0          60s
sky-93788a95-l4lvb                           0/1     Completed   0          60s
sky-b4fa0127-ccl4s                           0/1     Completed   0          25

werkzeug - 2023-11-10 23:58:00,149 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:00] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:58:00,164 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:00] "PUT /default/jobs HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:58:02,278 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:02] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:58:02,307 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:02] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 23:58:02,309 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:58:02,411 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:02] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 23:58:02,424 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:02] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 23:58:02,426 - INFO - Updated cluster state.
werkzeug - 2023-11-10 23:58:03,193 - INFO - 127.0.0.1 - - [10/Nov/2023 23:58:03] "PUT /default/jobs HTTP/1.1" 20

In [17]:
!skym get clusters

Name         Manager     Resources                            Status
cluster-2    kubernetes  {}                                   INIT
mluo-cloud   k8          cpu: 5.79/6.0                        READY
                         gpu: 0/0
                         memory: 18107.0234375/23864.0234375
mluo-onprem  k8          cpu: 3.86/4.0                        READY
                         gpu: 0/0
                         memory: 12071.34375/15909.34375
skycluster   k8          {}                                   INIT


werkzeug - 2023-11-10 02:30:34,911 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:34] "GET /clusters HTTP/1.1" 200 -


werkzeug - 2023-11-10 02:30:36,748 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:36] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:30:36,752 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:36] "GET /clusters/mluo-cloud HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:30:36,756 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:36] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 02:30:36,757 - INFO - Updated cluster state.
werkzeug - 2023-11-10 02:30:36,764 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:36] "PUT /clusters HTTP/1.1" 200 -
[mluo-cloud - Cluster Controller] - 2023-11-10 02:30:36,765 - INFO - Updated cluster state.
werkzeug - 2023-11-10 02:30:41,735 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:41] "GET /clusters/mluo-onprem HTTP/1.1" 200 -
werkzeug - 2023-11-10 02:30:41,744 - INFO - 127.0.0.1 - - [10/Nov/2023 02:30:41] "PUT /clusters HTTP/1.1" 200 -
[mluo-onprem - Cluster Controller] - 2023-11-10 02:30:41,745 - INFO - Updated cluster state.
werkzeug - 2023