Skip to content

Overview

Andre Merzky edited this page Sep 17, 2014 · 11 revisions

Mission Statement

The goal of saga-pilot is to develop a new Pilot-Framework implementation that can be used:

  1. by the RADICAL group to design and execute distributed computing experiments
  2. by production users to run large-scale workload across national and international cyber-infrastructures

Requirements

Functional Requirements

  • suitable as underpinning pilot layer for AIMES (and thus TROY)
  • suitable as pilot framework for production infrastructures, while operating fully in user space
  • MUST USE SAGA-python
  • pilots:
    • create
    • manage
    • list/...
    • assign to queues
  • compute units:
    • submit to queue
    • ordering dependencies:
      • [CU_1 || CU_2] (unordered)
      • [CU_1 && CU_2] (concurrency)
      • [CU_1] -> [DU_2] (logical dependency)
      • [CU_1] -> [CU_2 && CU_3] -> [CU_4 || CU_5] (combinations)
  • inspection:
    • on pilots
    • on units
    • on queues
    • on actions

Non-Functional Requirements

  • code simple (as simple as possible, but no simpler)
  • easy to maintain
  • support different deployment modes
    • single user / single application: modules bound to application
    • user community, multiple applications: services with user / application sessions
  • cleanly extensible toward
    • different backends (pluggable agents)
    • different workloads (pluggable schedulers)
  • compatibility with Pilot-API

Excluded Requirements:

  • no pilot placement intelligence (the pilot creation request fully constraints placement)
  • no unit placement intelligence (once submitted to a queue, the unit can transparently run on any pilot on that queue)
  • no unit co-location -- units within the same queue are considered fully co-located!
  • scheduler on queue is not exposed
  • no multi-user pilots (they are mostly useless if you cannot map to system accounting, and cannot do sandboxing, which would also require system layer functionality)
  • no credential delegation (this seems just too hard for us to attempt at the moment. However, we will need other means to solve the data transfer problems. Also we will use session handles.)
  • no direct support for workflows (some schedulers might interpret task dependencies though)
  • no dynamically resizable pilots (considered to be a corner use case which introduces significant agent complexity)

Scalability Requirements

  • Scale Up: per pilot, use maximum number of cores available on largest XSEDE machine (stampede - 10k - by request) concurrently (i.e., 10k tasks), without significant overhead.
  • Scale Out: 10k concurrent tasks on ~ 1000 small pilots (target use case OSG & Cloud: 4-12 cores / agent), without significant overhead.
  • Scale Across: 20k concurrent tasks in a hybrid up/out configuration.
  • Framework Scalability: multiuser mode: 10 different backends, 10 concurrent users, with 10 concurrent applications each.

Architecture

  • component architecture:

    • separate pilot manager and queue manager into separate modules/services
    • separate notification/callback system for state updates etc. (supports flexible information flow to/from pilots, service, applications)
    • Queue as central concept for CU management ('Queue' in the sense as queuing systems use the term, not in the programming abstraction sense)
  • layer architecture

    • separate communication / protocol / semantic layers
    • communication layer
      • persistent/resilient connectivity backbone (needed due to NAT and FW infested environments)
      • network of tornado based services
      • registry based bootstrapping for network setup/configuration
      • network can change (expand / shrink) dynamically
      • each service can forward messages (to circumvent firewall/connectivity restrictions)
    • coordination layer
      • lightweight protocol layer, open for bulk, async ops, notifications -- at scale (REST most likely)
      • push/pull/store based communication configurable
    • semantic layer
      • locality transparency (service locality, agent locality, etc)
      • each network service can be configured as pilot service / queue service / info service / agent
      • multiple services are either independent (serve different apps / session), or cooperating (scalability)
  • interfaces:

    • input
      • abstract parametrized DAG of fine grained application patterns
      • manipulation of DAG (*)
      • information about resources
      • information about pilots (queues, pilots, units)
    • output
      • create pilot
      • create queue
      • assign pilot to queue
      • submit CU/DU to queue
      • manage state of those entities

(*) One can consider to routes to support use cases where the application layer wants to change application configuration during run time (e.g. add more replicas on a RE use case, based on resource availability): * (a) leave those decisions completely to Troy * (b) allow the application to change a submitted DAG

(a) requires deep application knowledge within Troy (what is a replica in RE? Can I add it if resources are available? Why would I?) -- so that is not favorable, IMHO. (b) requires that the submitted DAG is kept as interface representation, and the application can inspect it (how was a node expanded? What is the state / performance of it?), and to allow to change it (expand node further etc).

Diagrams

Clone this wiki locally