Design Considerations Notes

The loose collection of notes below represents a number of considerations and requirements for the saga-pilot design. That list is not ordered.

API
- TROY is considered a primary API consumer, obviously
- it should be easy to support a PilotAPI wrapper for backward compatibility
CLI
- this is a second order concern, and derives from the API
flexibility
- support several usage types
  - BigJob like deployment, i.e. module bound to application
  - AIMES / Condor / Panda like deployment, i.e. as community service
- support production use
- support research (scheduling, middleware, infrastructure level)
- possibly support different language binding, at least for the service type deployment
stability
- agent deployment should be simple and stable
- error reporting should be intuitive, in particular for deployment and bootstrapping errors
ease of use
implementability
- we have finite resources, and need to consciously pick the points where we accept complexity
modularity
- apart from the different deployment modes, possibly support pluggable schedulers, pluggable coordination layers, pluggable (and configurable) agents.
- agents in particular should be very lightweight, and have limited dependencies
performance
- saga pilot should scale up (number of CUs, number of agents per backend)
- saga pilot should scale out (number agents across backends)
- notifications should work throughout the whole stack
- development should be driven by benchmarks (see Performance Tests)
inspection
- logging, inspection and/or auditing are required on all levels, for
  - operation transparency for the end user
  - support of development
  - support of experiments
  - support of SAGA-Pilot level and higher level schedulers
security
- are multi-user pilots needed for AIMES and gateway use cases? What does multi-user exactly mean in those contexts?
- credential delegation needed for pilot initiated data transfer operations.
data
- classic stage-in/stage-out directives on CUs will be needed, as they allow automated file movement for large numbers of jobs
- stdout/stderr streaming would be good for interactivity, and debugging -- possibly via a pub-sub mechanism
- list of CU working directory, returns list of data access URLs? This would be in line with general CU inspection capabilities
provenance
- Structured recording of what happened, where and who, etc. (Also related to performance metrics)

NOT considered to be design constraints are:

dynamically resizable pilots
- this is considered to be a corner use case which introduces significant agent complexity -- which we want to avoid for various reasons, see above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Considerations Notes

Clone this wiki locally