Skip to content

Design Considerations Notes

Mark Santcroos edited this page Aug 6, 2013 · 2 revisions

The loose collection of notes below represents a number of considerations and requirements for the saga-pilot design. That list is not ordered.

  • API
    • TROY is considered a primary API consumer, obviously
    • it should be easy to support a PilotAPI wrapper for backward compatibility
  • CLI
    • this is a second order concern, and derives from the API
  • flexibility
    • support several usage types
      • BigJob like deployment, i.e. module bound to application
      • AIMES / Condor / Panda like deployment, i.e. as community service
    • support production use
    • support research (scheduling, middleware, infrastructure level)
    • possibly support different language binding, at least for the service type deployment
  • stability
    • agent deployment should be simple and stable
    • error reporting should be intuitive, in particular for deployment and bootstrapping errors
  • ease of use
  • implementability
    • we have finite resources, and need to consciously pick the points where we accept complexity
  • modularity
    • apart from the different deployment modes, possibly support pluggable schedulers, pluggable coordination layers, pluggable (and configurable) agents.
    • agents in particular should be very lightweight, and have limited dependencies
  • performance
    • saga pilot should scale up (number of CUs, number of agents per backend)
    • saga pilot should scale out (number agents across backends)
    • notifications should work throughout the whole stack
    • development should be driven by benchmarks (see Performance Tests)
  • inspection
    • logging, inspection and/or auditing are required on all levels, for
      • operation transparency for the end user
      • support of development
      • support of experiments
      • support of SAGA-Pilot level and higher level schedulers
  • security
    • are multi-user pilots needed for AIMES and gateway use cases? What does multi-user exactly mean in those contexts?
    • credential delegation needed for pilot initiated data transfer operations.
  • data
    • classic stage-in/stage-out directives on CUs will be needed, as they allow automated file movement for large numbers of jobs
    • stdout/stderr streaming would be good for interactivity, and debugging -- possibly via a pub-sub mechanism
    • list of CU working directory, returns list of data access URLs? This would be in line with general CU inspection capabilities
  • provenance
    • Structured recording of what happened, where and who, etc. (Also related to performance metrics)

NOT considered to be design constraints are:

  • dynamically resizable pilots
    • this is considered to be a corner use case which introduces significant agent complexity -- which we want to avoid for various reasons, see above
Clone this wiki locally