Skip to content

Roadmap and Milestones

Andre Merzky edited this page Oct 6, 2021 · 47 revisions

Roadmap for RCT.v1

  • env isolation [DONE]
  • close some open branches
    • scheduler lookup
    • sandbox
    • func executor
    • COVID* / raptor
    • nodes
  • no-mongodb
  • task descriptions / task types
  • termination / no-heartbeats / destructors [RECONSIDER]
  • tracer / logger service
  • partitions
  • asyncio as base for components?
  • stand-alone components

Scope for RCT.v2

Radical-Pilot

  • Python-3 [DONE]
  • Pilot Partitioning
  • ZMQ Bridges
    • separation of network overlay / communication overlay
    • liberate tmgr / pmgr, decouple from agent
    • towards disconnect / reconnect (see below)
    • security implications?
    • Draft Implementation
    • -- how about stable communication services?
  • decoupled agent
    • decoupled components / bridges
      • no Python multiprocessing
    • provide communication / coordination layer to workload
      • data coordination
    • early experiment: NGE
    • -- opens up to support different client types, also to use the client alone
  • resilience & fault tolerance
    • component failure, (bridge failures), unit failures, node failures
    • changing software stack
    • fickle application environment [DONE]
    • bootstrapper as single process root -> simplify?
    • no preliminary work
    • -- check if batch job survives node failures
    • -- what are the failure modes or RP? Systematize error types
    • -- what means failure recovery in each case?
    • -- feasibility studies, prototypes for different failure modes / recoveries -> proposal
  • Configuration management
    • preliminary usage
    • RU support completed
      • shell expansion in configs [DONE (variable expansion)]
      • overloading by user configs [DONE]
      • query with module like names ("rp.resource.xsede.stampede2")
      • json-schema based validation is planned
  • Application Communication
    • application level communication channels (UC: Sebastian/FU) [DONE]
    • data pipelines
    • service tasks [DONE]
    • no preliminary work, but ZMQ channels are now independent and live in RU
    • -- stay focused on RP core capabilities, don't expand too much into userspace
  • faster dicts -> RU [DONE]
  • re-investigate usage modes
    • UC: batch submission of agent + workload
    • client side API - what else do we need?
    • ZMQ 'API'
    • no early results, but agent decoupling and partitioning are steps
  • connectivity management -> RU
  • rename CU to Task [DONE]
  • Tau / Monitoring
  • split scheduler in Resolver and Scheduler

Radical-SAGA

  • re-evaluate RS attribute interface -> RU
  • use the radical name space
  • json based configuration management
    • see RP config management
  • connectivity management -> RU
  • implement data staging and state notification fallbacks on API level. This will significantly simplify the RP pilot launchers.
  • capture state of completed batch jobs reliably

Radical-Utils

  • separation of network overlay (ssh tunnels) from communication protocol (ZMQ)
    • support basic communication patterns over above network overlay
    • network overlay (not fully automated, independently usable)
    • user-space micro-service as shell interface to remote hosts (persistent, reliable, stateful, transparent, secure, ZMQ-connected)
      • incomplete prototype
      • -- need to cover different login node policies
      • -- ensure that only one instance is alive per user
  • support for json-based configuration management
    • see RP
  • fast and lean dictionary implementation
    • toward 10^7 unit descriptions
    • json schema based?
    • C extension?
    • Python-3?
  • performance bottlenecks implemented in C or other languages:
    • scheduling (pattern searches)
    • dictionary implementation (see above)
    • profile mangling
    • timstamps

Radical-Analytics

  • Event and Stats Dashboard (explored)

Testing

Unit tests

Integration tests

  • Define the set of RP test targets
  • Use cron/at on machines with multi-factor authentication
  • Test supported launch methods for each resource

Obsolete


Timeline on Past and Planned Milestones

  • MS-Feature: 12, 2016
    • Topic: feature
    • Target: Winter '16
    • Status: waiting for MS-Scale
    • ?? - GPUs !
    • ?? - disconnect / reconnect
    • ?? - long running
    • ?? - all MPI flavors
    • ?? - easy extension (app schedulers, new clusters)
  • MS-Scale: 09, 2016
    • Topic: scaling
    • Target: Fall '16
    • Status: waiting for MS-Refactor-2
    • ?? - scheduler algorithm, data structure
    • OK - agent partitions (rendered as partition scheduler in agent)
    • ?? - possibly tailing cursors and/or zmq based client/pilot communication
    • ?? - stability @ scale for data staging
    • ?? - disconnect / reconnect
    • ?? - long running
    • ?? - ORTE-LIB in production
    • ?? - routine of benchmarks and (stall-based) micro-benchmarks
  • MS-Refactor-2: 06, 2016
    • Topic: Client Refactoring
    • Target: Summer '16
    • Status: delayed for termination issues
    • OK - code sharing between agent and RP module
    • OK - code refactoring on RP module side
    • OK - cleanup of state management, entity ownership, state transitions on RP application side
    • !! - shutdown issues (temporary resolution?)
    • OK - performance (~SAGA performance for spawning pilots, 1 roundtrip for CU submission)
    • ?? - improve integration of app kernels
    • ?? - improve performance of late binding scheduler (at least understand performance)
    • OK - scale of pilot / CUs
    • ?? - better error analysis / provenance / tooling
    • see Performance Challenges
    • see State Management in RADICAL-Pilot
  • MS-Data-2: ??, 201x
    • Topic: Data Management
    • Target: ??
    • ?? - Pilot Data
    • ?? - Agent staging to/from arbitrary locations
  • MS-Resources-2: ??, 2015
    • Topic: Resource Support
    • Target: April
      • 'stable': tests work at demo-scale
      • 'production': EnsembleMD folx can use the resources
      • individual target dates per resource
    • OK - Blue Waters stable (proposal in June)
    • OK - Titan stable
    • OK - Hopper stable
    • OK - SuperMIC stable
    • OK - OSG
    • OK - conceptual clarity on ORTE based agent
    • see Performance Challenges
  • MS-Analysis: 2015
    • Topic: Testing and Analysis
    • OK - April - move to Pandas Frames (PDF) (documentation)
    • OK - April - backports from aimes-experiments branch
    • WIP - rebase plotting on PDF
      • student project ??
      • repository of radical scripts OK
    • OK - integrated profiling over RADICAL stack (post-mortem to PDF)
  • MS9: Febuary 31, 2015
    • ?? - Focus on
      • OK - scalability (linear in #pilots, superlinear in #units)
      • OK - scaling limits (O(100) pilots, O(10.000) units)
      • OK - agent performance (20 unit ops/second)
      • OK - agent adaptable to resource architecture / OS constraints
      • OK - clean up of state management and entity ownership on agent level
      • OK - agent ported to relevant (ie. accessible) architectures, while maintaining scalability
    • see Performance Challenges
    • see State Management in RADICAL-Pilot
  • MS8: August 15, 2014
    • MS-8
    • OK - Focus on
      • OK - documentation
      • OK - tutorials
      • OK - examples
      • OK - packaging
  • MS7: July 17, 2014
    • MS-7
    • OK - Sinon can replace BigJob as research vehicle, within the Radical, i.e. it is deemed fit for current and upcoming research projects, such as
      • OK - Mark's work on workflows and pilot data
      • OK - Ashley's work on Scheduling
      • OK - Matteo's work on Federation
      • OK - Andre's work on Application Modeling
    • the respective stakeholders decide when this milestone is met, and collaborate on the respective demos
  • MS6: March 15, 2014
    • Focus on Data Capabilities, stability
    • OK - Sinon has basic data management capabilities, short of PilotData
      • CU level data staging
      • support for $HOME or equivalent
    • OK - ticket queue is under control
  • MS5: February 11, 2014
    • OK - Sinon can replace BigJob-as-is, within the Radical, i.e. it provides the same functionality as bigjob, as used by:
      • OK - Troy
      • OK - Aimes
      • ?? - Affinity Implementation
    • the respective stakeholders decide when this milestone is met, and collaborate on the respective demos
    • ability to replace BJ means that it is usable by the respective user
    • required features:
      • OK - reliable and simple deployment on local machines and on FutureGrid / XSEDE (alamo, sierra, india, hotel; possibly stampede, lonestar)
      • OK - performance comparable to BigJob (or faster, obviously)
      • OK - no major (show-stopping) tickets by 2 weeks after Tutorial / hand-over to Radical users
      • OK - Vishal's use case can be run with BJS, where Sinon replaces BJ
      • OK - Troy can reliably use multiple pilots, many CUs, for its demo-3 (https://github.com/saga-project/troy/wiki/Roadmap#final-demo-demo-3)
      • OK - data staging is considered to be performed out-of-band, e.g. via SAGA-Python / BJS
      • OK - supported functionality (as per above) is documented, and covered by unit tests.
  • MS4: December 15, 2013
    • OK - documentation for API and Data Model
    • OK - packaging / pypi
    • OK - examples
    • OK - performance of components is measured and understood
    • OK - performs as well as BigJob
    • OK - multiple UnitManagers, multiple Pilots, Pilots on several UMs
    • OK - can replace BigJob as Troy backend
    • OK - can run bag of tasks O(100) tasks
    • demo:
      • OK - submit 4 Pilot to india and 4 to sierra
      • OK - create 2 UnitManagers with 4 pilots each
      • OK - run 20 bulks of 100 CUs (CUs vary in runtime)
      • OK - after 10 bulks: disconnect / reconnect
      • OK - state changes for pilots and CUs are delivered via notifications
      • OK - performance for above is measured and reported routinely
    • Addendum: This milestone is delayed until Sinon can support Troy demos 1 and 2, on Futuregrid. That will define Sinon's readiness for a RC1.
  • MS3: November 30, 2013
    • OK - multiple agents get notifications from DB
    • ?? - API layer gets notification from DB
    • OK - non-dumb scheduling over multiple agents
    • OK - reconnect to agents
    • OK - integration with Troy
    • OK - agent works on FG
    • OK - perf measurements of bag-of-tasks use case
    • OK - API documentation
    • demo:
      • OK - submit 2 Pilot to india and 2 to sierra
      • OK - run 10 bulks of 10 CUs (CUs vary in runtime)
      • OK - after 5 bulks: disconnect / reconnect
      • OK - state changes for pilots and CUs are delivered via notifications
      • OK - performance for above is measured and reported routinely
  • MS2: November 15, 2013
    • OK - API layer pushes to DB
    • OK - agent works on on one machine on FutureGrid
    • OK - one agent pulls from DB
    • OK - agent enacts
    • OK - agent pushes state to DB
    • OK - API pulls from DB
    • OK - perf measurements of above
    • demo:
      • OK - submit 1 Pilot to india
      • OK - run 1 bulk of 10 CUs on that pilot (CUs have constant runtime)
      • OK - state pulling from application reports CU and Pilot states truthfully
      • OK performance for above is measured and reported routinely
  • MS1: October 30, 2013
    • OK - API agreed upon
      • OK - incl. packaging und pypi
    • OK - DB backend agreed upon
    • OK - first version of data model
    • OK - coding framework in place
      • OK - api layer
      • OK - plugin scheduler / structure
      • OK - interface to DB backend (rudimentary)
    • OK - dumb unit scheduler
    • OK - config file support
    • OK - demo:
      • OK - a local easy_install results in a 'functional' installation (see below)
      • OK - UM, Pilot and CU representations can be created in the DB, from API calls.
Clone this wiki locally