Roadmap and Milestones

Roadmap for RCT.v1

env isolation [DONE]
close some open branches
- scheduler lookup
- sandbox
- func executor
- COVID* / raptor
- nodes
no-mongodb
task descriptions / task types
termination / no-heartbeats / destructors [RECONSIDER]
tracer / logger service
partitions
asyncio as base for components?
stand-alone components

Scope for RCT.v2

Radical-Pilot

Python-3 [DONE]
Pilot Partitioning
- scale (constant overhead)
- heterogeneous resources
- heterogeneous workloads
- API draft & partial implementation
ZMQ Bridges
- separation of network overlay / communication overlay
- liberate tmgr / pmgr, decouple from agent
- towards disconnect / reconnect (see below)
- security implications?
- Draft Implementation
- -- how about stable communication services?
decoupled agent
- decoupled components / bridges
  - no Python multiprocessing
- provide communication / coordination layer to workload
  - data coordination
- early experiment: NGE
- -- opens up to support different client types, also to use the client alone
resilience & fault tolerance
- component failure, (bridge failures), unit failures, node failures
- changing software stack
- fickle application environment [DONE]
- bootstrapper as single process root -> simplify?
- no preliminary work
- -- check if batch job survives node failures
- -- what are the failure modes or RP? Systematize error types
- -- what means failure recovery in each case?
- -- feasibility studies, prototypes for different failure modes / recoveries -> proposal
Configuration management
- preliminary usage
- RU support completed
  - shell expansion in configs [DONE (variable expansion)]
  - overloading by user configs [DONE]
  - query with module like names ("rp.resource.xsede.stampede2")
  - json-schema based validation is planned
Application Communication
- application level communication channels (UC: Sebastian/FU) [DONE]
- data pipelines
- service tasks [DONE]
- no preliminary work, but ZMQ channels are now independent and live in RU
- -- stay focused on RP core capabilities, don't expand too much into userspace
faster dicts -> RU [DONE]
re-investigate usage modes
- UC: batch submission of agent + workload
- client side API - what else do we need?
- ZMQ 'API'
- no early results, but agent decoupling and partitioning are steps
connectivity management -> RU
rename CU to Task [DONE]
Tau / Monitoring
split scheduler in Resolver and Scheduler

Radical-SAGA

re-evaluate RS attribute interface -> RU
use the radical name space
- implemented, awaits PR, review and merge
json based configuration management
- see RP config management
connectivity management -> RU
implement data staging and state notification fallbacks on API level. This will significantly simplify the RP pilot launchers.
capture state of completed batch jobs reliably

Radical-Utils

separation of network overlay (ssh tunnels) from communication protocol (ZMQ)
- support basic communication patterns over above network overlay
- network overlay (not fully automated, independently usable)
  - prototype
- user-space micro-service as shell interface to remote hosts (persistent, reliable, stateful, transparent, secure, ZMQ-connected)
  - incomplete prototype
  - -- need to cover different login node policies
  - -- ensure that only one instance is alive per user
support for json-based configuration management
- see RP
fast and lean dictionary implementation
- toward 10^7 unit descriptions
- json schema based?
- C extension?
- Python-3?
performance bottlenecks implemented in C or other languages:
- scheduling (pattern searches)
- dictionary implementation (see above)
- profile mangling
- timstamps

Radical-Analytics

Event and Stats Dashboard (explored)

Testing

Unit tests

Integration tests

Define the set of RP test targets
Use cron/at on machines with multi-factor authentication
Test supported launch methods for each resource

Obsolete

Timeline on Past and Planned Milestones

MS-Feature: 12, 2016
- Topic: feature
- Target: Winter '16
- Status: waiting for MS-Scale
- ?? - GPUs !
- ?? - disconnect / reconnect
- ?? - long running
- ?? - all MPI flavors
- ?? - easy extension (app schedulers, new clusters)
MS-Scale: 09, 2016
- Topic: scaling
- Target: Fall '16
- Status: waiting for MS-Refactor-2
- ?? - scheduler algorithm, data structure
- OK - agent partitions (rendered as partition scheduler in agent)
- ?? - possibly tailing cursors and/or zmq based client/pilot communication
- ?? - stability @ scale for data staging
- ?? - disconnect / reconnect
- ?? - long running
- ?? - ORTE-LIB in production
- ?? - routine of benchmarks and (stall-based) micro-benchmarks
MS-Refactor-2: 06, 2016
- Topic: Client Refactoring
- Target: Summer '16
- Status: delayed for termination issues
- OK - code sharing between agent and RP module
- OK - code refactoring on RP module side
- OK - cleanup of state management, entity ownership, state transitions on RP application side
- !! - shutdown issues (temporary resolution?)
- OK - performance (~SAGA performance for spawning pilots, 1 roundtrip for CU submission)
- ?? - improve integration of app kernels
- ?? - improve performance of late binding scheduler (at least understand performance)
- OK - scale of pilot / CUs
- ?? - better error analysis / provenance / tooling
- see Performance Challenges
- see State Management in RADICAL-Pilot
MS-Data-2: ??, 201x
- Topic: Data Management
- Target: ??
- ?? - Pilot Data
- ?? - Agent staging to/from arbitrary locations
MS-Resources-2: ??, 2015
- Topic: Resource Support
- Target: April
  - 'stable': tests work at demo-scale
  - 'production': EnsembleMD folx can use the resources
  - individual target dates per resource
- OK - Blue Waters stable (proposal in June)
- OK - Titan stable
- OK - Hopper stable
- OK - SuperMIC stable
- OK - OSG
- OK - conceptual clarity on ORTE based agent
- see Performance Challenges
MS-Analysis: 2015
- Topic: Testing and Analysis
- OK - April - move to Pandas Frames (PDF) (documentation)
- OK - April - backports from aimes-experiments branch
- WIP - rebase plotting on PDF
  - student project ??
  - repository of radical scripts OK
- OK - integrated profiling over RADICAL stack (post-mortem to PDF)
MS9: Febuary 31, 2015
- ?? - Focus on
  - OK - scalability (linear in #pilots, superlinear in #units)
  - OK - scaling limits (O(100) pilots, O(10.000) units)
  - OK - agent performance (20 unit ops/second)
  - OK - agent adaptable to resource architecture / OS constraints
  - OK - clean up of state management and entity ownership on agent level
  - OK - agent ported to relevant (ie. accessible) architectures, while maintaining scalability
- see Performance Challenges
- see State Management in RADICAL-Pilot
MS8: August 15, 2014
- MS-8
- OK - Focus on
  - OK - documentation
  - OK - tutorials
  - OK - examples
  - OK - packaging
MS7: July 17, 2014
- MS-7
- OK - Sinon can replace BigJob as research vehicle, within the Radical, i.e. it is deemed fit for current and upcoming research projects, such as
  - OK - Mark's work on workflows and pilot data
  - OK - Ashley's work on Scheduling
  - OK - Matteo's work on Federation
  - OK - Andre's work on Application Modeling
- the respective stakeholders decide when this milestone is met, and collaborate on the respective demos
MS6: March 15, 2014
- Focus on Data Capabilities, stability
- OK - Sinon has basic data management capabilities, short of PilotData
  - CU level data staging
  - support for $HOME or equivalent
- OK - ticket queue is under control
MS5: February 11, 2014
- OK - Sinon can replace BigJob-as-is, within the Radical, i.e. it provides the same functionality as bigjob, as used by:
  - OK - Troy
  - OK - Aimes
  - ?? - Affinity Implementation
- the respective stakeholders decide when this milestone is met, and collaborate on the respective demos
- ability to replace BJ means that it is usable by the respective user
- required features:
  - OK - reliable and simple deployment on local machines and on FutureGrid / XSEDE (alamo, sierra, india, hotel; possibly stampede, lonestar)
  - OK - performance comparable to BigJob (or faster, obviously)
  - OK - no major (show-stopping) tickets by 2 weeks after Tutorial / hand-over to Radical users
  - OK - Vishal's use case can be run with BJS, where Sinon replaces BJ
  - OK - Troy can reliably use multiple pilots, many CUs, for its demo-3 (https://github.com/saga-project/troy/wiki/Roadmap#final-demo-demo-3)
  - OK - data staging is considered to be performed out-of-band, e.g. via SAGA-Python / BJS
  - OK - supported functionality (as per above) is documented, and covered by unit tests.
MS4: December 15, 2013
- OK - documentation for API and Data Model
- OK - packaging / pypi
- OK - examples
- OK - performance of components is measured and understood
- OK - performs as well as BigJob
- OK - multiple UnitManagers, multiple Pilots, Pilots on several UMs
- OK - can replace BigJob as Troy backend
- OK - can run bag of tasks O(100) tasks
- demo:
  - OK - submit 4 Pilot to india and 4 to sierra
  - OK - create 2 UnitManagers with 4 pilots each
  - OK - run 20 bulks of 100 CUs (CUs vary in runtime)
  - OK - after 10 bulks: disconnect / reconnect
  - OK - state changes for pilots and CUs are delivered via notifications
  - OK - performance for above is measured and reported routinely
- Addendum: This milestone is delayed until Sinon can support Troy demos 1 and 2, on Futuregrid. That will define Sinon's readiness for a RC1.
MS3: November 30, 2013
- OK - multiple agents get notifications from DB
- ?? - API layer gets notification from DB
- OK - non-dumb scheduling over multiple agents
- OK - reconnect to agents
- OK - integration with Troy
- OK - agent works on FG
- OK - perf measurements of bag-of-tasks use case
- OK - API documentation
- demo:
  - OK - submit 2 Pilot to india and 2 to sierra
  - OK - run 10 bulks of 10 CUs (CUs vary in runtime)
  - OK - after 5 bulks: disconnect / reconnect
  - OK - state changes for pilots and CUs are delivered via notifications
  - OK - performance for above is measured and reported routinely
MS2: November 15, 2013
- OK - API layer pushes to DB
- OK - agent works on on one machine on FutureGrid
- OK - one agent pulls from DB
- OK - agent enacts
- OK - agent pushes state to DB
- OK - API pulls from DB
- OK - perf measurements of above
- demo:
  - OK - submit 1 Pilot to india
  - OK - run 1 bulk of 10 CUs on that pilot (CUs have constant runtime)
  - OK - state pulling from application reports CU and Pilot states truthfully
  - OK performance for above is measured and reported routinely
MS1: October 30, 2013
- OK - API agreed upon
  - OK - incl. packaging und pypi
- OK - DB backend agreed upon
- OK - first version of data model
- OK - coding framework in place
  - OK - api layer
  - OK - plugin scheduler / structure
  - OK - interface to DB backend (rudimentary)
- OK - dumb unit scheduler
- OK - config file support
- OK - demo:
  - OK - a local easy_install results in a 'functional' installation (see below)
  - OK - UM, Pilot and CU representations can be created in the DB, from API calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap and Milestones

Roadmap for RCT.v1

Scope for RCT.v2

Radical-Pilot

Radical-SAGA

Radical-Utils

Radical-Analytics

Testing

Unit tests

Integration tests

Obsolete

Timeline on Past and Planned Milestones

Clone this wiki locally