Skip to content

Performance Challenges

andre-merzky edited this page Dec 3, 2014 · 10 revisions

Performance and Scalability Targets

Amongst the original RP requirements is this one:

  • R-9: perf: 10k concurrent CUs with 16 cores each, over 10 resources, in a single application (RepEx)

Another interesting target for RP performance would be to compare it to Cram

  • 1,572,864 jobs on 96k nodes (Sequoia)
  • process spawning etc: 104.5 s
  • tmp file (2 files/proc) creation on lustre: 1167.7 s
  • memory footprint: <10MB

Performance Scalability Metrics

  1. Assumptions:

    1. 'roundtrip' can refer to
      • app <-> pilot
      • app <-> db
      • db <-> pilot
      • app <-> (db+pilot) which are considered all to be in the same O().
    2. pilot operations are considered independent, and not bulkable
    3. unit operations on different pilots are considered independent and not bulkable
  2. Pilot Manager

    1. dispatching pilots should not take longer than O(start SAGA job)
    2. time to dispatch a pilot should stay constant over number of pilots, for O(100) pilots
    3. time to cancel a pilot should not take longer than: (time to cancel a unit) * (number of active units) + (time to cancel a job) (but see unit cancellation below)
    4. time to inspect a pilot state should not take longer than O(roundtrip)
    5. the time needed to stage-out complete and/or partial data can delay pilot shutdown.
  3. Unit Manager

    1. dispatching units should not take longer than a roundtrip
    2. dispatching multiple units to the same pilot should scale super-linear, and scaling should remain constant to O(10.000) units.
    3. cancelling units should perform similar to dispatching units.
    4. inspecting units should perform similar to dispatching units.
    5. the UM scheduler should not significantly decrease unit dispatch time (neither individually nor at scale)
  4. Pilot Agents

    1. main objective is high resource utilization, with scaling to O(10.000) cores and O(1.000) units. This implies:
      1. quick unit startup time: O(1.000) units < 1 min
      2. linear scaling of unit startup times
      3. small memory / I/O overhead per unit startup
  5. Data Management 1. the usual things :-P

At the moment, we see a number of challenges to reach the advertised goals:

  • the number of CUs per pilot agent is limited by the max number of processes per host. We require multiple processes per CU.
    • limits (4.i.b)
    • large scale pilots probably use multiple agent instances spread over nodes to avoid that limitation.
  • the startup time of CUs is approx. 1/sec; startup time grows exponentially (python popen()) -- overall startup time for 1.000 concurrent CUs is in order of hours.
    • limits (4.i.a/b/c)
    • (a) using popen() requires to spread over several agent processes
    • (b) switch to other spawning mechanisms (shell_wrapper), may still need spreading (but less so)
  • while the RP API supports bulk operations in all relevant places, those bulks are sometimes unrolled and do not result in bulk operations on the database layer. This causes a communication bottleneck for high-latency connections to the DB.
    • limits (3.ii/iii/iv/v, 4.i.b)
    • add caching / bulking to the DB interactions
    • ensure that DB interactions are not spread throughout the code, but are localized to the DB layer and can be measured / optimized there.
  • the RP API layer communicates with the RP controllers over the (remote) DB, incurring additional latencies.
    • limits (2.i/iii/iv, 3.i/iii/iv/v)
    • reconsider if the separation into controllers is needed, and/or if their communication needs to go over the DB
  • The backfilling scheduler waits for pilot to confirm free cores before dispatching a new CU to that pilot. This implies several DB roundtrips and process synchronization before the CU can be executed. This delay is increased if the unit requires data to be staged in.
    • limits (3.v)
    • The BF scheduler should pre-assign (and stage) a number of units before the pilot actually confirms the resource availability.
    • data staging should be started before units are assigned (even if that can result in transfers of data which are ultimately not used).
  • The agent process(es) uses some resources on core 0, and for large scale pilots, that may a become significant portion, slowing down the CUs assigned to that core.
    • impact uncertain / unclear
    • We should probably use dedicated resources for the agent process(es) at larger scale.
Clone this wiki locally