Skip to content

GPU support

Andre Merzky edited this page Dec 1, 2016 · 8 revisions

GPU Support

RP should support applications to utilize CPUs on the target resources. That support consists of two main components:

  • extend the RP API express GPU requirements for CUs
  • extend UMGR scheduler, agent scheduler and agent executor to correctly place and execute respective CUs

The first part can initially be as simple as adding a gpus requirement in the unit description, which would immediately also support mixed CPU/GPU units:

  cud            = rp.ComputeUnitDescription()
  cud.executable = 'sander.GPU'
  cud.cores      = 1
  cud.gpus       = 1

On the UMGR scheduler side, the late binding scheduler needs to perform similar bookkeeping on GPUs as it already have on CPU cores. Any more fine-grained scheduling (such as CPU/GPU co-scheduling) should remain in the agent scheduler.

The agent scheduler is probably the crucial part: the internal bookkeeping and scheduling data structures need to be changed to accommodate GPUs. Note that the agent scheduler is up for a revamp anyway: it is currently a performance bottleneck due to inefficiencies in searching and changing nested Python data structures. This is supposed to improve by using the bittarray implementation exposed by the radical.utils scheduler. GPU support should be integrated along the same lines, by using a second bitarray to map to the cluster's GPU layout.

For the agent executing component, we will initially make the trivial assumption that the CU description can differentiate any respective system peculiarities, by selecting the correct executable. Any more advanced supports should fall in line with the integration of application kernels. The latter will also be needed once we intend to support dynamic selection between CPU and GPU CUs -- which we consider out of scope initially.

SAGA layer

  • machines differ in how GPU jobs are specified (queue, node selection, flags, ...)
  • propose to use machine specific config files on SAGA layer
  • propose to overload the cores attribute to allow specifying cores and gpus

use cases

  • Extasy (Frank Noe, Cecilia, ...)
  • Incite (Princeton)
  • Darren, Tom Chethum (RepEx)
  • Jumana / Irina (Neuro Imaging, EC2)
Clone this wiki locally