Skip to content

Some Essential Information

Hüseyin Tuğrul BÜYÜKIŞIK edited this page May 16, 2017 · 7 revisions

Use arrays as persistent as possible, Cekirdekler API adds strong references and keeps 1-to-1 opencl buffer equivalents of them until NumberCruncher or Cores gets destroyed. So instead of using temporary arrays inside a long running loop, add another array in higher-scope to use and copy to/from it and use it for compute, to counter ever-growing memory-usage at each iteration.

If one passes array array1 into compute method, it recognizes the array if it was used before, then retains old data from the last compute operation and uses it. This is valid for device-side buffers. For C+++ wrapper buffers under "stream" compute, there is no guarrantee of having a dedicated copy per device, by definition of OpenCL.

    a.nextParam(b,c,d).nextParam(e.nextParam(f,g)).nextParam(h).compute(...) // updates only arrays with `read` and `write` flag set.
    b.read=false;
    b.compute(...); // doesn't read from host to device
    b.compute(...); // doesn't read from host to device
    b.compute(...); // doesn't read from host to device
    b.write=true;
    b.compute(); // writes to host the latest bits from 3 kernel executions before

ClArray<T>

  • Faster than T[] in everything, except host-side array access. C# has extra overhead for virtual method overriding and indexer implementation.
  • If array initialization takes too much time, use opencl for it too.
  • Works well with stream option when data is accessed only once.
  • Fast enough to make pci-e a bottleneck.
  • Multiple devices can access it concurrently(not overlapping any addresses).

stream

  • Enables shorter datapath for various implementation so is better when data needs streaming
  • Faster in low compute-to-data ratio(streaming) kernels, slower in global-memory heavy(more re-use, random access) kernels.
  • Even faster with ClArray<T>, or even slower with memory heavy kernels.
  • Best in CPU as it doesn't cross pci-e bridge.
  • Multiple devices can stream

pipelining

  • Faster when .compute has a lot of latency to hide, slower when its already at limits
  • Faster with more CPU threads
  • Multiplies the number of needed CPU threads with the number of devices. All devices enable/disable pipelining. So if 16-commandqueues are running for each device, depending on drivers(vendor implementation), it may need up to 32 threads for 2 devices and 48 threads for 3 devices. If drivers don't use many threads, it may be ok to have even only 4 threads.

local work size

  • Increasing makes load balancing(and pipelining) harder, but may increase device-side efficiency(ALU occupation)
  • All devices have same local size
  • All devices get proper local workitem id

global workitem id

  • Only first device's first workitem gets it zero, unless the optional offset parameter is set in compute(). So global id is continuous through all global range distributed to devices. One device's last item id is one less than next device's first item id.
  • Multiple devices and multiple pipeline blobs all have proper global and local indexings but not workgroup ids.

workgroup id

  • Starts from zero for each device
  • Pipelining means P times kernel executions, which means each blob starts from zero workgroup id

load balancing

  • Iterative adaptation strength is hardcoded, can be changed easily, or even be improved to take into consideration of latest N-iterations to elliminate OS interrupt based spikes corrupting balance.
  • More devices make it more resistant to OS hiccups since interruption of all devices at the same iteration is less possible than only 1 interruption.
  • For just a few kB data and a few GFLOP compute, it won't be useful at all. Better when tens of megabytes are used in a teraflops scale computing. Easily balances a HD7870(1280 cores) and a R7-240(320 cores) on an nbody kernel and keeps stability.
  • Doesn't do anything for a single device

atomic functions

  • They work only inside a kernel. When pipelining(host to device, not device-to-device) is activated, it does multiple kernel executions both concurrently and serially, so if atomics are needed, pipelining should be disabled.
  • Multiple devices are not aware of other devices' atomic functions, just like pipelining, corrupts usage of atomic functions.