# ACCL Performance
This notebook focuses on performance aspects of ACCL primitives and collectives. We can run the subsequent cells against an emulator or simulator session but hardware is recommended.

There are several factors influencing the duration of an ACCL API call:
* the complexity of a call - a copy will be faster than an all-reduce for example
* the size (in bytes) of communicated buffers and their location in the memory hierarchy
* memory contention between sending and receiving processes. ACCL can be configured in specific ways to minimize this contention
* use of blocking or non-blocking variants of the API calls
* network performance, which in itself might depend on the size of buffers i.e. very small buffers typically lead to low utilization of Ethernet bandwidth

Factors which should not influence runtime are:
* data type - API calls on buffers of the same byte size should take the same amount of time, even if the buffers themselves differ in datatype and number of elements 
* use of compression - ACCL is designed to perform compression at network rate

Let's initialize a few ACCL instances and explore two performance-related aspects of the API.

In [None]:
from pyaccl import accl
import numpy as np

RUN_ON_HARDWARE = False
XCLBIN = "axis3x.xclbin"

if RUN_ON_HARDWARE:
    WORLD_SIZE = 3
    RXBUF_SIZE = 16*1024*1024
else:
    WORLD_SIZE = 4
    RXBUF_SIZE = 16*1024

assert not RUN_ON_HARDWARE or WORLD_SIZE <= 3

accl_instances = []
for i in range(WORLD_SIZE):
    if RUN_ON_HARDWARE:
        accl_instances.append(accl(WORLD_SIZE, i, bufsize=RXBUF_SIZE, xclbin=XCLBIN, cclo_idx=i))
    else:
        accl_instances.append(accl(WORLD_SIZE, i, bufsize=RXBUF_SIZE, sim_mode=True))

def allocate_in_all(count, dtype=np.float32):
    op0_buffers = []
    op1_buffers = []
    res_buffers = []
    for i in range(WORLD_SIZE):    
        op0_buffers.append(accl_instances[i].allocate((count,)))
        op1_buffers.append(accl_instances[i].allocate((count,)))
        res_buffers.append(accl_instances[i].allocate((count,)))
        op0_buffers[i][:] = [1.0*i for i in range(count)]
        op1_buffers[i][:] = [1.0*i for i in range(count)]
    return op0_buffers, op1_buffers, res_buffers

op0_buf, op1_buf, res_buf = allocate_in_all(RXBUF_SIZE)
op0_buf_fp16, op1_buf_fp16, res_buf_fp16 = allocate_in_all(RXBUF_SIZE, dtype=np.float16)

## Host vs. FPGA buffers

Every ACCL primitive or collective assumes your source and destination buffers are in host memory, unless otherwise specified with the `from_fpga` and `to_fpga` optional arguments that most PyACCL calls take. As such, before the operation is initiated, the source data is moved to the FPGA device memory, and after it completes, the resulting data is moved back to host memory. These copies have a performance overhead which typically depends on the size of copied buffers. 

Let's start by profiling the execution of the copy, the simplest primitive. We will measure across a range of buffer sizes. Feel free to change the `timeit` parameters.

In [None]:
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf[0], res_buf[0], 1)
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf[0], res_buf[0], 1024/4)
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf[0], res_buf[0], RXBUF_SIZE/4)

As expected, the runtime increases with larger message sizes, however it does so from quite a high baseline, caused the by the time required to copy the buffers between host and FPGA memory. However, in many applications the data might have been produced on the FPGA itself, or is subsequently required on the FPGA, and therefore does not require copying to the host. Let's see how the runtime changes if we work on FPGA memory directly.

In [None]:
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf[0], res_buf[0], 1, from_fpga=True, to_fpga=True)
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf[0], res_buf[0], 1024/4, from_fpga=True, to_fpga=True)
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf[0], res_buf[0], RXBUF_SIZE/4, from_fpga=True, to_fpga=True)

Now let's see if the data type affects runtime (it shouldn't). We'll run the same copy operations again, from FPGA memory, but this time on identically sized FP16 buffers.

In [None]:
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf_fp16[0], res_buf_fp16[0], 2, from_fpga=True, to_fpga=True)
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf_fp16[0], res_buf_fp16[0], 1024/2, from_fpga=True, to_fpga=True)
%timeit -r 4 -n 10 accl_instances[0].copy(op0_buf_fp16[0], res_buf_fp16[0], RXBUF_SIZE/2, from_fpga=True, to_fpga=True)

## Asynchronous calls
Some PyACCL calls take the `async` optional argument. If this is set to true, the function call immediately returns a handle to a Python future object which can be waited on to determine if the processing has actually finished. This enables the program to continue processing on the host while the ACCL call is being executed in the FPGA.

We can experiment with this feature by emulating host-side work with calls to `time.sleep()`. As long as the call to ACCL takes longer than the call to `sleep()`, the sleep will be completely hidden behind the ACCL call.

In [None]:
import time

def overlap_computation(count):
    handle = accl_instances[0].copy(op0_buf[0], res_buf[0], count, from_fpga=True, to_fpga=True, run_async=True)
    time.sleep(0.1)
    handle.wait()

%timeit -r 4 -n 10 overlap_computation(1)
%timeit -r 4 -n 10 overlap_computation(1024/4)
%timeit -r 4 -n 10 overlap_computation(RXBUF_SIZE/4)

## De-Initialize ACCL instances
The `deinit()` function clears all internal data structures in the ACCL instance.

In [None]:
for i in range(WORLD_SIZE):
    accl_instances[i].deinit()