# Online Signal Processing Tools
Extending cuSignal support to streaming data, smaller signal sizes, and handling FFT plans and memory.

## Zero-Copy Memory
cuSignal has support for two separate zero-copy memory allocations, both in `_arraytools.py`:
1. `get_shared_array(data, strides, order, stream, portable, wc)` and
2. `get_shared_mem(shape, dtype, strides, order, stream, portable, wc)`

In both cases, Numba is used to create a pinned and mapped memory space. Pinning removes physical pages from being swapped by the OS, and mapping allows both the GPU and CPU to access a given memory space. Essentially, we're setting up a Direct Memory Access (DMA) pattern and eliminating an additional copy with the CPU bounce buffer.

`get_shared_array` establishes a zero-copy memory space and loads data of native type into that allocated array. Returned is a pre-populated data array that is accessible to both CPU and GPU functions/libraries.

`get_shared_mem` is similar to `numpy.zeros` and essentially allocates an empty zero-copy memory space of given type. This is probably preferred for online-signal processing applications when known data sizes are transfered into a buffer for computations.

**WARNING** Allocating zero-copy memory in this way physically removes memory resources from the operating system and should be used with the utmost caution.

If you're trying to leverage cuSignal on an embedded GPU - say an NVIDIA TX2, Nano, or Xavier or GPU integrated SDR platform like Deepwave Digital's [Air-T](https://deepwavedigital.com/sdr/), the GPU and CPU memory space are shared. Currently, Numba does not make use of CUDA's Unified Memory (UM) construct, so a GPU allocation via Numba or CuPy will physically migrate memory. If your application uses UM, the CUDA driver is 'smart' enough to know not to move bits, in this case. A Numba feature request to add UM support is [here](https://github.com/numba/numba/issues/4362)

## Comparing FFT Performance

In [1]:
import numpy as np
import cupy as cp
import cusignal
from cupyx.scipy import fftpack

# Number of samples in signal
N = 2**15

### Data Created on CPU and FFT Performed on CPU with NumPy

In [2]:
# Create Data on CPU
cpu_signal = np.random.rand(N) + 1j*np.random.rand(N)

In [3]:
%%time
cpu_fft = np.fft.fft(cpu_signal)

CPU times: user 3.85 ms, sys: 0 ns, total: 3.85 ms
Wall time: 2.57 ms


### Data Created on GPU and FFT Performed on GPU with CuPy

In [4]:
# Create Data on GPU
gpu_signal = cp.random.rand(N) + 1j*cp.random.rand(N)

In [5]:
%%time
gpu_fft = cp.fft.fft(gpu_signal)

CPU times: user 195 ms, sys: 31.1 ms, total: 226 ms
Wall time: 225 ms


**WAIT. WHAT?!**
GPUs are supposed to be faster, right? On first run (clear your kernel if you're not sure), you will most likely notice that the CPU/NumPy version of the 2^15 point FFT executed almost 100x faster than the GPU version. Fortunately for us, most of the time looped into the GPU FFT calculation involved establishing pointers to memory, the FFT plan, and other overhead calculations that only need to be performed once. If we run the function again, we'll see a significant performance improvement.

In [6]:
%%time
gpu_fft = cp.fft.fft(gpu_signal)

CPU times: user 2.06 ms, sys: 382 µs, total: 2.44 ms
Wall time: 1.28 ms


We can start to look at direct CPU to GPU calculations, but that's a bit misguided here. The value of something like cuSignal is that we can move an entire processing pipeline to a GPU to do faster end-to-end signal processing and then seamlessly move to GPU based ML/DL. That said, for those curious on raw CPU vs GPU performance on small signal samples:

In [7]:
%%timeit
cpu_fft = np.fft.fft(cpu_signal)

649 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [8]:
%%timeit
gpu_fft = cp.fft.fft(gpu_signal)

315 µs ± 8.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Allocating FFT Plan before Invocation

If we know the size of our FFTs, we can simply create the FFT plan prior to execution. This makes use of CuPy's SciPy library.

In [9]:
plan = fftpack.get_fft_plan(gpu_signal)

In [10]:
%%time
gpu_fft = fftpack.fft(gpu_signal, plan=plan)

CPU times: user 535 µs, sys: 185 µs, total: 720 µs
Wall time: 494 µs


### Data Created on CPU and Moved to GPU; FFT Performed on GPU with CuPy

In [11]:
%%time
gpu_fft = fftpack.fft(cp.asarray(cpu_signal), plan=plan)

CPU times: user 1.21 ms, sys: 4.43 ms, total: 5.63 ms
Wall time: 4.35 ms


Your mileage may vary, but typically memory migration and FFT execution is ~2x slower on GPU than CPU. We need a better way to handle memory!

### Data Created with Zero-Copy; FFT Performed on GPU with CuPy

In [12]:
# Allocate N samples of zero-copy array
shared_signal = cusignal.get_shared_mem(N, dtype=np.complex128)

# Load shared memory space with cpu_signal
shared_signal[:] = cpu_signal

# Confirm pointers
print('CPU Pointer: ', shared_signal.__array_interface__['data'])
print('GPU Pointer: ', shared_signal.__cuda_array_interface__['data'])

CPU Pointer:  (140537959022592, False)
GPU Pointer:  (140537959022592, False)


In [13]:
%%time
gpu_fft = fftpack.fft(cp.asarray(shared_signal), plan=plan)

CPU times: user 824 µs, sys: 0 ns, total: 824 µs
Wall time: 600 µs
