### Demonstration of GPU Accelerated SigMF Reader

Please note that our work with both SigMF readers and writers is focused on appropriately handling the data payload on GPU. This is similar to our usage of DPDK within the Aerial SDK and cuVNF.

In [1]:
import json
import numpy as np
import cupy as cp
import cusignal

We are using the Northeastern University Oracle SigMF recordings found [here](http://www.genesys-lab.org/oracle) (Dataset #2)

In [2]:
meta_file = '/data/oracle/KRI-16Devices-RawData/2ft/WiFi_air_X310_3123D7B_2ft_run1.sigmf-meta'
data_file = '/data/oracle/KRI-16Devices-RawData/2ft/WiFi_air_X310_3123D7B_2ft_run1.sigmf-data'

# Reader (Binary and SigMF)

For our purposes here, [SigMF](https://github.com/gnuradio/SigMF) data is treated as a JSON header and processed on CPU, while the *binary* payload file is mmaped to GPU and cuSignal uses a CUDA kernel to parse the file. While we've focused on SigMF here, you can use the underlying `cusignal.read_bin` and `cusignal.parse_bin` (and corresponding write functions) for your own datasets.

### Baseline Reader (CPU, Numpy)

In [3]:
with open(meta_file, 'r') as f:
    md = json.loads(f.read())

if md['_metadata']['global']['core:datatype'] == 'cf32':
    data_type = np.complex64

In [4]:
%%timeit
data_cpu = np.fromfile(data_file, dtype=data_type)

128 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Baseline Reader (GPU, Numpy)

In [5]:
%%timeit
data_gpu = cp.fromfile(data_file, dtype=data_type)
cp.cuda.runtime.deviceSynchronize()

188 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### cuSignal - Use Paged Memory (Default)

This method is preferred for offline signal processing and is the easiest to use

In [6]:
%%timeit
data_cusignal = cusignal.read_sigmf(data_file, meta_file)
cp.cuda.runtime.deviceSynchronize()

82.2 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### cuSignal - Use Pinned Buffer (Pinned)

This method is preferred for online signal processing tasks when you're streaming data to the GPU with known and consistent data sizes

In [7]:
binary = cusignal.read_bin(data_file)
buffer = cusignal.get_pinned_mem(binary.shape, cp.ubyte)

In [8]:
%%timeit
data_cusignal_pinned = cusignal.read_sigmf(data_file, meta_file, buffer)
cp.cuda.runtime.deviceSynchronize()

82.2 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### cuSignal - Use Shared Buffer (Mapped)

This method is preferred for the Jetson line of embedded GPUs. We're showing performance here on a PCIe GPU (which is why it's so slow!)

In [9]:
binary = cusignal.read_bin(data_file)
buffer = cusignal.get_shared_mem(binary.shape, cp.ubyte)

In [10]:
%%timeit
data_cusignal_shared = cusignal.read_sigmf(data_file, meta_file, buffer)
cp.cuda.runtime.deviceSynchronize()

82.8 ms ± 866 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Writer (Binary and SigMF)

In [11]:
import os

sigmf = cusignal.read_sigmf(data_file, meta_file)
test_file_ext = "test-data.sigmf-data"

if os.path.exists(test_file_ext):
    os.remove(test_file_ext)

### Baseline Writer

In [12]:
%%timeit
sigmf.tofile(test_file_ext)
cp.cuda.runtime.deviceSynchronize()

338 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### cuSignal - Use Paged Memory (Default)

In [13]:
%%timeit
cusignal.write_sigmf(test_file_ext, sigmf, append=False)
cp.cuda.runtime.deviceSynchronize()

341 ms ± 830 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### cuSignal - Use Pinned Buffer (Pinned)

In [14]:
binary = cusignal.read_bin(data_file)
buffer = cusignal.get_pinned_mem(binary.shape, cp.ubyte)

In [15]:
%%timeit
cusignal.write_sigmf(test_file_ext, sigmf, buffer, append=False)
cp.cuda.runtime.deviceSynchronize()

253 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### cuSignal - Use Mapped Buffer (Mapped)

In [16]:
binary = cusignal.read_bin(data_file)
buffer = cusignal.get_shared_mem(binary.shape, cp.ubyte)

In [17]:
%%timeit
cusignal.write_sigmf(test_file_ext, sigmf, buffer, append=False)
cp.cuda.runtime.deviceSynchronize()

253 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
