[QST] Is lazy evaluation used ? #74
Comments
Hey @flytrex-vadim -- thanks for asking a question. @mnicely is our benchmark and performance guru, but a couple of observations:
You can directly generate data on the GPU with CuPy with something like
Matt -- do you also mind posting an example on how to enable kernel pre-compilation and caching for |
Hi @awthomp , thanks for answering. Few comments:
In fact this lack of blocking seems to prevents DIRECT method from being used in a loop.
|
Hi @flytrex-vadim, I will try to recreate your scenario early next week and check if I'm missing any blocking. In the meantime, can you use %timeit instead of %time. We've found it provides miss leading results with GPU profiling. An even better way would be to use CuPy's NVTX markers. from cupy import prof
@cp.prof.TimeRangeDecorator()
def test_baseline():
h_a = np.ones(size, np.int)
h_b = np.ones(size, np.int) And profile with Nsight Systems nsys profile --sample=none --trace=cuda,nvtx --stats=true python3 <python script> To precompile the kernels try cusignal._signaltools.precompile_kernels( [np.float32],
[GPUBackend.CUPY], [GPUKernel.CORRELATE],) |
Hey @flytrex-vadim -- what's the size and dtypes for One way to avoid some of the overhead in the CPU -> GPU transfer is to make use of our shared memory function that removes pages from being swapped by the OS (pinned) and then virtually addressed by the GPU (mapped). Here's an example with the polyphase resampler, but this basically creates a DMA between CPU and GPU. Be careful how much memory you allocate here though -- as you can easily cause a kernel panic if you allocate too much. import cupy as cp
import numpy as np
import cusignal
start = 0
stop = 10
num_samps = int(1e8)
resample_up = 2
resample_down = 3
# Generate Data on CPU
cx = np.linspace(start, stop, num_samps, endpoint=False)
cy = np.cos(-cx**2/6.0)
# Create shared memory between CPU and GPU and load with CPU signal (cy)
gpu_signal = cusignal.get_shared_mem(num_samps, dtype=np.float64)
%%time
# Move data to GPU/CPU shared buffer and run polyphase resampler
gpu_signal[:] = cy
gf = cusignal.resample_poly(gpu_signal, resample_up, resample_down, window=('kaiser', 0.5)) |
Thanks guys, the array shapes I'm using for 2D : Here's the snapshot of my experimental notebook: I'll try looking into shared memory and other timing mechanisms |
btw, speaking of shared memory, how would I pass the allocated shared shared memory buffer to function to be used for output ? |
Hey @flytrex-vadim -- once you've allocated the shared memory buffer and loaded it with data, you can use it alike any normal CuPy/cuSignal array. For example, above: # Create shared memory between CPU and GPU. This is like `numpy.zeros` and basically creates an
# empty memory slot for `num_samps` of `np.float64` data. Remember, the GPU and CPU can access
# this memory block, so you could run both numpy/scipy and cupy/cusignal calls on it.
gpu_signal = cusignal.get_shared_mem(num_samps, dtype=np.float64)
# Now migrate data into the empty buffer. In the case of your file read, you'd read your file into
# this newly created buffer.
gpu_signal[:] = cy
# Perform cusignal/cupy (or scipy/numpy) function on this `gpu_signal`. It's now an allocated array;
# the only difference is, again, it can be used for GPU and CPU processing.
gf = cusignal.resample_poly(gpu_signal, resample_up, resample_down, window=('kaiser', 0.5)) The way CuPy/cuSignal migrate data is basically via |
Hi @awthomp , And I do not see any way to control the allocation of the return buffer. |
This is a good point @flytrex-vadim, and it's been suggested on another thread -- basically that we can ensure memory external to cusignal functions is zero-copy, but it's all abstract internal to the function. If we create some internal array, for example, can we have that be zero copy too? Further -- you're correct; all output is assumed to be on GPU, and there's not currently a feature to return an array that's already been transferred to the host; or, for that matter -- making the output array be zero-copy rather than a standard CuPy array. I'll file an issue about this in the next few days and point you to the conversation. |
@flytrex-vadim -- I created a feature request addressing one of your comments here: #76. Let's move discussion there. Do you mind if I close this issue? |
I think the two original questions remain:
|
Hey @flytrex-vadim. I was working out of your notebook and have a few comments observations:
To directly address your points:
CuPy launches asynchronously, but think we've been effectively blocking before results are returned. You can always add a
I can confirm the perf here. We can profile this specific usecase, but I'm curious if this just isn't enough data to see the perf improvement we're used to. |
This is very interesting. On my GTX 1050 Ti it takes 1.4 seconds. |
Yes, I can confirm that adding synchronize call waits for the operation to complete (tested with correlated2d) |
You can always run our benchmark tests on your 1050 TI and let us know what you see. From |
@flytrex-vadim cusignal functions are nonblocking by design. The same way a C++ CUDA kernel is nonblocking. And if you don't pass a non-default CuPy stream everything is launched in the default stream, which is blocking. So a kernel launch is non-blocking, but it launches in a stream that is blocking??? Yes, it can be a little confusing but in a heterogeneous system it means that the host code launch work (e.g. kernel) and then control returns to the host. Therefore, host code and device code can run asynchronously. Since the default stream is blocking if you were to run a blocking call like a cudaMemcpy, the host code will be blocked until the copy is finished. So lets think about that for a second... When you run Attached is sample code using our NVTX markers. We use Nsight Systems to profile the code. nsys profile --sample=none --trace=cuda,nvtx --stats=true python3 quicktest.py Notice the output below. Time(%) Total Time Instances Average Minimum Maximum Range
------- -------------- ---------- -------------- -------------- -------------- ---------------------------
98.0 26162924140 5 5232584828.0 5221783723 5242290976 Run signal.correlate2d
1.4 366266550 5 73253310.0 116566 365778317 Copy signal to GPU
0.6 156400519 5 31280103.8 504379 154166494 Run cusignal.correlate2d
0.0 1997599 5 399519.8 350902 591358 Create CPU signal
0.0 501849 5 100369.8 90972 136938 Create CPU filter
0.0 484525 5 96905.0 72400 163420 Copy filter to GPU You see that I ran multiple calls 5 times. You should notice that CPU calls are pretty consistent, while there's a swing in the GPU calls. This because the GPU is warming up. You want to sample a several runs or drop the first few to get a good number. Once the GPU is warmed up and there are no stochastic algorithms being executed the times will be consistent. To get a better understanding, I highly suggested reviewing the output with our Nsight Systems GUI! https://devblogs.nvidia.com/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/ I've attached the qdrep file from this example |
I'm getting:
Running pytest -v from HEAD\ works fine, all tests pass. |
Be sure to conda install or pip install pytest-benchmarks. |
@flytrex-vadim -- We've filed another feature request based on your questions. #77 Thanks again for the great discussion. |
Happy to help with some noob testing :)
Results attached: bench_correlate2d.txt Are there any reference results for comparison and/or guidelines how to interpret them ? |
Hi @flytrex-vadim, thanks for the benchmark. I found a bug today that causes the output to be sorted incorrectly. Accurate results are the median. I'll be pushing a fix in the next few days. |
Closing this issue; we certainly appreciate the discussion and the 2 feature requests generated from it! |
I'm trying to do a simple benchmarking of cuSignal vs signal on correlation. It seems that the correlate2d completes immediately, and the buffer transfer takes 2.8 seconds. Makes no sense for buffer size of 80 MBytes.
Could it be the correlation is evaluated lazily only when buffer transfer is requested ?
and getting :
The text was updated successfully, but these errors were encountered: