Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding CPU implementation of correlation function. #39

Closed
Somdyuti2 opened this issue Jul 29, 2020 · 3 comments
Closed

Regarding CPU implementation of correlation function. #39

Somdyuti2 opened this issue Jul 29, 2020 · 3 comments

Comments

@Somdyuti2
Copy link

Hi, thanks for the implementation. In my use case, I need to perform the inference on CPU. Inspecting your code in the file correlation.py, I kind of get that for that, we need to call the extern C functions ourselves instead of invoking CuPy functions to do it for us. In your code, each C function is called like this:
cupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', { 'input': second, 'output': rbot1 }))( grid=tuple([ int((n + 16 - 1) / 16), second.shape[1], second.shape[0] ]), block=tuple([ 16, 1, 1 ]), args=[ n, second.data_ptr(), rbot1.data_ptr() ] )

I am not familiar with CuPy code, so it will be helpful if you could explain these function calls a bit and give any clue about how to do the equivalent stuff on CPU. I understand that the args in each call are the arguments passed to the C function, but I am not sure what grid and block signify here. Probably, they may not be needed when CuPy is not used. As I only need to run on CPU at test time, I guess I don't need to care about the updateGrad functions.

I will appreciate your help/suggestion regarding this.

@sniklaus
Copy link
Owner

To make inference on CPUs work, you will have to convert the following CUDA code to something that runs on CPUs instead.

kernel_Correlation_rearrange = '''
extern "C" __global__ void kernel_Correlation_rearrange(
const int n,
const float* input,
float* output
) {
int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
if (intIndex >= n) {
return;
}
int intSample = blockIdx.z;
int intChannel = blockIdx.y;
float fltValue = input[(((intSample * SIZE_1(input)) + intChannel) * SIZE_2(input) * SIZE_3(input)) + intIndex];
__syncthreads();
int intPaddedY = (intIndex / SIZE_3(input)) + 4;
int intPaddedX = (intIndex % SIZE_3(input)) + 4;
int intRearrange = ((SIZE_3(input) + 8) * intPaddedY) + intPaddedX;
output[(((intSample * SIZE_1(output) * SIZE_2(output)) + intRearrange) * SIZE_1(input)) + intChannel] = fltValue;
}
'''

kernel_Correlation_updateOutput = '''
extern "C" __global__ void kernel_Correlation_updateOutput(
const int n,
const float* rbot0,
const float* rbot1,
float* top
) {
extern __shared__ char patch_data_char[];
float *patch_data = (float *)patch_data_char;
// First (upper left) position of kernel upper-left corner in current center position of neighborhood in image 1
int x1 = blockIdx.x + 4;
int y1 = blockIdx.y + 4;
int item = blockIdx.z;
int ch_off = threadIdx.x;
// Load 3D patch into shared shared memory
for (int j = 0; j < 1; j++) { // HEIGHT
for (int i = 0; i < 1; i++) { // WIDTH
int ji_off = (j + i) * SIZE_3(rbot0);
for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS
int idx1 = ((item * SIZE_1(rbot0) + y1+j) * SIZE_2(rbot0) + x1+i) * SIZE_3(rbot0) + ch;
int idxPatchData = ji_off + ch;
patch_data[idxPatchData] = rbot0[idx1];
}
}
}
__syncthreads();
__shared__ float sum[32];
// Compute correlation
for (int top_channel = 0; top_channel < SIZE_1(top); top_channel++) {
sum[ch_off] = 0;
int s2o = top_channel % 9 - 4;
int s2p = top_channel / 9 - 4;
for (int j = 0; j < 1; j++) { // HEIGHT
for (int i = 0; i < 1; i++) { // WIDTH
int ji_off = (j + i) * SIZE_3(rbot0);
for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS
int x2 = x1 + s2o;
int y2 = y1 + s2p;
int idxPatchData = ji_off + ch;
int idx2 = ((item * SIZE_1(rbot0) + y2+j) * SIZE_2(rbot0) + x2+i) * SIZE_3(rbot0) + ch;
sum[ch_off] += patch_data[idxPatchData] * rbot1[idx2];
}
}
}
__syncthreads();
if (ch_off == 0) {
float total_sum = 0;
for (int idx = 0; idx < 32; idx++) {
total_sum += sum[idx];
}
const int sumelems = SIZE_3(rbot0);
const int index = ((top_channel*SIZE_2(top) + blockIdx.y)*SIZE_3(top))+blockIdx.x;
top[index + item*SIZE_1(top)*SIZE_2(top)*SIZE_3(top)] = total_sum / (float)sumelems;
}
}
}
'''

There is nothing you need to be familiar with in terms of CuPy really, the grid and block arguments are something from CUDA. I would recommend you look into the fundamentals for CUDA, it shouldn't take much to understand what the code is doing once you know about the basics of CUDA. Good luck!

@duongpaKikai
Copy link

duongpaKikai commented Sep 15, 2020

i get problem in the code above when trace model from Pytorch to TorchScript

``Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 449, in compile
nvrtc.compileProgram(self.ptr, options)
File "cupy\cuda\nvrtc.pyx", line 101, in cupy.cuda.nvrtc.compileProgram
File "cupy\cuda\nvrtc.pyx", line 111, in cupy.cuda.nvrtc.compileProgram
File "cupy\cuda\nvrtc.pyx", line 56, in cupy.cuda.nvrtc.check_status
cupy.cuda.nvrtc.NVRTCError: NVRTC_ERROR_COMPILATION (6)

During handling of the above exception, another exception occurred:`

` File "D:\project\test_1.0.0.pytorch\DIFRINT\models\correlation\correlation.py", line 143, in cupy_launch
return cupy.cuda.compile_with_cache(strKernel).get_function(strFunction)
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 297, in compile_with_cache
return _compile_with_cache_cuda(source, options, arch, cache_dir,
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 350, in _compile_with_cache_cuda
ptx = compile_using_nvrtc(source, options, arch, name + '.cu')
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 158, in compile_using_nvrtc
ptx = prog.compile(options)
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 453, in compile
raise CompileException(log, self.src, self.name, options, 'nvrtc')
cupy.cuda.compiler.CompileException: C:\Users\Admin\AppData\Local\Temp\tmpg65_9di1\8f8f5ff490d72ae331cc951d3896a60d_2.cubin.cu(16): error: identifier "tensor" is undefined

1 error detected in the compilation of "C:\Users\Admin\AppData\Local\Temp\tmpg65_9di1\8f8f5ff490d72ae331cc951d3896a60d_2.cubin.cu".`

Etienne66 added a commit to Etienne66/pytorch-pwc that referenced this issue Mar 7, 2022
This fixes issue sniklaus#39 and allows Tensorboard's SummaryWriter.add_graph to work although there are some TracerWarnings.

`TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!`
@jiaweiHu-XDU
Copy link

I would like to ask, is there a PyTorch implementation of this CUDA_C code, because I really haven't changed the Python version, if you have done this work before, I hope you can help me solve this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants