Regarding CPU implementation of correlation function. #39

Somdyuti2 · 2020-07-29T16:36:26Z

Hi, thanks for the implementation. In my use case, I need to perform the inference on CPU. Inspecting your code in the file correlation.py, I kind of get that for that, we need to call the extern C functions ourselves instead of invoking CuPy functions to do it for us. In your code, each C function is called like this:
cupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', { 'input': second, 'output': rbot1 }))( grid=tuple([ int((n + 16 - 1) / 16), second.shape[1], second.shape[0] ]), block=tuple([ 16, 1, 1 ]), args=[ n, second.data_ptr(), rbot1.data_ptr() ] )

I am not familiar with CuPy code, so it will be helpful if you could explain these function calls a bit and give any clue about how to do the equivalent stuff on CPU. I understand that the args in each call are the arguments passed to the C function, but I am not sure what grid and block signify here. Probably, they may not be needed when CuPy is not used. As I only need to run on CPU at test time, I guess I don't need to care about the updateGrad functions.

I will appreciate your help/suggestion regarding this.

The text was updated successfully, but these errors were encountered:

sniklaus · 2020-07-29T17:19:16Z

To make inference on CPUs work, you will have to convert the following CUDA code to something that runs on CPUs instead.

pytorch-pwc/correlation/correlation.py

Lines 8 to 33 in cf0d2f2

    
           kernel_Correlation_rearrange = ''' 
        
           	extern "C" __global__ void kernel_Correlation_rearrange( 
        
           		const int n, 
        
           		const float* input, 
        
           		float* output 
        
           	) { 
        
           	  int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; 
        
           	  if (intIndex >= n) { 
        
           	    return; 
        
           	  } 
        
           	  int intSample = blockIdx.z; 
        
           	  int intChannel = blockIdx.y; 
        
           	  float fltValue = input[(((intSample * SIZE_1(input)) + intChannel) * SIZE_2(input) * SIZE_3(input)) + intIndex]; 
        
           	  __syncthreads(); 
        
           	  int intPaddedY = (intIndex / SIZE_3(input)) + 4; 
        
           	  int intPaddedX = (intIndex % SIZE_3(input)) + 4; 
        
           	  int intRearrange = ((SIZE_3(input) + 8) * intPaddedY) + intPaddedX; 
        
           	  output[(((intSample * SIZE_1(output) * SIZE_2(output)) + intRearrange) * SIZE_1(input)) + intChannel] = fltValue; 
        
           	} 
        
           '''

pytorch-pwc/correlation/correlation.py

Lines 35 to 103 in cf0d2f2

    
           kernel_Correlation_updateOutput = ''' 
        
           	extern "C" __global__ void kernel_Correlation_updateOutput( 
        
           	  const int n, 
        
           	  const float* rbot0, 
        
           	  const float* rbot1, 
        
           	  float* top 
        
           	) { 
        
           	  extern __shared__ char patch_data_char[]; 
        
           	  float *patch_data = (float *)patch_data_char; 
        
           	  // First (upper left) position of kernel upper-left corner in current center position of neighborhood in image 1 
        
           	  int x1 = blockIdx.x + 4; 
        
           	  int y1 = blockIdx.y + 4; 
        
           	  int item = blockIdx.z; 
        
           	  int ch_off = threadIdx.x; 
        
           	  // Load 3D patch into shared shared memory 
        
           	  for (int j = 0; j < 1; j++) { // HEIGHT 
        
           	    for (int i = 0; i < 1; i++) { // WIDTH 
        
           	      int ji_off = (j + i) * SIZE_3(rbot0); 
        
           	      for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS 
        
           	        int idx1 = ((item * SIZE_1(rbot0) + y1+j) * SIZE_2(rbot0) + x1+i) * SIZE_3(rbot0) + ch; 
        
           	        int idxPatchData = ji_off + ch; 
        
           	        patch_data[idxPatchData] = rbot0[idx1]; 
        
           	      } 
        
           	    } 
        
           	  } 
        
           	  __syncthreads(); 
        
           	  __shared__ float sum[32]; 
        
           	  // Compute correlation 
        
           	  for (int top_channel = 0; top_channel < SIZE_1(top); top_channel++) { 
        
           	    sum[ch_off] = 0; 
        
           	    int s2o = top_channel % 9 - 4; 
        
           	    int s2p = top_channel / 9 - 4; 
        
           	    for (int j = 0; j < 1; j++) { // HEIGHT 
        
           	      for (int i = 0; i < 1; i++) { // WIDTH 
        
           	        int ji_off = (j + i) * SIZE_3(rbot0); 
        
           	        for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS 
        
           	          int x2 = x1 + s2o; 
        
           	          int y2 = y1 + s2p; 
        
           	          int idxPatchData = ji_off + ch; 
        
           	          int idx2 = ((item * SIZE_1(rbot0) + y2+j) * SIZE_2(rbot0) + x2+i) * SIZE_3(rbot0) + ch; 
        
           	          sum[ch_off] += patch_data[idxPatchData] * rbot1[idx2]; 
        
           	        } 
        
           	      } 
        
           	    } 
        
           	    __syncthreads(); 
        
           	    if (ch_off == 0) { 
        
           	      float total_sum = 0; 
        
           	      for (int idx = 0; idx < 32; idx++) { 
        
           	        total_sum += sum[idx]; 
        
           	      } 
        
           	      const int sumelems = SIZE_3(rbot0); 
        
           	      const int index = ((top_channel*SIZE_2(top) + blockIdx.y)*SIZE_3(top))+blockIdx.x; 
        
           	      top[index + item*SIZE_1(top)*SIZE_2(top)*SIZE_3(top)] = total_sum / (float)sumelems; 
        
           	    } 
        
           	  } 
        
           	} 
        
           '''

There is nothing you need to be familiar with in terms of CuPy really, the grid and block arguments are something from CUDA. I would recommend you look into the fundamentals for CUDA, it shouldn't take much to understand what the code is doing once you know about the basics of CUDA. Good luck!

duongpaKikai · 2020-09-15T10:00:03Z

i get problem in the code above when trace model from Pytorch to TorchScript

``Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 449, in compile
nvrtc.compileProgram(self.ptr, options)
File "cupy\cuda\nvrtc.pyx", line 101, in cupy.cuda.nvrtc.compileProgram
File "cupy\cuda\nvrtc.pyx", line 111, in cupy.cuda.nvrtc.compileProgram
File "cupy\cuda\nvrtc.pyx", line 56, in cupy.cuda.nvrtc.check_status
cupy.cuda.nvrtc.NVRTCError: NVRTC_ERROR_COMPILATION (6)

During handling of the above exception, another exception occurred:`

` File "D:\project\test_1.0.0.pytorch\DIFRINT\models\correlation\correlation.py", line 143, in cupy_launch
return cupy.cuda.compile_with_cache(strKernel).get_function(strFunction)
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 297, in compile_with_cache
return _compile_with_cache_cuda(source, options, arch, cache_dir,
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 350, in _compile_with_cache_cuda
ptx = compile_using_nvrtc(source, options, arch, name + '.cu')
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 158, in compile_using_nvrtc
ptx = prog.compile(options)
File "C:\ProgramData\Anaconda3\envs\DIFRINT-v4\lib\site-packages\cupy\cuda\compiler.py", line 453, in compile
raise CompileException(log, self.src, self.name, options, 'nvrtc')
cupy.cuda.compiler.CompileException: C:\Users\Admin\AppData\Local\Temp\tmpg65_9di1\8f8f5ff490d72ae331cc951d3896a60d_2.cubin.cu(16): error: identifier "tensor" is undefined

1 error detected in the compilation of "C:\Users\Admin\AppData\Local\Temp\tmpg65_9di1\8f8f5ff490d72ae331cc951d3896a60d_2.cubin.cu".`

This fixes issue sniklaus#39 and allows Tensorboard's SummaryWriter.add_graph to work although there are some TracerWarnings. `TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!`

jiaweiHu-XDU · 2023-05-08T14:20:23Z

I would like to ask, is there a PyTorch implementation of this CUDA_C code, because I really haven't changed the Python version, if you have done this work before, I hope you can help me solve this problem

sniklaus closed this as completed Jul 29, 2020

sniklaus mentioned this issue Apr 8, 2021

Can it test in CPU device? #46

Closed

tommy19970714 mentioned this issue May 5, 2021

Support inference on a CPU geyuying/PF-AFN#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding CPU implementation of correlation function. #39

Regarding CPU implementation of correlation function. #39

Somdyuti2 commented Jul 29, 2020

sniklaus commented Jul 29, 2020

duongpaKikai commented Sep 15, 2020 •

edited

Loading

jiaweiHu-XDU commented May 8, 2023

Regarding CPU implementation of correlation function. #39

Regarding CPU implementation of correlation function. #39

Comments

Somdyuti2 commented Jul 29, 2020

sniklaus commented Jul 29, 2020

duongpaKikai commented Sep 15, 2020 • edited Loading

jiaweiHu-XDU commented May 8, 2023

duongpaKikai commented Sep 15, 2020 •

edited

Loading