# CUDA编程模型---线程组织

### 本次课程将介绍以下内容：
1. 使用多个线程的核函数
2. 使用线程索引
3. 多维网络
4. 网格与线程块



1.当我们在讨论GPU和CUDA时，我们一定会考虑如何调用每一个线程，如何定为每一个线程。其实，在CUDA编程模型中，每一个线程都有一个唯一的标识符或者序号，而我们可以通过__threadIdx__来得到当前的线程在线程块中的序号,通过__blockIdx__来得到该线程所在的线程块在grid当中的序号，即：  

    threadIdx.x 是执行当前kernel函数的线程在block中的x方向的序号  
    
    blockIdx.x 是执行当前kernel函数的线程所在block，在grid中的x方向的序号

接下来创建[Index_of_thread.cu](Index_of_thread.cu)文件，并在核函数中打印执行该核函数的线程编号和所在的线程块的编号，如果遇到麻烦，请参考[result1.cu](result1.cu)

创建好了之后，我们开始编译

In [27]:
!make

/usr/local/cuda/bin/nvcc -arch=compute_80 -code=sm_80 Index_of_thread.cu -o ./Index_of_thread


执行Index_of_thread

In [28]:
!./Index_of_thread

Hello World from block 2 and thread 0!
Hello World from block 2 and thread 1!
Hello World from block 2 and thread 2!
Hello World from block 2 and thread 3!
Hello World from block 2 and thread 4!
Hello World from block 1 and thread 0!
Hello World from block 1 and thread 1!
Hello World from block 1 and thread 2!
Hello World from block 1 and thread 3!
Hello World from block 1 and thread 4!
Hello World from block 3 and thread 0!
Hello World from block 3 and thread 1!
Hello World from block 3 and thread 2!
Hello World from block 3 and thread 3!
Hello World from block 3 and thread 4!
Hello World from block 4 and thread 0!
Hello World from block 4 and thread 1!
Hello World from block 4 and thread 2!
Hello World from block 4 and thread 3!
Hello World from block 4 and thread 4!
Hello World from block 0 and thread 0!
Hello World from block 0 and thread 1!
Hello World from block 0 and thread 2!
Hello World from block 0 and thread 3!
Hello World from block 0 and thread 4!


修改<<<...>>>中的值，查看执行结果，这里建议分三组：<<<33,5>>>, <<<5,33>>>,<<<5,65>>>

In [31]:
!make

/usr/local/cuda/bin/nvcc -arch=compute_80 -code=sm_80 Index_of_thread.cu -o ./Index_of_thread


In [62]:
!./Index_of_thread

Hello World from block 1 and thread 32!
Hello World from block 2 and thread 32!
Hello World from block 3 and thread 32!
Hello World from block 4 and thread 32!
Hello World from block 0 and thread 32!
Hello World from block 3 and thread 0!
Hello World from block 3 and thread 1!
Hello World from block 3 and thread 2!
Hello World from block 3 and thread 3!
Hello World from block 3 and thread 4!
Hello World from block 3 and thread 5!
Hello World from block 3 and thread 6!
Hello World from block 3 and thread 7!
Hello World from block 3 and thread 8!
Hello World from block 3 and thread 9!
Hello World from block 3 and thread 10!
Hello World from block 3 and thread 11!
Hello World from block 3 and thread 12!
Hello World from block 3 and thread 13!
Hello World from block 3 and thread 14!
Hello World from block 3 and thread 15!
Hello World from block 3 and thread 16!
Hello World from block 3 and thread 17!
Hello World from block 3 and thread 18!
Hello World from block 3 and thread 19!
Hello Worl

思考一下为什么会出现这种情况！  
  
    
    

2.那我们如何能够得到一个线程在所有的线程中的索引值？比如：我们申请了4个线程块，每个线程块有8个线程，那么我们就申请了30个线程，那么我需要找到第3个线程块（编号为2的block）里面的第6个线程（编号为5的thread）在所有线程中的索引值怎么办？  
这时，我们就需要blockDim 和 gridDim这两个变量：  
- gridDim表示一个grid中包含多少个block  
- blockDim表示一个block中包含多少个线程  

也就是说，在上面的那个例子中，gridDim.x=4, blockDim.x=8  
那么，我们要找的第22个线程（编号为21）的唯一索引就应该是，index = blockIdx.x * blockDim.x + threadIdx.x
![index_of_thread](index_of_thread.png)

接下来，我们通过完成一个向量加法的实例来实践一下，我们来实现的cpu代码如下:  

    #include <math.h>
    #include <stdlib.h>
    #include <stdio.h>
    
    void add(const double *x, const double *y, double *z, const int N)
    {
        for (int n = 0; n < N; ++n)
        {
            z[n] = x[n] + y[n];
        }
    }

    void check(const double *z, const int N)
    {
        bool has_error = false;
        for (int n = 0; n < N; ++n)
        {
            if (fabs(z[n] - 3) > (1.0e-10))
            {
                has_error = true;
            }
        }
        printf("%s\n", has_error ? "Errors" : "Pass");
    }


    int main(void)
    {
        const int N = 100000000;
        const int M = sizeof(double) * N;
        double *x = (double*) malloc(M);
        double *y = (double*) malloc(M);
        double *z = (double*) malloc(M);
    
        for (int n = 0; n < N; ++n)
        {
            x[n] = 1;
            y[n] = 2;
        }

        add(x, y, z, N);
        check(z, N);
    
        free(x);
        free(y);
        free(z);
        return 0;
    }

为了完成这个程序，我们先要将数据传输给GPU，并在GPU完成计算的时候，将数据从GPU中传输给CPU内存。这时我们就需要考虑如何申请GPU存储单元，以及内存和显存之前的数据传输。在[result2](result2.cu)中我们展示了如何完成这一过程的方法：  

我们利用cudaMalloc()来进行GPU存储单元的申请，利用cudaMemcpy()来完成数据的传输

接下来，我们在[vectorAdd.cu](vectorAdd.cu)文件中完成这一过程，如有困难，请参考[result2](result2.cu)

修改Makefile文件，并编译执行

In [75]:
!make -f Makefile_vectoradd

/usr/local/cuda/bin/nvcc -arch=compute_80 -code=sm_80 vectorAdd.cu -o ./vectorAdd


In [76]:
!./vectorAdd

Pass


利用nvprof查看程序性能

In [41]:
!/usr/local/cuda/bin/nsys nvprof --print-api-trace ./vectorAdd

The --print-api-trace  switch is ignored by nsys.

Pass
Generating '/tmp/nsys-report-b89e.qdstrm'
[3/7] Executing 'nvtxsum' stats report
SKIPPED: /mnt/CUDA_on_ARM/02_2.2/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cudaapisum' stats report

CUDA API Statistics:

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)   Med (ns)  Min (ns)  Max (ns)   StdDev (ns)        Name      
 --------  ---------------  ---------  ----------  --------  --------  ---------  -----------  ----------------
     99.9        204451190          3  68150396.7    6331.0      3950  204440909  118031046.0  cudaMalloc      
      0.1           178677          3     59559.0    7709.0      3413     167555      93551.9  cudaFree        
      0.0            62352          3     20784.0   23911.0     11294      27147       8376.3  cudaMemcpy      
      0.0            23978          1     23978.0   23978.0     23978      23978          0.0  cudaLaunchKernel

[5/7] Executing 'gpukernsum'

课后作业：  
1. 如果我们设置的线程数过大，比如设置grid_size = (N + block_size - 1) / block_size+10000，会产生什么后果？如何避免这种后果？ 
2. 如果我们的要处理的数据太多，远远超过我们能申请的线程数怎么办？
3. 修改[sobel.cu](sobel.cu)完成Sobel边缘检测kernel优化,如果遇到问题, 请参考[sobel_result.cu](sobel_result.cu)

----
编译:

In [77]:
# 这里根据自己的环境进行了修改。
!/usr/local/cuda/bin/nvcc -arch=compute_80 -code=sm_80 sobel.cu -L /usr/lib/x86_64-linux-gnu/libopencv*.so -I /usr/include/opencv2 -o sobel




In [78]:
#!/usr/local/cuda/bin/nvcc -arch=compute_80 -code=sm_80 sobel.cu -L /usr/lib/aarch64-linux-gnu/libopencv*.so -I /usr/include/opencv4 -o sobel

执行:

In [80]:
!./sobel