**Table of contents**<a id='toc0_'></a>    
- [Thrust Tutorial](#toc1_)    
  - [The underlying Compilation](#toc1_1_)    
  - [Code Explanation](#toc1_2_)    
  - [Execution Policy vs Specifier](#toc1_3_)    
    - [Code exercise: Compute Median Temperature](#toc1_3_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Thrust Tutorial](#toc0_)

Reference: 
[Youtube](https://www.youtube.com/watch?v=Sdjn9FOkhnA&list=PL5B692fm6--vWLhYPqLcEu6RF3hXjEyJr&index=1)
[Colab](https://github.com/NVIDIA/accelerated-computing-hub/blob/main/tutorials/cuda-cpp/README.md)
## <a id='toc1_1_'></a>[The underlying Compilation](#toc0_)

![](Sources/compilation.png)




In [3]:
import os
os.environ["PATH"] = "/usr/local/cuda/bin:" + os.environ["PATH"]


## <a id='toc1_2_'></a>[Code Explanation](#toc0_)

Start withe the following code:


In [7]:
%%writefile Sources/cpu-cooling.cpp

#include <cstdio>
#include <vector>
#include <algorithm>

int main() {
    float k = 0.5;
    float ambient_temp = 20;
    std::vector<float> temp{ 42, 24, 50 };
    

    auto op = [=](float temp){
        float diff = ambient_temp - temp;
        return temp + k * diff;
    };

    std::printf("step  temp[0]  temp[1]  temp[2]\n");
    for (int step = 0; step < 3; step++) {
        

        std::transform(temp.begin(), temp.end(),
                        temp.begin(), op);

        std::printf("%d     %.2f    %.2f    %.2f\n", step, temp[0], temp[1], temp[2]);
    }
}

Overwriting Sources/cpu-cooling.cpp


In [10]:
!nvcc -x cu -arch=native Sources/cpu-cooling.cpp -o temp/a.out # compile the code
!./temp/a.out # run the executable

step  temp[0]  temp[1]  temp[2]
0     31.00    22.00    35.00
1     25.50    21.00    27.50
2     22.75    20.50    23.75


We implement it at GPU side.


`thrust::universal_vector` is a vector that can be used in both host and device side.
Unified memory is a memory management system that allows the CPU and GPU to share a single memory space without explicit data transfers `cudaMemcpy`. In the underlying implementation, it was created using `cudaMallocManaged`. When using unified memory, the CUDA runtime automatically transfers the data whose unit is a page (typically 4KB) between the host and device as needed. But, the **synchronization** is still needed. It is UB(undefined behavior) if the data is accessed from both host and device without synchronization.



In [None]:
%%writefile Sources/thrust-cooling.cpp

#include <thrust/execution_policy.h>
#include <thrust/universal_vector.h>
#include <thrust/transform.h>
#include <cstdio>

int main() {
    float k = 0.5;
    float ambient_temp = 20;
    thrust::universal_vector<float> temp{ 42, 24, 50 };
    auto transformation = [=] __host__ __device__ (float temp) { return temp + k * (ambient_temp - temp); };

    std::printf("step  temp[0]  temp[1]  temp[2]\n");
    for (int step = 0; step < 3; step++) {
        thrust::transform(thrust::device, temp.begin(), temp.end(), temp.begin(), transformation);
        std::printf("%d     %.2f    %.2f    %.2f\n", step, temp[0], temp[1], temp[2]);
    }
}

Overwriting Sources/thrust-cooling.cpp


In [16]:
!nvcc -std=c++14 --extended-lambda Sources/thrust-cooling.cpp -x cu -arch=native -o temp/a.out # compile the code
!./temp/a.out # run the executable

step  temp[0]  temp[1]  temp[2]
0     31.00    22.00    35.00
1     25.50    21.00    27.50
2     22.75    20.50    23.75


## <a id='toc1_3_'></a>[Execution Policy vs Specifier](#toc0_)
`Execution Plolicy`(`thrust::device`,`thrust::host`) indicates where the code will run. It doesn't automatically compile code for the location.

`Execution Specifier`(`__host__`,`__device__`) indicates where the code can run. It doesn't automatically run code there.

![](Sources/policyvsspecifier.png)
![](Sources/table_policyvsspecifier.png)

### <a id='toc1_3_1_'></a>[Code exercise: Compute Median Temperature](#toc0_)

In [11]:
%%writefile Sources/port-sort-to-gpu.cpp
#include <thrust/execution_policy.h>
#include <thrust/universal_vector.h>
#include <thrust/transform.h>
#include <cstdio>
#include <algorithm>

float median(thrust::universal_vector<float> vec)
{
    
    std::sort(vec.begin(), vec.end());
    return vec[vec.size() / 2];
}

int main(){
    float k =0.5;
    float ambient_temp =20;
    thrust::universal_vector<float> temp{42,24,50};
    auto transformation = [=] __host__ __device__ (float temp) { return temp + k * (ambient_temp - temp); };
    std::printf("step  median\n");
    for (int step = 0; step < 3; step++) {
        thrust::transform(thrust::device, temp.begin(),temp.end(),
                          temp.begin(), transformation);
        float median_temp = median(temp);
        std::printf("%d     %.2f\n", step, median_temp);
    }
}

Overwriting Sources/port-sort-to-gpu.cpp


In [18]:
# !nvcc -std=c++14 --extended-lambda Sources/port-sort-to-gpu.cpp -x cu -arch=native -o temp/a.out # compile the code
# !time -v ./temp/a.out # run the executable
!/usr/bin/time -v ./temp/a.out


step  median
0     31.00
1     25.50
2     22.75
	Command being timed: "./temp/a.out"
	User time (seconds): 0.02
	System time (seconds): 0.13
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.16
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 106920
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 3
	Minor (reclaiming a frame) page faults: 5599
	Voluntary context switches: 60
	Involuntary context switches: 1
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0


## Thrust fancy iterators
**We don't need to declare the execution policy when using fancy iterators.**

Here we show three types of fancy iterators in Thrust.
```c++
auto begin = thrust::make_counting_iterator(1);
auto end = begin + 10;
thrust::for_each(begin,end,print);

thrust::make_zip_iterator(a.begin(), b.begin());
thrust::make_transform_iterator(a.begin(), 
    [] __host__ __device__ (float x) { return x * x; });
```
In the following code, we will implement the maximum of the change in the temperature using Thrust.
The original step should be 
1. compute the gap between the current temperature and previous temperature `a` and `b`;
2. compute the maximum of the gap

But in this implementation, we have to read `2*n` floats total from `a` and `b`, and then store `n` floats to the temporary array `temp`. Then in the reduction step, we read `n` floats from `temp`. So the total memory access is `4*n` floats.

To optimize the memory access, we can find the maximum when computing the gap, so that we don't need the temporary array `temp` to store the gap. The total memory access is `2*n` floats.


In [50]:
%%writefile Sources/naive-vs-iterators.cpp
#include <thrust/execution_policy.h>
#include <thrust/universal_vector.h>
#include <thrust/transform.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <cstdio>
#include <chrono>

float naive_max_change(const thrust::universal_vector<float>& a, const thrust::universal_vector<float>& b)
{
    thrust::universal_vector<float> diff(a.size());
    //x = a[i];y = b[i];diff[i] = abs(x - y);
    thrust::transform(thrust::device, a.begin(), a.end(), b.begin(), diff.begin(),
        []__host__ __device__(float x, float y) {
            return abs(x - y);
        });
    return thrust::reduce(thrust::device, diff.begin(), diff.end(), 0.0f, thrust::maximum<float>{});
}

float max_change(const thrust::universal_vector<float>& a, const thrust::universal_vector<float>& b)
{
    auto zip = thrust::make_zip_iterator(a.begin(), b.begin());
    auto transform = thrust::make_transform_iterator(zip, []__host__ __device__(thrust::tuple<float, float> t) {
        return abs(thrust::get<0>(t) - thrust::get<1>(t));
    });
    return thrust::reduce(thrust::device, transform, transform + a.size(), 0.0f, thrust::maximum<float>{});
}

int main()
{
    // allocate vectors containing 2^28 elements
    thrust::universal_vector<float> a(1 << 28);
    thrust::universal_vector<float> b(1 << 28);

    thrust::sequence(a.begin(), a.end());
    thrust::sequence(b.rbegin(), b.rend());

    auto start_naive = std::chrono::high_resolution_clock::now();
    naive_max_change(a, b);
    auto end_naive = std::chrono::high_resolution_clock::now();
    const double naive_duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_naive - start_naive).count();

    auto start = std::chrono::high_resolution_clock::now();
    max_change(a, b);
    auto end = std::chrono::high_resolution_clock::now();
    const double duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();

    std::printf("iterators are %g times faster than naive approach\n", naive_duration / duration);
}

Overwriting Sources/naive-vs-iterators.cpp


In [51]:
!nvcc --extended-lambda -o /tmp/a.out Sources/naive-vs-iterators.cpp -x cu -arch=native # build executable
!/tmp/a.out # run executable

iterators are 95 times faster than naive approach
