You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Understanding unified memory page allocation and transfer
First, we need to allocate new pages on the GPU and CPU (first-touch basis). If the page is not present and mapped to another, a device page table page fault occurs. When *x, which resides in page 2, is accessed in the GPU that is currently mapped to CPU memory, it gets a page fault. Take a look at the following diagram:
In the next step, the old page on the CPU is unmapped, as shown in the following diagram:
Next, the data is copied from the CPU to the GPU, as shown in the following diagram:
Finally, the new pages are mapped on the GPU, while the old pages are freed on the CPU, as shown in the following diagram:
CUDA Thread Programming
CUDA threads, blocks, and the GPU
Understanding parallel reduction
Naive parallel reduction using global memory
Reducing kernels using shared memory
Minimizing the CUDA warp divergence effect
Determining divergence as a performance bottleneck
Interleaved addressing
Sequential addressing
Performance modeling and balancing the limiter
The Roofline model
Warp-level primitive programming
Parallel reduction with warp primitives
Cooperative Groups for flexible thread handling
Cooperative Groups in a CUDA thread block
Benefits of Cooperative Groups
Modularity
Atomic operations
Low/mixed precision operations
Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)
Kernel Execution Model and Optimization Strategies
Kernel execution with CUDA streams
The usage of CUDA streams
Stream-level synchronization
Working with the default stream
Pipelining the GPU execution
Concept of GPU pipelining
Building a pipelining execution
The CUDA callback function
CUDA streams with priority
Stream execution with priorities
Kernel execution time estimation using CUDA events
Using CUDA events
CUDA dynamic parallelism
Usage of dynamic parallelism
Grid-level cooperative groups
Understanding grid-level cooperative groups
CUDA kernel calls with OpenMP
CUDA kernel calls with OpenMP
Multi-Process Service
Enabling MPS
Profiling an MPI application and understanding MPS operation
Kernel execution overhead comparison
Comparison of three executions
CUDA Application Profiling and Debugging
Scalable Multi-GPU Programming
Solving a linear equation using Gaussian elimination
Single GPU hotspot analysis of Gaussian elimination
GPUDirect peer to peer
Single node – multi-GPU Gaussian elimination
GPUDirect RDMA
CUDA-aware MPI
Multinode – multi-GPU Gaussian elimination
CUDA streams
Application 1 – using multiple streams to overlap data transfers with kernel execution
Application 2 – using multiple streams to run kernels on multiple devices
Additional tricks
Collective communication acceleration using NCCL
Parallel Programming Patterns in CUDA
Matrix multiplication optimization
Performance analysis of the tiling approach
Convolution
Convolution operation in CUDA
Optimization strategy
Prefix sum (scan)
Building a global size scan
Compact and split
N-body
Implementing an N-body simulation on GPU
Histogram calculation
Understanding a parallel histogram
Quicksort and CUDA dynamic parallelism
Quicksort in CUDA using dynamic parallelism
Radix sort
Programming with Libraries and Other Languages
Linear algebra operation using cuBLAS
level in cuBLAS
operation
level 1
vector-vector
level 2
matrix-vector
level 3
matrix-matrix
GPU Programming Using OpenACC
OpenACC directives
Parallel and loop directives
Data directive
Asynchronous programming in OpenACC
Applying the unstructured data and async directives to merge image code
Additional important directives and clauses
Gang/vector/worker
Deep Learning Acceleration with CUDA
List of posts followed by this article
Jaegeun Han, Bharatkumar Sharma - Learn CUDA Programming_ A beginner's guide to GPU programming and parallel computing with CUDA 10.x and C_C++-Packt Publishing (2019)