In this chapter, we will explore various tools and techniques to debug and profile CUDA programs. Debugging helps identify and fix errors in your code, while profiling helps optimize performance by analyzing how your code executes on the GPU.
Debugging and profiling are essential steps in CUDA development. They help ensure your code runs correctly and efficiently on the GPU. This guide will introduce you to the tools and techniques needed to debug and profile CUDA programs effectively.
cuda-gdb
is a powerful debugger for CUDA applications. It allows you to set breakpoints, step through code, and inspect variables.
-
Compile with Debug Information:
nvcc -g -G -o debapro debapro.cu
-
Start cuda-gdb:
cuda-gdb ./debapro
-
Set Breakpoints:
(cuda-gdb) break main
-
Run the Program:
(cuda-gdb) run
-
Step Through Code:
(cuda-gdb) next
-
Inspect Variables:
(cuda-gdb) print variable_name
Nsight Eclipse Edition provides an integrated development environment for debugging CUDA applications.
- Open Nsight Eclipse Edition.
- Create a Debug Configuration:
- Go to Run > Debug Configurations.
- Create a new CUDA C/C++ Application configuration.
- Set the project and application to your
debapro
executable.
- Set Breakpoints and Start Debugging.
Nsight Visual Studio Edition integrates with Visual Studio to provide debugging capabilities for CUDA applications.
- Open Visual Studio.
- Create a Debug Configuration:
- Go to Debug > Attach to Process.
- Select the CUDA application you want to debug.
- Set Breakpoints and Start Debugging.
Nsight Systems provides system-wide performance analysis, helping you identify bottlenecks in your application.
-
Run Nsight Systems:
nsys profile ./debapro
-
Analyze the Report:
- This command generates a report file (e.g.,
report.qdrep
). - Open this file in the Nsight Systems GUI for detailed analysis.
- This command generates a report file (e.g.,
Nsight Compute provides detailed analysis of CUDA kernel performance.
-
Run Nsight Compute:
ncu ./debapro
-
Analyze the Output:
- Nsight Compute will provide detailed metrics about the kernel execution.
Here is a sample CUDA program for matrix multiplication:
#include <iostream>
#include <cuda_runtime.h>
#define N 16 // Size of the matrix (N x N)
__global__ void matrixMul(const float *A, const float *B, float *C, int width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float value = 0;
if (row < width && col < width) {
for (int k = 0; k < width; ++k) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}
int main() {
int size = N * N * sizeof(float);
float h_A[N * N], h_B[N * N], h_C[N * N];
// Initialize matrices
for (int i = 0; i < N * N; ++i) {
h_A[i] = static_cast<float>(rand()) / RAND_MAX;
h_B[i] = static_cast<float>(rand()) / RAND_MAX;
}
float *d_A, *d_B, *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
dim3 threadsPerBlock(16, 16);
dim3 blocksPerGrid((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y);
matrixMul<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify the result
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
float sum = 0;
for (int k = 0; k < N; ++k) {
sum += h_A[i * N + k] * h_B[k * N + j];
}
if (fabs(sum - h_C[i * N + j]) > 1e-5) {
std::cerr << "Result verification failed at element (" << i << ", " << j << ")" << std::endl;
exit(EXIT_FAILURE);
}
}
}
std::cout << "Test PASSED" << std::endl;
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return 0;
}