**Lab-1: count_theads.cu and hello.cu**

03-11-25, v1.0, (c) Incubera AI Labs

**This an introduction to GPU parallelism. ***

Here's what your learn:

*   Count number of threads
*   We write **once** run many copies
*   Code Runs in Multiple Places at Once
*   We can concurrenlty launch multiple ("n") copies of a function (e.g. n=1000)
*   GPU executes all copies in **parallel**
*   Each thread gets a **unique ID** (0-(n-1)) automatically
*   Functions **at the same time**, not one after another

In [41]:
!nvidia-smi


Sat Nov  1 22:19:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [44]:
!nvcc --version


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [7]:
%%writefile count_threads.cu
#include <stdio.h>

//  Cardinality (Number of Threads)
__global__ void countThreads() {
    // Thread's position within its block
    int thread_in_block = threadIdx.x;

    // Which block am I in?
    int my_block = blockIdx.x;

    // My unique global thread ID
    int global_id = blockIdx.x * blockDim.x + threadIdx.x;

    // Total threads in the grid
    int total_threads = gridDim.x * blockDim.x;

    printf("Block %d, Thread %d → Global ID: %d (Total: %d threads)\n",
           my_block, thread_in_block, global_id, total_threads);
}

int main() {
    printf("Launching with <<<2, 5>>> (2 blocks, 5 threads each)\n\n");
    countThreads<<<2, 5>>>();
    cudaDeviceSynchronize();
    return 0;
}

Writing count_threads.cu


In [8]:
!nvcc -arch=sm_75 count_threads.cu -o count_threads


In [9]:
!./count_threads

Launching with <<<2, 5>>> (2 blocks, 5 threads each)

Block 0, Thread 0 → Global ID: 0 (Total: 10 threads)
Block 0, Thread 1 → Global ID: 1 (Total: 10 threads)
Block 0, Thread 2 → Global ID: 2 (Total: 10 threads)
Block 0, Thread 3 → Global ID: 3 (Total: 10 threads)
Block 0, Thread 4 → Global ID: 4 (Total: 10 threads)
Block 1, Thread 0 → Global ID: 5 (Total: 10 threads)
Block 1, Thread 1 → Global ID: 6 (Total: 10 threads)
Block 1, Thread 2 → Global ID: 7 (Total: 10 threads)
Block 1, Thread 3 → Global ID: 8 (Total: 10 threads)
Block 1, Thread 4 → Global ID: 9 (Total: 10 threads)


In [4]:
%%writefile hello.cu
#include <stdio.h>

//
__global__ void helloKernel() {
    printf("Hello from thread %d!\n", threadIdx.x);
}

int main() {
    // Check if CUDA device is available
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Found %d CUDA devices\n", deviceCount);

    if (deviceCount == 0) {
        printf("No CUDA devices found!\n");
        return 1;
    }

    // Set larger printf buffer BEFORE launching kernel
    cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 1024*1024*10);

    printf("Launching kernel from CPU...\n");

    // 1 block, 100 threads
    helloKernel<<<1, 1000>>>();

    // Check for launch errors
    cudaError_t launchErr = cudaGetLastError();
    if (launchErr != cudaSuccess) {
        printf("Kernel launch error: %s\n", cudaGetErrorString(launchErr));
        return 1;
    }

    // Wait and check for execution errors
    cudaError_t syncErr = cudaDeviceSynchronize();
    if (syncErr != cudaSuccess) {
        printf("Kernel execution error: %s\n", cudaGetErrorString(syncErr));
        return 1;
    }

    // Force flush - this is key!
    cudaDeviceReset();

    printf("GPU finished!\n");
    return 0;
}

Overwriting hello.cu


In [5]:
!nvcc -arch=sm_75 hello.cu -o hello


In [6]:
!./hello

Found 1 CUDA devices
Launching kernel from CPU...
Hello from thread 992!
Hello from thread 993!
Hello from thread 994!
Hello from thread 995!
Hello from thread 996!
Hello from thread 997!
Hello from thread 998!
Hello from thread 999!
Hello from thread 864!
Hello from thread 865!
Hello from thread 866!
Hello from thread 867!
Hello from thread 868!
Hello from thread 869!
Hello from thread 870!
Hello from thread 871!
Hello from thread 872!
Hello from thread 873!
Hello from thread 874!
Hello from thread 875!
Hello from thread 876!
Hello from thread 877!
Hello from thread 878!
Hello from thread 879!
Hello from thread 880!
Hello from thread 881!
Hello from thread 882!
Hello from thread 883!
Hello from thread 884!
Hello from thread 885!
Hello from thread 886!
Hello from thread 887!
Hello from thread 888!
Hello from thread 889!
Hello from thread 890!
Hello from thread 891!
Hello from thread 892!
Hello from thread 893!
Hello from thread 894!
Hello from thread 895!
Hello from thread 800!
Hello f