Problem Statment: Write a CUDA Program for Addition of two large vectors 

-- Program Overview --

This program performs vector addition using CUDA (Compute Unified Device Architecture) to run parallel code on an NVIDIA GPU. It:
    Generates two large vectors A and B with random integers.
    Adds the vectors element-wise in parallel on the GPU.
    Stores the result in vector C.
    Writes all three vectors to a text file in tabular format.

1. CUDA Basics
    CUDA is a parallel computing platform by NVIDIA that allows using the GPU (Graphics Processing Unit) for general-purpose computing.
    It uses extensions to C/C++ to run functions (called kernels) in parallel across many GPU threads.

2. Kernels
    A kernel is a function that runs on the GPU. It’s called with <<<blocks, threads>>> syntax.
    Each thread computes a part of the problem—in this case, one index of the vector sum.

3. Memory Management
    Host (CPU) and Device (GPU) have separate memory spaces.
    You must explicitly allocate GPU memory (cudaMalloc) and copy data between host and device (cudaMemcpy).



Explanation of How the Program Works

Step 1: Initialization
- N = 2^20 = 1,048,576 defines the size of the vectors.
- Host vectors h_A, h_B, h_C are created using std::vector.
- Random values (0–100) are assigned to h_A and h_B.

Step 2: GPU Memory Allocation
- cudaMalloc allocates memory for d_A, d_B, d_C on the GPU.
- cudaMemcpy copies vectors h_A and h_B to the GPU.

Step 3: Kernel Launch
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

- Launches the GPU kernel vectorAdd.
- Each thread computes: C[i] = A[i] + B[i].

Step 4: Error Checking & Synchronization
- cudaGetLastError() checks if kernel launch was successful.
- cudaDeviceSynchronize() ensures the kernel execution is finished.

Step 5: Copy Back Result
- cudaMemcpy copies result d_C to host vector h_C.

Step 6: Output to File
- The result is saved to vector_sum_output.txt in a well-formatted table using std::ofstream.

Step 7: Memory Cleanup
- Frees GPU memory using cudaFree.

Important CUDA Functions Used

1. cudaMalloc:	Allocates memory on GPU
2. cudaMemcpy:	Copies data between CPU and GPU
3. cudaFree:	Frees allocated GPU memory
4. cudaGetLastError:	Checks for kernel launch errors
5. cudaDeviceSynchronize:	Waits for all GPU threads to complete

The below code is for the vector addition, which we have to perform on the Google Colab.

In [2]:
code = r'''
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdlib>
#include <ctime>
#include <iomanip>
#include <cuda_runtime.h>

__global__ void vectorAdd(const int *A, const int *B, int *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

void checkCudaError(cudaError_t err, const char *msg) {
    if (err != cudaSuccess) {
        std::cerr << "CUDA Error: " << msg << ": " << cudaGetErrorString(err) << std::endl;
        exit(EXIT_FAILURE);
    }
}

int main() {
    int N = 1 << 20;
    size_t size = N * sizeof(int);

    std::vector<int> h_A(N), h_B(N), h_C(N);

    srand(static_cast<unsigned>(time(nullptr)));
    for (int i = 0; i < N; ++i) {
        h_A[i] = rand() % 101;  // integers from 0 to 100
        h_B[i] = rand() % 101;
    }

    int *d_A = nullptr, *d_B = nullptr, *d_C = nullptr;
    checkCudaError(cudaMalloc(&d_A, size), "Allocating d_A");
    checkCudaError(cudaMalloc(&d_B, size), "Allocating d_B");
    checkCudaError(cudaMalloc(&d_C, size), "Allocating d_C");

    checkCudaError(cudaMemcpy(d_A, h_A.data(), size, cudaMemcpyHostToDevice), "Copying h_A");
    checkCudaError(cudaMemcpy(d_B, h_B.data(), size, cudaMemcpyHostToDevice), "Copying h_B");

    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    checkCudaError(cudaGetLastError(), "Kernel launch");
    checkCudaError(cudaDeviceSynchronize(), "Kernel execution");

    checkCudaError(cudaMemcpy(h_C.data(), d_C, size, cudaMemcpyDeviceToHost), "Copying result");

    // Save to file in column format
    std::ofstream outFile("vector_sum_output.txt");
    if (!outFile.is_open()) {
        std::cerr << "Error opening output file!" << std::endl;
        return 1;
    }

    outFile << std::setw(10) << "A[i]"
            << std::setw(10) << "B[i]"
            << std::setw(15) << "C[i] = A + B" << "\n";
    outFile << std::string(35, '-') << "\n";

    for (int i = 0; i < N; ++i) {
        outFile << std::setw(10) << h_A[i]
                << std::setw(10) << h_B[i]
                << std::setw(15) << h_C[i] << "\n";
    }

    outFile.close();

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    return 0;
}

'''
with open("vector_add_random.cu", "w") as f:
    f.write(code)

In [3]:
!nvcc -arch=sm_75 vector_add_random.cu -o vector_add_random
!./vector_add_random

'nvcc' is not recognized as an internal or external command,
operable program or batch file.
'.' is not recognized as an internal or external command,
operable program or batch file.
