<a href="https://colab.research.google.com/github/saivinay997/Triton/blob/main/Triton_Day01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Use T4 GPU

Adding two tensors.

In [None]:
import torch
import triton
import triton.language as tl

In [None]:
@triton.jit
def add_kernel(x_ptr, # pointer for the first input vector
               y_ptr, # pointer for the second input vector
               output_ptr, # pointer for the output vector
               n_elements, # size of the vector
               BLOCK_SIZE: tl.constexpr, # Number of elements each program should process (is it thread??)
                # Note: 'constexpr' so it can be used as a shape value.
               ):
    # there are multiple `programs` processing different data.
    # We identify which program we are in using program_id (pid)
    pid = tl.program_id(axis=0) # We use a 1D launch grid, so axis is 0.
    # This programm will process inputs that are offset from the initial data.
    # For instance, if you had a vector length 256 and block_size of 64,
    # the programms would each access the elements [0:64, 64:128, 128:192, 192:256]
    # 1D grid of 4 block (since 256/64 = 4 blocks)
    # Note that the offsets is a list of pointers.
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    # Example of whats happening above. Consider 4 blocks as explaind above
    # - 1st block (pid = 0)
    # block_start = 0 * 64 = 0
    # offsets = 0 + tl.arange(0, 64) --> [0, 1, 2, ... 63]
    # this block processing elements [0:64]
    #
    # - 2nd block (pid = 1)
    # block_start = 1 * 64 = 64
    # offsets = 64 + tl.arange(0, 64) --> [64, 65, ... , 127]
    # this block processing elements [64:128]
    #
    # so on...
    # The entiere data arry is divided into chunks, with each block responsible for a range of 64 elements.
    # The offsets list determines exactly which elements each block will process, based on its program ID(pid)

    # Create mask to guard memory operations against out-of-bounds accesses.
    mask = offsets < n_elements
    # Load x and y from DRAM, masking out any extra elements in case the input is not a multiple of the blocksize
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    # output[i] = x[i] + y[i]

    # write x + y back on DRAM
    tl.store(output_ptr + offsets, output, mask=mask)


def add(x: torch.Tensor, y: torch.Tensor):
    # first we need to preallocate the tensor
    output = torch.empty_like(x) # this is to store the output
    assert x.is_cuda and y.is_cuda and output.is_cuda, "Tensors are not on cuda (GPU). Set device to `cuda`"
    n_elements = output.numel()  # returns the total number of elements in the tensor [dim * rows * col]
    # SPMD - Single Program Multiple Data, same program is executed across multiple processing usits, but each unit operates on separate subset of data.
    # Launch grid - structure used to organize and coordinate the execution of the parallel program on GPUs, in this context
    # The SPMD launch grid denots the number fo kernel instance that run in parallel
    # It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) -> Tuple[int]
    # In this case, we use a 1D grid where the size is the number of blocks
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),) # ceiling division ceil(a/b)
    # Note:
    # - Each torch.tensor object is implicitly converted into a pointer to its first element
    # triton.jit - decorator in the Triton library used to define GPU kenrels that can be JIT(just-in-time) compiled.
    # it converts Python functions into optimized GPU code at runtime, leveraging Triton's compiler
    # - 'triton.jit' 'ed functions can be indexed with a launch grid to obtain a callable GPU kernel
    # - Don't forget to pass meta-parameters as keywords arguments
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    # We return a handle to z but, since `torch.cuda.synchronize()` hasn't been called,
    # the kernel is still running asynchronously at this point.
    # What does the above comment mean:
    # - GPU Operations (e.g., Kernel launches, memory transfers) are often asynchronous when called from the host (CPU) code.
    # - When you invoke a kernel (e.g., a function marked with @triton.jit), it returns immediately
    # while the GPU stars executing the operation in parallel
    # - The code likely involves returning a variable (e.g., z), which is the result of a GPU computation
    # - At the point of returning, the computation associated with z may not be finished because the kernel runs asynchronusly(not at the same time or speed)
    # - `torch.cuda.synchronize()` function forces the CPU to wait until all previously launched GPU operations are completed.
    # - Without this synchronization, the CPU may process with subsequent operations, even though the GPU is still processing the kernel
    # - Synchronizing ensures that the result (z) is ready and valid before it is used for further computations or returned
    return output



In [None]:
import time
torch.manual_seed(0)
size = 98432

x = torch.rand(size, device="cuda")
y = torch.rand(size, device="cuda")

# Using torch
start = time.time()
output_torch = x + y
end = time.time()
print(f"Time taken by pytorch cuda: {end-start}")
print(output_torch)
start = time.time()

#using Triton
output_triton = add(x, y)
end = time.time()
print(f"Time taken by triton: {end - start}")
print(output_triton)





Time taken by pytorch cuda: 0.00016951560974121094
tensor([1.3713, 1.3076, 0.4940,  ..., 0.4024, 1.7918, 1.0686], device='cuda:0')
Time taken by triton: 0.0012385845184326172
tensor([1.3713, 1.3076, 0.4940,  ..., 0.4024, 1.7918, 1.0686], device='cuda:0')
