<a href="https://colab.research.google.com/github/vijaygwu/IntroToDeepLearning/blob/main/GPUAccelerationInPyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GPU acceleration in PyTorch

****

**Summary**

PyTorch provides a bridge between your high-level Python code and the immense computational power of GPUs, enabling you to train and run deep learning models with incredible speed and efficiency. It handles the complexities of memory management and hardware interaction, allowing you to focus on the core logic of your models.

**1. The Hardware**

* **CPU:**  The Central Processing Unit is excellent at handling a wide range of general-purpose tasks. It's designed for sequential execution and can handle complex logic and decision-making.
* **GPU:**  The Graphical Processing Unit was originally designed for rendering graphics. However, its massively parallel architecture, with thousands of simpler cores, makes it exceptionally efficient for performing the same mathematical operations on a large amount of data simultaneously. This is ideal for the matrix and tensor calculations that underpin deep learning.

**2. PyTorch's Bridge: The CUDA Backend**

* **CUDA:** CUDA (Compute Unified Device Architecture) is a parallel computing platform and API developed by NVIDIA for their GPUs.
* **PyTorch Integration:** PyTorch seamlessly connects with CUDA, enabling developers to write Python code that runs on the GPU.
* **Key Libraries:**
    * **cuDNN:** The CUDA Deep Neural Network library provides highly optimized implementations of core deep learning operations like convolutions, pooling, and recurrent neural networks for NVIDIA GPUs.
    * **cuBLAS:** The CUDA Basic Linear Algebra Subprograms library accelerates fundamental linear algebra operations, including matrix multiplication, which are essential for deep learning models.

**3. The Process in Motion**

1. **Tensor Creation (on CPU):**
   * When you create a tensor using `torch.tensor()` or other PyTorch tensor creation functions, the tensor is initially stored in the CPU's memory.

2. **Moving to GPU (Explicit Transfer):**
   * `tensor.to('cuda')`: This command instructs PyTorch to copy the tensor data from CPU memory to the GPU's memory.  This transfer does take time, so it's efficient to keep tensors on the GPU as long as possible during training.
   * **Behind the Scenes:** PyTorch handles the low-level details of memory allocation and data transfer between the CPU and GPU.

3. **Computation (on GPU):**

   * **Automatic Dispatch:** When you perform operations on tensors located on the GPU, PyTorch's execution engine automatically dispatches those operations to the GPU for computation.
   * **Parallel Execution:**  The GPU's many cores work in parallel to perform the calculations, taking full advantage of the tensor's structure to speed up computations dramatically.
   * **Optimized Libraries:**  PyTorch uses cuDNN and cuBLAS to ensure these operations are executed in the most efficient way possible on the specific GPU architecture.

4. **Result Retrieval (Back to CPU, if needed):**

   * **tensor.cpu()**:  If you need to process the results further using CPU-bound operations or to display them, you can bring the tensor back to the CPU memory using this method.
   * **Automatic Transfer:** In some cases, PyTorch will automatically transfer results back to the CPU if needed for certain operations.

**Additional Considerations:**

* **Asynchronous Execution:**  PyTorch can perform operations asynchronously, meaning it might schedule multiple GPU operations at once to further improve efficiency.
* **Memory Management:**  PyTorch tries to optimize memory usage on both the CPU and GPU, but it's important to be mindful of memory constraints, especially when working with large models or datasets.



In [None]:
import torch

# 1. Create Tensors (on CPU by default)
matrix_size = 1000
x = torch.randn(matrix_size, matrix_size)
y = torch.randn(matrix_size, matrix_size)

# 2. CPU Computation
z_cpu = torch.matmul(x, y)  # Matrix multiplication on the CPU

# 3. GPU Check and Transfer
if torch.cuda.is_available():
    device = torch.device("cuda")  # Get the default CUDA device
    x_gpu = x.to(device)  # Move x to GPU memory
    y_gpu = y.to(device)  # Move y to GPU memory

    # 4. GPU Computation
    z_gpu = torch.matmul(x_gpu, y_gpu)  # Matrix multiplication on the GPU

    # 5. Bring Result Back to CPU (if needed)
    z_cpu_from_gpu = z_gpu.cpu()  # Copy result from GPU to CPU

    # Compare Results (should be very close, accounting for minor numerical differences)
    print(torch.allclose(z_cpu, z_cpu_from_gpu))  # Output: True
else:
    print("GPU not available, computations performed on CPU only.")

False


In [None]:
import torch
import time

# Define and run the computation on-the-fly
matrix_size = 2000  # Increase the size for a more noticeable difference
x = torch.randn(matrix_size, matrix_size)
y = torch.randn(matrix_size, matrix_size)

# 1. CPU Computation (Time it)
start_time = time.time()
z_cpu = torch.matmul(x, y)
cpu_time = time.time() - start_time
print(f"CPU Time: {cpu_time:.4f} seconds")

# 2. GPU Check and Transfer
if torch.cuda.is_available():
    device = torch.device("cuda")
    x_gpu = x.to(device)
    y_gpu = y.to(device)

    # 3. GPU Computation (Time it)
    start_time = time.time()
    z_gpu = torch.matmul(x_gpu, y_gpu)
    gpu_time = time.time() - start_time
    print(f"GPU Time: {gpu_time:.4f} seconds")

    # 4. Speedup Calculation
    speedup = cpu_time / gpu_time
    print(f"Speedup: GPU is {speedup:.2f} times faster than CPU!")

else:
    print("GPU not available, computations performed on CPU only.")

CPU Time: 0.0408 seconds
GPU Time: 0.0006 seconds
Speedup: GPU is 70.72 times faster than CPU!


## cuDNN and PyTorch

****

**Summary**

PyTorch and cuDNN form a powerful partnership that empowers developers to harness the full computational potential of NVIDIA GPUs. By intelligently integrating cuDNN's optimized implementations, PyTorch abstracts away low-level complexities, making it easier to build and train high-performance deep learning models.

**cuDNN (CUDA Deep Neural Network library)**

* **Optimized Primitives:** cuDNN provides highly tuned implementations for standard deep learning operations:
    * **Convolution:** The core of many computer vision tasks, including image classification, object detection, and segmentation.
    * **Pooling:**  Reduces dimensionality and helps extract key features from convolutional layers.
    * **Normalization:** Ensures stability and faster convergence during training.
    * **Activation Functions:** Introduce non-linearity to neural networks, enabling them to learn complex patterns.
    * **Recurrent Layers:** Handle sequential data in tasks like natural language processing and time series analysis.
    * **Matrix Multiplication:** Underlies fully connected layers and many other computations.

* **Architecture-Specific Tuning:** These implementations are meticulously crafted to harness the full potential of specific NVIDIA GPU microarchitectures. This includes optimizations for:
    * **Tensor Cores:** Specialized hardware in modern GPUs designed to accelerate matrix multiplication and convolution operations.
    * **Memory Access Patterns:** Carefully designed data access patterns to minimize memory latency and maximize throughput.
    * **Algorithm Optimizations:**  Employing state-of-the-art algorithms and techniques to extract the best performance from the hardware.

* **Flexibility:** cuDNN offers a balance between performance and flexibility:
    * **Graph API:** This allows defining computations as a graph of operations, enabling the runtime to schedule and optimize the execution plan for a specific neural network model.
    * **Operator Fusion:**  cuDNN can intelligently combine multiple operations (like convolution followed by activation) into a single kernel, reducing overhead and improving performance.
    * **Workspace:**  Provides workspace memory for intermediate calculations, allowing for further optimizations.

**PyTorch's Integration**

* **ATen Backend:** PyTorch's core is built upon ATen (A Tensor Library), which provides the fundamental tensor operations and handles interactions with hardware backends like CUDA.
* **Transparent Integration:** When your PyTorch code executes on a GPU with cuDNN installed, ATen intelligently dispatches supported operations (convolutions, etc.) to cuDNN's highly optimized implementations.
* **Automatic Selection:** PyTorch also automatically selects the most efficient cuDNN algorithms for your specific hardware configuration and tensor sizes. This can be further enhanced using the `torch.backends.cudnn.benchmark = True` flag.

**Technical Deep Dive**

* **Kernel Selection:**  cuDNN maintains a collection of kernels (optimized code implementations) for each supported operation. Based on factors like input tensor sizes, data types, and GPU architecture, it dynamically chooses the most efficient kernel.
* **Memory Optimization:** cuDNN employs techniques like workspace management and memory reuse to reduce the amount of data transferred between the GPU's global memory and its faster shared memory or registers.
* **Autotuning (Optional):** PyTorch can optionally use cuDNN's autotuning feature to find the best kernel configuration for your specific model and hardware. This can involve running multiple configurations and measuring their performance.

**Benefits:**

* **Substantial Speedup:** By leveraging cuDNN's optimized implementations, PyTorch can achieve significant performance gains on NVIDIA GPUs for deep learning tasks.
* **Ease of Use:** Developers can focus on writing high-level PyTorch code without worrying about the intricate low-level CUDA programming details.
* **Framework Agnostic:** cuDNN is not tied to a specific framework. It can be used with other deep learning frameworks besides PyTorch, contributing to its widespread adoption.



In [None]:
import torch
import torch.nn as nn

# Create a Convolutional Layer
conv_layer = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)

# Move to GPU (if available)
if torch.cuda.is_available():
    device = torch.device("cuda")
    conv_layer = conv_layer.to(device)

# Input Data (on the same device as the layer)
input_data = torch.randn(1, 3, 224, 224).to(device)

# Perform Convolution
output = conv_layer(input_data)

# PyTorch handles this:
# 1. Detects that the input and layer are on the GPU.
# 2. Uses cuDNN's optimized convolution implementation behind the scenes.
# 3. Computations are performed on the GPU, significantly faster than on the CPU.

## cuBLAS: The Backbone of Linear Algebra on GPUs

**cuBLAS and PyTorch**

cuBLAS forms the backbone of efficient linear algebra computations on NVIDIA GPUs. By transparently integrating cuBLAS into its backend, PyTorch allows developers to effortlessly harness the power of GPUs for accelerating deep learning models without the need for explicit low-level CUDA programming.

****

cuBLAS (CUDA Basic Linear Algebra Subprograms) is a fundamental library developed by NVIDIA that accelerates basic linear algebra operations on NVIDIA GPUs. It provides optimized implementations of core routines for matrix and vector computations, such as:

* **Matrix Multiplication (GEMM):**  The cornerstone of many deep learning models, especially for fully connected layers and transformers.
* **Vector-Vector Operations:**  Dot products, addition, scaling, etc.
* **Matrix-Vector Operations:**  Multiplication of a matrix and a vector.
* **Solving Linear Systems:**  Finding solutions to systems of linear equations.
* **Eigenvalue and Eigenvector Computations:**  Crucial for tasks like dimensionality reduction and principal component analysis.

**Key Advantages of cuBLAS**

* **Optimized Performance:** cuBLAS implementations are highly tuned for various NVIDIA GPU architectures, leveraging:
    * **Tensor Cores:**  Specialized hardware for accelerating matrix operations.
    * **Parallelism:**  Exploits the inherent parallelism of GPUs to perform many calculations simultaneously.
    * **Memory Optimizations:**  Employs efficient data access patterns to minimize memory latency and maximize throughput.
    * **Algorithm-Specific Tuning:**  Uses tailored algorithms for different matrix sizes and data types to achieve the best possible performance.

* **Wide Range of Functionality:** cuBLAS covers an extensive set of BLAS (Basic Linear Algebra Subprograms) routines, making it a versatile tool for a variety of numerical computations.
* **Compatibility:**  cuBLAS is designed to work seamlessly with both C and Fortran code, making it accessible to a broad range of developers.

**How PyTorch Uses cuBLAS**

1. **Transparent Integration:**
   * PyTorch, through its underlying ATen library, automatically leverages cuBLAS when performing linear algebra operations on tensors located on the GPU.
   * You typically don't need to explicitly call cuBLAS functions in your PyTorch code.

2. **Efficient Dispatching:**  
   * When PyTorch detects that an operation (like matrix multiplication) involves tensors on the GPU, it intelligently dispatches the computation to the appropriate cuBLAS routine.
   * This ensures that the operation is executed using cuBLAS's optimized implementations, leading to significant speedups compared to CPU-based computation.

3. **Automatic Selection:**  
   * cuBLAS provides different implementations for various matrix sizes and data types.  
   * PyTorch automatically selects the most efficient implementation based on the characteristics of your tensors and the specific GPU being used.


**Behind the Scenes:**

* PyTorch recognizes that `x` and `y` are on the GPU.
* It utilizes cuBLAS's GEMM routine (optimized for matrix multiplication) to efficiently perform the computation on the GPU.
* The result `result` remains on the GPU, ready for further operations or transfer back to the CPU if needed.

**Significance in Deep Learning**

* **Fully Connected Layers:** These layers rely heavily on matrix multiplication, making cuBLAS crucial for their efficient implementation on GPUs.
* **Transformers:**  Transformers use self-attention mechanisms that involve extensive matrix operations, greatly benefiting from cuBLAS optimizations.
* **Other Applications:** Any deep learning architecture utilizing linear algebra benefits from the performance gains provided by cuBLAS.






In [8]:

import torch

# Create Tensors (on GPU)
device = torch.device("cuda")
x = torch.randn(1000, 500).to(device)
y = torch.randn(500, 200).to(device)

# Perform Matrix Multiplication (cuBLAS used behind the scenes)
result = torch.matmul(x, y)
print("Shape of the Tensor: ", result.shape)  # Output: torch.Size([1000, 200])
print("Data Type: ", result.dtype)  # Output: torch.float32
print("Type: ", result.dtype)  # Output: torch.float32
print("requires_grad: ", result.requires_grad)  # Output: False
print("Is Leaf: ", result.is_leaf)  # Output: False
print("Device: ", result.device)  # Output: cuda:0
print("Size: ", result.size())  # Output: torch.Size([1000, 200])
print("Result: ", result.data)  # Output: A

Shape of the Tensor:  torch.Size([1000, 200])
Data Type:  torch.float32
Type:  torch.float32
requires_grad:  False
Is Leaf:  True
Device:  cuda:0
Size:  torch.Size([1000, 200])
Result:  tensor([[ -5.3409, -25.0921,  31.6651,  ...,   9.7887,  19.8120,   7.1660],
        [-19.0333,   1.3516, -42.9619,  ..., -23.1982,  -3.7588,   7.7247],
        [-38.1926, -18.8267,  24.6628,  ...,  10.1063, -19.2557,  -2.5019],
        ...,
        [ 22.7634, -10.8261, -18.1449,  ...,   0.1868, -28.2732,  15.8873],
        [  2.0095, -10.5785, -33.7464,  ...,   0.2439, -23.6063,  42.4330],
        [-23.9185, -48.3162,   9.8585,  ..., -28.3540, -16.4420,  11.5154]],
       device='cuda:0')
