# üöÄ Day 2: Element-wise Vector Operations

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sdodlapati3/cuda-lab/blob/main/learning-path/week-03/day-2-elementwise-ops.ipynb)

---

## üéØ The Challenge

*You have 10 million data points and need to apply the same operation to each one. CPU takes seconds... but the GPU should finish in milliseconds. What makes this problem perfectly suited for GPUs?*

Element-wise operations are the **embarrassingly parallel** workhorses of GPU computing. When every element can be processed independently, we unlock the GPU's full potential!

---

## üìö Learning Objectives

By the end of this session, you will be able to:

| Objective | Skill Level |
|-----------|-------------|
| Implement basic arithmetic operations on vectors | Apply |
| Apply transcendental math functions (sqrt, exp, log, trig) | Apply |
| Build neural network activation functions (ReLU, sigmoid, tanh) | Create |
| Combine operations efficiently with grid-stride loops | Apply |

---

## üó∫Ô∏è Session Roadmap

| Part | Topic | Duration |
|------|-------|----------|
| 1 | Basic Arithmetic Operations | 10 min |
| 2 | Math Functions | 15 min |
| 3 | Activation Functions | 15 min |
| 4 | Combined Operations | 10 min |
| 5 | Exercises | 10 min |

> **Primary Focus:** CUDA C++ code examples first, Python/Numba backup for interactive testing

---

In [None]:
# ‚öôÔ∏è Colab/Local Setup - Run this first!
import subprocess, sys
try:
    import google.colab
    print("üîß Running on Google Colab - Installing dependencies...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numba"])
    print("‚úÖ Setup complete!")
except ImportError:
    print("üíª Running locally - make sure you have: pip install numba numpy")

import numpy as np
from numba import cuda
import math
import time

print(f"\nCUDA available: {cuda.is_available()}")
if cuda.is_available():
    device = cuda.get_current_device()
    print(f"Device: {device.name}")

---

## Part 1: Basic Arithmetic Operations

> üí° **Concept Card: Embarrassingly Parallel Operations**
> 
> ```
> üéØ EMBARRASSINGLY PARALLEL = PERFECTLY GPU-FRIENDLY
> ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
> 
>   What makes an operation "embarrassingly parallel"?
>   
>   ‚úÖ NO dependencies between elements
>   ‚úÖ NO communication between threads needed
>   ‚úÖ NO shared data modified by multiple threads
>   
>   ELEMENT-WISE OPERATIONS:
>   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
>   Input A:  [a‚ÇÄ] [a‚ÇÅ] [a‚ÇÇ] [a‚ÇÉ] ... [a‚Çô]
>   Input B:  [b‚ÇÄ] [b‚ÇÅ] [b‚ÇÇ] [b‚ÇÉ] ... [b‚Çô]
>              ‚Üì    ‚Üì    ‚Üì    ‚Üì   ...  ‚Üì
>   Output:  [c‚ÇÄ] [c‚ÇÅ] [c‚ÇÇ] [c‚ÇÉ] ... [c‚Çô]
>   
>   Thread 0 computes c‚ÇÄ = f(a‚ÇÄ, b‚ÇÄ)  ‚Üê Independent!
>   Thread 1 computes c‚ÇÅ = f(a‚ÇÅ, b‚ÇÅ)  ‚Üê Independent!
>   Thread 2 computes c‚ÇÇ = f(a‚ÇÇ, b‚ÇÇ)  ‚Üê Independent!
>   ...
>   
>   EXAMPLES:
>   ‚Ä¢ Vector add:     c[i] = a[i] + b[i]
>   ‚Ä¢ Scalar multiply: c[i] = Œ± √ó a[i]
>   ‚Ä¢ Element-wise:   c[i] = a[i] √ó b[i]  (Hadamard)
>   ‚Ä¢ Math functions: c[i] = sin(a[i])
>   
> ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
> ```
> 
> **Why GPUs Excel:** When threads don't need to talk to each other, we can unleash all of them simultaneously!

### üî∑ CUDA C++ Implementation (Primary)

### üî∂ Python/Numba (Optional - Quick Testing)

In [None]:
%%writefile elementwise_ops.cu
// elementwise_ops.cu - Basic vector operations
#include <stdio.h>
#include <cuda_runtime.h>

// Vector Addition
__global__ void vectorAdd(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] + b[i];
    }
}

// Vector Subtraction
__global__ void vectorSub(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] - b[i];
    }
}

// Element-wise Multiplication (Hadamard product)
__global__ void vectorMul(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] * b[i];
    }
}

// Element-wise Division
__global__ void vectorDiv(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] / b[i];
    }
}

// Scalar operations
__global__ void scalarMul(const float* a, float scalar, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] * scalar;
    }
}

int main() {
    int n = 1000000;
    size_t size = n * sizeof(float);
    
    // Allocate and initialize host arrays
    float *h_a = (float*)malloc(size);
    float *h_b = (float*)malloc(size);
    float *h_out = (float*)malloc(size);
    
    for (int i = 0; i < n; i++) {
        h_a[i] = 1.0f;
        h_b[i] = 2.0f;
    }
    
    // Allocate device arrays
    float *d_a, *d_b, *d_out;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_b, size);
    cudaMalloc(&d_out, size);
    
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
    
    // Launch
    int threads = 256;
    int blocks = 256;
    
    vectorAdd<<<blocks, threads>>>(d_a, d_b, d_out, n);
    cudaDeviceSynchronize();
    
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("Add: %f + %f = %f\n", h_a[0], h_b[0], h_out[0]);
    
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_out);
    free(h_a); free(h_b); free(h_out);
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o elementwise_ops elementwise_ops.cu
!./elementwise_ops

In [None]:
# Python equivalents for interactive testing
@cuda.jit
def vector_add(a, b, out, n):
    """out[i] = a[i] + b[i]"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] + b[i]

@cuda.jit
def vector_sub(a, b, out, n):
    """out[i] = a[i] - b[i]"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] - b[i]

@cuda.jit
def vector_mul(a, b, out, n):
    """out[i] = a[i] * b[i]"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] * b[i]

@cuda.jit
def vector_div(a, b, out, n):
    """out[i] = a[i] / b[i]"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] / b[i]

In [None]:
# Test basic operations
n = 1_000_000
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32) + 0.1  # Avoid div by zero
out = np.zeros(n, dtype=np.float32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

blocks, threads = 256, 256

# Test each operation
ops = [
    ('Add', vector_add, lambda a, b: a + b),
    ('Sub', vector_sub, lambda a, b: a - b),
    ('Mul', vector_mul, lambda a, b: a * b),
    ('Div', vector_div, lambda a, b: a / b),
]

print(f"Testing with {n:,} elements\n")
for name, kernel, np_op in ops:
    kernel[blocks, threads](d_a, d_b, d_out, n)
    result = d_out.copy_to_host()
    expected = np_op(a, b)
    match = np.allclose(result, expected)
    print(f"{name}: {'‚úì' if match else '‚úó'}")

---

## Part 2: Math Functions

Excellent! You've mastered basic arithmetic. Now let's tap into the GPU's specialized math hardware.

> üí° **Concept Card: GPU Math Unit (SFU)**
> 
> ```
> üßÆ SPECIAL FUNCTION UNIT - Dedicated Math Hardware
> ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
> 
>   GPUs have dedicated hardware for fast approximations:
>   
>   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
>   ‚îÇ CUDA Math Functions (GPU Accelerated)          ‚îÇ
>   ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
>   ‚îÇ BASIC:    sqrtf, rsqrtf (1/‚àöx), powf           ‚îÇ
>   ‚îÇ EXP/LOG:  expf, exp2f, logf, log2f, log10f     ‚îÇ
>   ‚îÇ TRIG:     sinf, cosf, tanf                     ‚îÇ
>   ‚îÇ INVERSE:  asinf, acosf, atanf, atan2f          ‚îÇ
>   ‚îÇ HYPER:    sinhf, coshf, tanhf                  ‚îÇ
>   ‚îÇ UTILITY:  fabsf, fmodf, floorf, ceilf          ‚îÇ
>   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
>   
>   PRECISION OPTIONS:
>   ‚Ä¢ sinf()  - Full precision (slower)
>   ‚Ä¢ __sinf() - Fast approximation (faster, less accurate)
>   
>   INTRINSICS (even faster):
>   ‚Ä¢ __expf(), __logf(), __sinf(), __cosf()
>   
> ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
> ```
> 
> **Pro Tip:** Use fast math for graphics/ML, full precision for scientific computing.

### üî∑ CUDA C++ Implementation (Primary)

### üî∂ Python/Numba (Optional - Quick Testing)

In [None]:
%%writefile scalar_ops.cu
// scalar_ops.cu - Scalar operations on vectors
#include <stdio.h>
#include <math.h>
#include <cuda_runtime.h>

// Scalar Add: out[i] = a[i] + scalar
__global__ void scalarAdd(const float* a, float scalar, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] + scalar;
    }
}

// Scalar Multiply: out[i] = a[i] * scalar
__global__ void scalarMul(const float* a, float scalar, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = a[i] * scalar;
    }
}

// Scalar Power: out[i] = a[i] ^ power
__global__ void scalarPow(const float* a, float power, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = powf(a[i], power);
    }
}

int main() {
    int n = 1000000;
    size_t size = n * sizeof(float);
    
    float *h_a = (float*)malloc(size);
    float *h_out = (float*)malloc(size);
    
    for (int i = 0; i < n; i++) {
        h_a[i] = 2.0f;
    }
    
    float *d_a, *d_out;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_out, size);
    
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    
    int threads = 256, blocks = 256;
    
    // Test scalar add
    scalarAdd<<<blocks, threads>>>(d_a, 3.0f, d_out, n);
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("ScalarAdd: %f + 3.0 = %f\n", h_a[0], h_out[0]);
    
    // Test scalar multiply
    scalarMul<<<blocks, threads>>>(d_a, 2.5f, d_out, n);
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("ScalarMul: %f * 2.5 = %f\n", h_a[0], h_out[0]);
    
    // Test scalar power
    scalarPow<<<blocks, threads>>>(d_a, 3.0f, d_out, n);
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("ScalarPow: %f ^ 3.0 = %f\n", h_a[0], h_out[0]);
    
    cudaFree(d_a); cudaFree(d_out);
    free(h_a); free(h_out);
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o scalar_ops scalar_ops.cu
!./scalar_ops

In [None]:
@cuda.jit
def scalar_add(a, scalar, out, n):
    """out[i] = a[i] + scalar"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] + scalar

@cuda.jit
def scalar_mul(a, scalar, out, n):
    """out[i] = a[i] * scalar"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] * scalar

@cuda.jit
def scalar_pow(a, power, out, n):
    """out[i] = a[i] ** power"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = a[i] ** power

In [None]:
# Test scalar operations
scalar_mul[blocks, threads](d_a, 2.5, d_out, n)
result = d_out.copy_to_host()
expected = a * 2.5
print(f"Scalar multiply by 2.5: {'‚úì' if np.allclose(result, expected) else '‚úó'}")

scalar_pow[blocks, threads](d_a, 2.0, d_out, n)
result = d_out.copy_to_host()
expected = a ** 2.0
print(f"Scalar power of 2: {'‚úì' if np.allclose(result, expected) else '‚úó'}")

---

## Part 3: Activation Functions

Now let's build something practical ‚Äî the activation functions powering neural networks!

> üí° **Concept Card: Neural Network Activations**
> 
> ```
> üß† ACTIVATION FUNCTIONS - The Heart of Deep Learning
> ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
> 
>   Every neural network uses element-wise activations:
>   
>   ReLU (Rectified Linear Unit):
>   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
>   f(x) = max(0, x)
>   
>        ‚ï±
>       ‚ï±
>      ‚ï±
>   ‚îÄ‚îÄ‚Ä¢‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ  ‚Üê "Dead" for x < 0
>   
>   Sigmoid:
>   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
>   f(x) = 1 / (1 + exp(-x))
>   
>      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ 1.0
>     ‚ï±
>   ‚îÄ‚Ä¢‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ  ‚Üê Outputs (0, 1)
>     
>   Tanh:
>   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
>   f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
>   
>      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ +1
>     ‚ï±
>   ‚îÄ‚Ä¢‚îÄ
>     ‚ï≤
>      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ -1  ‚Üê Outputs (-1, +1)
>   
>   ALL ARE ELEMENT-WISE ‚Üí PERFECT FOR GPU!
>   
> ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
> ```
> 
> **ML Context:** These run on every element of every layer for every sample in every batch!

### üî∑ CUDA C++ Implementation (Primary)

### üî∂ Python/Numba (Optional - Quick Testing)

In [None]:
%%writefile math_functions.cu
// math_functions.cu - Math operations on vectors
#include <stdio.h>
#include <math.h>
#include <cuda_runtime.h>

__global__ void vectorSqrt(const float* a, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = sqrtf(a[i]);
    }
}

__global__ void vectorExp(const float* a, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = expf(a[i]);
    }
}

__global__ void vectorLog(const float* a, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = logf(a[i]);
    }
}

__global__ void vectorSin(const float* a, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = sinf(a[i]);
    }
}

__global__ void vectorCos(const float* a, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for (int i = tid; i < n; i += stride) {
        out[i] = cosf(a[i]);
    }
}

int main() {
    int n = 1000000;
    size_t size = n * sizeof(float);
    
    float *h_a = (float*)malloc(size);
    float *h_out = (float*)malloc(size);
    
    for (int i = 0; i < n; i++) {
        h_a[i] = 4.0f;  // Use 4.0 for sqrt demo
    }
    
    float *d_a, *d_out;
    cudaMalloc(&d_a, size);
    cudaMalloc(&d_out, size);
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    
    int threads = 256, blocks = 256;
    
    vectorSqrt<<<blocks, threads>>>(d_a, d_out, n);
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("sqrt(%f) = %f\n", h_a[0], h_out[0]);
    
    // Reset input for exp
    for (int i = 0; i < n; i++) h_a[i] = 1.0f;
    cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
    
    vectorExp<<<blocks, threads>>>(d_a, d_out, n);
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("exp(%f) = %f\n", h_a[0], h_out[0]);
    
    vectorSin<<<blocks, threads>>>(d_a, d_out, n);
    cudaMemcpy(h_out, d_out, size, cudaMemcpyDeviceToHost);
    printf("sin(%f) = %f\n", h_a[0], h_out[0]);
    
    cudaFree(d_a); cudaFree(d_out);
    free(h_a); free(h_out);
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o math_functions math_functions.cu
!./math_functions

In [None]:
@cuda.jit
def vector_sqrt(a, out, n):
    """out[i] = sqrt(a[i])"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = math.sqrt(a[i])

@cuda.jit
def vector_exp(a, out, n):
    """out[i] = exp(a[i])"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = math.exp(a[i])

@cuda.jit
def vector_log(a, out, n):
    """out[i] = log(a[i])"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = math.log(a[i])

@cuda.jit
def vector_sin(a, out, n):
    """out[i] = sin(a[i])"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = math.sin(a[i])

@cuda.jit
def vector_cos(a, out, n):
    """out[i] = cos(a[i])"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = math.cos(a[i])

In [None]:
# Test math functions
a_pos = np.abs(a) + 0.01  # Positive values for sqrt/log
d_a_pos = cuda.to_device(a_pos)

math_ops = [
    ('sqrt', vector_sqrt, np.sqrt, d_a_pos, a_pos),
    ('exp', vector_exp, np.exp, d_a, a * 0.1),  # Scale down to avoid overflow
    ('log', vector_log, np.log, d_a_pos, a_pos),
    ('sin', vector_sin, np.sin, d_a, a),
    ('cos', vector_cos, np.cos, d_a, a),
]

print("Math function tests:")
for name, kernel, np_fn, d_input, h_input in math_ops:
    kernel[blocks, threads](d_input, d_out, n)
    result = d_out.copy_to_host()
    expected = np_fn(h_input)
    match = np.allclose(result, expected, rtol=1e-5)
    print(f"  {name}: {'‚úì' if match else '‚úó'}")

---

## Part 4: Compound Operations

### Combining Multiple Operations

In [None]:
@cuda.jit
def vector_normalize(a, out, n):
    """Normalize to [0, 1] assuming input in [0, max_val]"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        # Sigmoid-like normalization
        out[i] = 1.0 / (1.0 + math.exp(-a[i]))

@cuda.jit
def vector_relu(a, out, n):
    """ReLU activation: max(0, x)"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = max(0.0, a[i])

@cuda.jit
def vector_leaky_relu(a, out, alpha, n):
    """Leaky ReLU: max(alpha*x, x)"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        x = a[i]
        out[i] = x if x > 0 else alpha * x

@cuda.jit
def vector_tanh(a, out, n):
    """Hyperbolic tangent activation"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        out[i] = math.tanh(a[i])

In [None]:
# Test activation functions
a_centered = (a - 0.5) * 4  # Values around 0
d_a_centered = cuda.to_device(a_centered)

# Sigmoid
vector_normalize[blocks, threads](d_a_centered, d_out, n)
result = d_out.copy_to_host()
expected = 1 / (1 + np.exp(-a_centered))
print(f"Sigmoid: {'‚úì' if np.allclose(result, expected, rtol=1e-5) else '‚úó'}")

# ReLU
vector_relu[blocks, threads](d_a_centered, d_out, n)
result = d_out.copy_to_host()
expected = np.maximum(0, a_centered)
print(f"ReLU: {'‚úì' if np.allclose(result, expected) else '‚úó'}")

# Tanh
vector_tanh[blocks, threads](d_a_centered, d_out, n)
result = d_out.copy_to_host()
expected = np.tanh(a_centered)
print(f"Tanh: {'‚úì' if np.allclose(result, expected, rtol=1e-5) else '‚úó'}")

---

## Part 4: Combined Operations & Performance

Now that you have a toolkit of element-wise operations, let's explore combining them efficiently.

### Key Performance Insight

| Operation Type | Compute Intensity | GPU Benefit |
|----------------|-------------------|-------------|
| Simple (add, mul) | Low | Memory-bound, ~100√ó speedup |
| Medium (sqrt, div) | Medium | Balanced, ~200√ó speedup |
| Complex (exp, sin) | High | Compute-bound, ~500√ó speedup |

> **Rule of Thumb:** More complex operations ‚Üí higher GPU speedups!

---

## Part 5: Exercises

Now it's your turn! Apply what you've learned:

In [None]:
def benchmark_operation(kernel, np_op, a, b, name, iterations=100):
    """Benchmark GPU kernel vs NumPy."""
    n = len(a)
    out = np.zeros(n, dtype=np.float32)
    
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b) if b is not None else None
    d_out = cuda.to_device(out)
    
    blocks, threads = 256, 256
    
    # Warmup
    if d_b is not None:
        kernel[blocks, threads](d_a, d_b, d_out, n)
    else:
        kernel[blocks, threads](d_a, d_out, n)
    cuda.synchronize()
    
    # GPU benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        if d_b is not None:
            kernel[blocks, threads](d_a, d_b, d_out, n)
        else:
            kernel[blocks, threads](d_a, d_out, n)
    cuda.synchronize()
    gpu_time = (time.perf_counter() - start) / iterations * 1000
    
    # NumPy benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        if b is not None:
            _ = np_op(a, b)
        else:
            _ = np_op(a)
    numpy_time = (time.perf_counter() - start) / iterations * 1000
    
    speedup = numpy_time / gpu_time
    return gpu_time, numpy_time, speedup

In [None]:
# Comprehensive benchmark
n = 10_000_000
a = np.random.rand(n).astype(np.float32)
b = np.random.rand(n).astype(np.float32) + 0.1

print(f"Benchmarking with N = {n:,} elements\n")
print(f"{'Operation':<15} | {'GPU (ms)':<10} | {'NumPy (ms)':<10} | {'Speedup':<10}")
print("-" * 55)

benchmarks = [
    ('Add', vector_add, lambda x, y: x + y, b),
    ('Mul', vector_mul, lambda x, y: x * y, b),
    ('Div', vector_div, lambda x, y: x / y, b),
    ('Sqrt', vector_sqrt, np.sqrt, None),
    ('Exp', vector_exp, np.exp, None),
    ('Log', vector_log, np.log, None),
    ('Sin', vector_sin, np.sin, None),
    ('Cos', vector_cos, np.cos, None),
]

for name, kernel, np_op, b_arr in benchmarks:
    a_input = np.abs(a) + 0.01 if name in ['Sqrt', 'Log'] else a
    gpu_t, np_t, speedup = benchmark_operation(kernel, np_op, a_input, b_arr, name)
    print(f"{name:<15} | {gpu_t:<10.3f} | {np_t:<10.3f} | {speedup:<10.1f}x")

### Observations

```
Memory-bound ops (add, mul):
‚Ä¢ Moderate speedup (5-10x)
‚Ä¢ Limited by memory bandwidth
‚Ä¢ GPU has higher bandwidth than CPU

Compute-bound ops (exp, sin, sqrt):
‚Ä¢ Higher speedup (10-50x)
‚Ä¢ GPU excels at parallel math
‚Ä¢ More compute per memory access
```

---

## Part 6: In-Place Operations

In [None]:
@cuda.jit
def inplace_add(a, b, n):
    """a[i] += b[i] (modifies a in-place)"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        a[i] += b[i]

@cuda.jit
def inplace_scale(a, scalar, n):
    """a[i] *= scalar (modifies a in-place)"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        a[i] *= scalar

@cuda.jit
def inplace_clamp(a, min_val, max_val, n):
    """Clamp values to [min_val, max_val] in-place"""
    tid = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(tid, n, stride):
        a[i] = max(min_val, min(max_val, a[i]))

In [None]:
# Test in-place operations
test_a = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
test_b = np.array([0.5, 0.5, 0.5, 0.5, 0.5], dtype=np.float32)

d_test_a = cuda.to_device(test_a.copy())
d_test_b = cuda.to_device(test_b)

print(f"Original a: {test_a}")

inplace_add[1, 32](d_test_a, d_test_b, len(test_a))
print(f"After a += b: {d_test_a.copy_to_host()}")

inplace_scale[1, 32](d_test_a, 2.0, len(test_a))
print(f"After a *= 2: {d_test_a.copy_to_host()}")

inplace_clamp[1, 32](d_test_a, 2.0, 8.0, len(test_a))
print(f"After clamp[2,8]: {d_test_a.copy_to_host()}")

---

## üéØ Exercises

### üî∑ CUDA C++ Exercises (Primary)

Complete these elementwise operation exercises in CUDA C++.

In [None]:
%%writefile elementwise_exercises.cu
// elementwise_exercises.cu - Elementwise operation exercises
#include <stdio.h>
#include <cuda_runtime.h>
#include <math.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            fprintf(stderr, "CUDA Error: %s\n", cudaGetErrorString(err)); \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// =============================================================================
// Exercise 1: Vector Absolute Value
// =============================================================================

__global__ void vectorAbs(const float* a, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = fabsf(a[idx]);
    }
}

// =============================================================================
// Exercise 2: Softplus Activation: log(1 + exp(x))
// =============================================================================

__global__ void vectorSoftplus(const float* a, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float x = a[idx];
        // Numerical stability: for large x, softplus(x) ‚âà x
        if (x > 20.0f) {
            out[idx] = x;
        } else {
            out[idx] = logf(1.0f + expf(x));
        }
    }
}

// =============================================================================
// Exercise 3: Polynomial Evaluation: ax^2 + bx + c
// =============================================================================

__global__ void polynomialEval(const float* x, float a, float b, float c, 
                                float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        float xi = x[idx];
        out[idx] = a * xi * xi + b * xi + c;
    }
}

// =============================================================================
// Exercise 4: Distance from Origin (2D vectors)
// =============================================================================

__global__ void vectorDistance2D(const float* x, const float* y, float* dist, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        dist[idx] = sqrtf(x[idx] * x[idx] + y[idx] * y[idx]);
    }
}

// =============================================================================
// Test harness
// =============================================================================

int main() {
    printf("=== Elementwise Operation Exercises ===\n\n");
    
    // Exercise 1: Vector Absolute Value
    printf("Exercise 1: Vector Absolute Value\n");
    printf("-" "---------------------------------\n");
    {
        float h_a[] = {-3.0f, -1.0f, 0.0f, 1.0f, 3.0f};
        float h_out[5];
        const int N = 5;
        
        float *d_a, *d_out;
        CUDA_CHECK(cudaMalloc(&d_a, N * sizeof(float)));
        CUDA_CHECK(cudaMalloc(&d_out, N * sizeof(float)));
        CUDA_CHECK(cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice));
        
        vectorAbs<<<1, 256>>>(d_a, d_out, N);
        CUDA_CHECK(cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
        
        printf("Input:  [%.0f, %.0f, %.0f, %.0f, %.0f]\n", 
               h_a[0], h_a[1], h_a[2], h_a[3], h_a[4]);
        printf("Output: [%.0f, %.0f, %.0f, %.0f, %.0f]\n", 
               h_out[0], h_out[1], h_out[2], h_out[3], h_out[4]);
        printf("Expected: [3, 1, 0, 1, 3]\n\n");
        
        cudaFree(d_a); cudaFree(d_out);
    }
    
    // Exercise 2: Softplus
    printf("Exercise 2: Softplus Activation\n");
    printf("-" "-------------------------------\n");
    {
        float h_a[] = {-5.0f, 0.0f, 1.0f, 5.0f, 25.0f};
        float h_out[5];
        const int N = 5;
        
        float *d_a, *d_out;
        CUDA_CHECK(cudaMalloc(&d_a, N * sizeof(float)));
        CUDA_CHECK(cudaMalloc(&d_out, N * sizeof(float)));
        CUDA_CHECK(cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice));
        
        vectorSoftplus<<<1, 256>>>(d_a, d_out, N);
        CUDA_CHECK(cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
        
        printf("Input:  [%.1f, %.1f, %.1f, %.1f, %.1f]\n", 
               h_a[0], h_a[1], h_a[2], h_a[3], h_a[4]);
        printf("Output: [%.3f, %.3f, %.3f, %.3f, %.3f]\n", 
               h_out[0], h_out[1], h_out[2], h_out[3], h_out[4]);
        printf("Note: softplus(0)=ln(2)‚âà0.693, large x‚Üíx\n\n");
        
        cudaFree(d_a); cudaFree(d_out);
    }
    
    // Exercise 3: Polynomial
    printf("Exercise 3: Polynomial (x^2 + 2x + 1)\n");
    printf("-" "-------------------------------------\n");
    {
        float h_x[] = {0.0f, 1.0f, 2.0f, 3.0f};
        float h_out[4];
        const int N = 4;
        
        float *d_x, *d_out;
        CUDA_CHECK(cudaMalloc(&d_x, N * sizeof(float)));
        CUDA_CHECK(cudaMalloc(&d_out, N * sizeof(float)));
        CUDA_CHECK(cudaMemcpy(d_x, h_x, N * sizeof(float), cudaMemcpyHostToDevice));
        
        polynomialEval<<<1, 256>>>(d_x, 1.0f, 2.0f, 1.0f, d_out, N);
        CUDA_CHECK(cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost));
        
        printf("Input x: [%.0f, %.0f, %.0f, %.0f]\n", h_x[0], h_x[1], h_x[2], h_x[3]);
        printf("Output:  [%.0f, %.0f, %.0f, %.0f]\n", h_out[0], h_out[1], h_out[2], h_out[3]);
        printf("Expected: [1, 4, 9, 16] = (x+1)^2\n\n");
        
        cudaFree(d_x); cudaFree(d_out);
    }
    
    // Exercise 4: Distance 2D
    printf("Exercise 4: Distance from Origin\n");
    printf("-" "--------------------------------\n");
    {
        float h_x[] = {3.0f, 0.0f, 4.0f};
        float h_y[] = {4.0f, 5.0f, 3.0f};
        float h_dist[3];
        const int N = 3;
        
        float *d_x, *d_y, *d_dist;
        CUDA_CHECK(cudaMalloc(&d_x, N * sizeof(float)));
        CUDA_CHECK(cudaMalloc(&d_y, N * sizeof(float)));
        CUDA_CHECK(cudaMalloc(&d_dist, N * sizeof(float)));
        CUDA_CHECK(cudaMemcpy(d_x, h_x, N * sizeof(float), cudaMemcpyHostToDevice));
        CUDA_CHECK(cudaMemcpy(d_y, h_y, N * sizeof(float), cudaMemcpyHostToDevice));
        
        vectorDistance2D<<<1, 256>>>(d_x, d_y, d_dist, N);
        CUDA_CHECK(cudaMemcpy(h_dist, d_dist, N * sizeof(float), cudaMemcpyDeviceToHost));
        
        printf("x: [%.0f, %.0f, %.0f], y: [%.0f, %.0f, %.0f]\n", 
               h_x[0], h_x[1], h_x[2], h_y[0], h_y[1], h_y[2]);
        printf("Distance: [%.0f, %.0f, %.0f]\n", h_dist[0], h_dist[1], h_dist[2]);
        printf("Expected: [5, 5, 5] (3-4-5 triangles!)\n");
        
        cudaFree(d_x); cudaFree(d_y); cudaFree(d_dist);
    }
    
    printf("\n=== All exercises complete! ===\n");
    return 0;
}

In [None]:
!nvcc -arch=sm_75 -o elementwise_exercises elementwise_exercises.cu && ./elementwise_exercises

### üî∂ Python/Numba Exercises (Optional)

### Exercise 1: Vector Absolute Value

In [None]:
# TODO: Implement vector absolute value
@cuda.jit
def vector_abs(a, out, n):
    """out[i] = |a[i]|"""
    # Hint: Use math.fabs(x)
    pass

# Test with [-3, -1, 0, 1, 3]
# Expected: [3, 1, 0, 1, 3]

### Exercise 2: Softplus Activation

In [None]:
# TODO: Implement softplus: log(1 + exp(x))
@cuda.jit
def vector_softplus(a, out, n):
    """Softplus activation: out[i] = log(1 + exp(a[i]))"""
    # Hint: For numerical stability, use:
    # if x > 20: return x (avoid exp overflow)
    # else: return log(1 + exp(x))
    pass

### Exercise 3: Polynomial Evaluation

In [None]:
# TODO: Evaluate polynomial a*x^2 + b*x + c
@cuda.jit
def polynomial_eval(x, a_coef, b_coef, c_coef, out, n):
    """Evaluate ax^2 + bx + c for each element."""
    pass

# Test: x = [0, 1, 2, 3], a=1, b=2, c=1
# Expected (x^2 + 2x + 1): [1, 4, 9, 16]

### Exercise 4: Distance from Origin (2D vectors)

In [None]:
# TODO: Compute distance from origin for 2D points
@cuda.jit
def vector_distance_2d(x, y, dist, n):
    """dist[i] = sqrt(x[i]^2 + y[i]^2)"""
    pass

# Test: x = [3, 0, 4], y = [4, 5, 3]
# Expected: [5, 5, 5] (3-4-5 triangles!)

---

## üìù Key Takeaways

### Quick Reference Card: Element-wise Operations

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  ELEMENT-WISE OPERATION TEMPLATE                                ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  BINARY OPERATION (2 inputs):                                   ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ    ‚îÇ
‚îÇ  @cuda.jit                                                      ‚îÇ
‚îÇ  def binary_op(a, b, out, n):                                   ‚îÇ
‚îÇ      tid = cuda.grid(1)                                         ‚îÇ
‚îÇ      stride = cuda.gridsize(1)                                  ‚îÇ
‚îÇ      for i in range(tid, n, stride):                            ‚îÇ
‚îÇ          out[i] = a[i] ‚óã b[i]   # ‚óã = +, -, *, /               ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îÇ  UNARY OPERATION (1 input):                                     ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ    ‚îÇ
‚îÇ  @cuda.jit                                                      ‚îÇ
‚îÇ  def unary_op(a, out, n):                                       ‚îÇ
‚îÇ      tid = cuda.grid(1)                                         ‚îÇ
‚îÇ      stride = cuda.gridsize(1)                                  ‚îÇ
‚îÇ      for i in range(tid, n, stride):                            ‚îÇ
‚îÇ          out[i] = func(a[i])    # sqrt, exp, sin, etc.         ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îÇ  AVAILABLE MATH FUNCTIONS:                                      ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ    ‚îÇ
‚îÇ  ‚Ä¢ Basic:  +, -, *, /, **, %                                    ‚îÇ
‚îÇ  ‚Ä¢ Math:   math.sqrt, math.exp, math.log, math.log10            ‚îÇ
‚îÇ  ‚Ä¢ Trig:   math.sin, math.cos, math.tan                         ‚îÇ
‚îÇ  ‚Ä¢ Inv:    math.asin, math.acos, math.atan                      ‚îÇ
‚îÇ  ‚Ä¢ Other:  math.fabs, math.floor, math.ceil, math.tanh          ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### ‚úÖ What You Achieved Today

| Skill | Status |
|-------|--------|
| Implemented arithmetic operations on vectors | ‚úÖ |
| Applied transcendental math functions | ‚úÖ |
| Built neural network activation functions | ‚úÖ |
| Combined operations with grid-stride loops | ‚úÖ |

### üß† Performance Notes

| Operation Type | Characteristic | GPU Advantage |
|----------------|----------------|---------------|
| Simple ops | Memory bandwidth bound | Moderate speedup |
| Complex ops | Compute bound | Massive speedup |
| Combined ops | Best of both | Use grid-stride! |

---

## üöÄ What's Next?

**Day 3: SAXPY & BLAS-like Operations** ‚Äî We'll combine multiple operations and learn about the industry-standard BLAS specification!

| Preview Topic | What You'll Learn |
|---------------|-------------------|
| BLAS Introduction | The standard for linear algebra |
| SAXPY | The "Hello World" of GPU benchmarks |
| Memory bandwidth | Understanding performance limits |
| DOT, SCAL, AXPY | Building blocks of linear algebra |

> üí° **Tomorrow's Hook:** SAXPY is just `y = a*x + y`‚Äîso why is it the most famous GPU benchmark in the world?