# Introduction to GPU Programming with Taichi

## GPU and Taichi

### Why GPU Programming?

- **Parallelism:** GPUs have thousands of cores vs CPU's 4-16 cores
- **Performance:** Ideal for mathematical computations on large datasets
- **Efficiency:** Perfect for image processing, simulations, and machine learning

sequencial vs parallel

<img src="https://j.gifs.com/862LRm.gif" width="600" height="300" alt="cpu rendering">

<img src="https://media.licdn.com/dms/image/v2/C5612AQGcYt0zCoklDg/article-cover_image-shrink_423_752/article-cover_image-shrink_423_752/0/1533141973574?e=1756339200&v=beta&t=6kE4mGTjj9m_xqI-QcyKsh7UDs1Ao8pB3xGFjBHFi7I" width="600" height="300" alt="cpu rendering">



### What is Taichi Lang?
Taichi Lang is an open-source, imperative, parallel programming language for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks, for example LLVM, to offload the compute-intensive Python code to the native GPU or CPU instructions.

[Github](https://github.com/taichi-dev/taichi?tab=readme-ov-file) | [Documentation](https://docs.taichi-lang.org/)

In [1]:
!pip install taichi

Collecting taichi
  Downloading taichi-1.7.3-cp311-cp311-manylinux_2_27_x86_64.whl.metadata (12 kB)
Collecting colorama (from taichi)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading taichi-1.7.3-cp311-cp311-manylinux_2_27_x86_64.whl (55.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.0/55.0 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama, taichi
Successfully installed colorama-0.4.6 taichi-1.7.3


In [2]:
# importing all the required libs
import taichi as ti
import taichi.math as tm
import imageio
from time import time

[Taichi] version 1.7.3, llvm 15.0.4, commit 5ec301be, linux, python 3.11.13


## Taichi Syntax


### Fields?

- **GPU Memory Allocation:** Fields live in GPU memory for lightning-fast access by GPU cores
- **Type Safety & Performance:** Explicit dtype=float prevents runtime errors and enables compiler optimizations
- **Zero Python Overhead:** Direct GPU memory access without Python interpretation layer
- **Persistent Storage:** Unlike temporary variables, fields persist across kernel calls
- **Multi-dimensional:** Support 1D, 2D, 3D+ arrays just like NumPy but optimized for parallel access

Equivalents in numpy and pytorch
- `NumPynp.zeros((2000, 1000))`,  CPU RAM,  Dynamic typing
- `PyTorchtorch.zeros(2000, 1000, device='cuda')`, GPU VRAM, Static on GPU
- `Taichiti.field(dtype=float, shape=(2000, 1000))`, GPU VRAM, Static compiled

### @ti.func:
- **Compiled Function:** Runs on GPU, not in Python interpreter
- **No Python Overhead:** Direct GPU execution
- **Type Inference:** Taichi automatically determines types

### @ti.kernel:
- **Massive Parallel Execution:** Each (i, j) pixel processed by a separate GPU thread simultaneously
- **Core Count Advantage:** GPU cores (3000-10000+) vs CPU cores (4-16) - strength in numbers, not individual speed
- **SIMD Architecture:** Single Instruction, Multiple Data - same operation on different data points
- **Automatic Work Distribution:** Taichi handles thread scheduling and memory access patterns

## Fractals


### Fractal Logic:
1. **Iteration:** Apply z = z² + c repeatedly
2. **Escape Condition:** If |z| > 20, point escapes to infinity
3. **Max Iterations:** Prevent infinite loops (50 iterations max)
4. **Coloring:** Convert iteration count to pixel brightness

### Julia Set:
A Julia set, named after Gaston Julia, is a boundary of points in the complex plane that either remain bounded or escape to infinity under repeated iterations of a complex function.

## Code

In [3]:
def generate_fractal(device):
    # initialise taichi to compile and run
    ti.init(arch=device, cpu_max_num_threads=1)

    n = 500  # pixel shape of the image set to (1000, 500)
    pixels = ti.field(dtype=float, shape=(n * 2, n))

    @ti.func
    def complex_sqr(z):  # complex square of a 2D vector
        return tm.vec2(z[0] * z[0] - z[1] * z[1], 2 * z[0] * z[1])

    @ti.kernel
    def paint(t: float):
        for i, j in pixels:  # parallelized over all pixels
            c = tm.vec2(-0.8, tm.cos(t) * 0.2)
            z = tm.vec2(i / n - 1, j / n - 0.5) * 2 # position fraction of the pixel in the image
            iterations = 0
            while z.norm() < 20 and iterations < 50:
                z = complex_sqr(z) + c
                iterations += 1
            pixels[i, j] = 1 - iterations * 0.02

    images = []
    start_time = time()
    for i in range(600):
        paint(i * 0.03)
        i += 1

        image = pixels.to_numpy()
        images.append(image)
    end_time = time()

    # convert to 8-bit grayscale
    images = [((img - img.min()) / (img.max() - img.min()) * 255).astype('uint8') for img in images]

    # save as a video, and download later. Replace this with taichi's GUI viewer when running locally to view without having to save it
    imageio.mimwrite('julia_set.mp4', images, fps=60, quality=8, codec='libx264', ffmpeg_params=['-pix_fmt', 'yuv420p'])

    print('total time taken:', end_time - start_time)

In [5]:
# profiling on CPU
generate_fractal(ti.cpu)

[Taichi] Starting on arch=x64


KeyboardInterrupt: 

In [None]:
# profiling on GPU
generate_fractal(ti.gpu)