A high-performance numerical simulation of the Rayleigh-Taylor instability implemented in CUDA, Python, and C for comparative performance analysis.
This project implements a numerical simulation of the Rayleigh-Taylor instability using CUDA parallel computing. The Rayleigh-Taylor instability occurs when a heavy fluid rests on top of a lighter fluid under the influence of gravity, creating an unstable interface that evolves over time. The simulation solves the compressible fluid dynamics equations on a 2D grid, parallelizing computations on GPU for optimal performance.
Density field evolution showing the characteristic mushroom-like structures of the Rayleigh-Taylor instability (512×1024 grid resolution)
This repository contains three main implementation approaches:
rayleigh_taylor_instability_cpu_py.ipynb
- Pure Python implementation with NumPy for baseline comparisonrayleigh_taylor_instability_cpu_c.ipynb
- C CPU implementation for performance comparisonrayleigh_taylor_instability_gpu.ipynb
- CUDA GPU-accelerated implementation for maximum performance
The system models a compressible fluid with the following conservative variables:
- Density: ρ (denoted as r)
- Momentum: ρu (denoted as ru) and ρv (denoted as rv)
- Total Energy: e (denoted as e)
The 2D conservation equations are:
where
where
where
where
The initial interface is defined at
-
Density:
$\rho = 2.0$ for$y \geq L_y/2$ ,$\rho = 1.0$ for$y < L_y/2$ -
Velocity:
$u = 0$ everywhere -
Velocity:
$v = 0$ except near interface ($|y - L_y/2| \leq 0.05$ ), where$v$ is a random perturbation between$-10^{-3}$ and$10^{-3}$ -
Pressure:
$p = 40 + \rho g_a(y - L_y/2)$ (hydrostatic equilibrium) -
Energy:
$e$ derived from$p$ ,$u$ , and$v$
-
Y-boundaries (top and bottom): Rigid walls (normal velocity
$v = 0$ , hydrostatic equilibrium for$p$ ) -
X-boundaries: Periodic (values at
$x = 0$ and$x = L_x$ are connected)
Two numerical schemes are implemented:
where
The time step
For numerical stability, diffusion terms are included:
Implementation | Relative Performance | Memory Usage | Parallelization |
---|---|---|---|
Python/NumPy | 1x (baseline) | High | Limited (vectorized ops) |
C CPU | ~10-50x faster | Moderate | Sequential loops |
CUDA GPU | ~100-500x faster | Optimized | Thousands of threads |
Python Implementation:
- Easy to prototype and debug
- Limited by Python's GIL and interpreted nature
- Suitable for algorithm validation and small grids
C CPU Implementation:
- Significant performance improvement over Python
- Efficient memory management with direct array access
- Sequential implementation without threading
- Tested on smaller grids (64×128) for faster development
CUDA GPU Implementation:
- Massive parallel acceleration with GPU kernels
- Shared memory optimization for reduction operations
- Block-thread organization for spatial computations
- Optimal for production simulations on large grids (512×1024)
Performance scaling with different grid resolutions:
Grid Size | Memory Required | CUDA Performance | Recommended Use Case |
---|---|---|---|
256×128 | ~50 MB | Excellent | Development/Testing |
512×256 | ~200 MB | Excellent | Standard simulations |
1024×512 | ~800 MB | Very Good | High-resolution studies |
2048×1024 | ~3.2 GB | Good* | Research applications |
*Performance depends on GPU memory bandwidth and compute capability.
Explicit Euler Method:
- Simple implementation
- Low computational cost per step
- First-order accuracy
- More restrictive stability conditions
Second-Order Runge-Kutta (RK2):
- Higher accuracy (second-order)
- Better stability properties
- More physical results
- Double computational cost per step
- More complex implementation
Recommendation: RK2 provides significantly better accuracy and stability, making it the preferred choice despite the computational overhead.
- GPU Acceleration: Up to 500x speedup over Python implementation
- Multiple Methods: Both Euler and RK2 time integration schemes
- Comprehensive Benchmarking: Performance comparison across Python, C, and CUDA
- CUDA Optimizations: Parallel kernel execution and shared memory for reductions
- Configurable Parameters: Adjustable grid sizes, diffusion coefficients, and time stepping
- Scalable Design: Efficient performance from small test cases to large production runs
All performance benchmarks were conducted on:
- Platform: Kaggle Notebooks
- GPU: NVIDIA Tesla T4 (16GB VRAM)
- CPU: Intel Xeon (4 cores)
- Memory: 13GB RAM
- C Implementation: Sequential processing without OpenMP parallelization
- CUDA Optimization: Shared memory currently limited to reduction operations
- Memory Management: Dynamic allocation patterns could be further optimized
- Enhanced Parallelization: Add OpenMP threading to C implementation
- CUDA Optimizations: Implement shared memory for spatial derivative computations
- Memory Efficiency: Explore texture memory for read-only data access
- Adaptive Methods: Implement adaptive time stepping and mesh refinement
- Validation Studies: Compare results against analytical solutions and experimental data
For detailed performance reports, technical questions, or collaboration opportunities, please contact me.
Contributions are welcome. Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature
) - Commit your changes (
git commit -am 'Add new feature'
) - Push to the branch (
git push origin feature/new-feature
) - Create a Pull Request
Please ensure your code follows the existing style and includes appropriate documentation.
This project is available under the MIT License. See LICENSE file for details.
- Rayleigh-Taylor Instability Theory
- CUDA Programming Guide
Developed as part of advanced parallel programming coursework focusing on GPU-accelerated numerical simulations.