TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs

I. Overview of Contributions and Artifacts

A. Paper's Main Contributions

TurboFFT without fault tolerance outperforms the popular open-source library VkFFT, and is comparable to the state-of-the-art closed-source library, cuFFT.
TurboFFT with two-side fault tolerance efficiently fuses the checksum computation into the FFT kernel, minimizing the fault tolerance overhead compared to existing offline fault-tolerant FFT (FT-FFT).
TurboFFT’s online error correction protects FFT computation on-the-fly, obtaining lower error correction overhead compared to the time-redundant recomputation in offline FT-FFT under error injections.

B. Computational Artifacts

Artifact ID	Contributions	Related Supported Paper Elements
A1	C1	Figure 1, 10–14, 21
	C2	Figure 16–18, 19–20
	C3	Figure 19–20, 22

II. Introduction

This document demonstrates how to reproduce the results in the paper:

TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs.

All supplementary files are available on Zenodo. The repository PPoPP25_Artifact_TurboFFT.zip consists of all code, including two one-command scripts run_A100.sh and run_T4.sh to reproduce all result figures listed in Table I.

We executed all benchmarks in the paper using the hardware detailed in Table II and the software detailed in Table III.

III. Getting Started

This section guides you through the necessary steps to set up your machine. Please follow these steps before starting to reproduce the results.

A. Extract Code Repositories

Start by downloading PPoPP25_Artifact_TurboFFT.zip.
Extract the archive into an empty directory and change into this directory using the commands below. Make sure that the absolute path to this directory contains no spaces.

# Create a reproduce directory
mkdir reproduce
cd reproduce
cp <path-to>/PPoPP25_Artifact_TurboFFT.zip ./

# Extract the artifact
unzip <path-to>/PPoPP25_Artifact_TurboFFT.zip
cd PPoPP25_Artifact_TurboFFT

Now, the folder reproduce contains all the necessary code to produce the results shown in the paper. The code consists of:

TurboFFT: Source code of the high-performance FFT library.
Common: Contains CUDA helper functions from NVIDIA/cuda-samples.

B. Install Host Machine Compilation Prerequisites

GCC: Install a recent GCC version (≥ 11.2.0).
CMake: Install a recent CMake version (≥ 3.24.3).
CUDA Toolkit:
- Use CUDA Toolkit 12.0 for an A100 machine.
- Use CUDA Toolkit 11.6 for a T4 machine.

# Check the version after installing
gcc --version
cmake --version
nvcc --version

C. Install Host Machine Codegen & Plot Prerequisites

Python: Use a recent Python version (≥ 3.9).
PyTorch: Use a recent version (≥ 2.4).
NumPy: Use a recent version (≥ 2.0.2).
Matplotlib: Use a recent version (≥ 3.8.4).
Seaborn: Use a recent version (≥ 0.13.2).

Below is an example of setting up a Python environment:

python -m venv .venv --prompt turbofft
source .venv/bin/activate
pip install --upgrade pip
pip install torch
pip install numpy
pip install matplotlib
pip install seaborn

IV. Reproducing Results

The one-command scripts run_A100.sh and run_T4.sh allow you to regenerate:

11 experimental result figures on NVIDIA A100 GPUs (Figures 1, 10–14, and 16–20)
2 experimental result figures on NVIDIA T4 GPUs (Figures 21–22)

A. How to Run

Ensure all dependencies are installed (see Section III).
Run the script:
- On an NVIDIA A100 machine:
```
./run_A100.sh
```
- On an NVIDIA T4 machine:
```
./run_T4.sh
```
View results:
- Experimental data will be available in the artifact_data directory.
- Figures will be saved in the artifact_figures directory.

B. Workflow Overview

The scripts run_A100.sh and run_T4.sh execute the following steps:

Environment Setup: Configures environment variables.
Code Generation: Generates required CUDA kernels.
Compilation: Builds TurboFFT and related binaries.
Benchmarking: Runs benchmarks for TurboFFT.
Plotting: Produces figures matching the paper’s results.

C. Runtime Details

Table IV shows the estimated execution time of run_A100.sh and run_T4.sh.

The script run_A100.sh takes approximately 2 hours on a machine with an AMD EPYC 7763 64-Core Processor and an NVIDIA A100 40GB GPU.
The script run_T4.sh takes approximately 30 minutes on a machine with an Intel(R) Xeon(R) Silver 4216 CPU and an NVIDIA T4 GPU.

Hardware Environment

System Type	Description
System A	GPU: 1× NVIDIA A100-SXM4-40GB GPU Power: 400 W CPU: AMD EPYC 7713 64-Core Processor Cores per socket: 64 Threads per core: 2 Memory: 256 GB
System B	GPU: 1× NVIDIA Tesla-T4 GPU Power: 70 W CPU: Intel(R) Xeon(R) Silver 4216 CPU Cores per socket: 16 Threads per core: 2 Memory: 192 GB

Software Environment

System	Software	Version
System A	gcc	12.3.0
	cmake	3.24.3
	cudatoolkit	12.0
	python	3.10.14
	torch	2.5.1
	numpy	2.1.3
	matplotlib	3.8.4
	seaborn	0.13.2
System B	gcc	11.2.0
	cmake	3.26.4
	cudatoolkit	11.6
	python	3.9.18
	torch	2.5.1
	numpy	2.0.2
	matplotlib	3.9.2
	seaborn	0.13.2

Estimated Execution Time

Script	CodeGen	TurboFFT	Baseline (cuFFT + VkFFT)	Plot	Total
run_A100.sh	20 s	15 min	90 min	3 min	2 hr
run_T4.sh	20 s	10 min	10 min	3 min	30 min

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
Common		Common
TurboFFT		TurboFFT
.gitignore		.gitignore
CMakeLists_VkFFT.txt		CMakeLists_VkFFT.txt
PPoPP25_TurboFFT_Artifact.pdf		PPoPP25_TurboFFT_Artifact.pdf
clean.sh		clean.sh
output		output
readme.md		readme.md
run_A100.sh		run_A100.sh
run_T4.sh		run_T4.sh
test_VkFFT.py		test_VkFFT.py
test_VkFFT_T4.py		test_VkFFT_T4.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs

Table of Contents

I. Overview of Contributions and Artifacts

A. Paper's Main Contributions

B. Computational Artifacts

II. Introduction

III. Getting Started

A. Extract Code Repositories

B. Install Host Machine Compilation Prerequisites

C. Install Host Machine Codegen & Plot Prerequisites

IV. Reproducing Results

A. How to Run

B. Workflow Overview

C. Runtime Details

Hardware Environment

Software Environment

Estimated Execution Time

About

Releases

Packages

Languages

shixun404/TurboFFT

Folders and files

Latest commit

History

Repository files navigation

TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs

Table of Contents

I. Overview of Contributions and Artifacts

A. Paper's Main Contributions

B. Computational Artifacts

II. Introduction

III. Getting Started

A. Extract Code Repositories

B. Install Host Machine Compilation Prerequisites

C. Install Host Machine Codegen & Plot Prerequisites

IV. Reproducing Results

A. How to Run

B. Workflow Overview

C. Runtime Details

Hardware Environment

Software Environment

Estimated Execution Time

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages