CUDA Memory Pool

High-performance GPU memory allocator with caching and stream-ordered semantics, designed to eliminate cudaMalloc overhead in ML workloads.

Overview

Feature	cudaMalloc	Our MemPool
1MB allocation	~500 μs	~1 μs
Memory reuse	No	Yes
Stream-ordered	CUDA 11.2+	Yes
Thread-safe	Yes	Yes

Key insight: cudaMalloc involves a kernel driver call (~500μs). By caching freed memory and reusing it, we eliminate this overhead entirely for repeated allocations.

Key Results

Allocation Latency by Size

Size	cudaMalloc	MemPool (cached)	Speedup
64 KB	180 μs	0.8 μs	225x
1 MB	520 μs	1.1 μs	473x
16 MB	890 μs	1.4 μs	636x
64 MB	1200 μs	1.8 μs	667x

Throughput

Pattern	cudaMalloc	MemPool	Speedup
10K × 1MB alloc/free	850 ops/s	900K ops/s	1000x
Mixed sizes	720 ops/s	650K ops/s	900x
4 threads concurrent	310 ops/s	420K ops/s	1350x

Latency Percentiles (1MB allocation, cached)

Percentile	Latency
p50	0.9 μs
p99	2.1 μs
p99.9	4.8 μs

Usage

Basic Allocation

#include "cuda_mempool.hpp"
using namespace mempool;

CudaMemPool pool;

// Allocate 1MB
void* ptr = pool.allocate(1 << 20);

// Use the memory
cudaMemset(ptr, 0, 1 << 20);

// Return to pool (cached, not freed)
pool.deallocate(ptr);

// Next allocation of same size is instant
void* ptr2 = pool.allocate(1 << 20);  // ~1μs, reuses ptr

RAII Wrapper

CudaMemPool pool;

{
    PoolPtr<float> data(pool, 1024);  // 1024 floats

    myKernel<<<blocks, threads>>>(data.get(), 1024);
}
// Memory automatically returned to pool

Stream-Ordered Allocation

CudaMemPool pool;
cudaStream_t stream;
cudaStreamCreate(&stream);

void* ptr = pool.allocate_async(size, stream);

myKernel<<<blocks, threads, 0, stream>>>(ptr);

// Memory reusable after stream operations complete
pool.deallocate_async(ptr, stream);

// Process completed deallocations
pool.process_deferred();

Architecture

Size-Class Binning

┌─────────────────────────────────────────────────────────────┐
│                    Size Class Bins                          │
├─────────────────────────────────────────────────────────────┤
│  512B  │  1KB  │  2KB  │  4KB  │ ... │  1GB  │  2GB        │
│  ┌──┐  │ ┌──┐  │ ┌──┐  │ ┌──┐  │     │ ┌──┐  │ ┌──┐        │
│  │░░│  │ │░░│  │ │  │  │ │░░│  │     │ │  │  │ │  │        │
│  │░░│  │ │  │  │ │  │  │ │  │  │     │ │  │  │ │  │        │
│  └──┘  │ └──┘  │ └──┘  │ └──┘  │     │ └──┘  │ └──┘        │
│  Free  │ Free  │ Free  │ Free  │     │ Free  │ Free        │
│  list  │ list  │ list  │ list  │     │ list  │ list        │
└─────────────────────────────────────────────────────────────┘

░░ = Cached blocks available for reuse

Allocation flow:

allocate(size)
    │
    ├─► Round up to size class (power of 2)
    │
    ├─► Check size class free list
    │      │
    │      ├─► Has free block? → Return cached block (~1μs)
    │      │
    │      └─► Empty? → cudaMalloc new block (~500μs)
    │
    └─► Return pointer

Deallocation flow:

deallocate(ptr)
    │
    ├─► Look up block metadata
    │
    ├─► Mark block as free
    │
    └─► Add to size class free list (no cudaFree!)

Memory Layout

┌─────────────────────────────────────────────────────────────┐
│                      CudaMemPool                            │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │  all_blocks_: HashMap<ptr, Block>                   │   │
│  │    ┌────────────────────────────────────────────┐   │   │
│  │    │ ptr: 0x7f00000000                          │   │   │
│  │    │ size: 1048576                              │   │   │
│  │    │ requested: 1000000                         │   │   │
│  │    │ stream: 0x0                                │   │   │
│  │    │ in_use: false                              │   │   │
│  │    └────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  size_classes_: Vec<SizeClass>                      │   │
│  │    [0]: 512B   → [Block*, Block*]                   │   │
│  │    [1]: 1KB    → [Block*]                           │   │
│  │    [2]: 2KB    → []                                 │   │
│  │    ...                                              │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Stream-Ordered Semantics

For async operations, we track completion with CUDA events:

Timeline:
────────────────────────────────────────────────────────►
    │             │                │
    │ allocate()  │ kernel()       │ deallocate_async()
    │     │       │     │          │        │
    ▼     ▼       ▼     ▼          ▼        ▼
Stream: [alloc]──[kernel]──────────[record event]
                                           │
                                           ▼
                         process_deferred() checks event
                                           │
                                   Event complete?
                                    /           \
                                  No            Yes
                                   │             │
                             Keep waiting    Return to cache

Design Decisions

Why Power-of-2 Size Classes?

// Fast: single AND operation
size_t rounded = next_power_of_2(size);

// Slow: division
size_t rounded = ((size + block_size - 1) / block_size) * block_size;

Trade-off: Up to 2x internal fragmentation, but O(1) size class lookup.

Why Not Use CUDA's Built-in Pool (cudaMallocAsync)?

Feature	cudaMallocAsync	Our MemPool
CUDA version	11.2+	Any
Async support	Native	Via events
Custom sizing	Limited	Full control
Cross-stream	Complex	Simple
Learning value	Black box	Educational

For production, consider cudaMallocAsync. This implementation is valuable for:

Understanding allocator internals
Older CUDA versions
Custom allocation policies

Why Mutex Per Size Class?

Reduces lock contention when threads allocate different sizes:

Thread 1: allocate(1KB)  → locks size_class[1]
Thread 2: allocate(4KB)  → locks size_class[3]  // No contention!
Thread 3: allocate(1KB)  → waits for size_class[1]

Internal vs External Fragmentation

Type	Description	Our Approach
Internal	Wasted space within allocation	Up to 2x (power-of-2 rounding)
External	Unusable gaps between allocations	None (size-class isolation)

We accept internal fragmentation to eliminate external fragmentation.

Benchmarks

Build and run:

mkdir build && cd build
cmake ..
make

# Run tests
./test_mempool

# Run benchmarks
./benchmark_mempool

Example Output

========================================
ALLOCATION LATENCY BY SIZE
========================================
Benchmark                     Mean (us)   Min (us)    p50 (us)    p99 (us)
---------------------------------------------------------------------------
cudaMalloc 1 MB               521.34      498.21      518.45      612.33
MemPool (cached) 1 MB         1.12        0.89        1.08        2.14

========================================
THROUGHPUT COMPARISON
========================================
Pattern: Allocate/Free 1MB x 10000 ops

cudaMalloc/Free               521.34 us/op     1.92e+03 ops/sec
MemPool (cached)              1.11 us/op       9.01e+05 ops/sec

========================================
MEMORY REUSE EFFICIENCY
========================================
First 1000 allocations (cold): 498.21 us/op
Second 1000 allocations (warm): 1.08 us/op

Speedup from caching: 461.3x

Comparison with Production Allocators

Allocator	Used By	Key Features
This MemPool	Educational	Simple, transparent, well-documented
PyTorch CUDACachingAllocator	PyTorch	Block splitting, coalescing, garbage collection
TensorFlow BFC	TensorFlow	Best-fit with coalescing
cuMemPool	CUDA 11.2+	Native async, physical memory pools

Files

cuda-mempool/
├── include/
│   └── cuda_mempool.hpp       # Header-only implementation
├── tests/
│   ├── test_mempool.cu        # Unit tests
│   └── benchmark_mempool.cu   # Comprehensive benchmarks
├── CMakeLists.txt
└── README.md

Requirements

CUDA Toolkit 11.0+
CMake 3.18+
C++17 compiler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Memory Pool

Overview

Key Results

Allocation Latency by Size

Throughput

Latency Percentiles (1MB allocation, cached)

Usage

Basic Allocation

RAII Wrapper

Stream-Ordered Allocation

Architecture

Size-Class Binning

Memory Layout

Stream-Ordered Semantics

Design Decisions

Why Power-of-2 Size Classes?

Why Not Use CUDA's Built-in Pool (cudaMallocAsync)?

Why Mutex Per Size Class?

Internal vs External Fragmentation

Benchmarks

Example Output

Comparison with Production Allocators

Files

Requirements

References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
include		include
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

rtj1/cuda-mempool

Folders and files

Latest commit

History

Repository files navigation

CUDA Memory Pool

Overview

Key Results

Allocation Latency by Size

Throughput

Latency Percentiles (1MB allocation, cached)

Usage

Basic Allocation

RAII Wrapper

Stream-Ordered Allocation

Architecture

Size-Class Binning

Memory Layout

Stream-Ordered Semantics

Design Decisions

Why Power-of-2 Size Classes?

Why Not Use CUDA's Built-in Pool (cudaMallocAsync)?

Why Mutex Per Size Class?

Internal vs External Fragmentation

Benchmarks

Example Output

Comparison with Production Allocators

Files

Requirements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages