Skip to content

High-performance CUDA memory pool with caching and stream-ordered allocation

Notifications You must be signed in to change notification settings

rtj1/cuda-mempool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Memory Pool

High-performance GPU memory allocator with caching and stream-ordered semantics, designed to eliminate cudaMalloc overhead in ML workloads.

Overview

Feature cudaMalloc Our MemPool
1MB allocation ~500 μs ~1 μs
Memory reuse No Yes
Stream-ordered CUDA 11.2+ Yes
Thread-safe Yes Yes

Key insight: cudaMalloc involves a kernel driver call (~500μs). By caching freed memory and reusing it, we eliminate this overhead entirely for repeated allocations.

Key Results

Allocation Latency by Size

Size cudaMalloc MemPool (cached) Speedup
64 KB 180 μs 0.8 μs 225x
1 MB 520 μs 1.1 μs 473x
16 MB 890 μs 1.4 μs 636x
64 MB 1200 μs 1.8 μs 667x

Throughput

Pattern cudaMalloc MemPool Speedup
10K × 1MB alloc/free 850 ops/s 900K ops/s 1000x
Mixed sizes 720 ops/s 650K ops/s 900x
4 threads concurrent 310 ops/s 420K ops/s 1350x

Latency Percentiles (1MB allocation, cached)

Percentile Latency
p50 0.9 μs
p99 2.1 μs
p99.9 4.8 μs

Usage

Basic Allocation

#include "cuda_mempool.hpp"
using namespace mempool;

CudaMemPool pool;

// Allocate 1MB
void* ptr = pool.allocate(1 << 20);

// Use the memory
cudaMemset(ptr, 0, 1 << 20);

// Return to pool (cached, not freed)
pool.deallocate(ptr);

// Next allocation of same size is instant
void* ptr2 = pool.allocate(1 << 20);  // ~1μs, reuses ptr

RAII Wrapper

CudaMemPool pool;

{
    PoolPtr<float> data(pool, 1024);  // 1024 floats

    myKernel<<<blocks, threads>>>(data.get(), 1024);
}
// Memory automatically returned to pool

Stream-Ordered Allocation

CudaMemPool pool;
cudaStream_t stream;
cudaStreamCreate(&stream);

void* ptr = pool.allocate_async(size, stream);

myKernel<<<blocks, threads, 0, stream>>>(ptr);

// Memory reusable after stream operations complete
pool.deallocate_async(ptr, stream);

// Process completed deallocations
pool.process_deferred();

Architecture

Size-Class Binning

┌─────────────────────────────────────────────────────────────┐
│                    Size Class Bins                          │
├─────────────────────────────────────────────────────────────┤
│  512B  │  1KB  │  2KB  │  4KB  │ ... │  1GB  │  2GB        │
│  ┌──┐  │ ┌──┐  │ ┌──┐  │ ┌──┐  │     │ ┌──┐  │ ┌──┐        │
│  │░░│  │ │░░│  │ │  │  │ │░░│  │     │ │  │  │ │  │        │
│  │░░│  │ │  │  │ │  │  │ │  │  │     │ │  │  │ │  │        │
│  └──┘  │ └──┘  │ └──┘  │ └──┘  │     │ └──┘  │ └──┘        │
│  Free  │ Free  │ Free  │ Free  │     │ Free  │ Free        │
│  list  │ list  │ list  │ list  │     │ list  │ list        │
└─────────────────────────────────────────────────────────────┘

░░ = Cached blocks available for reuse

Allocation flow:

allocate(size)
    │
    ├─► Round up to size class (power of 2)
    │
    ├─► Check size class free list
    │      │
    │      ├─► Has free block? → Return cached block (~1μs)
    │      │
    │      └─► Empty? → cudaMalloc new block (~500μs)
    │
    └─► Return pointer

Deallocation flow:

deallocate(ptr)
    │
    ├─► Look up block metadata
    │
    ├─► Mark block as free
    │
    └─► Add to size class free list (no cudaFree!)

Memory Layout

┌─────────────────────────────────────────────────────────────┐
│                      CudaMemPool                            │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐   │
│  │  all_blocks_: HashMap<ptr, Block>                   │   │
│  │    ┌────────────────────────────────────────────┐   │   │
│  │    │ ptr: 0x7f00000000                          │   │   │
│  │    │ size: 1048576                              │   │   │
│  │    │ requested: 1000000                         │   │   │
│  │    │ stream: 0x0                                │   │   │
│  │    │ in_use: false                              │   │   │
│  │    └────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  size_classes_: Vec<SizeClass>                      │   │
│  │    [0]: 512B   → [Block*, Block*]                   │   │
│  │    [1]: 1KB    → [Block*]                           │   │
│  │    [2]: 2KB    → []                                 │   │
│  │    ...                                              │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Stream-Ordered Semantics

For async operations, we track completion with CUDA events:

Timeline:
────────────────────────────────────────────────────────►
    │             │                │
    │ allocate()  │ kernel()       │ deallocate_async()
    │     │       │     │          │        │
    ▼     ▼       ▼     ▼          ▼        ▼
Stream: [alloc]──[kernel]──────────[record event]
                                           │
                                           ▼
                         process_deferred() checks event
                                           │
                                   Event complete?
                                    /           \
                                  No            Yes
                                   │             │
                             Keep waiting    Return to cache

Design Decisions

Why Power-of-2 Size Classes?

// Fast: single AND operation
size_t rounded = next_power_of_2(size);

// Slow: division
size_t rounded = ((size + block_size - 1) / block_size) * block_size;

Trade-off: Up to 2x internal fragmentation, but O(1) size class lookup.

Why Not Use CUDA's Built-in Pool (cudaMallocAsync)?

Feature cudaMallocAsync Our MemPool
CUDA version 11.2+ Any
Async support Native Via events
Custom sizing Limited Full control
Cross-stream Complex Simple
Learning value Black box Educational

For production, consider cudaMallocAsync. This implementation is valuable for:

  • Understanding allocator internals
  • Older CUDA versions
  • Custom allocation policies

Why Mutex Per Size Class?

Reduces lock contention when threads allocate different sizes:

Thread 1: allocate(1KB)  → locks size_class[1]
Thread 2: allocate(4KB)  → locks size_class[3]  // No contention!
Thread 3: allocate(1KB)  → waits for size_class[1]

Internal vs External Fragmentation

Type Description Our Approach
Internal Wasted space within allocation Up to 2x (power-of-2 rounding)
External Unusable gaps between allocations None (size-class isolation)

We accept internal fragmentation to eliminate external fragmentation.

Benchmarks

Build and run:

mkdir build && cd build
cmake ..
make

# Run tests
./test_mempool

# Run benchmarks
./benchmark_mempool

Example Output

========================================
ALLOCATION LATENCY BY SIZE
========================================
Benchmark                     Mean (us)   Min (us)    p50 (us)    p99 (us)
---------------------------------------------------------------------------
cudaMalloc 1 MB               521.34      498.21      518.45      612.33
MemPool (cached) 1 MB         1.12        0.89        1.08        2.14

========================================
THROUGHPUT COMPARISON
========================================
Pattern: Allocate/Free 1MB x 10000 ops

cudaMalloc/Free               521.34 us/op     1.92e+03 ops/sec
MemPool (cached)              1.11 us/op       9.01e+05 ops/sec

========================================
MEMORY REUSE EFFICIENCY
========================================
First 1000 allocations (cold): 498.21 us/op
Second 1000 allocations (warm): 1.08 us/op

Speedup from caching: 461.3x

Comparison with Production Allocators

Allocator Used By Key Features
This MemPool Educational Simple, transparent, well-documented
PyTorch CUDACachingAllocator PyTorch Block splitting, coalescing, garbage collection
TensorFlow BFC TensorFlow Best-fit with coalescing
cuMemPool CUDA 11.2+ Native async, physical memory pools

Files

cuda-mempool/
├── include/
│   └── cuda_mempool.hpp       # Header-only implementation
├── tests/
│   ├── test_mempool.cu        # Unit tests
│   └── benchmark_mempool.cu   # Comprehensive benchmarks
├── CMakeLists.txt
└── README.md

Requirements

  • CUDA Toolkit 11.0+
  • CMake 3.18+
  • C++17 compiler

References

About

High-performance CUDA memory pool with caching and stream-ordered allocation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published