A custom implementation of vLLM inference engine with attention mechanism benchmarks, based on Nano-vLLM but with self-contained paged attention and flash attention implementation.
Benchmarking on flash attention in prefilling time and paged attention in decoding time are provided.
New to vLLM? Check out HowToApproachvLLM.md for a step-by-step implementation guide covering layers, models, paged attention, CUDA graphs, and scheduling.
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Sync dependencies
uv sync
# Run the main inference engine
uv run python main.py
# Run prefilling benchmark
uv run python benchmark_prefilling.py
# Run decoding benchmark
uv run python benchmark_decoding.pyuv run python main.pyThis is the main inference engine demo
Demonstrates the complete LLM inference pipeline using a custom engine implementation:
- Create a small version of Qwen3 with random initialization
- Creates 60 chat prompts (2 base prompts repeated 30 times each)
- Processes them through the custom LLM engine with batch processing
- Uses paged attention and KV cache management for efficient inference
- Generates up to 256 tokens per prompt with temperature sampling
This showcases how the custom vLLM implementation handles batched text generation with memory-efficient attention.
uv run python benchmark_prefilling.pyThis is the pefilling phase comparison
Compares three attention implementations during the prefilling phase (processing input prompts):
- PyTorch Standard (O(N²) memory): Traditional attention that materializes full attention matrix
- Naive Triton (O(N²) memory): GPU kernel that also uses O(N²) memory, limited by shared memory constraints (≤128 tokens)
- Flash Attention (O(N) memory): Memory-efficient online softmax algorithm that processes attention in blocks
uv run python benchmark_decoding.pyThis is the decoding phase comparison
Compares three implementations during the decoding phase (generating output tokens one at a time):
- Naive PyTorch: Simple loop-based implementation using paged KV cache
- Optimized PyTorch: Vectorized implementation with batch gathering and masking
- Triton Kernel: Custom GPU kernel optimized for paged attention decode
myvllm/
├── src/
│ └── myvllm/ # Core vLLM implementation
│ ├── models/ # Model implementations
│ ├── engine/ # LLM engine logic, including sequence definition for input prompts, block management for KV cache management for GPU, scheduler for iteration-based scheduling of sequences, runner for actual implementation of running prefilling and decoding, and engine for generation API interface
│ ├── layers/ # Components for model/
│ ├── utils/ # context
│ └── sampling_parameters.py
├── main.py # Full inference demo
├── benchmark_prefilling.py # Prefilling attention comparison
└── benchmark_decoding.py # Decoding attention comparison
- Python ≥3.11, <3.12
- CUDA-capable GPU
- Dependencies:
transformers,torch,xxhash(managed by uv)