Skip to content

withlin/MinivLLM

 
 

Repository files navigation

MyVLLM

A custom implementation of vLLM inference engine with attention mechanism benchmarks, based on Nano-vLLM but with self-contained paged attention and flash attention implementation.

Benchmarking on flash attention in prefilling time and paged attention in decoding time are provided.

New to vLLM? Check out HowToApproachvLLM.md for a step-by-step implementation guide covering layers, models, paged attention, CUDA graphs, and scheduling.

Quickstart

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Sync dependencies
uv sync

# Run the main inference engine
uv run python main.py

# Run prefilling benchmark
uv run python benchmark_prefilling.py

# Run decoding benchmark
uv run python benchmark_decoding.py

What Each Script Does

uv run python main.py

This is the main inference engine demo

Demonstrates the complete LLM inference pipeline using a custom engine implementation:

  • Create a small version of Qwen3 with random initialization
  • Creates 60 chat prompts (2 base prompts repeated 30 times each)
  • Processes them through the custom LLM engine with batch processing
  • Uses paged attention and KV cache management for efficient inference
  • Generates up to 256 tokens per prompt with temperature sampling

This showcases how the custom vLLM implementation handles batched text generation with memory-efficient attention.

uv run python benchmark_prefilling.py

This is the pefilling phase comparison

Compares three attention implementations during the prefilling phase (processing input prompts):

  1. PyTorch Standard (O(N²) memory): Traditional attention that materializes full attention matrix
  2. Naive Triton (O(N²) memory): GPU kernel that also uses O(N²) memory, limited by shared memory constraints (≤128 tokens)
  3. Flash Attention (O(N) memory): Memory-efficient online softmax algorithm that processes attention in blocks
uv run python benchmark_decoding.py

This is the decoding phase comparison

Compares three implementations during the decoding phase (generating output tokens one at a time):

  1. Naive PyTorch: Simple loop-based implementation using paged KV cache
  2. Optimized PyTorch: Vectorized implementation with batch gathering and masking
  3. Triton Kernel: Custom GPU kernel optimized for paged attention decode

Project Structure

myvllm/
├── src/
│   └── myvllm/           # Core vLLM implementation
│       ├── models/       # Model implementations
│       ├── engine/       # LLM engine logic, including sequence definition for input prompts, block management for KV cache management for GPU, scheduler for iteration-based scheduling of sequences, runner for actual implementation of running prefilling and decoding, and engine for generation API interface
│       ├── layers/       # Components for model/
│       ├── utils/        # context
│       └── sampling_parameters.py
├── main.py              # Full inference demo
├── benchmark_prefilling.py   # Prefilling attention comparison
└── benchmark_decoding.py     # Decoding attention comparison

Requirements

  • Python ≥3.11, <3.12
  • CUDA-capable GPU
  • Dependencies: transformers, torch, xxhash (managed by uv)

About

Based on Nano-vLLM, a simple replication of vLLM with self-contained paged attention and flash attention implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%