MyVLLM

A custom implementation of vLLM inference engine with attention mechanism benchmarks, based on Nano-vLLM but with self-contained paged attention and flash attention implementation.

Benchmarking on flash attention in prefilling time and paged attention in decoding time are provided.

New to vLLM? Check out HowToApproachvLLM.md for a step-by-step implementation guide covering layers, models, paged attention, CUDA graphs, and scheduling.

Quickstart

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Sync dependencies
uv sync

# Run the main inference engine
uv run python main.py

# Run prefilling benchmark
uv run python benchmark_prefilling.py

# Run decoding benchmark
uv run python benchmark_decoding.py

What Each Script Does

uv run python main.py

This is the main inference engine demo

Demonstrates the complete LLM inference pipeline using a custom engine implementation:

Create a small version of Qwen3 with random initialization
Creates 60 chat prompts (2 base prompts repeated 30 times each)
Processes them through the custom LLM engine with batch processing
Uses paged attention and KV cache management for efficient inference
Generates up to 256 tokens per prompt with temperature sampling

This showcases how the custom vLLM implementation handles batched text generation with memory-efficient attention.

uv run python benchmark_prefilling.py

This is the pefilling phase comparison

Compares three attention implementations during the prefilling phase (processing input prompts):

PyTorch Standard (O(N²) memory): Traditional attention that materializes full attention matrix
Naive Triton (O(N²) memory): GPU kernel that also uses O(N²) memory, limited by shared memory constraints (≤128 tokens)
Flash Attention (O(N) memory): Memory-efficient online softmax algorithm that processes attention in blocks

uv run python benchmark_decoding.py

This is the decoding phase comparison

Compares three implementations during the decoding phase (generating output tokens one at a time):

Naive PyTorch: Simple loop-based implementation using paged KV cache
Optimized PyTorch: Vectorized implementation with batch gathering and masking
Triton Kernel: Custom GPU kernel optimized for paged attention decode

Project Structure

myvllm/
├── src/
│   └── myvllm/           # Core vLLM implementation
│       ├── models/       # Model implementations
│       ├── engine/       # LLM engine logic, including sequence definition for input prompts, block management for KV cache management for GPU, scheduler for iteration-based scheduling of sequences, runner for actual implementation of running prefilling and decoding, and engine for generation API interface
│       ├── layers/       # Components for model/
│       ├── utils/        # context
│       └── sampling_parameters.py
├── main.py              # Full inference demo
├── benchmark_prefilling.py   # Prefilling attention comparison
└── benchmark_decoding.py     # Decoding attention comparison

Requirements

Python ≥3.11, <3.12
CUDA-capable GPU
Dependencies: transformers, torch, xxhash (managed by uv)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src/myvllm		src/myvllm
.gitignore		.gitignore
HowToApproachvLLM.md		HowToApproachvLLM.md
LICENSE		LICENSE
README.md		README.md
benchmark_decoding.py		benchmark_decoding.py
benchmark_prefilling.py		benchmark_prefilling.py
main.py		main.py
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MyVLLM

Quickstart

What Each Script Does

Project Structure

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MyVLLM

Quickstart

What Each Script Does

Project Structure

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages