0.15.1 by qdanik · Pull Request #5 · qdanik/vllm

qdanik · 2026-02-26T02:32:28Z

Summary

This PR implemented PoC support for vLLM 0.15.1 and integrated the PoC flow into the server, runtime, worker, and validation layers.

What Was Implemented

Updated the PoC server surface, request handling, queueing, callback delivery, validation, and state management.
Refactored the PoC consensus and transform modules required for artifact generation and verification.
Integrated PoC execution into the v1 engine by collective RPC and worker path so PoC requests could be processed by the runtime.

Outcome

PoC support was fully introduced into the 0.15.1 codebase with end-to-end server and runtime integration for continuous generation and artifact handling.

- Add vllm/poc/ module with PoCConfig, PoCState, ProofBatch, ValidatedBatch - Add tests/poc/test_data.py with 12 unit tests - ValidatedBatch uses scipy binomtest for fraud detection - Update rules.md with phase completion workflow

- Add gpu_random.py with murmur3-based deterministic RNG - Portable across all GPU architectures (pure math operations) - Box-Muller transform for normal distribution - Argsort-based permutations (fully parallel on GPU) - All 6 tests passing

- Implemented deterministic GPU-native RNG using murmur3 hash - Added generate_inputs, generate_permutations, generate_target, compute_distances - Uses direct integer argsort for permutations (no float conversion) - Added stable=True for CPU/GPU cross-device reproducibility - 13 unit tests covering determinism, seed validation, cross-device matching - Updated documentation with performance notes and design decisions

- Add PoCManager for orchestrating PoC rounds with vLLM model - Add PoCCallbackSender for async callback delivery with retry logic - Add _create_prefill_attn_metadata() using PAD_SLOT_ID for forward context - Manager accepts vllm_config for set_forward_context integration - State machine: IDLE -> GENERATING -> VALIDATING -> STOPPED - 20 unit tests for state transitions, nonce generation, stats - Smoke test validates run_batch/validate with real Qwen3-0.6B model - TODO: Multistep inference with proper KV cache allocation - TODO: vLLM v1 API compatibility

… note, dedupe TODOs

Implements PoC API endpoints with proper distributed execution: API Endpoints: - POST /api/v1/pow/init - Initialize round - POST /api/v1/pow/init/generate - Init + start generating - POST /api/v1/pow/init/validate - Init + start validating - POST /api/v1/pow/phase/generate - Switch to generate - POST /api/v1/pow/phase/validate - Switch to validate - POST /api/v1/pow/stop - Stop round - GET /api/v1/pow/status - Get status - POST /api/v1/pow/validate - Validate nonces Key Changes: - Add worker_ops.py with PP coordination patterns - Update PoCManager to use model_executor.collective_rpc() - Add RPCPoCRequest/Response for multiprocessing mode - Add --enable-poc CLI flag - Add poc_request() to EngineClient protocol TP/PP Support: - Deterministic RNG on each worker (no tensor serialization) - PP coordination follows Worker.execute_model() patterns - Works with default multiprocessing mode Tests: - 67 unit tests passing - Smoke test with real model passing - ~380 nonces/sec with Qwen3-0.6B

Key changes: - Remove PoCCallbackSender class (vllm/poc/sender.py deleted) - Move callback logic to routes.py with asyncio.Queue for decoupling - Add run_batch_with_state() for efficient single-RPC generation loop - Fix validation flow: queue_validation now returns results with fraud detection - Update manager.validate() to track stats and detect fraud - Fix /validate endpoint to send callback with validation results - Update callback receiver script with /generated and /validated endpoints - Update docs: README.md (inference blocking deferred to Phase 6) - Update docs: phase-3-manager.md (correct collective_rpc approach) - Update tests for new API behavior E2E tested: generation → callbacks → validation → fraud detection

Key fixes: - Fix per-app callback state (replaced global _callback_url) - Remove dead validate action from engine.py - Add time-based progress logging (5s interval) - Use vLLM logger for consistent format - Add r_target to status/callback responses - Add computed_distances to validate response E2E testing: - New poc_e2e_test.py for full integration testing - Tests generation, restart validation, fraud detection - Supports multiple models (qwen, llama) - Model-specific r_target calibration (empirical) Bug fix: seq_len must match between generation and validation (default 256)

Experiments to measure actual distance distributions vs theoretical: Results: - Theoretical (random weights): mean=1.4142, p10=1.4119 - Trained Qwen: mean=1.2155, p10=1.1739 (16.9% compressed) - Trained Llama: mean=1.4486, p10=1.3796 (2.3% compressed) - Random Qwen (same arch): mean=1.4138, p10=1.4117 (matches theory) Key finding: Trained models have structure that compresses distances. Qwen is highly compressed, Llama is close to random. Updated r_target for 10% valid rate: - Qwen: 1.174 (was 1.145) - Llama: 1.380 (was 1.35) New scripts: - poc_distribution_experiment.py: Collect full distance distribution - poc_random_qwen_experiment.py: Compare random vs trained weights

PROBLEM: - Trained models caused inconsistent distribution across block_hashes - Cross-block spread was 49-54% (e.g., block_alpha 88%, block_beta 50%) - No single r_target worked for all block_hashes SOLUTION: 1. Per-layer normalization in layer_hooks.py - Normalizes hidden states to unit sphere at each layer - Breaks structure accumulation (orthogonal transforms preserve structure) 2. Random lm_head with POC_OUTPUT_DIM=8192 in worker_ops.py - Replaces trained lm_head (vocab_size ~150K) with random projection - 18x memory savings - Deterministic, seeded by block_hash RESULTS: - Cross-block spread: 2-3.5% (was 49-54%) - Qwen: 3.5% spread - Llama: 2.0% spread - Model-agnostic r_target: ~1.405 for 10% valid rate Updated documentation: - phase4.2-distribution-analysis.md: Status SOLVED - phase4.1-randomization.md: Layer hooks work, added normalization - phase-4-final.md: Updated r_target calibration - README.md: Updated distribution note - testing-guide.md: Updated empirical values

Code cleanup: - Remove unused generate_orthogonal_matrix, apply_orthogonal_transform from gpu_random.py - Remove unused LayerSignsHook, _generate_signs from layer_hooks.py - Update poc_e2e_test.py to use consistent r_target=1.405 for both models - Fix test_manager.py to account for poc_setup_layer_hooks call during init_round Documentation: - Update phase-3-final.md: clarify callback logic moved inline to routes.py - Update phase-3-manager.md: remove sender.py references, note simplification - Update phase-4-integration.md: correct file structure (add layer_hooks.py) - Update phase-6-optional.md: add section 6.3 for callback retry queue TODO All tests pass: - pytest tests/poc/ -v: 25 passed - poc_smoke_test.py: all checks pass - poc_e2e_test.py: both models pass - poc_full_e2e_test.py: 18/18 tests pass

- Add sign flips transform as lightweight alternative to per-layer hooks - Remove experimental flags (normalize_first, use_random_lm_head) - Both approaches achieve <5% cross-block spread in E2E tests - Clean up experiment scripts, keep core test suite - Streamline phase4.3 documentation

E2E Test Results (36 tests total): - Sign Flips: Qwen 1.1%, Llama 0.9% cross-block spread ✓ - Layer Hooks: Qwen 3.9%, Llama 10.3% cross-block spread Changes: - Fix normalize-first order in worker_ops.py (consistency with layer_hooks.py) - Set Sign Flips as default in config.py - Update phase4.3 documentation with accurate results - Add phase4-final-results.md summary - Include test logs in planning-poc-v2/test-results/

…l transform Changes: - worker_ops.py: Remove all transformation flags, use single pipeline (k=64 dim pick + Haar rotate) - config.py: Remove experiment flags (use_layer_hooks, use_sign_flips, etc.) - manager.py: Remove artifact handling, simplify run_batch_with_state - routes.py: Remove experiment flags from API, improve logging format - gpu_random.py: generate_target now requires public_key parameter - engine handlers: Remove return_inputs/return_outputs params Deleted: - scripts/poc_e2e_test.py (outdated) - scripts/poc_full_e2e_test.py (outdated) - logs/e2e_results.jsonl (stale) Added: - scripts/poc_e2e_simple_test.py (simplified E2E test) - planning-poc-v2/clean-up.md (documents this iteration) Pipeline: input -> forward -> normalize -> pick k dims -> Haar rotate -> distance to k-dim target

…2e test Remove deprecated functions from gpu_random.py (-51% lines): - Permutation-based pipeline functions - Convenience wrappers moved to scripts/poc_distribution/ Rewrite poc_e2e_test.py with 3x3 seed matrix: - Tests 9 seed combinations per model without server restart - Saves per-seed JSON with nonces for validation - Determinism, independence, and fraud detection tests Changes: - gpu_random.py: keep only production pipeline functions - test_gpu_random.py: update tests for new API - routes.py: remove /phase/generate_manual endpoint - Add scripts/poc_distribution/gpu_random_utils.py for analysis - Delete collect_hidden_vectors.py, poc_e2e_simple_test.py E2E results (100s/seed): qwen 20.8%, llama 20.5%, qwen4b 20.8% valid

Add 20 statistical tests verifying correct distributions: - _uniform/_normal: mean, variance, KS tests - generate_target/householder: uniform on sphere - random_pick_indices: chi-square uniform selection - generate_haar: columns on sphere, det ±1 balanced Uses scipy.stats with large samples (10K+) for reliability. Results: 44 tests total, all passing

Detect when chat inference is active or TP workers are in broadcast loop and pause PoC GPU work until engine is idle. Prevents NCCL collective mismatch deadlock in v0 distributed execution with TP>1. Changes: - Add _engine_step_in_progress flag to track active engine.step() - Add _prepare_for_poc_gpu_work() helper to stop remote worker loop - Add has_unfinished_requests() check to skip PoC when chat is active - Add input_socket.poll() check to detect pending chat requests - Refactor _process_poc_action() with priority-based skip logic: - Priority 1: Pending input (chat requests waiting in socket) - Priority 2: Engine step in progress (async callback path) - Priority 3: Chat unfinished (requests still processing) - Safe: Stop remote worker loop before PoC collective_rpc - Add timeout_ms param to poc_request() client method - Add POC_RUN_BATCH_TIMEOUT_MS env var for long inference survival - Add POC_CHAT_BUSY_BACKOFF_SEC for skip retry backoff - Add check_mp_engine_required() guard for in-process mode rejection - Replace immediate return with backoff+retry in generation loop - Add timeout recovery and exception logging in _generation_loop() - Add retry loop with per-chunk timeout in /generate endpoint

…sync queue Minimize vLLM core changes by consolidating PoC engine handlers and moving all state management to API layer. Add async queue for wait=false requests. Engine changes: - Reduce PoC actions to single "generate_artifacts" (remove init/start/stop/status/run_batch) - Keep chat-priority checks (pending_input, engine_step_in_progress, has_unfinished_requests) - Support both MP engine and in-process AsyncLLMEngine modes API layer changes: - Move nonce generation, stats tracking to routes.py - Add GenerateJob/GenerateResult dataclasses for queue infrastructure - Add background worker processing jobs FIFO with queue-until-idle behavior - Add GET /generate/{request_id} polling endpoint - Remove 409 conflict for /generate when /init/generate active (queues instead) - Add configurable result TTL cleanup (POC_GENERATE_RESULT_TTL_SEC) PoCManager simplification: - Remove all stateful APIs (init_round, start_generate, stop_round, get_status, run_batch) - Keep only generate_artifacts + _run_forward helper (now stateless) Tests: - Update test_routes.py with queue/polling tests - Update test_manager.py for stateless manager - Update test_coexist.py for generate_artifacts-only behavior

…ble callbacks Extract queue, callbacks, and validation logic from routes.py into separate modules for clarity and testability. Add backpressure and retry mechanisms to prevent OOM and dropped callbacks. Changes: - Add generate_queue.py: bounded queue with 100k nonce cap, result store - Add callbacks.py: retry-until-stop with backoff, 1M artifact cap (drops oldest) - Add validation.py: centralized artifact comparison and fraud test - Refactor routes.py to thin orchestrator (~330 lines from ~960) - Make PoCParamsModel strict (extra=forbid) per Phase 1 spec - Extend check_params_match() to validate model/seq_len/k_dim against deployed - Update /stop to clear queue, results, and callback buffers - Add batch processing logging with stats for /generate endpoints - Update tests for new module structure and behaviors

…ck retry logging Server-configurable batch size via POC_BATCH_SIZE_DEFAULT env var (default 32). Add NonceIterator with group_id/n_groups for disjoint nonce ranges across multiple groups. Add non-spammy retry logging for callbacks. Changes: - Add POC_BATCH_SIZE_DEFAULT env var, used as default for both endpoints - Add NonceIterator dataclass with multi-node + multi-group nonce iteration - Add group_id (default 0) and n_groups (default 1) to /init/generate request - Add attempt counting and warning logs for callback retries (1st + every 10th) - Log success info when callback succeeds after retries - Add tests for NonceIterator (single/multi-node, multi-group, disjoint coverage) - Add tests for batch_size defaults

- /generate callback HTTP 503 errors blocked queue processing - Callbacks retried indefinitely, preventing subsequent jobs - No limit on concurrent callback requests Solution: - Introduce CallbackQueue with bounded concurrency (default: 10) - Add max retry limit (default: 10) with exponential backoff - Use bounded queue (default: 10000) with oldest-drop on overflow - Share single aiohttp session for efficiency - Non-blocking: job processing continues regardless of callback status Files: - vllm/poc/callbacks.py: Add CallbackQueue class, max retry constants - vllm/poc/generate_queue.py: Integrate CallbackQueue for callback delivery - tests/poc/test_callback_blocking.py: Tests for callback behavior Config (env vars): - POC_CALLBACK_MAX_RETRIES (default: 10) - POC_CALLBACK_MAX_CONCURRENT (default: 10) - POC_CALLBACK_QUEUE_SIZE (default: 10000)"

Implement SM120 blockwise FP8 scaled matrix multiplication kernels using CUTLASS v4.x while maintaining CUTLASS v3.9.2 for SM89/SM90/SM100 archs. Changes: - Add CUTLASS v4.x FetchContent for SM120 kernel compilation - Add enable_sm120_only guard in common.hpp - Add cutlass_3x_gemm_sm120 template using Sm120 collective builders - Add SM120 per-tensor and blockwise FP8 kernels and dispatch logic - Add runtime dispatch to route SM120 GPUs to dedicated kernels - Configure CMake to build SM120 sources with v4.x includes This enables FP8 quantization on RTX PRO 6000 / RTX 5060 Ti GPUs.

github-actions · 2026-02-26T02:32:36Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

This reverts commit dddb72e.

…tency

…OC_LOG_ARTIFACTS_JSON

libermans-saturn and others added 30 commits June 18, 2025 00:18

Add enforced sampling

d85e137

Fix

6b44ce5

Pretty

8400c7f

WIP

ae3bfbb

Support full input

10ff420

Fixes

bbf9959

Quck release

ce2cdad

Fix dir

cac6fdb

poc: phase 1 infrastructure - config and data schemas

75c4bf9

- Add vllm/poc/ module with PoCConfig, PoCState, ProofBatch, ValidatedBatch - Add tests/poc/test_data.py with 12 unit tests - ValidatedBatch uses scipy binomtest for fraud detection - Update rules.md with phase completion workflow

Phase 2: GPU random generation with murmur3

7345895

- Add gpu_random.py with murmur3-based deterministic RNG - Portable across all GPU architectures (pure math operations) - Box-Muller transform for normal distribution - Argsort-based permutations (fully parallel on GPU) - All 6 tests passing

Update rules.md: murmur3 instead of torch.Generator

b427a24

docs(poc): cleanup phase 3 - fix sender doc, add nonce simplification…

95a2be9

… note, dedupe TODOs

docs: Add full 18-test distribution experiment to testing guide

9b0eb01

Consolidate phase 4 docs: merge final results into phase-4-final.md

7962b15

orthogonal transform quick check

22ce6ce

household vs haar

2b15224

Add valid rate estimation script for K-dimensional sphere

c828f07

gmorgachev and others added 20 commits January 11, 2026 05:48

Fixes

bca7d35

Add layers

125a211

simplify transform

8e33bd6

Artifacts

4c1428e

clean

0f698d1

clean up outdated files

399da1a

Remove --enable-poc flag

d30ddc1

A lot of cleanup: test

2875eeb

cleanup

f3f880a

Thread for generate_artifacts

9569da8

Fix build

5c41539

clear queue

68c3321

Remove outdated PoC distribution experiment reports for Qwen

4e73b69

Merge vllm release v0.15.1 into reborn

e82e339

qdanik added 2 commits March 17, 2026 07:32

feat(poc): initial commit for PoC 0.15.1

724a46d

Revert "Add SM120 (Blackwell) FP8 quantization support via dual CUTLASS"

dd06584

This reverts commit dddb72e.

qdanik force-pushed the 0.15.1 branch from a812295 to 4dbc30c Compare March 17, 2026 06:37

qdanik added 2 commits March 17, 2026 07:42

feat(poc): Implement PoC for vLLM 0.15.1

a43b4bf

fix(poc): Update req_enforced_token_ids to use string keys for consis…

ed6506a

…tency

qdanik force-pushed the 0.15.1 branch from 4dbc30c to ed6506a Compare March 17, 2026 06:42

qdanik added 2 commits March 17, 2026 13:53

fix(poc): Update PoC environment variables for batch size and max tokens

48a0909

fix(validation): Conditional logging of artifact distances based on P…

0aac6a5

…OC_LOG_ARTIFACTS_JSON

mtvnastya mentioned this pull request Mar 17, 2026

Upgrade v0.2.11 gonka-ai/gonka#813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.15.1#5

0.15.1#5
qdanik wants to merge 75 commits into
vllm/0.15.1from
0.15.1

qdanik commented Feb 26, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qdanik commented Feb 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Was Implemented

Outcome

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qdanik commented Feb 26, 2026 •

edited by github-actions Bot

Loading