0.15.1#5
Conversation
- Add vllm/poc/ module with PoCConfig, PoCState, ProofBatch, ValidatedBatch - Add tests/poc/test_data.py with 12 unit tests - ValidatedBatch uses scipy binomtest for fraud detection - Update rules.md with phase completion workflow
- Add gpu_random.py with murmur3-based deterministic RNG - Portable across all GPU architectures (pure math operations) - Box-Muller transform for normal distribution - Argsort-based permutations (fully parallel on GPU) - All 6 tests passing
- Implemented deterministic GPU-native RNG using murmur3 hash - Added generate_inputs, generate_permutations, generate_target, compute_distances - Uses direct integer argsort for permutations (no float conversion) - Added stable=True for CPU/GPU cross-device reproducibility - 13 unit tests covering determinism, seed validation, cross-device matching - Updated documentation with performance notes and design decisions
- Add PoCManager for orchestrating PoC rounds with vLLM model - Add PoCCallbackSender for async callback delivery with retry logic - Add _create_prefill_attn_metadata() using PAD_SLOT_ID for forward context - Manager accepts vllm_config for set_forward_context integration - State machine: IDLE -> GENERATING -> VALIDATING -> STOPPED - 20 unit tests for state transitions, nonce generation, stats - Smoke test validates run_batch/validate with real Qwen3-0.6B model - TODO: Multistep inference with proper KV cache allocation - TODO: vLLM v1 API compatibility
… note, dedupe TODOs
Implements PoC API endpoints with proper distributed execution: API Endpoints: - POST /api/v1/pow/init - Initialize round - POST /api/v1/pow/init/generate - Init + start generating - POST /api/v1/pow/init/validate - Init + start validating - POST /api/v1/pow/phase/generate - Switch to generate - POST /api/v1/pow/phase/validate - Switch to validate - POST /api/v1/pow/stop - Stop round - GET /api/v1/pow/status - Get status - POST /api/v1/pow/validate - Validate nonces Key Changes: - Add worker_ops.py with PP coordination patterns - Update PoCManager to use model_executor.collective_rpc() - Add RPCPoCRequest/Response for multiprocessing mode - Add --enable-poc CLI flag - Add poc_request() to EngineClient protocol TP/PP Support: - Deterministic RNG on each worker (no tensor serialization) - PP coordination follows Worker.execute_model() patterns - Works with default multiprocessing mode Tests: - 67 unit tests passing - Smoke test with real model passing - ~380 nonces/sec with Qwen3-0.6B
Key changes: - Remove PoCCallbackSender class (vllm/poc/sender.py deleted) - Move callback logic to routes.py with asyncio.Queue for decoupling - Add run_batch_with_state() for efficient single-RPC generation loop - Fix validation flow: queue_validation now returns results with fraud detection - Update manager.validate() to track stats and detect fraud - Fix /validate endpoint to send callback with validation results - Update callback receiver script with /generated and /validated endpoints - Update docs: README.md (inference blocking deferred to Phase 6) - Update docs: phase-3-manager.md (correct collective_rpc approach) - Update tests for new API behavior E2E tested: generation → callbacks → validation → fraud detection
Key fixes: - Fix per-app callback state (replaced global _callback_url) - Remove dead validate action from engine.py - Add time-based progress logging (5s interval) - Use vLLM logger for consistent format - Add r_target to status/callback responses - Add computed_distances to validate response E2E testing: - New poc_e2e_test.py for full integration testing - Tests generation, restart validation, fraud detection - Supports multiple models (qwen, llama) - Model-specific r_target calibration (empirical) Bug fix: seq_len must match between generation and validation (default 256)
Experiments to measure actual distance distributions vs theoretical: Results: - Theoretical (random weights): mean=1.4142, p10=1.4119 - Trained Qwen: mean=1.2155, p10=1.1739 (16.9% compressed) - Trained Llama: mean=1.4486, p10=1.3796 (2.3% compressed) - Random Qwen (same arch): mean=1.4138, p10=1.4117 (matches theory) Key finding: Trained models have structure that compresses distances. Qwen is highly compressed, Llama is close to random. Updated r_target for 10% valid rate: - Qwen: 1.174 (was 1.145) - Llama: 1.380 (was 1.35) New scripts: - poc_distribution_experiment.py: Collect full distance distribution - poc_random_qwen_experiment.py: Compare random vs trained weights
PROBLEM: - Trained models caused inconsistent distribution across block_hashes - Cross-block spread was 49-54% (e.g., block_alpha 88%, block_beta 50%) - No single r_target worked for all block_hashes SOLUTION: 1. Per-layer normalization in layer_hooks.py - Normalizes hidden states to unit sphere at each layer - Breaks structure accumulation (orthogonal transforms preserve structure) 2. Random lm_head with POC_OUTPUT_DIM=8192 in worker_ops.py - Replaces trained lm_head (vocab_size ~150K) with random projection - 18x memory savings - Deterministic, seeded by block_hash RESULTS: - Cross-block spread: 2-3.5% (was 49-54%) - Qwen: 3.5% spread - Llama: 2.0% spread - Model-agnostic r_target: ~1.405 for 10% valid rate Updated documentation: - phase4.2-distribution-analysis.md: Status SOLVED - phase4.1-randomization.md: Layer hooks work, added normalization - phase-4-final.md: Updated r_target calibration - README.md: Updated distribution note - testing-guide.md: Updated empirical values
Code cleanup: - Remove unused generate_orthogonal_matrix, apply_orthogonal_transform from gpu_random.py - Remove unused LayerSignsHook, _generate_signs from layer_hooks.py - Update poc_e2e_test.py to use consistent r_target=1.405 for both models - Fix test_manager.py to account for poc_setup_layer_hooks call during init_round Documentation: - Update phase-3-final.md: clarify callback logic moved inline to routes.py - Update phase-3-manager.md: remove sender.py references, note simplification - Update phase-4-integration.md: correct file structure (add layer_hooks.py) - Update phase-6-optional.md: add section 6.3 for callback retry queue TODO All tests pass: - pytest tests/poc/ -v: 25 passed - poc_smoke_test.py: all checks pass - poc_e2e_test.py: both models pass - poc_full_e2e_test.py: 18/18 tests pass
- Add sign flips transform as lightweight alternative to per-layer hooks - Remove experimental flags (normalize_first, use_random_lm_head) - Both approaches achieve <5% cross-block spread in E2E tests - Clean up experiment scripts, keep core test suite - Streamline phase4.3 documentation
E2E Test Results (36 tests total): - Sign Flips: Qwen 1.1%, Llama 0.9% cross-block spread ✓ - Layer Hooks: Qwen 3.9%, Llama 10.3% cross-block spread Changes: - Fix normalize-first order in worker_ops.py (consistency with layer_hooks.py) - Set Sign Flips as default in config.py - Update phase4.3 documentation with accurate results - Add phase4-final-results.md summary - Include test logs in planning-poc-v2/test-results/
…l transform Changes: - worker_ops.py: Remove all transformation flags, use single pipeline (k=64 dim pick + Haar rotate) - config.py: Remove experiment flags (use_layer_hooks, use_sign_flips, etc.) - manager.py: Remove artifact handling, simplify run_batch_with_state - routes.py: Remove experiment flags from API, improve logging format - gpu_random.py: generate_target now requires public_key parameter - engine handlers: Remove return_inputs/return_outputs params Deleted: - scripts/poc_e2e_test.py (outdated) - scripts/poc_full_e2e_test.py (outdated) - logs/e2e_results.jsonl (stale) Added: - scripts/poc_e2e_simple_test.py (simplified E2E test) - planning-poc-v2/clean-up.md (documents this iteration) Pipeline: input -> forward -> normalize -> pick k dims -> Haar rotate -> distance to k-dim target
…2e test Remove deprecated functions from gpu_random.py (-51% lines): - Permutation-based pipeline functions - Convenience wrappers moved to scripts/poc_distribution/ Rewrite poc_e2e_test.py with 3x3 seed matrix: - Tests 9 seed combinations per model without server restart - Saves per-seed JSON with nonces for validation - Determinism, independence, and fraud detection tests Changes: - gpu_random.py: keep only production pipeline functions - test_gpu_random.py: update tests for new API - routes.py: remove /phase/generate_manual endpoint - Add scripts/poc_distribution/gpu_random_utils.py for analysis - Delete collect_hidden_vectors.py, poc_e2e_simple_test.py E2E results (100s/seed): qwen 20.8%, llama 20.5%, qwen4b 20.8% valid
Add 20 statistical tests verifying correct distributions: - _uniform/_normal: mean, variance, KS tests - generate_target/householder: uniform on sphere - random_pick_indices: chi-square uniform selection - generate_haar: columns on sphere, det ±1 balanced Uses scipy.stats with large samples (10K+) for reliability. Results: 44 tests total, all passing
Detect when chat inference is active or TP workers are in broadcast loop and pause PoC GPU work until engine is idle. Prevents NCCL collective mismatch deadlock in v0 distributed execution with TP>1. Changes: - Add _engine_step_in_progress flag to track active engine.step() - Add _prepare_for_poc_gpu_work() helper to stop remote worker loop - Add has_unfinished_requests() check to skip PoC when chat is active - Add input_socket.poll() check to detect pending chat requests - Refactor _process_poc_action() with priority-based skip logic: - Priority 1: Pending input (chat requests waiting in socket) - Priority 2: Engine step in progress (async callback path) - Priority 3: Chat unfinished (requests still processing) - Safe: Stop remote worker loop before PoC collective_rpc - Add timeout_ms param to poc_request() client method - Add POC_RUN_BATCH_TIMEOUT_MS env var for long inference survival - Add POC_CHAT_BUSY_BACKOFF_SEC for skip retry backoff - Add check_mp_engine_required() guard for in-process mode rejection - Replace immediate return with backoff+retry in generation loop - Add timeout recovery and exception logging in _generation_loop() - Add retry loop with per-chunk timeout in /generate endpoint
…sync queue
Minimize vLLM core changes by consolidating PoC engine handlers and moving
all state management to API layer. Add async queue for wait=false requests.
Engine changes:
- Reduce PoC actions to single "generate_artifacts" (remove init/start/stop/status/run_batch)
- Keep chat-priority checks (pending_input, engine_step_in_progress, has_unfinished_requests)
- Support both MP engine and in-process AsyncLLMEngine modes
API layer changes:
- Move nonce generation, stats tracking to routes.py
- Add GenerateJob/GenerateResult dataclasses for queue infrastructure
- Add background worker processing jobs FIFO with queue-until-idle behavior
- Add GET /generate/{request_id} polling endpoint
- Remove 409 conflict for /generate when /init/generate active (queues instead)
- Add configurable result TTL cleanup (POC_GENERATE_RESULT_TTL_SEC)
PoCManager simplification:
- Remove all stateful APIs (init_round, start_generate, stop_round, get_status, run_batch)
- Keep only generate_artifacts + _run_forward helper (now stateless)
Tests:
- Update test_routes.py with queue/polling tests
- Update test_manager.py for stateless manager
- Update test_coexist.py for generate_artifacts-only behavior
…ble callbacks Extract queue, callbacks, and validation logic from routes.py into separate modules for clarity and testability. Add backpressure and retry mechanisms to prevent OOM and dropped callbacks. Changes: - Add generate_queue.py: bounded queue with 100k nonce cap, result store - Add callbacks.py: retry-until-stop with backoff, 1M artifact cap (drops oldest) - Add validation.py: centralized artifact comparison and fraud test - Refactor routes.py to thin orchestrator (~330 lines from ~960) - Make PoCParamsModel strict (extra=forbid) per Phase 1 spec - Extend check_params_match() to validate model/seq_len/k_dim against deployed - Update /stop to clear queue, results, and callback buffers - Add batch processing logging with stats for /generate endpoints - Update tests for new module structure and behaviors
…ck retry logging Server-configurable batch size via POC_BATCH_SIZE_DEFAULT env var (default 32). Add NonceIterator with group_id/n_groups for disjoint nonce ranges across multiple groups. Add non-spammy retry logging for callbacks. Changes: - Add POC_BATCH_SIZE_DEFAULT env var, used as default for both endpoints - Add NonceIterator dataclass with multi-node + multi-group nonce iteration - Add group_id (default 0) and n_groups (default 1) to /init/generate request - Add attempt counting and warning logs for callback retries (1st + every 10th) - Log success info when callback succeeds after retries - Add tests for NonceIterator (single/multi-node, multi-group, disjoint coverage) - Add tests for batch_size defaults
- /generate callback HTTP 503 errors blocked queue processing - Callbacks retried indefinitely, preventing subsequent jobs - No limit on concurrent callback requests Solution: - Introduce CallbackQueue with bounded concurrency (default: 10) - Add max retry limit (default: 10) with exponential backoff - Use bounded queue (default: 10000) with oldest-drop on overflow - Share single aiohttp session for efficiency - Non-blocking: job processing continues regardless of callback status Files: - vllm/poc/callbacks.py: Add CallbackQueue class, max retry constants - vllm/poc/generate_queue.py: Integrate CallbackQueue for callback delivery - tests/poc/test_callback_blocking.py: Tests for callback behavior Config (env vars): - POC_CALLBACK_MAX_RETRIES (default: 10) - POC_CALLBACK_MAX_CONCURRENT (default: 10) - POC_CALLBACK_QUEUE_SIZE (default: 10000)"
Implement SM120 blockwise FP8 scaled matrix multiplication kernels using CUTLASS v4.x while maintaining CUTLASS v3.9.2 for SM89/SM90/SM100 archs. Changes: - Add CUTLASS v4.x FetchContent for SM120 kernel compilation - Add enable_sm120_only guard in common.hpp - Add cutlass_3x_gemm_sm120 template using Sm120 collective builders - Add SM120 per-tensor and blockwise FP8 kernels and dispatch logic - Add runtime dispatch to route SM120 GPUs to dedicated kernels - Configure CMake to build SM120 sources with v4.x includes This enables FP8 quantization on RTX PRO 6000 / RTX 5060 Ti GPUs.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Summary
This PR implemented PoC support for vLLM 0.15.1 and integrated the PoC flow into the server, runtime, worker, and validation layers.
What Was Implemented
Outcome
PoC support was fully introduced into the 0.15.1 codebase with end-to-end server and runtime integration for continuous generation and artifact handling.