-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
🚀 The feature, motivation and pitch
Related to issues-16284
Analysis of current CI test process problems
The vLLM project currently uses Buildkite as the main CI system, which has the following problems:
-
Long-running tests: Some tests run too long, and even 2-hour tests are skipped
-
Test classification is not detailed enough: Although there are entry point tests, specification decoding tests, kernel tests and other classifications, there is a lack of effective grouping based on execution time
-
Insufficient parallelism: The current CI configuration does not fully utilize the parallel execution capability
Detailed design plan
1. Test layering strategy
First layer: Fast Check
- Based on the existing fast_check tag
- Execution time: < 5 minutes
- Includes: basic functional tests, unit tests, fast integration tests
Second layer: Standard Tests
- Execution time: 5-20 minutes
- Includes: most functional tests, API tests
Third layer: Extended Tests
- Execution time: 20-60 minutes
- Includes: complex scenario testing, performance regression testing
Layer 4: Nightly Tests
- Execution time: > 60 minutes
- Includes: large-scale testing, stress testing, complete performance benchmark testing 6
2. Split strategy by functional module
Entry point test module split
entrypoints-llm: LLM interface testingentrypoints-openai: OpenAI API compatibility testingentrypoints-offline: offline mode testing
Spec decoding test module split
spec-decode-core: core spec decoding functionspec-decode-e2e: end-to-end testingspec-decode-performance: performance testing
Kernel test module split
kernels-attention: attention mechanism kernelkernels-quantization: quantization kernelkernels-moe: Expert Mixture Model Kernelkernels-core: Core Operation Kernel
3. Hardware Platform Parallelization Strategy
Based on the multiple hardware platforms supported by the project:
GPU test grouping
gpu-single: Single GPU test (L4, A100, etc.)gpu-multi: Multi-GPU distributed testgpu-memory: Large memory requirement test
CPU test grouping
cpu-x86: x86 architecture CPU testcpu-arm: ARM architecture CPU test
Specialized hardware test
tpu-tests: TPU platform testneuron-tests: AWS Neuron test
4. Timeout and retry mechanism optimization
Use pytest-timeout function:
- Quick test: Timeout 5 minutes, retry 1 time immediately after failure
- Standard test: timeout 20 minutes, retry once after failure
- Extended test: timeout 60 minutes, retry once after failure
- Night test: timeout 180 minutes, no automatic retry after failure, record detailed logs
5. Test selection and marking strategy
Extend the existing pytest marking system:
Suggested new marks:
- @pytest.mark.fast (< 5min)
- @pytest.mark.standard (5-20min)
- @pytest.mark.extended (20-60min)
- @pytest.mark.nightly (> 60min)
- @pytest.mark.gpu_intensive
- @pytest.mark.memory_heavy
6. CI pipeline reconstruction plan
Pull Request triggered pipeline
- Fast check layer (parallel execution, 5-8 jobs)
- Standard test layer (conditional trigger, based on code changes)
- Critical path testing (always executed)
Pipeline triggered by master branch merge
-
Complete fast + standard test
-
Selective extension test
-
Performance regression detection
Nightly scheduled pipeline
-
Full test suite
-
Performance benchmark test
-
Stress test and stability test
7. Implementation steps
Phase 1: Test analysis and tagging
-
Analyze the execution time distribution of existing tests
-
Add time stratification tags to all tests
-
Identify long tests that can be skipped or optimized
Phase 2: Pipeline refactoring
-
Modify Buildkite configuration to increase parallel jobs
-
Implement conditional trigger mechanism
-
Configure timeout and retry strategy
Phase 3: Monitoring and optimization
-
Establish CI execution time monitoring dashboard
-
Continuously optimize test grouping and parallelism
-
Adjust strategy based on feedback
Alternatives
No response
Additional context
solutions: https://docs.google.com/document/d/1gHMT8ZfNqpu67KrJ3DaNeC-mmdPmBRStK_9-g06hONs/edit?usp=sharing
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.