Skip to content

[Feature]: Split and shorten long CI jobs (e.g. entrypoints, spec decodes, kernels, etc.) #20218

@kaori-seasons

Description

@kaori-seasons

🚀 The feature, motivation and pitch

Related to issues-16284

Analysis of current CI test process problems

The vLLM project currently uses Buildkite as the main CI system, which has the following problems:

  1. Long-running tests: Some tests run too long, and even 2-hour tests are skipped

  2. Test classification is not detailed enough: Although there are entry point tests, specification decoding tests, kernel tests and other classifications, there is a lack of effective grouping based on execution time

  3. Insufficient parallelism: The current CI configuration does not fully utilize the parallel execution capability

Detailed design plan

1. Test layering strategy

First layer: Fast Check

  • Based on the existing fast_check tag
  • Execution time: < 5 minutes
  • Includes: basic functional tests, unit tests, fast integration tests

Second layer: Standard Tests

  • Execution time: 5-20 minutes
  • Includes: most functional tests, API tests

Third layer: Extended Tests

  • Execution time: 20-60 minutes
  • Includes: complex scenario testing, performance regression testing

Layer 4: Nightly Tests

  • Execution time: > 60 minutes
  • Includes: large-scale testing, stress testing, complete performance benchmark testing 6

2. Split strategy by functional module

Entry point test module split

  • entrypoints-llm: LLM interface testing
  • entrypoints-openai: OpenAI API compatibility testing
  • entrypoints-offline: offline mode testing

Spec decoding test module split

  • spec-decode-core: core spec decoding function
  • spec-decode-e2e: end-to-end testing
  • spec-decode-performance: performance testing

Kernel test module split

  • kernels-attention: attention mechanism kernel
  • kernels-quantization: quantization kernel
  • kernels-moe: Expert Mixture Model Kernel
  • kernels-core: Core Operation Kernel

3. Hardware Platform Parallelization Strategy

Based on the multiple hardware platforms supported by the project:

GPU test grouping

  • gpu-single: Single GPU test (L4, A100, etc.)
  • gpu-multi: Multi-GPU distributed test
  • gpu-memory: Large memory requirement test

CPU test grouping

  • cpu-x86: x86 architecture CPU test
  • cpu-arm: ARM architecture CPU test

Specialized hardware test

  • tpu-tests: TPU platform test
  • neuron-tests: AWS Neuron test

4. Timeout and retry mechanism optimization

Use pytest-timeout function:

  • Quick test: Timeout 5 minutes, retry 1 time immediately after failure
  • Standard test: timeout 20 minutes, retry once after failure
  • Extended test: timeout 60 minutes, retry once after failure
  • Night test: timeout 180 minutes, no automatic retry after failure, record detailed logs

5. Test selection and marking strategy

Extend the existing pytest marking system:

Suggested new marks:
- @pytest.mark.fast (< 5min)
- @pytest.mark.standard (5-20min)
- @pytest.mark.extended (20-60min)
- @pytest.mark.nightly (> 60min)
- @pytest.mark.gpu_intensive
- @pytest.mark.memory_heavy

6. CI pipeline reconstruction plan

Pull Request triggered pipeline

  1. Fast check layer (parallel execution, 5-8 jobs)
  2. Standard test layer (conditional trigger, based on code changes)
  3. Critical path testing (always executed)

Pipeline triggered by master branch merge

  1. Complete fast + standard test

  2. Selective extension test

  3. Performance regression detection

Nightly scheduled pipeline

  1. Full test suite

  2. Performance benchmark test

  3. Stress test and stability test

7. Implementation steps

Phase 1: Test analysis and tagging

  1. Analyze the execution time distribution of existing tests

  2. Add time stratification tags to all tests

  3. Identify long tests that can be skipped or optimized

Phase 2: Pipeline refactoring

  1. Modify Buildkite configuration to increase parallel jobs

  2. Implement conditional trigger mechanism

  3. Configure timeout and retry strategy

Phase 3: Monitoring and optimization

  1. Establish CI execution time monitoring dashboard

  2. Continuously optimize test grouping and parallelism

  3. Adjust strategy based on feedback

Alternatives

No response

Additional context

solutions: https://docs.google.com/document/d/1gHMT8ZfNqpu67KrJ3DaNeC-mmdPmBRStK_9-g06hONs/edit?usp=sharing

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions