CI/CD Pipeline Overhaul 2022 #6445

feisuzhu · 2022-10-26T11:13:17Z

This is an on-going work primarily to reduce CI/CD build time and improve maintainability.
Discussions and feature requests are welcome.

The status quo

Buildbots

5 Linux, 3 + 2(VM) Windows, 3 macOS, organization shared

Pipeline

release.yml: Build nightly wheels, run complete tests, upload to pypi. GA releases also goes here.
testing.yml: Build PR and master, run partial or complete tests.

perf.yml: Performance monitoring, only run on master.

Problems

Code & logic duplication

release.yml is largely the same with testing.yml except publishing and different build matrix.
Different code path for CPU and GPU building tasks.

Slow build

Latency !

1h30min for each complete run (master branch) (!!)
Slightly over 1h for each PR run (!!!)

Wasted cycles & bandwidth on same things

For now C++ build cache are stored locally on each buildbot machine.
Auxiliary git repos are not cached at all.
Perf Monitoring and Android Demo needs to compile taichi binary separately.
Pulling huge docker images for every run on GitHub Hosted Runners (cost 30min).
ti.init() dominates test case time consumption

Buildbot cluster of a low capacity

Only a single M1 Mac Mini operating (SPOF)
Unlimited(?) but weak GitHub Hosted Runners, better not to have them.
Beefy runners are prohibitively expensive.

Under-utilized buildbot cluster

Cores sitting idle due to extra low usage when running tests
Different test suites run in serial

Strong coupling with GitHub Actions

Difficult for developers to run workflows locally

Non-reproducible building environment (partially addressed)

Windows and macOS builds still run on bare metal
Buildbots are provisioned manually

Minor mess

Most of buildbot tags are useless

Proposed changes

Exercise single responsibility principle in workflow definition
- Merge CPU & GPU tasks
- Separate build & test & release & perf into 4 different stages
  - Also prevents from compiling multiple times
  - Guarantees Perf Mon monitors the same wheel published
  - Resolves: Code & logic duplication

Cluster shared caching mechanism
- sccache running at cluster scale
- Resolves: Slow build, cluster utilization
- Possible blocker: Windows build is currently using ccache because sccache can't recognize a particular compiler option.
apt & pip caching proxy
- squid-deb-proxy
- PyPI cacing (https://gist.github.com/dctrwatson/5785638)
- Resolves: Slow build, cluster utilization
Home grown git clone caching proxy
- Open source solutions are not polished (goblet, jonasmalacofilho/git-cache-http-server, etc)
- Long term, may not implement this time
- Could grow up to a non-trivial open source project.
- Could benefit satellite projects like soft2d as well.
- Resolves: Slow build, cluster utilization
Adding new buildbots
- 15 x HPE ProLiant DL360 Gen 9 with 2x Xeon E5-2696v4 + 128GB RAM + Tesla P4 (total 44 cores 88 threads, CNY 5k per node)
- 40GbE interconnect via Mellanox CX3 Pro, may add InfiniBand if necessary
- macOS 10.14 phased out, the Mac Pro will be upgraded to 10.15
- New M1 Mac Mini will join buildbot cluster after Monterey issue resolved
- Resolves: Slow build

Go distributed

Miscellaneous

Buildbot tags cleanup
Issues/Actions dashboard
buildbot resource monitoring

Non-goals

Virtualization &&/|| Kubernetes integration

Both are fun tasks not on critical path, leaving them to (for now non-existent) interns

The text was updated successfully, but these errors were encountered:

imcom · 2022-10-27T04:04:02Z

Dapper now provides Go SDK, dev may use Go to describe its CICD

feisuzhu · 2022-11-07T06:38:20Z

According to the document, containerization w/ GPU on Windows is non-feasible now(requires CUDA & OpenGL).
Moving all workloads to bare metal on Windows.

Issue: #6445 ### Brief Summary

Issue: #6445 ### Brief Summary This PR addresses quirks in `sccache` distributed compiling: 1. `sccache` would run preprocessing locally (`clang -E`) first and then send the expanded code to remote build server. This can trigger false warnings not present when compiling locally. 2. Compile host and target can be different, `--target=` must be send to remote build server. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: #6445 ### Brief Summary

Issue: taichi-dev#6445 ### Brief Summary

Issue: taichi-dev#6445 ### Brief Summary This PR addresses quirks in `sccache` distributed compiling: 1. `sccache` would run preprocessing locally (`clang -E`) first and then send the expanded code to remote build server. This can trigger false warnings not present when compiling locally. 2. Compile host and target can be different, `--target=` must be send to remote build server. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#6445 ### Brief Summary

feisuzhu added enhancement Make existing things or codebases better discussion Welcome discussion! ci labels Oct 26, 2022

feisuzhu self-assigned this Oct 26, 2022

taichi-gardener added this to Taichi Lang Oct 26, 2022

taichi-gardener moved this to Untriaged in Taichi Lang Oct 26, 2022

neozhaoliang moved this from Untriaged to In Progress in Taichi Lang Oct 28, 2022

feisuzhu mentioned this issue Nov 8, 2022

[ci] GitHub Actions: Merge CPU & GPU tasks #6476

Merged

feisuzhu added a commit that referenced this issue Nov 10, 2022

[ci] GitHub Actions: Merge CPU & GPU tasks (#6476)

67a7b34

Issue: #6445 ### Brief Summary

feisuzhu mentioned this issue Nov 11, 2022

[ci] CUDA tests speed up #6516

Merged

feisuzhu added a commit that referenced this issue Nov 15, 2022

[ci] CUDA tests speed up (#6516)

a4920a7

Issue: #6445 ### Brief Summary

feisuzhu mentioned this issue Nov 29, 2022

[build] Initial distributed compiling support #6762

Merged

feisuzhu mentioned this issue Dec 9, 2022

[ci] Workflow Rewrite: Building on Linux #6848

Merged

feisuzhu added a commit that referenced this issue Dec 20, 2022

[ci] Workflow Rewrite: Building on Linux (#6848)

56fab81

Issue: #6445 ### Brief Summary

feisuzhu mentioned this issue Dec 22, 2022

[ci] Sync CI cache script & workflow #6959

Merged

feisuzhu added a commit that referenced this issue Dec 22, 2022

[ci] Sync CI cache script & workflow (#6959)

3b3ba48

Issue: #6445 ### Brief Summary

This was referenced Dec 23, 2022

[ci] Auto setup miniforge3 env when build #6966

Merged

[ci] Build: auto install vulkan on Linux #6969

Merged

feisuzhu added a commit that referenced this issue Dec 23, 2022

[ci] Auto setup miniforge3 env when build (#6966)

c100a9a

Issue: #6445 ### Brief Summary

feisuzhu added a commit that referenced this issue Dec 23, 2022

[ci] Build: auto install vulkan on Linux (#6969)

7e8582e

Issue: #6445 ### Brief Summary

feisuzhu mentioned this issue Dec 28, 2022

[ci] Switch Windows build script to build.py #6993

Merged

feisuzhu added a commit that referenced this issue Jan 17, 2023

[ci] Switch Windows build script to build.py (#6993)

437fe47

Issue: #6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] GitHub Actions: Merge CPU & GPU tasks (taichi-dev#6476)

43a1ffd

Issue: taichi-dev#6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] CUDA tests speed up (taichi-dev#6516)

5411efc

Issue: taichi-dev#6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] Workflow Rewrite: Building on Linux (taichi-dev#6848)

d7b85e0

Issue: taichi-dev#6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] Sync CI cache script & workflow (taichi-dev#6959)

539f5ff

Issue: taichi-dev#6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] Auto setup miniforge3 env when build (taichi-dev#6966)

c9e8b1a

Issue: taichi-dev#6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] Build: auto install vulkan on Linux (taichi-dev#6969)

8b0185a

Issue: taichi-dev#6445 ### Brief Summary

quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023

[ci] Switch Windows build script to build.py (taichi-dev#6993)

1e078f7

Issue: taichi-dev#6445 ### Brief Summary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/CD Pipeline Overhaul 2022 #6445

CI/CD Pipeline Overhaul 2022 #6445

feisuzhu commented Oct 26, 2022 •

edited

Loading

imcom commented Oct 27, 2022

feisuzhu commented Nov 7, 2022

CI/CD Pipeline Overhaul 2022 #6445

CI/CD Pipeline Overhaul 2022 #6445

Comments

feisuzhu commented Oct 26, 2022 • edited Loading

The status quo

Buildbots

Pipeline

Problems

Code & logic duplication

Slow build

Latency !

Wasted cycles & bandwidth on same things

Buildbot cluster of a low capacity

Under-utilized buildbot cluster

Strong coupling with GitHub Actions

Non-reproducible building environment (partially addressed)

Minor mess

Proposed changes

Go distributed

Miscellaneous

Non-goals

Virtualization &&/|| Kubernetes integration

imcom commented Oct 27, 2022

feisuzhu commented Nov 7, 2022

feisuzhu commented Oct 26, 2022 •

edited

Loading