Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/CD Pipeline Overhaul 2022 #6445

Open
16 of 28 tasks
feisuzhu opened this issue Oct 26, 2022 · 2 comments
Open
16 of 28 tasks

CI/CD Pipeline Overhaul 2022 #6445

feisuzhu opened this issue Oct 26, 2022 · 2 comments
Assignees
Labels
ci discussion Welcome discussion! enhancement Make existing things or codebases better

Comments

@feisuzhu
Copy link
Contributor

feisuzhu commented Oct 26, 2022

This is an on-going work primarily to reduce CI/CD build time and improve maintainability.
Discussions and feature requests are welcome.

The status quo

Buildbots

  • 5 Linux, 3 + 2(VM) Windows, 3 macOS, organization shared

2022-10-26-183022_934x857_scrot

Pipeline

  • release.yml: Build nightly wheels, run complete tests, upload to pypi. GA releases also goes here.
    2022-10-26-183142_1257x534_scrot

  • testing.yml: Build PR and master, run partial or complete tests.

2022-10-26-183127_1269x601_scrot

  • perf.yml: Performance monitoring, only run on master.

无标题-2022-10-26-1844

Problems

Code & logic duplication

  • release.yml is largely the same with testing.yml except publishing and different build matrix.
  • Different code path for CPU and GPU building tasks.

Slow build

Latency !

  • 1h30min for each complete run (master branch) (!!)
  • Slightly over 1h for each PR run (!!!)

2022-10-27-113656_1321x329_scrot

Wasted cycles & bandwidth on same things

  • For now C++ build cache are stored locally on each buildbot machine.
  • Auxiliary git repos are not cached at all.
  • Perf Monitoring and Android Demo needs to compile taichi binary separately.
  • Pulling huge docker images for every run on GitHub Hosted Runners (cost 30min).
  • ti.init() dominates test case time consumption

Buildbot cluster of a low capacity

  • Only a single M1 Mac Mini operating (SPOF)
  • Unlimited(?) but weak GitHub Hosted Runners, better not to have them.
  • Beefy runners are prohibitively expensive.

Under-utilized buildbot cluster

  • Cores sitting idle due to extra low usage when running tests
  • Different test suites run in serial

Strong coupling with GitHub Actions

  • Difficult for developers to run workflows locally

Non-reproducible building environment (partially addressed)

  • Windows and macOS builds still run on bare metal
  • Buildbots are provisioned manually

Minor mess

  • Most of buildbot tags are useless

Proposed changes

  • Exercise single responsibility principle in workflow definition
    • Merge CPU & GPU tasks
    • Separate build & test & release & perf into 4 different stages
      • Also prevents from compiling multiple times
      • Guarantees Perf Mon monitors the same wheel published
      • Resolves: Code & logic duplication

无标题-2022-10-27-1002 excalidraw

  • Cluster shared caching mechanism

    • sccache running at cluster scale
    • Resolves: Slow build, cluster utilization
    • Possible blocker: Windows build is currently using ccache because sccache can't recognize a particular compiler option.
  • apt & pip caching proxy

  • Home grown git clone caching proxy

    • Open source solutions are not polished (goblet, jonasmalacofilho/git-cache-http-server, etc)
    • Long term, may not implement this time
    • Could grow up to a non-trivial open source project.
    • Could benefit satellite projects like soft2d as well.
    • Resolves: Slow build, cluster utilization
  • Adding new buildbots

    • 15 x HPE ProLiant DL360 Gen 9 with 2x Xeon E5-2696v4 + 128GB RAM + Tesla P4 (total 44 cores 88 threads, CNY 5k per node)
    • 40GbE interconnect via Mellanox CX3 Pro, may add InfiniBand if necessary
    • macOS 10.14 phased out, the Mac Pro will be upgraded to 10.15
    • New M1 Mac Mini will join buildbot cluster after Monterey issue resolved
    • Resolves: Slow build

Go distributed

  • Distributed compiling

    • sccache running at cluster scale (sccache can do both caching and distributed compiling)
    • Unify toolchain
    • Cross compiling (build wheels for all platforms/archs on Linux/x86_64 machines)
      • Will continue to support native compiling workflow on Windows & macOS
    • Could enable CI job with build matrix of different cmake options #5924
    • Resolves: Slow build, cluster utilization
  • Distributed testing

    • Long term, will only start if necessary
    • Long running daemons stealing testing work from distributor
    • Compile taichi kernels, and then run, on different machines
      • Can speed up and test offline cache at the same time
    • Resolves: Slow build, cluster utilization
  • Runner roles

    • Build runners coexist with test runners on same machine
    • Build runners run tasks with higher nice (lower priority)
    • Resolves: Cluster utilization
  • Migrate to dagger (or other similar solutions)

    • Low priority, may not implement this time
    • Resolves:
      • Strong coupling with GitHub Actions
      • Ability to run pipelines locally
      • Non-reproducible building environment
  • Drop github hosted runners

    • Resolves: Strong coupling with GitHub Actions, slow build
  • Infrastructure states condense into ansible code

    • Install common tools, CUDA, docker
    • Configure GitHub Action runners
    • sccache and redis infrastructures
    • Resolves: Non-reproducible building environment
  • Containerize all the things !

    • Linux & Windows+CPU containers are already done
    • Windows+GPU process isolation container
    • macOS on Apple M1 containers may not be feasible
    • macOS x86 containers by sickcodes/Docker-OSX
    • Resolves: Non-reproducible building environment

Miscellaneous

  • Buildbot tags cleanup
  • Issues/Actions dashboard
  • buildbot resource monitoring

Non-goals

Virtualization &&/|| Kubernetes integration

  • Both are fun tasks not on critical path, leaving them to (for now non-existent) interns
@feisuzhu feisuzhu added enhancement Make existing things or codebases better discussion Welcome discussion! ci labels Oct 26, 2022
@feisuzhu feisuzhu self-assigned this Oct 26, 2022
@taichi-gardener taichi-gardener moved this to Untriaged in Taichi Lang Oct 26, 2022
@imcom
Copy link

imcom commented Oct 27, 2022

Dapper now provides Go SDK, dev may use Go to describe its CICD

@neozhaoliang neozhaoliang moved this from Untriaged to In Progress in Taichi Lang Oct 28, 2022
@feisuzhu
Copy link
Contributor Author

feisuzhu commented Nov 7, 2022

According to the document, containerization w/ GPU on Windows is non-feasible now(requires CUDA & OpenGL).
Moving all workloads to bare metal on Windows.
2022-11-07-143143_646x355_scrot

feisuzhu added a commit that referenced this issue Nov 10, 2022
feisuzhu added a commit that referenced this issue Nov 15, 2022
Issue: #6445 

### Brief Summary
feisuzhu added a commit that referenced this issue Nov 30, 2022
Issue: #6445

### Brief Summary

This PR addresses quirks in `sccache` distributed compiling:

1. `sccache` would run preprocessing locally (`clang -E`) first and then
send the expanded code to remote build server. This can trigger false
warnings not present when compiling locally.
2. Compile host and target can be different, `--target=` must be send to
remote build server.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
feisuzhu added a commit that referenced this issue Dec 20, 2022
feisuzhu added a commit that referenced this issue Dec 22, 2022
feisuzhu added a commit that referenced this issue Dec 23, 2022
feisuzhu added a commit that referenced this issue Dec 23, 2022
feisuzhu added a commit that referenced this issue Jan 17, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Issue: taichi-dev#6445

### Brief Summary

This PR addresses quirks in `sccache` distributed compiling:

1. `sccache` would run preprocessing locally (`clang -E`) first and then
send the expanded code to remote build server. This can trigger false
warnings not present when compiling locally.
2. Compile host and target can be different, `--target=` must be send to
remote build server.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
quadpixels pushed a commit to quadpixels/taichi that referenced this issue May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci discussion Welcome discussion! enhancement Make existing things or codebases better
Projects
Status: In Progress
Development

No branches or pull requests

2 participants