Skip to content
Zack Galbreath edited this page Jul 14, 2023 · 1 revision

Attendees

  • Aashish Chaudhary
  • Alec Scott
  • Dan LaManna
  • Jacob Nesbitt
  • Mike VanDenburgh
  • Massimiliano Culpo
  • Ryan Krattiger
  • Scott Wittenburg
  • Todd Gamblin
  • Tamara Grimmett
  • Zack Galbreath

Cluster Upgrades

  • This week we upgraded EKS to v1.27 and karpenter to v0.29. These upgrades went well with few surpises and minimal downtime.
  • Next week we plan to upgrade gitlab.spack.io to the latest patch release (v16.1.2).

Metrics & Dashboarding

  • Ryan's PR to add more fine-grained timers to spack install is going well. We are hoping to merge it soon.
    • We are also working on ingesting this new data into OpenSearch and using it to publish new Grafana dashboards.
  • Jake and Alec are going to meet next week to discuss strategies to more centrally store CI metrics.
  • cache.spack.io now shows results for our weekly snapshot mirrors.

CI Status

  • Scott opened PR #38866 to unconditionally run the protected-publish job in our protected pipelines. This will fix the problem where the top-level mirror is not always up-to-date with the results from the individual stack-specific mirrors.
  • Scott also discovered that many of the no-binary-for-spec failures we've seen lately may be due to a DeleteOldObjects lifecycle policy that was configured to delete objects from the PR mirror after 14 days. This has since been disabled.

Buildcache Pruning

  • Ryan discovered that his pruning script was sometimes receiving incomplete results from the GitLab API. He's updating this script to directly query GitLab's database for the list of jobs to fetch instead.

Other topics

  • Alec will be working with a student this summer to investigating job scheduling & performance in our GitLab CI pipelines.

Priorities

  • Finish timing data PR and start working on subsequent dashboards
  • Upgrade gitlab.spack.io to the latest patch release
  • Migrate GitLab's minio volume from gp2 to gp3
  • Manually running the pruning script for our develop buildcache and continue to work on automating this task.
  • Investigate why our gitlab sidekiq pods die and get restarted somewhat frequently. Perhaps increasing resource requests will reduce this error rate?
  • Update gitlab.spack.io to use S3 and ElasticCache rather than minio and redis.
  • Update the sync script to merge topics branches against their base branch instead of assuming that it is always develop (necessary for release branch PRs).
Clone this wiki locally