Skip to content

refactor: AE-392: refactored implementation of several modules#9

Merged
pandyamarut merged 7 commits into
mainfrom
dean/refactor-3
Apr 16, 2025
Merged

refactor: AE-392: refactored implementation of several modules#9
pandyamarut merged 7 commits into
mainfrom
dean/refactor-3

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Apr 16, 2025

I suggest you review this per commit, and not per file. It makes it easier to understand the intentions.

Refactored modules:

  • client
  • resource manager
  • serverless execution
  • protos and stubs

Corrections for example files

@deanq deanq requested a review from pandyamarut April 16, 2025 07:06
Copy link
Copy Markdown
Contributor

@pandyamarut pandyamarut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@pandyamarut pandyamarut merged commit fa4a429 into main Apr 16, 2025
@deanq deanq deleted the dean/refactor-3 branch April 16, 2025 18:28
TimPietruskyRunPod added a commit that referenced this pull request Mar 5, 2026
- Fix ADA_80_PRO VRAM to 80-94GB (H100 NVL is 94GB)
- Add flash deploy --app flag example
- Add gotcha #9: runsync 60s timeout vs cold starts
deanq added a commit that referenced this pull request Jun 4, 2026
…sync; catch modified data files

Henrik's delta review on PR #345.

Henrik #6 — ALLOWED was hand-synced with release-please-config.json
- Previous regex hardcoded the four files release-please touches today.
  Asymmetric failure mode is safe (out-of-sync -> bypass doesn't fire ->
  full CI runs) but the next person to add an `extra-files` entry to
  release-please-config.json would silently break the bypass for bot PRs.
- Derive the file list dynamically via jq from release-please-config.json's
  `.packages["."]["extra-files"]`, append the three implicit files
  (CHANGELOG.md, manifest, pyproject.toml from release-type=python). Use
  `grep -Fxvf <(allowed_files)` for fixed-string exact-line matching --
  removes the regex-escaping concern entirely.

Henrik #7 — `uv run` undid the dev-only sync
- `uv run` defaults to syncing the environment before each invocation, so
  the steps after `uv sync --only-group dev --frozen` would pull in the
  full project on each ruff call -- defeating the "no project install"
  premise of pre-check.
- Add `--no-sync` to both ruff invocations. ruff runs from the existing
  dev-only venv directly; pre-check stays fast.

Henrik #8 — `--diff-filter=A` missed modified non-Python files
- Edits to existing packaged data files (e.g. future tweaks to
  runpod_flash/rules/AGENTS.md) wouldn't trigger should_build, so
  validate-wheel.sh wouldn't smoke-test the modified content. Packaging
  metadata wouldn't change, but the wheel content would.
- Switch to `--diff-filter=AM` (Added or Modified). Cheap and correct.

Henrik #9 (fetch-depth: 0 over-broad) intentionally not addressed -- Henrik
flagged it as "not a real cost issue."

Henrik #4 (make dev entry-point) re-verified locally: `rm -rf .venv &&
make dev && .venv/bin/flash --version` -> Runpod Flash CLI v1.16.0.

https://linear.app/runpod/issue/AE-3161
deanq added a commit that referenced this pull request Jun 4, 2026
* ci: dedupe quality gates and add fast-fail pre-check (AE-3161)

Eliminates redundant CI/CD work that was costing ~24 min of compute per
merge and ~28 min per release-please bot PR push.

Workflow changes:
- ci.yml: add concurrency group (cancel in-flight PR runs), paths-ignore
  for release-please bot PRs, new pre-check job (uvx ruff format+check,
  ~10s no-install fast-fail), bump setup-uv v2 -> v5, build job only on
  push:main since release-please.yml builds + publishes the same SHA.
- release-please.yml: remove duplicate quality-gates job. Branch protection
  on main already enforces ci.yml green before code lands, so the second
  matrix validated nothing. Saves a full 4-leg matrix per merge.
- e2e.yml: drop unit-tests job (duplicated ci.yml quality-gates), fold the
  summary step into the e2e job to remove the third workflow definition of
  the same install setup.

Makefile changes:
- dev: drop redundant 'uv pip install -e .' (uv sync handles editable
  install in workspace mode).
- quality-check: alias ci-quality-github so there is one canonical CI
  quality gate, no drift between local and CI.
- ci-quality-github / test-coverage: fix the latent junit overwrite bug
  where both pytest invocations wrote pytest-results.xml -- now writes
  pytest-results-parallel.xml and pytest-results-serial.xml. The double
  pytest pass is retained because it provides process isolation between
  parallel and serial tests; collapsing to a single loadgroup run
  surfaced state pollution in test_load_balancer_sls_stub.

Note: removing quality-gates from release-please.yml + paths-ignore for
the bot's files means the release-please PR will have no required CI
check. Branch protection on main will need to allow missing checks for
the release-please--* branch pattern, or add a no-op sentinel job named
to match the protection rule.

https://linear.app/runpod/issue/AE-3161

* ci: address QA — fix skip cascade, add packaging guard, sentinel for release-please bot

Addresses @runpod-Henrik (PASS WITH NITS) and Copilot review on PR #345.

Henrik #3 / Copilot inline ci.yml:50 — skip cascade
- Previously `pre-check.if` skipped on release-please bot PRs, and
  `quality-gates needs: [pre-check]` cascaded to skip too, leaving any
  release-please PR that touched non-metadata files with zero CI signal.
- Drop the workflow-level `paths-ignore` (it interacted badly with branch
  protection's required-check semantics anyway) and instead put explicit
  `if: !startsWith(github.head_ref, 'release-please--')` on pre-check,
  quality-gates, packaging-guard, and build. Symmetric, auditable.

Henrik #1 + #6 — branch-protection compatibility (release-please bot PRs)
- Add `release-please-sentinel` job with matching matrix + name "Quality
  Gates" so it produces the same "Quality Gates (3.10)"…"(3.13)" check
  names that branch protection requires, but auto-passes. Branch
  protection on main can now require those checks without admins having
  to either loosen the rule or hand-craft an external sentinel workflow.

Henrik #2 — PR-time wheel build is dropped
- Reintroduce build on PRs but gated by `packaging-guard` job that
  inspects `git diff` against the PR base. Build runs when:
    - pyproject.toml / Makefile / MANIFEST.in / scripts/validate-wheel.sh
      changed, OR
    - any new non-Python file was added under src/ (data files need
      explicit package-data inclusion or they're silently dropped).
- Push to main always builds.

Henrik #4 — `make dev` entry-point sanity check
- Verified locally: `rm -rf .venv && make dev && .venv/bin/flash --version`
  prints "Runpod Flash CLI v1.16.0". `uv sync` registers the entry point
  for the workspace project; the dropped `uv pip install -e .` was
  redundant. No code change needed.

Henrik #5 / Copilot Makefile:108 — misleading "plain output" comment
- Rewrite the comment block above `quality-check`. The aliases do NOT
  produce plain output -- local invocations see GitHub Actions
  annotation markers (::group::, --output-format=github) which are
  cosmetically noisy but functionally inert. Comment now reflects this.

Henrik misc nit — e2e.yml `unit-tests` job removal
- Add a leading comment on the `e2e` job explaining why the duplicate
  unit-tests job was dropped, so a future maintainer doesn't re-add it
  thinking it was lost by accident.

https://linear.app/runpod/issue/AE-3161

* ci: close sentinel-bypass hole and pin pre-check ruff to lockfile

Addresses Copilot's second review on PR #345.

Copilot ci.yml:42 — sentinel-bypass vulnerability
- Previous sentinel gated only on `startsWith(head_ref, 'release-please--')`.
  A contributor could open a PR from any branch named with that prefix and
  inherit the sentinel's auto-passes for "Quality Gates (3.10..3.13)" while
  pre-check / quality-gates / build all skipped via the same `if:` guard.
  Net: CI fully bypassed, branch protection satisfied, code unreviewed.
- Replace the three independent `if:` gates with a single `guard` job that
  inspects the actual PR diff and classifies the run as either
  `release-please-bypass` (branch prefix AND diff is exactly CHANGELOG.md +
  .release-please-manifest.json) or `full`. `release-please-sentinel` only
  fires when classification is bypass; pre-check / quality-gates / build
  fire only when classification is full. A contributor naming a branch
  release-please-- but changing code falls through to full CI.
- Also folds the previous packaging-guard logic into the same job; build
  reads `should_build` from the unified guard output.

Copilot ci.yml:59 — ruff version drift in pre-check
- `uvx ruff …` installed the latest ruff at runtime, which can drift from
  the locked version (uv.lock pins ruff 0.14.14) used by
  `make ci-quality-github`. That produces flaky "passes locally / fails CI"
  (or vice versa) verdicts when ruff cuts a release between runs.
- Pre-check now runs `uv sync --only-group dev --frozen` then `uv run ruff
  …`. Pulls the exact pinned version; warm-cache install is ~1s, cold is
  a few seconds. Fast-fail premise preserved; determinism gained.

https://linear.app/runpod/issue/AE-3161

* ci: replace release-please sentinel with single Validation aggregator

Borrowed from runpod-python (ci.yml:105) -- one required check that
inspects upstream `needs.*.result` and treats "success" or "skipped" as
green. Much cleaner than the previous sentinel-with-matching-matrix-names
hack:

- Drops `release-please-sentinel` job entirely. No more duplicate
  `Quality Gates` display name. No more "skipped Quality Gates" cosmetic
  row alongside the real four matrix legs on every regular PR.
- For release-please bot bypass runs: guard classifies bypass ->
  pre-check / quality-gates / build all skip -> `validation` aggregates
  ("skipped" treated as ok) -> green.
- For regular PRs: everything runs -> `validation` aggregates -> green.
- For any genuinely failed or cancelled job: `validation` exits 1 ->
  branch protection blocks merge.

Future matrix changes (add 3.14, drop 3.10) or new jobs (security scans,
etc.) no longer require touching branch protection -- only the `needs:`
list and the results array in this aggregator.

Requires branch protection on `main` to be reconfigured before merge:
- Remove required checks: `CI / Quality Gates (3.10)`, `(3.11)`, `(3.12)`,
  `(3.13)`
- Add required check: `CI / Validation`

https://linear.app/runpod/issue/AE-3161

* ci: classify metadata-only push:main commits as release-please-bypass

Closes the last remaining redundancy in the lifecycle. When release-please's
release PR is merged, the resulting commit on main only changes CHANGELOG.md
and .release-please-manifest.json -- the same code was already validated
when the underlying feature PRs merged. Previously ci.yml ran the full
4-version matrix + build on those merge commits (~7 min compute per release).

Now `guard` classifies push:main by diff:
- Metadata-only diff -> release-please-bypass -> pre-check / quality-gates /
  build all skip, validation aggregates "skipped" as success.
- Any non-metadata change (feature merges, admin direct pushes) -> full.

release-please.yml's pypi-publish builds its own wheel before publishing,
so skipping ci.yml's build on release commits doesn't blind us to wheel
regressions.

https://linear.app/runpod/issue/AE-3161

* ci: align bypass classifier with release-please-config; use uv.lock for cache

Addresses Copilot's third review on PR #345.

Copilot ci.yml:66 + ci.yml:88 — bypass missing release-please-managed files
- release-please-config.json declares release-type=python (auto-bumps
  pyproject.toml) and extra-files=src/runpod_flash/__init__.py. Real bot
  PRs touch 4 files, not 2. Previous ALLOWED regex missed pyproject.toml
  and __init__.py, so every real release-please bot PR was misclassified
  as `full`. Entire bypass optimization was broken for the actual case
  it was designed for.
- Extend ALLOWED to all four files. Add a comment pointing back to
  release-please-config.json so the regex stays in sync.
- Add defense-in-depth: PR classifier now also requires
  github.event.pull_request.user.login == 'runpod-release-please-bot[bot]'
  (the GitHub App account). Branch prefix + bot author + file content all
  required. A contributor spoofing the branch name fails the author check;
  a contributor with write access to the bot would already be trusted.
- Push:main classifier keeps content-only check (admin can squash-merge
  the bot's PR by hand, so actor isn't a reliable signal on push).

Copilot ci.yml:164 + ci.yml:193 + release-please.yml:61 + e2e.yml:39
— setup-uv cache keyed on pyproject.toml
- uv sync is lockfile-driven; pyproject.toml-only cache key means uv.lock
  changes (most common dependency update) don't invalidate the cache,
  causing stale-deps cache hits.
- Switch cache-dependency-glob to uv.lock everywhere. pre-check already
  used uv.lock; this aligns the rest.

Copilot Makefile:107 — misleading "aliases" comment
- quality-check-strict is NOT an alias for ci-quality-github -- it runs a
  different set (adds typecheck, plain output, no junit). Comment block
  rewritten to enumerate each target individually so maintainers don't
  conflate them.

https://linear.app/runpod/issue/AE-3161

* ci: derive bypass file list from release-please-config; pre-check no-sync; catch modified data files

Henrik's delta review on PR #345.

Henrik #6 — ALLOWED was hand-synced with release-please-config.json
- Previous regex hardcoded the four files release-please touches today.
  Asymmetric failure mode is safe (out-of-sync -> bypass doesn't fire ->
  full CI runs) but the next person to add an `extra-files` entry to
  release-please-config.json would silently break the bypass for bot PRs.
- Derive the file list dynamically via jq from release-please-config.json's
  `.packages["."]["extra-files"]`, append the three implicit files
  (CHANGELOG.md, manifest, pyproject.toml from release-type=python). Use
  `grep -Fxvf <(allowed_files)` for fixed-string exact-line matching --
  removes the regex-escaping concern entirely.

Henrik #7 — `uv run` undid the dev-only sync
- `uv run` defaults to syncing the environment before each invocation, so
  the steps after `uv sync --only-group dev --frozen` would pull in the
  full project on each ruff call -- defeating the "no project install"
  premise of pre-check.
- Add `--no-sync` to both ruff invocations. ruff runs from the existing
  dev-only venv directly; pre-check stays fast.

Henrik #8 — `--diff-filter=A` missed modified non-Python files
- Edits to existing packaged data files (e.g. future tweaks to
  runpod_flash/rules/AGENTS.md) wouldn't trigger should_build, so
  validate-wheel.sh wouldn't smoke-test the modified content. Packaging
  metadata wouldn't change, but the wheel content would.
- Switch to `--diff-filter=AM` (Added or Modified). Cheap and correct.

Henrik #9 (fetch-depth: 0 over-broad) intentionally not addressed -- Henrik
flagged it as "not a real cost issue."

Henrik #4 (make dev entry-point) re-verified locally: `rm -rf .venv &&
make dev && .venv/bin/flash --version` -> Runpod Flash CLI v1.16.0.

https://linear.app/runpod/issue/AE-3161

* ci: content-shape bypass check, conditional build on push, e2e summary table fix

Copilot fourth review on PR #345.

Copilot ci.yml:98 — pyproject.toml in ALLOWED could let non-version edits
slip through the bypass
- A push:main touching only the metadata filenames could smuggle
  non-version edits to pyproject.toml (deps, ruff config, scripts) since
  the previous check was filename-only. Push:main can't rely on actor
  identity (admins can squash-merge by hand), so the file-list check was
  the only gate.
- Add `verify_version_shape` helper: enforces that any pyproject.toml
  change is a single `version = "..."` line, and any
  src/runpod_flash/__init__.py change is a single `__version__ = "..."`
  line. Applies to both push:main and PR bypass paths. Tested locally
  against simulated bot and malicious diffs.

Copilot ci.yml:103 — `should_build=true` unconditional on push:main
- Previous code only computed should_build content-based on PR full-mode;
  push:main non-bypass always built. Contradicted the documented intent
  and cost ~1 min per feature merge.
- Factor the content heuristic into `compute_should_build`; use it for
  both push:main full and PR full paths. Same allow-list (pyproject.toml,
  Makefile, MANIFEST.in, validate-wheel.sh) plus AM-filter on non-Python
  files under src/.

Copilot e2e.yml:129 — malformed markdown table when JUnit file missing
- Python ternary `... if duration is not None else "- |"` had the wrong
  scope: when duration was None, the *entire* print expression became
  the string `"- |"`, breaking the table.
- Extract `_num()` and `_dur()` helpers that placehold "-" per field;
  build the row in one f-string. Table stays valid when the suite
  didn't run.

Copilot Makefile:116 (NOT changed) — claim was that aliasing quality-check
to ci-quality-github drops `--cov-fail-under=65` enforcement. Verified
locally: the serial pass in ci-quality-github inherits --cov-fail-under=65
from pyproject.toml's addopts (only the parallel pass overrides to 0).
Local `make quality-check` still hits the 65% gate at the end of the
serial pass; observed "Required test coverage of 65% reached. Total
coverage: 85.96%" on this branch. Will reply on the thread, not change code.

https://linear.app/runpod/issue/AE-3161
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants