Skip to content

refactor: separate .env from deployed endpoint env vars#257

Open
deanq wants to merge 11 commits intomainfrom
deanq/ae-1549-explicit-env-vars
Open

refactor: separate .env from deployed endpoint env vars#257
deanq wants to merge 11 commits intomainfrom
deanq/ae-1549-explicit-env-vars

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 5, 2026

Summary

  • Stop auto-carrying .env file contents to deployed endpoints
  • ServerlessResource.env defaults to None instead of reading .env via get_env_vars()
  • Delete EnvironmentVars class and environment.py
  • Stop stripping RUNPOD_API_KEY from explicit resource env in manifest
  • Fix flash run hot-reload not propagating env changes (module cache busting)
  • Fix CPU config_hash to include env for drift detection
  • Preserve platform-injected env vars (PORT, FLASH_HOST, etc.) during template updates

Motivation

.env files were implicitly dumped into every deployed endpoint, causing:

  • Platform-injected vars (PORT, PORT_HEALTH) overwritten on template updates
  • False config drift from runtime var injection into self.env
  • User confusion about what actually reaches deployed workers

How it works now

  • .env still populates os.environ via load_dotenv() for CLI and get_api_key() usage
  • To pass env vars to deployed endpoints, declare them explicitly: env={"HF_TOKEN": os.environ["HF_TOKEN"]}
  • Flash injects runtime vars (RUNPOD_API_KEY, FLASH_MODULE_PATH) into template.env automatically
  • On env changes, update() reads the live template via get_template and preserves platform-injected vars
  • flash run hot-reload properly reimports user modules with updated env values

Breaking change

ServerlessResource.env no longer defaults to .env file contents. Users relying on implicit carryover must add explicit env={} to their resource configs.

Upgrade note

Existing deployed endpoints have .env contents baked into resource.env. After upgrading, env defaults to None, so the config hash changes. flash deploy will flag all previously-deployed endpoints as drifted and trigger redeployment on the first deploy after upgrading. This is expected and harmless — the endpoints will be redeployed with only explicitly declared env vars.

Companion PR

Test plan

  • All 2533 tests pass (85.4% coverage)
  • Format and lint clean
  • Platform env preservation verified against real Runpod API (get_template query)
  • User-removed env vars correctly not resurrected
  • CPU config_hash includes env for drift detection
  • Module cache busting in generated server.py for hot-reload

@deanq deanq requested a review from Copilot March 5, 2026 19:13
@deanq deanq force-pushed the deanq/ae-1549-explicit-env-vars branch from 5b765f6 to be4269a Compare March 5, 2026 19:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@deanq deanq requested a review from Copilot March 5, 2026 22:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/runpod_flash/cli/utils/env_preview.py:1

  • The _SECRET_PATTERN regex matches any key containing TOKEN as a substring, so TOKENIZER_PATH gets masked even though it's not a secret. This is a false positive — environment variables like TOKENIZER_PATH, TOKEN_COUNT, or BUCKET_NAME (no match, but illustrative) could be misleadingly masked. The test at line 52-55 in test_env_preview.py confirms this behavior as intended, but the docstring says "TOKEN in TOKENIZER should still match" which suggests this was a deliberate choice. However, this will surprise users who have non-secret vars with these substrings. Consider using word-boundary matching (e.g., r"\b(KEY|TOKEN|SECRET|PASSWORD|CREDENTIAL)\b" or r"(^|_)(KEY|TOKEN|SECRET|PASSWORD|CREDENTIAL)($|_)") to avoid masking keys like TOKENIZER_PATH or TOKENIZER_CONFIG.
"""Deploy-time env preview: show what env vars go to each endpoint."""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Clean, well-motivated change. The core fix (env defaults to None instead of implicitly loading .env) removes a real footgun. Good test coverage in test_env_separation.py. One confirmed bug and two nits below.


Bug: preview/deploy parity broken — makes_remote_calls default mismatch

_do_deploy calls _check_makes_remote_calls() which defaults to True in all failure cases:

# serverless.py:583
makes_remote_calls = resource_config.get("makes_remote_calls", True)
# also: "Manifest not found, assuming makes_remote_calls=True"
# also: "Resource not in manifest, assuming makes_remote_calls=True"

collect_env_for_preview uses the opposite default:

# env_preview.py
config.get("makes_remote_calls", False)  # ← should be True

On first deploy (no prior build/manifest), the preview shows no RUNPOD_API_KEY for QB endpoints but deploy injects it. The preview is wrong exactly when it matters most — the first time a user runs flash deploy.

Fix: config.get("makes_remote_calls", True) in env_preview.py to match deploy behaviour.


Nit: env={} explicit empty dict won't appear in manifest

manifest.py uses:

if hasattr(resource_config, "env") and resource_config.env:

An explicit env={} is falsy so it's silently skipped. Probably the right behaviour (nothing to store), but test_env_separation.py::test_serverless_resource_env_explicit_empty_dict_preserved only checks the resource object — not the manifest path. Worth a test or a comment to make the intent explicit.


Nit: _MASK_VISIBLE_CHARS = 6 exposes key type prefix

RUNPOD_API_KEY values start with rp_ — 6 visible chars reveals the full type prefix (rp_xxx). Not a security issue since the prefix is public, but worth being intentional about if multiple key formats are ever supported.


Positives

  • Removing env_dict.pop("RUNPOD_API_KEY", None) in manifest.py is correct — user-explicit keys should be preserved, not silently stripped
  • except Exception: logger.debug(...) around render_env_preview in deploy.py is the right call — preview failure must not block deploy
  • test_env_separation.py covers the _configure_existing_template None case, which was the trickiest correctness question
  • Companion flash-examples PR (#39) is the right process for a breaking change

Verdict: PASS with fix — the makes_remote_calls default is a one-line fix but should land before merge to avoid shipping a misleading preview.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq force-pushed the deanq/ae-1549-explicit-env-vars branch 2 times, most recently from 7a21f55 to 1bb160c Compare March 12, 2026 19:12
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Core change — clean and correct

env default changing from Field(default_factory=get_env_vars) to Field(default=None) is the right fix. Template creation falls back to self.env or {}, so None and {} both produce an empty template env. EnvironmentVars / environment.py deletion is complete — no remaining references.

The hash behavior is correct: exclude_none=True means env=None (default) is excluded from the drift hash, and setting env={"HF_TOKEN": "..."} correctly triggers drift detection.

2. Issue: all existing deployed endpoints will drift on upgrade

This is an expected consequence of the breaking change, but it should be in the upgrade notes: any endpoint deployed with the old SDK had .env contents baked into resource.env. After upgrading, env defaults to None, so the hash changes for every endpoint that previously had env vars → flash deploy will flag all of them as drifted and redeploy. Users should be warned to expect forced redeployments on their first deploy after upgrading.

3. Question: preview's makes_remote_calls default vs actual injection default

In collect_env_for_preview:

if not is_lb and config.get("makes_remote_calls", False):   # default: False

In _inject_runtime_template_vars_check_makes_remote_calls:

log.debug("Manifest not found, assuming makes_remote_calls=True")  # default: True

If a resource's manifest entry lacks the makes_remote_calls key, the preview shows no RUNPOD_API_KEY while the actual deploy injects it. Worth confirming that the scanner always writes makes_remote_calls explicitly for every resource — if so, this default never fires and the preview is accurate. If not, the preview undersells what gets deployed.

4. Testing — solid

267-line test_env_preview.py covers masking, collect_env_for_preview, and rendering. 99-line test_env_separation.py covers the field default, explicit dict preservation, and template creation. The test_exactly_six_chars_fully_masked test correctly captures that 6-char values are fully masked rather than showing abcdef...****.

One gap: no test for the "user explicitly sets RUNPOD_API_KEY in env={}" path — the preview would show it as "user" source but the old code stripped it. Not a bug (the behavior is intentional now), but a test would lock in the intent.

Verdict

PASS with one upgrade-note item (item 2) and one question to confirm (item 3). The separation is clean, the deletion is thorough, and the preview table is well-tested.

@deanq deanq force-pushed the deanq/ae-1549-explicit-env-vars branch from 1bb160c to e3f74d4 Compare March 13, 2026 16:42
deanq added a commit that referenced this pull request Mar 18, 2026
- Remove render_env_preview call from deploy command
- Delete env_preview.py module and its tests (no other consumers)

Addresses PR #257 feedback: table was visual clutter
deanq added a commit that referenced this pull request Mar 18, 2026
- Remove render_env_preview call from deploy command
- Delete env_preview.py module and its tests (no other consumers)

Addresses PR #257 feedback: table was visual clutter
@deanq deanq force-pushed the deanq/ae-1549-explicit-env-vars branch from dd37adc to f3f1c59 Compare March 18, 2026 23:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

deanq added a commit that referenced this pull request Mar 19, 2026
- Remove render_env_preview call from deploy command
- Delete env_preview.py module and its tests (no other consumers)

Addresses PR #257 feedback: table was visual clutter
@deanq deanq force-pushed the deanq/ae-1549-explicit-env-vars branch from f3f1c59 to c1a98e2 Compare March 19, 2026 18:04
deanq added a commit that referenced this pull request Mar 19, 2026
- Use `is not None` instead of truthiness for env checks in serverless.py,
  serverless_cpu.py, and manifest.py so explicit env={} clears template env
- Add RUNPOD_API_KEY injection note to LB architecture doc
- Remove Claude-specific directive from plan doc
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta review — since 2026-03-12

From the 2026-03-06 review:

  • Bug: makes_remote_calls default mismatch in env_preview.py → RESOLVED. The env preview table was removed entirely (c1a98e2), so the mismatch no longer exists. The bug was valid at the time, but the decision to pull the preview table sidesteps it.
  • Nit: env={} skipped in manifest (falsy check) → RESOLVED. Latest commit (d4ecd776) changed and resource_config.env to and resource_config.env is not None in manifest.py. Consistent with the same fix applied to _configure_existing_template in serverless.py and serverless_cpu.py.
  • Nit: _MASK_VISIBLE_CHARS = 6 → MOOT. Preview table removed.

From the 2026-03-12 review:

  • Item 2: Upgrade drift warning — Still unaddressed. Any endpoint deployed with the old SDK has .env contents baked into resource.env. After upgrading, env defaults to None, the hash changes, and flash deploy flags every pre-existing endpoint as drifted. Should be in upgrade notes or a migration note in the PR body.
  • Item 3: makes_remote_calls default question → MOOT. Preview removed.

New concerns on current diff:

  • Issue: CPU config_hash silently ignores explicit env= changes. CpuServerlessEndpoint.config_hash hashes a fixed set of CPU fields and explicitly excludes env. Under the old model (env was a dynamic .env dump) this was reasonable. Now that env is user-declared intent, a user who adds env={"HF_TOKEN": "..."} to a CPU endpoint and redeploys will get no drift detection, no update, and no error. The base class correctly includes env in the hash; the CPU override silently drops it. If this is intentional (env changes don't trigger CPU re-deploy by design), it needs a comment explaining why. As written, it looks like an oversight.

  • PR body references env preview table that was removed. The summary still lists "Add deploy-time env preview table with secret masking" and the checklist has items for verifying preview behavior. The feature doesn't exist in the final diff. Worth cleaning up before merge to avoid confusing future readers.

Overall: The two bugs from the 2026-03-06 review are closed. The main open item is the CPU config_hash asymmetry — silent failure for a user-facing feature after a correctness fix. The upgrade drift note is still worth adding.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq requested a review from jhcipar March 19, 2026 22:54
Copy link
Contributor

@KAJdev KAJdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what im seeing that env drift detection bug is real, but only present in flash run. If you flash run, send a request, change the env, have it auto reload, then send another request, the env is not updated at all

I have verified this

2026-03-19 17:00:05,739 | INFO  | CpuLiveServerless:c6i34mfjfj9gct | API /run
2026-03-19 17:00:06,230 | INFO  | CpuLiveServerless:c6i34mfjfj9gct | Started Job:0be84094-e004-4f10-b15c-1edab200e234-u1
2026-03-19 17:00:06,330 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | Status: IN_QUEUE
2026-03-19 17:00:06,847 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | .
2026-03-19 17:00:08,425 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | ..
2026-03-19 17:00:14,093 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | ...
2026-03-19 17:00:32,335 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | ....
2026-03-19 17:00:53,938 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | .....
2026-03-19 17:01:04,267 | INFO  | Job:0be84094-e004-4f10-b15c-1edab200e234-u1 | Status: COMPLETED
2026-03-19 17:01:04,426 | INFO  | Worker:4tfsc74uch0vfj | Delay Time: 48063 ms
2026-03-19 17:01:04,427 | INFO  | Worker:4tfsc74uch0vfj | Execution Time: 42 ms
WARNING:  WatchFiles detected changes in '.flash/server.py'. Reloading...
2026-03-19 17:01:15,994 | INFO  | CpuLiveServerless:c6i34mfjfj9gct | API /run
2026-03-19 17:01:16,179 | INFO  | CpuLiveServerless:c6i34mfjfj9gct | Started Job:af0c02dd-19c1-42f7-a9ac-db3788d80435-u1
2026-03-19 17:01:16,280 | INFO  | Job:af0c02dd-19c1-42f7-a9ac-db3788d80435-u1 | Status: IN_PROGRESS
2026-03-19 17:01:16,483 | INFO  | Job:af0c02dd-19c1-42f7-a9ac-db3788d80435-u1 | Status: COMPLETED
2026-03-19 17:01:16,578 | INFO  | Worker:4tfsc74uch0vfj | Delay Time: 127 ms
2026-03-19 17:01:16,578 | INFO  | Worker:4tfsc74uch0vfj | Execution Time: 40 ms

(the edit was a change to the env)

I verified that the change didnt apply to the endpoint in the console.

deanq added 9 commits March 19, 2026 17:14
Change ServerlessResource.env default from get_env_vars() (reads .env)
to None. Delete EnvironmentVars class and get_env_vars(). Template
creation now uses self.env or {} instead of falling back to .env file.

.env still populates os.environ via load_dotenv() in __init__.py for
CLI and get_api_key() usage. This change only affects what gets sent
to deployed endpoints.
Replace TestServerlessResourceEnvLoading tests that imported deleted
get_env_vars/EnvironmentVars with tests that verify env defaults to
None and preserves explicit dicts.
With env separation, resource.env only contains user-declared vars.
If a user explicitly sets RUNPOD_API_KEY in their resource env, it
should be preserved. Runtime injection via _inject_runtime_template_vars()
handles the automatic case.
New module renders a Rich table per resource showing all env vars
that will be sent to deployed endpoints. User-declared vars shown
directly; flash-injected vars (RUNPOD_API_KEY, FLASH_MODULE_PATH)
labeled as 'injected by flash'. Secret values masked based on key
pattern matching (KEY, TOKEN, SECRET, PASSWORD, CREDENTIAL).

Wired into flash deploy: renders before provisioning so users see
exactly what goes to each endpoint.
Clarify that .env populates os.environ for CLI/local dev only.
Resource env vars must be explicit via env={} on resource config.
RUNPOD_API_KEY injected automatically for makes_remote_calls=True.
- Fix preview/deploy mismatch: LB endpoints no longer show
  RUNPOD_API_KEY injection in preview (matches _do_deploy behavior)
- Wrap render_env_preview in try-except so preview failures
  don't abort deployment
- Fix stale comment referencing .env file in serverless_cpu.py
- Correct drift detection doc: env is conditionally included
  in hash, not always excluded
- Fix LB architecture doc: LB endpoints get FLASH_MODULE_PATH,
  not RUNPOD_API_KEY
- Update config_hash docstring to reflect exclude_none behavior
- Add tests for render_env_preview, LB API key exclusion,
  and _configure_existing_template with env=None
- Compact env preview into single table with resource column to
  reduce terminal clutter
- Show "user" source label for user-declared vars instead of
  empty string
- Remove local filesystem paths from plan docs
- Fix design doc: env=None omits the key, not stores empty dict
- Update render tests for new table format
- Remove render_env_preview call from deploy command
- Delete env_preview.py module and its tests (no other consumers)

Addresses PR #257 feedback: table was visual clutter
- Use `is not None` instead of truthiness for env checks in serverless.py,
  serverless_cpu.py, and manifest.py so explicit env={} clears template env
- Add RUNPOD_API_KEY injection note to LB architecture doc
- Remove Claude-specific directive from plan doc
@deanq deanq force-pushed the deanq/ae-1549-explicit-env-vars branch from d4ecd77 to 5ac0bb6 Compare March 20, 2026 00:15
…eservation

- Include env in CPU config_hash so env-only changes trigger drift
  detection and redeployment (was silently excluded)
- Purge user modules from sys.modules in generated server.py so
  uvicorn hot-reload reimports them with updated env values
- Add get_template GraphQL query (podTemplate) to read live template
  state before updates
- Add _preserve_platform_env to carry forward platform-injected vars
  (FLASH_HOST, FLASH_PORT, PYTHONPATH, etc.) that the platform sets
  once at initial deploy and does not re-inject on saveTemplate
- Fix _configure_existing_template to use truthiness check so empty
  env={} does not clobber explicit template.env entries
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Update drift detection docs to reflect env included in CPU hash
- Fix _preserve_platform_env to not resurrect user-removed keys by
  comparing against old config env (old_env parameter)
- Add test_update_does_not_resurrect_user_removed_env_vars
- Remove --no-verify from plan doc test command
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants