Skip to content

Fix VRAM detection and Docker authentication in launcher script#156

Merged
toxicoder merged 1 commit intomainfrom
fix-vram-docker-auth-8399662171178761435
Jan 9, 2026
Merged

Fix VRAM detection and Docker authentication in launcher script#156
toxicoder merged 1 commit intomainfrom
fix-vram-docker-auth-8399662171178761435

Conversation

@google-labs-jules
Copy link
Copy Markdown
Contributor

This PR addresses two critical issues reported by users of the run-distributed-model.sh script:

  1. VRAM Detection Failure (Total: 0.00 GB): The script was failing to detect VRAM on remote nodes because nvidia-smi was not in the PATH for non-interactive SSH sessions. This resulted in a warning and 0.00 GB detected.

    • Fix: Prepend PATH=$PATH:/usr/local/cuda/bin:/usr/bin to the nvidia-smi SSH command. Also improved awk parsing to default to 0 on empty input.
  2. Docker Pull "Access Denied": The image pull was failing due to authentication issues, likely caused by improper handling of the NGC_API_KEY when embedded in the remote SSH command string (quoting/expansion issues).

    • Fix: Changed the authentication mechanism to pipe the API key via stdin (echo "$KEY" | ssh ...) directly to docker login --password-stdin. This avoids embedding the key in the command arguments, solving the quoting issues and preventing the key from appearing in the remote ps output.

Tests Verified:

  • dgx-spark/tests/test_run_distributed_model.sh passed, covering VRAM detection logic, quoting checks, and general execution flow.
  • Manual verification of the logic confirms it addresses the reported root causes.

PR created automatically by Jules for task 8399662171178761435 started by @toxicoder

- Updates `_check_vram_requirements` to explicitly add `/usr/local/cuda/bin:/usr/bin` to PATH when running `nvidia-smi` via SSH.
- Adds `+0` to VRAM awk parsing to correctly handle empty output as 0 instead of empty string.
- Refactors `_ensure_image_present` (both head pull and fallback download) to pass `NGC_API_KEY` via stdin pipe to SSH/docker login. This fixes "Access Denied" errors caused by improper quoting/expansion of the key in the command string and improves security by hiding the key from the remote process list.
@google-labs-jules
Copy link
Copy Markdown
Contributor Author

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@toxicoder toxicoder marked this pull request as ready for review January 9, 2026 21:36
@toxicoder toxicoder merged commit 37498e9 into main Jan 9, 2026
@toxicoder toxicoder deleted the fix-vram-docker-auth-8399662171178761435 branch January 9, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant