Fix VRAM detection and Docker authentication in launcher script#156
Fix VRAM detection and Docker authentication in launcher script#156
Conversation
- Updates `_check_vram_requirements` to explicitly add `/usr/local/cuda/bin:/usr/bin` to PATH when running `nvidia-smi` via SSH. - Adds `+0` to VRAM awk parsing to correctly handle empty output as 0 instead of empty string. - Refactors `_ensure_image_present` (both head pull and fallback download) to pass `NGC_API_KEY` via stdin pipe to SSH/docker login. This fixes "Access Denied" errors caused by improper quoting/expansion of the key in the command string and improves security by hiding the key from the remote process list.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
This PR addresses two critical issues reported by users of the
run-distributed-model.shscript:VRAM Detection Failure (Total: 0.00 GB): The script was failing to detect VRAM on remote nodes because
nvidia-smiwas not in the PATH for non-interactive SSH sessions. This resulted in a warning and 0.00 GB detected.PATH=$PATH:/usr/local/cuda/bin:/usr/binto thenvidia-smiSSH command. Also improvedawkparsing to default to 0 on empty input.Docker Pull "Access Denied": The image pull was failing due to authentication issues, likely caused by improper handling of the
NGC_API_KEYwhen embedded in the remote SSH command string (quoting/expansion issues).echo "$KEY" | ssh ...) directly todocker login --password-stdin. This avoids embedding the key in the command arguments, solving the quoting issues and preventing the key from appearing in the remotepsoutput.Tests Verified:
dgx-spark/tests/test_run_distributed_model.shpassed, covering VRAM detection logic, quoting checks, and general execution flow.PR created automatically by Jules for task 8399662171178761435 started by @toxicoder