Skip to content

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Aug 5, 2025

GitHub container doesn't work with multi-tenant rootless Docker in Docker setup https://github.com/pytorch/pytorch-integration-testing/actions/runs/16742012333/job/47392166734, maybe we want to look closer into this to understand the why.

I'm rewriting the workflow to call Docker directly. Credit to Claude code.

Signed-off-by: Huy Do <huydhn@gmail.com>
@meta-cla meta-cla bot added the cla signed label Aug 5, 2025
huydhn added 4 commits August 5, 2025 01:46
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Signed-off-by: Huy Do <huydhn@gmail.com>
Credit to Claude code

Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn changed the title Call setup-node before checkout Rewrite flash attention workflow to avoid using GH container Aug 5, 2025
Signed-off-by: Huy Do <huydhn@gmail.com>
@huydhn huydhn requested review from jduprat and seemethere August 5, 2025 18:43
@huydhn huydhn marked this pull request as ready for review August 5, 2025 18:46
@huydhn
Copy link
Contributor Author

huydhn commented Aug 5, 2025

Some notes I have while debugging the issue:

  • The volume seems to be mounted in the right place https://github.com/pytorch/pytorch-integration-testing/actions/runs/16742012333/job/47392166734#step:2:312 -v "/home/alice/externals":"/__e":ro. That's where node20 bundle is found on the runner.
  • However, when docker exec was called later on, node20 wasn't there (maybe I should ls and see if any permission issue with the mounted volume)
  • The Initialize container step runs outside of the multi-tenant-gpu container I think because it refers to alice user directly there, that's the confusing part

Copy link
Contributor

@jduprat jduprat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit regarding the output of nvidia-smi.
I can fix in post...

export PYTHONPATH=$(pwd)
python benchmarks/benchmark_attn.py >> $GITHUB_STEP_SUMMARY
echo '<h1>B200 1000W</h1>' >> /tmp/workspace/fa4_output.txt
nvidia-smi >> /tmp/workspace/fa4_output.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of nvidia-smi makes the output hard to read. I prefer that we run it (so I can check the logs and confirm we are on the right class of machine) without piping into /tmp/workspace/fa4_output.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I can remove this quickly

@jduprat jduprat merged commit 120745b into main Aug 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants