-
Couldn't load subscription status.
- Fork 107
Clear conda and pip cache when low disk space on MacOS M1 pet runners #2102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trading off exposure to network flakiness so that the job can actually run, eh?
How often are you seeing this kind of disk space failure? Hud is looking pretty green today (approving since I assume you must be seeing these failures frequently enough somewhere 🙂)
|
Thank you for the quick review!
Yeah, I opt for the latter because once the pet runner gets into this state, it would keep failing till someone take a look and clean up the disk space. Having a bigger partition is the way to go in the long run.
There are several today, enough for Cat to notice them and disable the test that creates a 2GB dummy model file (pytorch/pytorch#84898) |
Despite my initial attempt to clean up MacOS runner as best as I could (pytorch/test-infra#2100, pytorch/test-infra#2102), the runner in question `i-09df3754ea622ad6b` (yes, the same one) still had its free space gradually dropping from 10GB (after cleaning conda and pip packages few days ago) to only 5.2GB today: https://hud.pytorch.org/pytorch/pytorch/commit/4207d3c330c2b723caf0e1c4681ffd80f0b1deb7 I have a gotcha moment after logging into the runner and the direct root cause is right before my eyes. I forgot to look at the processes running there: ``` 501 7008 1 0 13Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3912838018 --no-capture-output python3 -m tools.stats.monitor 501 30351 30348 0 18Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3953492510 --no-capture-output python3 -m tools.stats.monitor 501 36134 36131 0 19Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3956679232 --no-capture-output python3 -m tools.stats.monitor 501 36579 36576 0 Mon11PM ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4048875121 --no-capture-output python3 -m tools.stats.monitor 501 37096 37093 0 20Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3971130804 --no-capture-output python3 -m tools.stats.monitor 501 62770 62767 0 27Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4025485821 --no-capture-output python3 -m tools.stats.monitor 501 82293 82290 0 20Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3969944513 --no-capture-output python3 -m tools.stats.monitor 501 95762 95759 0 26Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4012836881 --no-capture-output python3 -m tools.stats.monitor ``` There were many leftover `tools.stats.monitor` processes there. After pkill them all, an extra 45GB of free space was immediately free up. Same situation could be seen on other MacOS pet runners too, i.e. `i-026bd028e886eed73`. At the moment, it's unclear to me what edge case could cause this as the step to stop the monitoring script should always be executed, may be it received an invalid PID somehow. However, the safety net catch-all solution would be to cleanup all leftover processes on MacOS pet runner before running the workflow (similar to what is done in Windows #93914) Pull Request resolved: #94127 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
Desperate times call for desperate measures before #2101 is done.