Clear conda and pip cache when low disk space on MacOS M1 pet runners #2102

huydhn · 2023-02-03T00:44:33Z

Desperate times call for desperate measures before #2101 is done.

vercel · 2023-02-03T00:44:36Z

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2023-02-03T00:44:58Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated
torchci	⬜️ Ignored (Inspect)			Feb 3, 2023 at 0:44AM (UTC)

ZainRizvi

We're trading off exposure to network flakiness so that the job can actually run, eh?

How often are you seeing this kind of disk space failure? Hud is looking pretty green today (approving since I assume you must be seeing these failures frequently enough somewhere 🙂)

huydhn · 2023-02-03T01:28:13Z

Thank you for the quick review!

We're trading off exposure to network flakiness so that the job can actually run, eh?

Yeah, I opt for the latter because once the pet runner gets into this state, it would keep failing till someone take a look and clean up the disk space. Having a bigger partition is the way to go in the long run.

How often are you seeing this kind of disk space failure? Hud is looking pretty green today (approving since I assume you must be seeing these failures frequently enough somewhere 🙂)

There are several today, enough for Cat to notice them and disable the test that creates a 2GB dummy model file (pytorch/pytorch#84898)

Despite my initial attempt to clean up MacOS runner as best as I could (pytorch/test-infra#2100, pytorch/test-infra#2102), the runner in question `i-09df3754ea622ad6b` (yes, the same one) still had its free space gradually dropping from 10GB (after cleaning conda and pip packages few days ago) to only 5.2GB today: https://hud.pytorch.org/pytorch/pytorch/commit/4207d3c330c2b723caf0e1c4681ffd80f0b1deb7 I have a gotcha moment after logging into the runner and the direct root cause is right before my eyes. I forgot to look at the processes running there: ``` 501 7008 1 0 13Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3912838018 --no-capture-output python3 -m tools.stats.monitor 501 30351 30348 0 18Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3953492510 --no-capture-output python3 -m tools.stats.monitor 501 36134 36131 0 19Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3956679232 --no-capture-output python3 -m tools.stats.monitor 501 36579 36576 0 Mon11PM ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4048875121 --no-capture-output python3 -m tools.stats.monitor 501 37096 37093 0 20Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3971130804 --no-capture-output python3 -m tools.stats.monitor 501 62770 62767 0 27Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4025485821 --no-capture-output python3 -m tools.stats.monitor 501 82293 82290 0 20Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3969944513 --no-capture-output python3 -m tools.stats.monitor 501 95762 95759 0 26Jan23 ttys001 0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4012836881 --no-capture-output python3 -m tools.stats.monitor ``` There were many leftover `tools.stats.monitor` processes there. After pkill them all, an extra 45GB of free space was immediately free up. Same situation could be seen on other MacOS pet runners too, i.e. `i-026bd028e886eed73`. At the moment, it's unclear to me what edge case could cause this as the step to stop the monitoring script should always be executed, may be it received an invalid PID somehow. However, the safety net catch-all solution would be to cleanup all leftover processes on MacOS pet runner before running the workflow (similar to what is done in Windows #93914) Pull Request resolved: #94127 Approved by: https://github.com/clee2000, https://github.com/ZainRizvi

Clear conda and pip cache

e3c28a8

huydhn requested review from a team and ZainRizvi February 3, 2023 00:44

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2023

ZainRizvi approved these changes Feb 3, 2023

View reviewed changes

huydhn merged commit 2353b54 into pytorch:main Feb 3, 2023

huydhn mentioned this pull request Feb 4, 2023

Cleanup all leftover processes in MacOS pet runner pytorch/pytorch#94127

Closed

huydhn deleted the macos-cleanup-pip-conda-cache branch February 9, 2023 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Clear conda and pip cache when low disk space on MacOS M1 pet runners #2102

Clear conda and pip cache when low disk space on MacOS M1 pet runners #2102

Uh oh!

huydhn commented Feb 3, 2023

Uh oh!

vercel bot commented Feb 3, 2023

Uh oh!

vercel bot commented Feb 3, 2023 •

edited

Loading

Uh oh!

ZainRizvi left a comment

Uh oh!

huydhn commented Feb 3, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Clear conda and pip cache when low disk space on MacOS M1 pet runners #2102

Clear conda and pip cache when low disk space on MacOS M1 pet runners #2102

Uh oh!

Conversation

huydhn commented Feb 3, 2023

Uh oh!

vercel bot commented Feb 3, 2023

Uh oh!

vercel bot commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZainRizvi left a comment

Choose a reason for hiding this comment

Uh oh!

huydhn commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Feb 3, 2023 •

edited

Loading

huydhn commented Feb 3, 2023 •

edited

Loading