Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Feb 3, 2023

Desperate times call for desperate measures before #2101 is done.

@huydhn huydhn requested review from a team and ZainRizvi February 3, 2023 00:44
@vercel
Copy link

vercel bot commented Feb 3, 2023

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2023
@vercel
Copy link

vercel bot commented Feb 3, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated
torchci ⬜️ Ignored (Inspect) Feb 3, 2023 at 0:44AM (UTC)

Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trading off exposure to network flakiness so that the job can actually run, eh?

How often are you seeing this kind of disk space failure? Hud is looking pretty green today (approving since I assume you must be seeing these failures frequently enough somewhere 🙂)

@huydhn
Copy link
Contributor Author

huydhn commented Feb 3, 2023

Thank you for the quick review!

We're trading off exposure to network flakiness so that the job can actually run, eh?

Yeah, I opt for the latter because once the pet runner gets into this state, it would keep failing till someone take a look and clean up the disk space. Having a bigger partition is the way to go in the long run.

How often are you seeing this kind of disk space failure? Hud is looking pretty green today (approving since I assume you must be seeing these failures frequently enough somewhere 🙂)

There are several today, enough for Cat to notice them and disable the test that creates a 2GB dummy model file (pytorch/pytorch#84898)

@huydhn huydhn merged commit 2353b54 into pytorch:main Feb 3, 2023
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Feb 7, 2023
Despite my initial attempt to clean up MacOS runner as best as I could (pytorch/test-infra#2100, pytorch/test-infra#2102), the runner in question `i-09df3754ea622ad6b` (yes, the same one) still had its free space gradually dropping from 10GB (after cleaning conda and pip packages few days ago) to only 5.2GB today: https://hud.pytorch.org/pytorch/pytorch/commit/4207d3c330c2b723caf0e1c4681ffd80f0b1deb7

I have a gotcha moment after logging into the runner and the direct root cause is right before my eyes.  I forgot to look at the processes running there:

```
  501  7008     1   0 13Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3912838018 --no-capture-output python3 -m tools.stats.monitor
  501 30351 30348   0 18Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3953492510 --no-capture-output python3 -m tools.stats.monitor
  501 36134 36131   0 19Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3956679232 --no-capture-output python3 -m tools.stats.monitor
  501 36579 36576   0 Mon11PM ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4048875121 --no-capture-output python3 -m tools.stats.monitor
  501 37096 37093   0 20Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3971130804 --no-capture-output python3 -m tools.stats.monitor
  501 62770 62767   0 27Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4025485821 --no-capture-output python3 -m tools.stats.monitor
  501 82293 82290   0 20Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3969944513 --no-capture-output python3 -m tools.stats.monitor
  501 95762 95759   0 26Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4012836881 --no-capture-output python3 -m tools.stats.monitor

```

There were many leftover `tools.stats.monitor` processes there.  After pkill them all, an extra 45GB of free space was immediately free up.  Same situation could be seen on other MacOS pet runners too, i.e. `i-026bd028e886eed73`.

At the moment, it's unclear to me what edge case could cause this as the step to stop the monitoring script should always be executed, may be it received an invalid PID somehow.  However, the safety net catch-all solution would be to cleanup all leftover processes on MacOS pet runner before running the workflow (similar to what is done in Windows #93914)
Pull Request resolved: #94127
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
@huydhn huydhn deleted the macos-cleanup-pip-conda-cache branch February 9, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants