Skip to content

Conversation

@jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented Oct 18, 2022

removeGithubRunner[Org || Repo] used to remove the EC2 instance, so no need to call terminateRunner again. This potentially could cause runners that failed to be unregistered from GHA to be terminated on EC2.

As a fix, removeGithubRunner won't terminate the instance, nor generate logs. This will enable scaleDown to control when to call terminateRunner and generate the proper logs and metrics. Avoiding having this issue in the future.

This bug also explains why we had in the past more EC2 instances being kept at its minimum time: instances with less than minimum time got unregistered and terminated without being tracked on main application metric. This is obvious when we compare the API calls to terminate and the count of app level termination.

Screenshot 2022-10-18 at 09 21 47
Screenshot 2022-10-18 at 09 26 19

Bug initially flagged on 87134

@vercel
Copy link

vercel bot commented Oct 18, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
torchci ⬜️ Ignored (Inspect) Oct 19, 2022 at 9:41AM (UTC)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 18, 2022
* [FIX] scale-down-test.ts
* [IMPROVE] removeGithubRunner does not terminate EC2 runner
* [IMPROVE] application-level metrics for GH runner de-registers
@jeanschmidt jeanschmidt merged commit 7a7c4ea into main Oct 19, 2022
@jeanschmidt jeanschmidt deleted the jeanschmidt/fix_not_remove_instances branch October 19, 2022 09:56
kit1980 pushed a commit that referenced this pull request Nov 23, 2022
`removeGithubRunner[Org || Repo]` used to remove the EC2 instance, so no
need to call `terminateRunner` again. This potentially could cause
runners that failed to be unregistered from GHA to be terminated on EC2.

As a fix, `removeGithubRunner` won't terminate the instance, nor
generate logs. This will enable `scaleDown` to control when to call
`terminateRunner` and generate the proper logs and metrics. Avoiding
having this issue in the future.

This bug also explains why we had in the past more EC2 instances being
kept at its minimum time: instances with less than minimum time got
unregistered and terminated without being tracked on main application
metric. This is obvious when we compare the API calls to terminate and
the count of app level termination.

![Screenshot 2022-10-18 at 09 21
47](https://user-images.githubusercontent.com/4520845/196364535-5aaab331-2080-44be-b6af-0702f99d50d9.png)
![Screenshot 2022-10-18 at 09 26
19](https://user-images.githubusercontent.com/4520845/196364542-376ff99f-617e-4e82-b459-dfc8364219ad.png)

Bug initially flagged on
[87134](pytorch/pytorch#87134)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants