-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate inaccessible compute instances. #7427
base: dev
Are you sure you want to change the base?
Conversation
Schema is unchanged, no database migration needed.Carry on! '' |
8eedb9d
to
5020470
Compare
873f41a
to
9d2547d
Compare
07b6d1f
to
c5f1fee
Compare
9d2547d
to
f3f3258
Compare
Codecov Report
@@ Coverage Diff @@
## dev #7427 +/- ##
==========================================
+ Coverage 55.94% 56.14% +0.20%
==========================================
Files 158 162 +4
Lines 32519 32563 +44
==========================================
+ Hits 18193 18284 +91
+ Misses 14071 14022 -49
- Partials 255 257 +2
*This pull request uses carry forward flags. Click here to find out more.
Continue to review full report at Codecov.
|
c5f1fee
to
7233733
Compare
f3f3258
to
ac9017f
Compare
81c5a0a
to
4b61791
Compare
ac9017f
to
2fbe5b9
Compare
2fbe5b9
to
a31e2cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it looks good to me, I have left some comments and questions. I still need to test the PR.
func (c *DBClient) TerminateLockedInstances(ctx context.Context, client subscriptions.WhistGraphQLClient, region string, ids []string) ([]string, error) { | ||
var m subscriptions.TerminateLockedInstances | ||
|
||
// We need to pass the instance IDs as a slice of grpahql String type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "grpahql"
algos "github.com/whisthq/whist/backend/services/scaling-service/scaling_algorithms/default" // Import as algos, short for scaling_algorithms | ||
"github.com/whisthq/whist/backend/services/subscriptions" | ||
"github.com/whisthq/whist/backend/services/utils" | ||
logger "github.com/whisthq/whist/backend/services/whistlogger" | ||
) | ||
|
||
func main() { | ||
var cleanupPeriod time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a little bit cleaner to do
var (
cleanupPeriod time.Duration
noCleanup bool
)
var cleanupPeriod time.Duration | ||
var noCleanup bool | ||
|
||
flag.DurationVar(&cleanupPeriod, "cleanup", time.Duration(time.Minute), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the flags but find our current setup to be not ideal to handle them. If we need to pass different flags you need to modify the Makefile and it is annoying. Of course this is out of scope for this PR but something to keep in mind. If we want to avoid this and maintain consistency for now, we can:
if !metadata.IsLocalEnv() {
CleanRegion()
.
.
.
} else {
logger.Infof("Not running cleaner")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the scaling service needs to start accepting command line flags because otherwise we will continue to have to live with at least one of the following: 1) having to define and handle a new value of APP_ENV
for every configuration of features we want to run; 2) having to modify go code every time we want to enable or disable a feature for testing purposes; 3) ambiguously defined behavior for each possible value of APP_ENV
and run_scaling_service*
Makefile target.
I think it is preferable for the canonical way to run the binary to be to pass command-line flags directly to the binary because it allows developers to specify explicitly and transparently exactly how the scaling service should behave. Developers should feel free to define whatever aliases for combinations of command-line flags they want to use to run the binary locally. As a (perhaps the) believer in this conclusion, I see no reason not to begin adding command-line flags to the binary immediately. As an added bonus, as long as we still have a Makefile, it is cleaner to modify flags in a Makefile to enable or disable feature than it is to edit code directly. Think about past PRs whose testing instructions included things like "modify the delay between scale up/down operations."
…sive compute instances in the database
a31e2cd
to
7122836
Compare
7122836
to
31df6f0
Compare
Ticket(s) Closed
Description
This PR adds code to the scaling service that prevents broken instances from existing for too long. Although the linked issue calls for broken instances to be marked as DRAINING, this PR terminates them directly. Normally, when the host service notices that it has been marked as draining in the database, it waits until all users have disconnected, shuts down all running Mandelboxes, and terminates itself. The tricky thing about broken instances is that they're broken, so we can't assume they can or will do any of those things. For this reason, broken instances need to be terminated directly, not just marked as DRAINING.
Implementation
Run a cleanup thread in each region. Every cleanup period, the cleanup threads look for unresponsive instances. An unresponsive instance is one that hasn't sent a ping in at least 2 minutes. If all unresponsive instances can be terminated, also delete the corresponding rows from the database. If termination of one or more instances fails, log an error, but do not remove any rows from the database.
Documentation & Tests Added
Just godoc comments.
Testing Instructions
-nocleanup
flag from the command line in therun_scaling_service_localdevwithdb
target.make run_scaling_service_localdevwithdb
PR Checklist