Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate inaccessible compute instances. #7427

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

owenniles
Copy link
Contributor

@owenniles owenniles commented Oct 4, 2022

Ticket(s) Closed

Description

This PR adds code to the scaling service that prevents broken instances from existing for too long. Although the linked issue calls for broken instances to be marked as DRAINING, this PR terminates them directly. Normally, when the host service notices that it has been marked as draining in the database, it waits until all users have disconnected, shuts down all running Mandelboxes, and terminates itself. The tricky thing about broken instances is that they're broken, so we can't assume they can or will do any of those things. For this reason, broken instances need to be terminated directly, not just marked as DRAINING.

Implementation

Run a cleanup thread in each region. Every cleanup period, the cleanup threads look for unresponsive instances. An unresponsive instance is one that hasn't sent a ping in at least 2 minutes. If all unresponsive instances can be terminated, also delete the corresponding rows from the database. If termination of one or more instances fails, log an error, but do not remove any rows from the database.

Documentation & Tests Added

Just godoc comments.

Testing Instructions

  1. Create an instance (any instance). We just need an instance that we can practice deleting.
  2. Spin up a local database and config database.
  3. Add a row for the instance you created in step 1 to your local scaling service database (if your database is empty, you'll need to add a row to the images table as well, but you can just call it fake-image or something)
  4. Remove the -nocleanup flag from the command line in the run_scaling_service_localdevwithdb target.
  5. Start the scaling service with make run_scaling_service_localdevwithdb
  6. After about a minute, the scaling service should delete the row from the database and terminate the instance.

PR Checklist

  • Did the PR author fully test this PR end-to-end?
  • Did one PR reviewer fully test this PR end-to-end?
  • Did one PR reviewer conduct a thorough code design review?

@github-actions github-actions bot added 📁 Repo: services This PR/Issue modifies /services code 📁 Repo: backend This PR/Issue modifies /backend code labels Oct 4, 2022
@github-actions
Copy link

github-actions bot commented Oct 4, 2022

Schema is unchanged, no database migration needed.

Carry on!

''

@owenniles owenniles force-pushed the owen/collect-garbage branch 4 times, most recently from 8eedb9d to 5020470 Compare October 7, 2022 14:01
@owenniles owenniles force-pushed the owen/collect-garbage branch 2 times, most recently from 873f41a to 9d2547d Compare October 14, 2022 14:53
@owenniles owenniles changed the base branch from dev to owen/get-region October 14, 2022 15:11
@codecov
Copy link

codecov bot commented Oct 17, 2022

Codecov Report

Merging #7427 (a31e2cd) into dev (8d26a90) will increase coverage by 0.20%.
The diff coverage is 21.79%.

❗ Current head a31e2cd differs from pull request most recent head 31df6f0. Consider uploading reports for the commit 31df6f0 to get more accurate results

@@            Coverage Diff             @@
##              dev    #7427      +/-   ##
==========================================
+ Coverage   55.94%   56.14%   +0.20%     
==========================================
  Files         158      162       +4     
  Lines       32519    32563      +44     
==========================================
+ Hits        18193    18284      +91     
+ Misses      14071    14022      -49     
- Partials      255      257       +2     
Flag Coverage Δ *Carryforward flag
backend-services 48.57% <21.79%> (-0.64%) ⬇️
protocol 57.75% <ø> (+0.41%) ⬆️ Carriedforward from 2a4e842

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
backend/services/scaling-service/cleanup.go 0.00% <0.00%> (ø)
backend/services/scaling-service/dbclient/types.go 0.00% <0.00%> (ø)
backend/services/scaling-service/event_handler.go 21.83% <0.00%> (-4.93%) ⬇️
...ling-service/scaling_algorithms/default/default.go 17.57% <ø> (+0.71%) ⬆️
...ckend/services/scaling-service/dbclient/cleanup.go 85.00% <85.00%> (ø)
protocol/server/client.c 95.45% <0.00%> (-3.41%) ⬇️
protocol/whist/network/tcp.c 61.52% <0.00%> (-0.68%) ⬇️
protocol/whist/utils/queue.c 83.82% <0.00%> (-0.24%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d662bc6...31df6f0. Read the comment docs.

@owenniles owenniles force-pushed the owen/get-region branch 4 times, most recently from 81c5a0a to 4b61791 Compare October 20, 2022 14:14
Base automatically changed from owen/get-region to dev October 20, 2022 14:42
@owenniles owenniles marked this pull request as ready for review October 25, 2022 14:43
Copy link

@MauAraujo MauAraujo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it looks good to me, I have left some comments and questions. I still need to test the PR.

backend/services/subscriptions/mutations.go Outdated Show resolved Hide resolved
backend/services/subscriptions/mutations.go Show resolved Hide resolved
backend/services/scaling-service/dbclient/types.go Outdated Show resolved Hide resolved
backend/services/scaling-service/cleanup.go Outdated Show resolved Hide resolved
func (c *DBClient) TerminateLockedInstances(ctx context.Context, client subscriptions.WhistGraphQLClient, region string, ids []string) ([]string, error) {
var m subscriptions.TerminateLockedInstances

// We need to pass the instance IDs as a slice of grpahql String type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "grpahql"

algos "github.com/whisthq/whist/backend/services/scaling-service/scaling_algorithms/default" // Import as algos, short for scaling_algorithms
"github.com/whisthq/whist/backend/services/subscriptions"
"github.com/whisthq/whist/backend/services/utils"
logger "github.com/whisthq/whist/backend/services/whistlogger"
)

func main() {
var cleanupPeriod time.Duration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little bit cleaner to do

var (
    cleanupPeriod time.Duration
    noCleanup bool
)

var cleanupPeriod time.Duration
var noCleanup bool

flag.DurationVar(&cleanupPeriod, "cleanup", time.Duration(time.Minute),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the flags but find our current setup to be not ideal to handle them. If we need to pass different flags you need to modify the Makefile and it is annoying. Of course this is out of scope for this PR but something to keep in mind. If we want to avoid this and maintain consistency for now, we can:

if !metadata.IsLocalEnv() {
    CleanRegion() 
    .
    .
    .
} else {
    logger.Infof("Not running cleaner")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scaling service needs to start accepting command line flags because otherwise we will continue to have to live with at least one of the following: 1) having to define and handle a new value of APP_ENV for every configuration of features we want to run; 2) having to modify go code every time we want to enable or disable a feature for testing purposes; 3) ambiguously defined behavior for each possible value of APP_ENV and run_scaling_service* Makefile target.

I think it is preferable for the canonical way to run the binary to be to pass command-line flags directly to the binary because it allows developers to specify explicitly and transparently exactly how the scaling service should behave. Developers should feel free to define whatever aliases for combinations of command-line flags they want to use to run the binary locally. As a (perhaps the) believer in this conclusion, I see no reason not to begin adding command-line flags to the binary immediately. As an added bonus, as long as we still have a Makefile, it is cleaner to modify flags in a Makefile to enable or disable feature than it is to edit code directly. Think about past PRs whose testing instructions included things like "modify the delay between scale up/down operations."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📁 Repo: backend This PR/Issue modifies /backend code 📁 Repo: services This PR/Issue modifies /services code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make bricked instances mark themselves as DRAINING
3 participants