Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement server provisioning #300

Merged
merged 1 commit into from
Apr 8, 2024
Merged

Implement server provisioning #300

merged 1 commit into from
Apr 8, 2024

Conversation

MasterPtato
Copy link
Contributor

Depends on https://github.com/rivet-gg/rfcs/pull/7
Merge with https://github.com/rivet-gg/rivet-ee/pull/6

Changes

  • Move server provisioning to a service for future automation and flexibility

Copy link
Contributor Author

MasterPtato commented Jan 12, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MasterPtato and the rest of your teammates on Graphite Graphite

Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this review is still wip

.github/actions/pre-init/action.yaml Show resolved Hide resolved
.gitignore Show resolved Hide resolved
docs/packages/cluster/AUTOSCALING.md Show resolved Hide resolved
docs/packages/cluster/AUTOSCALING.md Show resolved Hide resolved
docs/packages/cluster/AUTOSCALING.md Show resolved Hide resolved
infra/tf/pools/main.tf Show resolved Hide resolved
lib/api-helper/build/src/macro_util.rs Show resolved Hide resolved
lib/bolt/core/src/context/service.rs Show resolved Hide resolved
docs/packages/cluster/SERVER_PROVISIONING.md Show resolved Hide resolved
lib/bolt/core/src/tasks/gen.rs Outdated Show resolved Hide resolved
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to delete watches & yarn.lock

wip review

.rivet/config.yaml Show resolved Hide resolved
docs/packages/cluster/AUTOSCALING.md Show resolved Hide resolved
docs/packages/cluster/TLS_AND_DNS.md Show resolved Hide resolved
docs/packages/cluster/TLS_AND_DNS.md Show resolved Hide resolved
docs/packages/cluster/TLS_AND_DNS.md Show resolved Hide resolved
lib/bolt/core/src/dep/terraform/gen.rs Show resolved Hide resolved
infra/tf/tls/acme.tf Show resolved Hide resolved
lib/bolt/core/src/tasks/test.rs Show resolved Hide resolved
lib/bolt/core/src/tasks/test.rs Show resolved Hide resolved
lib/nomad-util/src/lib.rs Show resolved Hide resolved
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wip

lib/nomad-util/Cargo.toml Show resolved Hide resolved
lib/util/core/src/math.rs Show resolved Hide resolved
lib/util/core/src/math.rs Show resolved Hide resolved
lib/util/env/src/lib.rs Outdated Show resolved Hide resolved
proto/backend/cluster.proto Show resolved Hide resolved
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level changes

  • needs to document the prebake process & how we do normal installs when waiting for prebake servers to finish
  • it's not clear that the server token is meant to be reused in the image and servers are identified by public ip. just needs some comments around that.
  • do we taint all servers once a new prebaked image is ready?
  • make sure the logic is sound such that servers can be recreated with a recycled public ip
  • see comments on figma for diagram

load testing

was trying to think about what it would take to feel confident in shipping this without taking everythign down. i think the biggest thing is load testing it with fault injection on staging (since that's on eks). that consists of:

  • using mm-sustain with an up and down in terms of players online. we want to mix job types between 1d8, 1d4, etc, 4d1. want to test the equivalent of 2k->10k->2k ccu repeating with 16 players/core.
  • inject faults in to the linode + cloudflare api by passing an env var with the failure percent. we can write a macro or something that injects faults here.
  • do we have a way to read the linode api to make sure that the servers match what we have internally? like a "doctor" service that prints anything wrong. this will help us catch leaked servers.

i know this is scope creep, but i'm hoping this will save us a series of headaches. i don't think it's too hard afaik, the hard part of load tests is already done.

Copy link
Contributor

graphite-app bot commented Apr 5, 2024

Merge activity

  • Apr 5, 1:07 PM EDT: NathanFlurry added this pull request to the Graphite merge queue.
  • Apr 5, 3:01 PM EDT: The Graphite merge queue couldn't merge this PR because it was not satisfying all requirements (Failed CI: 'check', 'cargo-deny').
  • Apr 8, 1:46 PM EDT: NathanFlurry added this pull request to the Graphite merge queue.
  • Apr 8, 1:47 PM EDT: The Graphite merge queue wasn't able to merge this pull request due to internal failures.
  • Apr 8, 2:01 PM EDT: NathanFlurry added this pull request to the Graphite merge queue.
  • Apr 8, 2:02 PM EDT: NathanFlurry merged this pull request with the Graphite merge queue.

NathanFlurry pushed a commit that referenced this pull request Apr 5, 2024
<!-- Please make sure there is an issue that this PR is correlated to. -->
Depends on rivet-gg/rfcs#7
Merge with rivet-gg/rivet-ee#6
## Changes
- Move server provisioning to a service for future automation and flexibility
<!-- If there are frontend changes, please include screenshots. -->
MasterPtato added a commit that referenced this pull request Apr 5, 2024
<!-- Please make sure there is an issue that this PR is correlated to. -->
Depends on rivet-gg/rfcs#7
Merge with rivet-gg/rivet-ee#6
## Changes
- Move server provisioning to a service for future automation and flexibility
<!-- If there are frontend changes, please include screenshots. -->
MasterPtato added a commit that referenced this pull request Apr 5, 2024
<!-- Please make sure there is an issue that this PR is correlated to. -->
Depends on rivet-gg/rfcs#7
Merge with rivet-gg/rivet-ee#6
## Changes
- Move server provisioning to a service for future automation and flexibility
<!-- If there are frontend changes, please include screenshots. -->
<!-- Please make sure there is an issue that this PR is correlated to. -->
Depends on rivet-gg/rfcs#7
Merge with rivet-gg/rivet-ee#6
## Changes
- Move server provisioning to a service for future automation and flexibility
<!-- If there are frontend changes, please include screenshots. -->
Copy link
Member

@NathanFlurry NathanFlurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will resolve remaining issues in follow up prs

@graphite-app graphite-app bot merged commit 1aa344e into main Apr 8, 2024
9 of 10 checks passed
@graphite-app graphite-app bot deleted the max/SVC-3454 branch April 8, 2024 18:02
@MasterPtato MasterPtato restored the max/SVC-3454 branch April 8, 2024 21:22
NathanFlurry added a commit to rivet-gg/frontend that referenced this pull request Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants