New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align internal deployment processes with customers' #976

sqs opened this Issue Nov 13, 2018 · 6 comments


None yet
5 participants
Copy link

sqs commented Nov 13, 2018

Our current deployment process to dogfood and has a couple problems:

  1. Most people on the team don’t understand how our current deploy process works. This means it is unclear how to add new services, edit existing services, or rollback in a safe way.
  2. Our current deployment process is completely unrelated to how customers deploy Sourcegraph. The basis of our internal deploy process should be exactly what customers do so we are regularly testing those code paths (not just once a month before release). This does not preclude us from adding additional automation now or in the future.
  3. Even though some common things are automated now (e.g. deploying sourcegraph-frontend), other things are not (e.g. every other service needs to be deployed by pushing to a special docker-images branch). Other conveniences that we had in the past (e.g. /deploy from Slack) are broken now and not fixed exactly because of unresolved tech debt, lack of understanding, and lack of ownership.

The obvious solution to (2) is to “just fix it”. This means:

  • We are going to deploy our single docker images to dogfood because that is what most customers of our size/scale (<20 engineers) would do.
  • will continue to use a full Kubernetes deployment because it operates at a much larger scale. We will migrate it to using a fork of which is manually deployed like our customers use and get rid of all the legacy helm kubegen logic.
    • We can still use renovate on our fork to keep image tags up to date, and deploying should be as simple as kubectl apply --prune -l deploy=sourcegraph -f base --recursive
    • Likewise, rolling back is as simple as reverting a commit and running that command, or temporarily checking out an old commit and running that command.

The current conveniences we have/had in (3) (e.g. auto deployment, deploy/rollback from Slack, automatic e2e tests per commit) are based on our old model of deployment, so they will not be available on the initial transition to the new model. This is an acceptable intermediate state for a few reasons:

  • Solving (1) and (2) has become increasingly urgent because we keep paying a development tax for not having solved these problems already.
  • Attempting to not regress on (3) would only delay the solution for (1) and (2).
  • Getting (1) and (2) solved sooner without (3) is a much better state than delaying for a complete solution.
  • The intermediate state of solving (1) and (2) without (3) gives our teammates the opportunity to learn and get comfortable with bare metal deployments (and feel any pains that our customers feel). We benefit from having everyone understand the first principles of deployment before being exposed to additional complexity that will be implied by (3).
  • We are only going to be in this intermediate state temporarily.

Once we have solved (1) and (2), we can start to build on top of it.

After migrating dogfood and production to the new deployment styles:

  • The first priority is automated e2e testing. Sketch:

    • On any commit (including feature branches), our CI system builds the single docker image, runs it on the CI machine, and points our e2e tests at that local instance.
    • This gives us synchronous e2e test failures tied to a commit (something we have today) and e2e tests on every branch (something we don’t have today).
  • The second priority is to bring back automated deploys. Sketch:

      • Assume that we have already setup Renovate to update image tags in our fork of deploy-sourcegraph (because that is useful)
      • Any commit to master on our fork can run the kubectl apply command described above.
      • Rollback is as easy as reverting a commit in our fork.
      • Renovate commits are delayed, but that is ok since we don’t need instantaneous deployment. If you need to deploy anything urgently, you can always manually edit the image tag before Renovate does. Easy.
    • Dogfood
      • This is actually the same process as above. The only difference is there is a single tag for renovate to update (i.e. sourcegraph/server).
    • It would still be technically possible to add back the ability to do deploys from Slack, but that would be a bad idea as it would obfuscate how deploys work. Git operations seem like an ideal layer of abstraction. This point is not worth debating until we are already have experience deploying with the new model.
  Manual Automated
Complicated/misunderstood   Current state
Simple/understandable Next intermediate state Final state

Rough plan and milestones:

Preparation checklist

  • Test cluster in configuration
  • Script backup of existing volumes from prod and restoration of those volumes into staging cluster
    • required: postgres, management console, redis, graphana, prometheus)
    • ideal: optional: gitserver, npm
    • maybe: jaeger (do we even use this)
  • Figure out what is lowest traffic time of day to deploy
  • Test users aren't signed out
  • Manual testing
    • Nick
    • Geoffrey

Deployment checklist

  • Update message of the day on to say "We are currently performing maintenance on so you may experience some service disruption."
  • Send message to dev-announce what is going to happen (e.g. deployments are stopping)
  • Delete deploy bot from k8s cluster to disable auto deploy
  • Update test cluster to use prod certificate
  • Pause payments on stripe?
  • Run script to backup all existing volumes and restore to prod cluster
     gcloud container clusters get-credentials dot-com --zone us-central1-f --project sourcegraph-dev
    kubectl delete deployments -nprod sourcegraph-frontend redis-cache redis-store pgsql
    kubectl delete deployments -nmonitoring prometheus grafana
    kubectl delete pvc -nprod pgsql redis-cache redis-store
    kubectl delete pvc -nmonitoring grafana prometheus
    kubectl delete pv prometheus grafana pgsql-prod redis-cache redis-store
    # Wait a little bit before running this - it takes a while for the GCE disks to become unbound 
    gcloud compute --project=sourcegraph-dev disks delete grafana-dot-com pgsql-prod-dot-com prometheus-dot-com redis-cache-dot-com redis-store-dot-com --zone=us-central1-f
    #Create new redis-cache disk
    gcloud compute --project=sourcegraph-dev disks create redis-cache-dot-com --zone=us-central1-f --type=pd-ssd --size=100GB
    # takes 10-15 minutes
    go run util/snapshot.go
  • Uncomment backup manager deployment in
  • Change tls.Secret.yaml to use prod's cert instead of sgdev's
  • Switch DNS to new cluster

Post deployment checklist

  • Basic usage (search, code intel)
  • Verify consistency with Stripe and our database (did any transactions happen in the data loss period?)
  • Does looker work?
  • Fix Prometheus
  • Fix search-badger
  • Investigate whether or not repo-updater has a memory leak or just a higher steady-state
  • Investigate why the new github-proxy doesn't work with out authentication used in sourcegraph-dot-com
  • Figure out why backup-manager has permission issues
  • Merge in base branch to the PR
  • Delete old main-cluster-5 cluster

After migrating dogfood and production to the new deployment styles:

  • work on re-adding functionality to run e2e tests entirely in CI
  • work on re-adding functionally to have CI auto deploy dogfood and production
  • work on slack-bot (or something similar) that notifies us when deploys have occured cc @keegancsmith

@sqs sqs added the roadmap label Nov 13, 2018

@sqs sqs added this to the 3.0-preview milestone Nov 13, 2018

@sqs sqs changed the title WIP: Deployment Align internal deployment processes with customers' Nov 13, 2018

@sqs sqs added the deployment label Nov 13, 2018


This comment has been minimized.

Copy link

felixfbecker commented Nov 13, 2018

Am I reading this correctly that we are getting rid of all auto-deployment? That would be a huge step backwards. Not running e2e tests automatically too, everybody already knows how to run them locally, nobody does. They need to run automatically.


This comment has been minimized.

Copy link

keegancsmith commented Nov 14, 2018

Why can't we keep our continuous deployment for, and dogfood what our customers use on dogfood.


This comment has been minimized.

Copy link

ggilmore commented Nov 15, 2018

@felixfbecker @keegancsmith PTAL. The original post has been updated with a more fleshed out description.


This comment has been minimized.

Copy link

ggilmore commented Nov 20, 2018

For our dogfood cluster, I am planning to create a single K8s deployment that has the entire sourcegraph/server image. Using K8s for this allows us to have a well-defined + easily versioned configuration file (and allows us to facilitate auto-deployments in the future).

Another possible option is to not use K8S, and instead provision a VM and only use the docker run that we provide I have no personal experience with setting this up as a real deployment (only for testing during endgame etc.), but this path has the benefit of using the exact steps that we tell customers to use on We'd gain experience with what it takes to provision persistent disks / update sourcegraph / etc. outside of a K8s context.

I am still leaning towards just using K8s. @sqs what are your thoughts?


This comment has been minimized.

Copy link
Member Author

sqs commented Nov 20, 2018

A single k8s sourcegraph/server deployment sounds great. There are a lot of companies that run Sourcegraph that way. If you choose this route, add a doc page (alongside the Docker deployment instructions in doc/admin/) for how to do this.


This comment has been minimized.

Copy link

keegancsmith commented Nov 21, 2018

The plan sounds good, what is missing for me is having a timeline of when deploys happen. When things go wrong, knowing that gitserver was deployed 15min ago is very useful. Maybe a slack bot monitoring the k8s deployments would be useful here. Right now just because there is a new commit to the deployment repo doesn't mean it was actually deployed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment