Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal PR: Remove the management console (馃unicorns and 馃寛rainbows here) #7197

Merged
merged 31 commits into from Dec 17, 2019

Conversation

slimsag
Copy link
Member

@slimsag slimsag commented Dec 13, 2019

The management console today is the single most significant barrier that site admins encounter when trying to configure their Sourcegraph instance, which is a major blocker for customer onboarding and ARR growth. This (fully-functioning) PR proposes removing it, making the user experience better, reducing technical complexity (-8k LOC) and closing 18 open issues in the process.

From admins not saving their password (#5186), confusing browser TLS warnings prompts (#6070), it being difficult to access (#1934), complexity poking firewall holes properly (#2478), a general lack of maintenance and significant technical complexity for us (#5760, #5350, #4235, #4213, #2759, #2404, #2322, #2095, #2090, #1822, #1731, #1685, #1435, #1432), even slowing down our CI builds (#6770) -- the management console overall is a massive source of problems for our users and complexity for us from Engineering, Sales, and overall Business standpoint.

But before we go around suggesting we remove things willy-nilly, why did we add it in the first place and why does it make sense to remove now?

Management console: a brief history

In November of last year in PR #966 we introduced the management console. At the time, Sourcegraph configuration was provided through a JSON configuration file mounted into the container and this presented challenges for site admins going through our onboarding / trial process:

  • Admins needed to not only figure out the right properties to plug in for e.g. their auth provider and code host (which can be quite complex), but also needed to do so in their editor without any autocomplete or helpful tooltips to guide them.
  • After making some edits, it required they restart all containers for the changes to take effect.

All in all, this was a tedious process and made configuration very difficult. To resolve this, we made all configuration dynamic so that restarting services was not required. However, some configuration properties such as authentication if misconfigured could lock the site admin out of their instance and prevent them from fixing their configuration after. To resolve this, we took inspiration from other similar tools and created the management console.

Our hope was that the management console and overall configuration UI would eventually become more intuitive, UI-based / beginner-friendly (instead of just a JSON editor), and more. In practice, this work was de-prioritized in favor of other more important work.

Additionally, the management console introduced security concerns which meant we had to lock it down heavily with a self-signed TLS cert on new instances and a strong auto-generated basic auth password in order for it to be sufficiently secure. You can read more about this in my prior proposal RFC 63 REVIEW: Authless management console access.

The management console succeeded in lowering the barrier to onboarding at the time we added it, because it meant we could make a majority of the options (site configuration) editable easily at runtime.

What has changed?

We can do better.

We understand our product and the configuration options it must provide more intimately than anyone else. When we added it last year, we closely mimicked best-practices in similar software -- and in the meantime, we have grown to make our product more resilient towards configuration. Nearly every single configuration option since has been made dynamic (editable at runtime without requiring a restart), we have paid more close attention to configuration issues, and most important we have learned what the problems are with the configuration of Sourcegraph on its own merits without mimicking other products.

Before starting this change, I wanted to ensure I had a full grasp on the entire situation from a technical point of view. To do this, I compiled a complete list of all existing critical configuration / management console options and determined whether or not they fit the original criteria for what belongs in the management console (things that, if misconfigured, could lead to the admin being locked out). You can find this document here.

Once this was written, two things became clear to me:

  1. Not a single critical configuration option can prevent the frontend from starting. This was not true when we added the management console originally and has important implications I will get to later.
  2. Only 5 configuration options could lock an admin out of the instance:
    • One, the licenseKey, I resolved last night in Allow admins past license expiration screen in order to easily update key聽#7181
    • The remaining 4 would need to be set to ridiculously stupid values in order to lock an admin out in practice. Removing all auth providers, setting session expiry to a very low value, adding custom HTML which bricks the frontend web UI, etc.
    • This is not to say these cases are impossible, but that they are very unlikely in practice and I have not once heard of an admin ever doing this by mistake.

Given this data, we can shift around our tradeoffs from what they are today:

  • 95% of admins onboarding must go through the painful escape hatch (the management console)
  • 5% of admins needing an escape hatch have a nice/friendly UI.

To something more amicable to our goals such as:

  • 0% of admins onboarding must go through the painful escape hatch.
  • 5% of admins needing an escape hatch still have one, and it's MUCH friendlier than the management console in its current form.

Solution

  1. The management console is removed - gone.
  2. All configuration options previously found in the management console are automatically migrated into the site configuration.
  3. We no longer have both "critical" and "site" configuration: we now have just "site" configuration and one editor / schema.
  4. In 95% of cases, admins can make all configuration edits through the regular web UI without any trouble.
  5. In 5% of cases where an admin messed up bad and needs an escape hatch, we provide them a command to edit a file in the Docker container using e.g. vi or nano. This file has your current site configuration contents, and any edits made to it are synchronized with the database. The file is otherwise ephemeral (i.e. DB remains the source of truth for configuration, the file just allows edits as an escape hatch). How this looks:

Editing your site configuration if you cannot access the web UI

If you are having trouble accessing the web UI, you can make edits to your site configuration by editing a file in the container using the following commands:

Single-container Docker instances

docker exec -it $CONTAINER -- nano /site-config.json

Kubernetes cluster instances

kubectl exec -it $FRONTEND_POD -- nano /site-config.json

Perform your edits, type ctrl+x and y to save the changes. They will be applied immediately just as if made through the web UI. If you prefer vi, simply replace nano in the commands above.

Next steps

100% of what I described above is already implemented in this PR. This could not have been written as an RFC because the solution was inherently tied to the constraints we have in our product -- which could not be uncovered without implementing this.

If this proposal is accepted, we can land this in 3.11 easily.

  • Get feedback from Product to confirm they believe this is a good step forward. cc @sqs @christinaforney
  • Get help from @beyang to resolve issues in the regression test suite -- it relied on the management console API for configuration and must now go through the app itself instead.
  • Write and update both documentation and in-product documentation/links.
  • Write paired changes for deploy-sourcegraph and deploy-sourcegraph-docker to remove the management console.

Fixes

@beyang
Copy link
Member

beyang commented Dec 13, 2019

  • 1 immediate thing that comes to mind: need to update the onboarding widget so that the last item ("Configure SSO") links to site config, rather than documentation about how to access the mgt console. Should be a quick fix.

image

Copy link
Contributor

@nicksnyder nicksnyder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed PR description and for proposing a holistic solution to a bunch of problems. Removing the management console sounds like the right thing to do and I am in favor of moving forward with this change so I am approving this PR. I did not review the implementation so please get that reviewed by others.

Questions:

  • What migration, if any, is required by existing customers?
  • Does this impact @uwedeportivo's work on metrics (or is that completely separate from the management console)?

)

// RunMigrations runs configuration DB table migrations.
func RunMigrations(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • TODO(@beyang): test this migration runs properly.

@beyang
Copy link
Member

beyang commented Dec 13, 2019

My understanding is that the backup config file is written to a file within the Docker container (the single Docker container in the single node deployment, or each of the frontend Docker containers in the clustered deployment). What happens if the container fails to start? I understand that we don't think config changes will cause the container to fail to start, but containers may fail to start for some other reason.

A counter-proposal that occurred to me (not yet fully fleshed out): what if we provided a CLI one-liner to extract the config from the database and then instructed the site admin to use SITE_CONFIG_FILE and SITE_CONFIG_ALLOW_EDITS=true?

Resolved these concerns with @slimsag in-person. Changes will be synced secondly from DB to file (if DB has changed) and then from file to DB (if file has changed). It's a little redundant in the cluster setting (as this will happen in each frontend container), but doesn't have a noticeable impact on end-user experience. This approach is easier for the site admin than executing a psql command to extract the config from the DB and then using SITE_CONFIG_FILE.

@felixfbecker
Copy link
Contributor

This is a great analysis @slimsag and every proposal to remove/replace/change something should look like this! I.e. why was this added in the first place 鉃★笍 do these reasons still apply. 馃憦馃徎

UX should be the deciding factor here, but I would it would certainly remove code maintenance headaches for me.

@beyang
Copy link
Member

beyang commented Dec 13, 2019

@slimsag a note for later: the docs should clearly mention that all this is not necessary if the admin is using SITE_CONFIG_FILE (because then the source of config truth is outside the DB).

@slimsag
Copy link
Member Author

slimsag commented Dec 13, 2019

What migration, if any, is required by existing customers?

I will add documentation before merging this PR which explains the change to site admins and any manual steps that are needed. The migration is already mentioned in the CHANGELOG somewhat effectively: it is automatic and no manual migration is needed.

There is a small caveat to this which is that users of SITE_CONFIG_FILE should copy their config from Sourcegraph back into the file so that we can remove this migration code in the future (but nothing bad will happen if they upgrade without doing that, until we remove this migration code down the line). This is mentioned in the CHANGELOG, and will be mentioned in the docs before merging.

Does this impact @uwedeportivo's work on metrics (or is that completely separate from the management console)?

There is zero relationship with metrics here., I'm not sure what you might be thinking of with this question. It does supersede the work he did in #6696 though.

@slimsag
Copy link
Member Author

slimsag commented Dec 13, 2019

Me and Beyang went over his questions in a call, responding here for transparency sake:

Does this mean every frontend instance will sync config to its local FS? How does this work in a Kubernetes cluster where there are multiple frontend instances?

The best way to mentally model / reason about the behavior is this: There is no difference between what we do here with the escape hatch file and with what we do in the web UI in terms of handling conflicts.

In the case of the web UI, we sync the DB contents to the web UI, the admin makes an edit, and we sync that edit to the DB.

In the case of the escape-hatch file, we sync the DB contents to the (ephemeral) file on disk, the admin makes an edit, and we sync that edit to the DB.

The file itself only acts as a mechanism for admins to easily make edits without a web UI.

In terms of conflicting edits, we have the same risks as the web UI: if two admins edit the site config through the web UI at once, one of their changes would be rejected (they would receive an error). In the case of this approach, the error would appear in the frontend container logs whether the conflict is with someone editing the config through a file on another frontend container or through the web UI.

@beyang
Copy link
Member

beyang commented Dec 15, 2019

@slimsag I've pushed my commits updating the regression tests to this branch

@christinaforney
Copy link
Contributor

christinaforney commented Dec 16, 2019

This looks great! Approving from product side (though I would like to test it out). Don鈥檛 block on me testing since I won鈥檛 be able to until Monday morning.

Is there any risk pushing this in too close to branch cut and creating problems with the release?

@sqs sqs removed their request for review December 16, 2019 12:45
@slimsag slimsag requested a review from a team as a code owner December 17, 2019 03:14
@slimsag
Copy link
Member Author

slimsag commented Dec 17, 2019

I am going to merge this. I think this change will be a MASSIVE improvement for our users and is unlikely to cause us any issues in the release. We have multiple escape options if I turn out to be wrong (reversion, patch releases, the documented SITE_CONFIG_FILE backup escape hatch, etc.)

(I've addressed all feedback (including feedback from the web team about the critical config requiring a gulpfile hack), and am committed to following up on any additional feedback anyone leaves here or in issues filed on me. There is more cleanup I can and will do to this part of our codebase overall, but this PR is too large and making those changes in it would be risky.)

There are always risks with landing large changes like this, and even more-so with this change in particular given it has not had a significant amount of time to bake (i.e. it has gone from conception -> implementation in just three days!) but I think the value we gain and risks of not merging this in 3.11 greatly outweigh the risks of merging it.

Potential risks of merging this include:

  • Migration trouble -> The migration is automatic so there should be no issues, and there are clear migration notes called out in the CHANGELOG, documentation in several locations, and k8s migration documentation. The risk here is minimal.
  • Product change oversight (we learn we don't like the new behavior for some reason) -> this seems incredibly unlikely, as I have heard very positive feedback from multiple people on the team.
  • Something falling through the cracks (this is a big change) -> It is certainly possible, and this is likely the greatest risk with this change, but I feel confident enough to merge this now. Release captain (Beyang) will also assess independently whether or not this change makes the 3.11 branch cut (along with some other things) which makes this a "two people must turn the nuclear launch key" kind of action.

The (quite large) values we gain are outlined in the PR description already, and there are large risks of this not landing in 3.11 in my view from what I have seen in our current onboarding flow and given our renewed targets for the new year.

@slimsag
Copy link
Member Author

slimsag commented Dec 17, 2019

Important note: the e2e test failure is unrelated and is passing locally (error is with CI not the actual tests), please see discussion here: https://sourcegraph.slack.com/archives/C07KZF47K/p1576555658027900?thread_ts=1576387614.018200&cid=C07KZF47K

@slimsag slimsag merged commit c23f977 into master Dec 17, 2019
@slimsag slimsag deleted the sg/remove-mgmt-console branch December 17, 2019 04:30
sqs added a commit that referenced this pull request Jan 11, 2020
Now that #7197 was merged (removing the management console and merging site config and critical config), it is not necessary to have separate files for site config and critical config.

After this PR is merged, anyone running `enterprise/dev/start.sh` will be notified to update dev-private to a commit that incorporates sourcegraph/dev-private#10 (which removes the dev-private critical-config.json file).
sqs added a commit that referenced this pull request Jan 11, 2020
Now that #7197 was merged (removing the management console and merging site config and critical config), it is not necessary to have separate files for site config and critical config.

After this PR is merged, anyone running `enterprise/dev/start.sh` will be notified to update dev-private to a commit that incorporates sourcegraph/dev-private#10 (which removes the dev-private critical-config.json file).
sqs added a commit that referenced this pull request Jun 17, 2020
The management console (and critical config) was removed in #7197. Critical config is no longer a concept; it has been merged into site config. This commit updates code and docs to remove remnants of critical config.

@slimsag stated on 2020-02-05 that we should wait until Sourcegraph 3.15 to remove this (#8284 (comment)). It is past that time already. The migration guides referring to critical config remain, of course, so admins of older instances can still find those instructions.
sqs added a commit that referenced this pull request Jun 17, 2020
The management console (and critical config) was removed in #7197. Critical config is no longer a concept; it has been merged into site config. This commit updates code and docs to remove remnants of critical config.

@slimsag stated on 2020-02-05 that we should wait until Sourcegraph 3.15 to remove this (#8284 (comment)). It is past that time already. The migration guides referring to critical config remain, of course, so admins of older instances can still find those instructions.
j-shilling pushed a commit to j-shilling/cody-lsp-gateway that referenced this pull request Jun 20, 2023
Remove the management console and merge all critical configuration automatically into the site configuration.

Instead of the management console, a file-based site configuration editing escape hatch is now available and is documented.

See [the added 3.11 migration notes](https://docs.sourcegraph.com/admin/migration/3_11) for details on the implications of this change, and ["editing your site configuration if you cannot access teh web UI"](https://docs.sourcegraph.com/admin/config/site_config#editing-your-site-configuration-if-you-cannot-access-the-web-ui) for how to use the new escape-hatch configuration file.

Fully-detailed proposal and motivations behind this change are documented in the PR description here: sourcegraph/sourcegraph#7197

Associated PRs:

- sourcegraph/deploy-sourcegraph#478
- sourcegraph/deploy-sourcegraph-docker#54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment