Skip to content

Migrate Taskcluster to postgres#154

Merged
helfi92 merged 8 commits intotaskcluster:masterfrom
helfi92:rfc-65
May 1, 2020
Merged

Migrate Taskcluster to postgres#154
helfi92 merged 8 commits intotaskcluster:masterfrom
helfi92:rfc-65

Conversation

@helfi92
Copy link
Copy Markdown
Contributor

@helfi92 helfi92 commented Dec 16, 2019

No description provided.

@helfi92 helfi92 self-assigned this Dec 16, 2019
@helfi92 helfi92 requested a review from djmitche December 16, 2019 16:32
Copy link
Copy Markdown
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good at a read-through. I wonder if we shouldn't follow the IETF process and create a new RFC, rather than revise this one in-place? For example https://tools.ietf.org/html/rfc293 updates RFC288 but is obsoleted by RFC298. It probably doesn't matter.. @ccooper do you have an opinion there?

We talked about audiences that might be interested in this RFC:

  • People deploying Taskcluster (cloudops)
  • People using Taskcluster (releng, firefox-ci at large)
  • People developing Taskcluster (TC team)

I think this addresses what we know of those groups' requirements, and hopefully has enough detail that they can raise any concerns early in the process.

Comment thread rfcs/0065-Migrate-queue-to-postgres.md Outdated
@helfi92 helfi92 force-pushed the rfc-65 branch 5 times, most recently from 44a80ca to 0876a8e Compare December 17, 2019 19:56
@helfi92 helfi92 changed the title Update RFC#65 Migrate Taskcluster to Postgres Dec 17, 2019
@helfi92 helfi92 changed the title Migrate Taskcluster to Postgres Migrate Taskcluster to postgres Dec 17, 2019
@helfi92
Copy link
Copy Markdown
Contributor Author

helfi92 commented Dec 17, 2019

I wonder if we shouldn't follow the IETF process and create a new RFC, rather than revise this one in-place?

Done!

Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-queue-to-postgres.md Outdated
Comment thread rfcs/0154-Migrate-taskcluster-to-postgres.md
Comment thread rfcs/0154-Migrate-taskcluster-to-postgres.md Outdated
@helfi92
Copy link
Copy Markdown
Contributor Author

helfi92 commented Jan 16, 2020

@edunham @sciurus I updated the RFC. Do you mind giving it another look, please? Thanks!

To test the scalability and performance of the system, we will do an import of
the FirefoxCI production database (minus the secrets) into the postgres database
on the staging deployment and then observe if the database crashes or if there
are any noticeable performance issues that arise.
Copy link
Copy Markdown

@edunham edunham Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for adding this! Let's make sure we do the parallel request testing on the system to which we've imported production amounts of data. We want to test 2 things:

  1. Can production quantities of data be imported successfully?
  2. Does the DB perform as expected when it has production quantities of data in it?

As currently written, this tests 1 but not 2. To test both, we could just make sure to do the Parallel Requests testing on the same instance that we've imported the prod quantities of data into.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. I'll update the RFC to make this clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Backups and Restores

Teams operating Taskcluster will rely on the cloud provider's backup system to
handle backups and restores.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about what backup/restore guarantees we're getting from the old system and would want to be no worse than in the new one might be handy here. There are often many options to choose from between the extremes of "no backups" vs "keep everything forever", and if there's any general guidance then this could be a good place to put it. If not, we can figure that out per-deployment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to figure that out per-deployment. The tools available are basically unrelated to Taskcluster and the design of this project, so there's not much more to say here other than this.

Direct SQL access to the database is *not allowed*. Taskcluster will allow
ad-hoc read-only queries on the data-set via stored procedures with access
controlled by Postgres permissions. This feature will most likely be done after
step 2 of the transition.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the read-only intent here. I assume that the details of how postgres permissions will need to be configured will be forthcoming with the update that adds this feature?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually upon reading the next section it's sounding like TC will handle all the postgres perms stuff internally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Taskcluster will manage posrtgres permissions internally.

(configured in Kubernetes), and on install/upgrade we'll use the admin user to
create a non-admin user for each service, with appropriate GRANTs for that
service's access. Deployers of Taskcluster will pick the passwords for all the
non-admin users (configured in Kubernetes). It's up to the deployer to create
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "configured in Kubernetes", you mean "encrypted then passed as env vars like all the other secrets we currently provide to services", right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@helfi92 helfi92 requested review from edunham and sciurus January 21, 2020 15:11
Copy link
Copy Markdown

@edunham edunham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this and have no objections to moving it to final comment period.

## Permissions

Taskcluster will manage permissions to tables/schemas and deployers will manage
user accounts. The deployment will have an "admin" postgres user/password
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the admin password need to be in kubernetes at all? Can it live outside of it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not, an in fact the deployment docs specifically warn against including it in the kubernetes config.


## Ad-hoc Queries

Direct SQL access to the database is *not allowed*. Taskcluster will allow
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we allowed ad-hoc queries but only on a reporting db

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad-hoc queries will run on a read-only db. What is a reporting db? Let me know if this doesn't answer your question.

using the existing stored procedure that returned the single column. That new
stored procedure is then deployed before the code that uses it is deployed.

A consequence of this design is that "procedures are forever" -- an upgrade can
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete a procedure but it has to happen in a later upgrade, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also talked about supporting rollbacks during the all-hands. Can we add a section here talking about them and how we will support them or justifying why we won't.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete a procedure but it has to happen in a later upgrade, right?

Rather than delete a procedure, a safer alternative would be to change the body of the function to return an empty array. `

We also talked about supporting rollbacks during the all-hands. Can we add a section here talking about them and how we will support them or justifying why we won't.

As long as the procedure signature is not changed, rolling back shouldn't cause any issues.

Can we add a section here talking about them

Will do.


### Tracing

Taskcluster will use New Relic to a have better visibility of the database,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be expanded on? We haven't used New Relic with tc before. What sort of changes do we need to make to support it. Why New Relic instead of something else, etc.

Not saying I disagree, just want to know more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support it i think you'd want to add an env var to conditionally load the newrelic package, so we can turn it on in stage and prod but not in local dev. You'll also need to add a couple env vars for config. Beyond that, hopefully it will monkeypatch the relevant libraries like pg and just work™

New Relic was my recommendation because it's something Mozilla has licenses for and some other teams use it extensively. https://cloud.google.com/trace/ plus https://googleapis.dev/nodejs/trace/latest/ would be a potential alternative.

The goal of using a tracing / apm service during the migration is to get clearer visibility into query performance over time than we can get from application logging or pg views and logs alone.

@imbstack
Copy link
Copy Markdown
Contributor

imbstack commented Feb 5, 2020

Looking good so far!

@djmitche
Copy link
Copy Markdown
Contributor

djmitche commented May 1, 2020

Having almost finished the project, we should probably (fix the check failure and) merge this!

@helfi92
Copy link
Copy Markdown
Contributor Author

helfi92 commented May 1, 2020

I forgot this was still open. Agreed. I'll take care of this. Thanks for the ping.

@helfi92 helfi92 force-pushed the rfc-65 branch 2 times, most recently from 14b6246 to 8664bdf Compare May 1, 2020 16:10
@helfi92 helfi92 merged commit 8a76163 into taskcluster:master May 1, 2020
@helfi92 helfi92 deleted the rfc-65 branch May 1, 2020 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants