Migrate Taskcluster to postgres#154
Conversation
djmitche
left a comment
There was a problem hiding this comment.
This looks good at a read-through. I wonder if we shouldn't follow the IETF process and create a new RFC, rather than revise this one in-place? For example https://tools.ietf.org/html/rfc293 updates RFC288 but is obsoleted by RFC298. It probably doesn't matter.. @ccooper do you have an opinion there?
We talked about audiences that might be interested in this RFC:
- People deploying Taskcluster (cloudops)
- People using Taskcluster (releng, firefox-ci at large)
- People developing Taskcluster (TC team)
I think this addresses what we know of those groups' requirements, and hopefully has enough detail that they can raise any concerns early in the process.
44a80ca to
0876a8e
Compare
Done! |
| To test the scalability and performance of the system, we will do an import of | ||
| the FirefoxCI production database (minus the secrets) into the postgres database | ||
| on the staging deployment and then observe if the database crashes or if there | ||
| are any noticeable performance issues that arise. |
There was a problem hiding this comment.
Great, thanks for adding this! Let's make sure we do the parallel request testing on the system to which we've imported production amounts of data. We want to test 2 things:
- Can production quantities of data be imported successfully?
- Does the DB perform as expected when it has production quantities of data in it?
As currently written, this tests 1 but not 2. To test both, we could just make sure to do the Parallel Requests testing on the same instance that we've imported the prod quantities of data into.
There was a problem hiding this comment.
Good observation. I'll update the RFC to make this clearer.
| ## Backups and Restores | ||
|
|
||
| Teams operating Taskcluster will rely on the cloud provider's backup system to | ||
| handle backups and restores. |
There was a problem hiding this comment.
A note about what backup/restore guarantees we're getting from the old system and would want to be no worse than in the new one might be handy here. There are often many options to choose from between the extremes of "no backups" vs "keep everything forever", and if there's any general guidance then this could be a good place to put it. If not, we can figure that out per-deployment.
There was a problem hiding this comment.
I think we'll want to figure that out per-deployment. The tools available are basically unrelated to Taskcluster and the design of this project, so there's not much more to say here other than this.
| Direct SQL access to the database is *not allowed*. Taskcluster will allow | ||
| ad-hoc read-only queries on the data-set via stored procedures with access | ||
| controlled by Postgres permissions. This feature will most likely be done after | ||
| step 2 of the transition. |
There was a problem hiding this comment.
Thanks for clarifying the read-only intent here. I assume that the details of how postgres permissions will need to be configured will be forthcoming with the update that adds this feature?
There was a problem hiding this comment.
Actually upon reading the next section it's sounding like TC will handle all the postgres perms stuff internally.
There was a problem hiding this comment.
Correct. Taskcluster will manage posrtgres permissions internally.
| (configured in Kubernetes), and on install/upgrade we'll use the admin user to | ||
| create a non-admin user for each service, with appropriate GRANTs for that | ||
| service's access. Deployers of Taskcluster will pick the passwords for all the | ||
| non-admin users (configured in Kubernetes). It's up to the deployer to create |
There was a problem hiding this comment.
By "configured in Kubernetes", you mean "encrypted then passed as env vars like all the other secrets we currently provide to services", right?
edunham
left a comment
There was a problem hiding this comment.
I'm happy with this and have no objections to moving it to final comment period.
| ## Permissions | ||
|
|
||
| Taskcluster will manage permissions to tables/schemas and deployers will manage | ||
| user accounts. The deployment will have an "admin" postgres user/password |
There was a problem hiding this comment.
Does the admin password need to be in kubernetes at all? Can it live outside of it?
There was a problem hiding this comment.
It does not, an in fact the deployment docs specifically warn against including it in the kubernetes config.
|
|
||
| ## Ad-hoc Queries | ||
|
|
||
| Direct SQL access to the database is *not allowed*. Taskcluster will allow |
There was a problem hiding this comment.
I was thinking we allowed ad-hoc queries but only on a reporting db
There was a problem hiding this comment.
Ad-hoc queries will run on a read-only db. What is a reporting db? Let me know if this doesn't answer your question.
| using the existing stored procedure that returned the single column. That new | ||
| stored procedure is then deployed before the code that uses it is deployed. | ||
|
|
||
| A consequence of this design is that "procedures are forever" -- an upgrade can |
There was a problem hiding this comment.
I think we can delete a procedure but it has to happen in a later upgrade, right?
There was a problem hiding this comment.
We also talked about supporting rollbacks during the all-hands. Can we add a section here talking about them and how we will support them or justifying why we won't.
There was a problem hiding this comment.
I think we can delete a procedure but it has to happen in a later upgrade, right?
Rather than delete a procedure, a safer alternative would be to change the body of the function to return an empty array. `
We also talked about supporting rollbacks during the all-hands. Can we add a section here talking about them and how we will support them or justifying why we won't.
As long as the procedure signature is not changed, rolling back shouldn't cause any issues.
Can we add a section here talking about them
Will do.
|
|
||
| ### Tracing | ||
|
|
||
| Taskcluster will use New Relic to a have better visibility of the database, |
There was a problem hiding this comment.
Can this be expanded on? We haven't used New Relic with tc before. What sort of changes do we need to make to support it. Why New Relic instead of something else, etc.
Not saying I disagree, just want to know more.
There was a problem hiding this comment.
To support it i think you'd want to add an env var to conditionally load the newrelic package, so we can turn it on in stage and prod but not in local dev. You'll also need to add a couple env vars for config. Beyond that, hopefully it will monkeypatch the relevant libraries like pg and just work™
New Relic was my recommendation because it's something Mozilla has licenses for and some other teams use it extensively. https://cloud.google.com/trace/ plus https://googleapis.dev/nodejs/trace/latest/ would be a potential alternative.
The goal of using a tracing / apm service during the migration is to get clearer visibility into query performance over time than we can get from application logging or pg views and logs alone.
|
Looking good so far! |
|
Having almost finished the project, we should probably (fix the check failure and) merge this! |
|
I forgot this was still open. Agreed. I'll take care of this. Thanks for the ping. |
14b6246 to
8664bdf
Compare
No description provided.