-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding on-call best practices document #32
Changes from 2 commits
8a0ba63
4b4d881
87fb2f1
26f713a
854eaff
a18099e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,192 @@ | ||||||||||||||||||||||||||||||||||
# [Tools and Practice](../README.md) / On-Call Best Practices | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## Overview | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
In a modern development environment, we want to make sure that the people | ||||||||||||||||||||||||||||||||||
writing the code own the services they write in production. Part of that | ||||||||||||||||||||||||||||||||||
ownership is sharing the burden of the on-call rotation. In order to make | ||||||||||||||||||||||||||||||||||
sure that said burden is not too arduous for everyone involved, here are | ||||||||||||||||||||||||||||||||||
some best practices when developing the on-call practice for your project. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## Assumptions | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
For the sake of argument, most of this article assumes you will be using | ||||||||||||||||||||||||||||||||||
[PagerDuty](https://www.pagerduty.com) for handling the actual alerting | ||||||||||||||||||||||||||||||||||
of the engineers on-call; if you are using another provider, most of these | ||||||||||||||||||||||||||||||||||
recommendations can be adapted for them. You can read more about various | ||||||||||||||||||||||||||||||||||
alerting providers in the [Alert Providers](../../infra/alerting/alert-providers.md) | ||||||||||||||||||||||||||||||||||
guide. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## On-Call Responsibilities | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
If you are the engineer on-call, you have a number of responsibilities | ||||||||||||||||||||||||||||||||||
you are expected to fulfill. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* Prior to going on-call, you should make sure that you have access to | ||||||||||||||||||||||||||||||||||
any resources necessary to diagnose and correct issues -- this means | ||||||||||||||||||||||||||||||||||
AWS or GitHub, documentation, or any other tools. Your project should | ||||||||||||||||||||||||||||||||||
have an on-call checklist to make it easy for you to be confident you | ||||||||||||||||||||||||||||||||||
have this covered. | ||||||||||||||||||||||||||||||||||
* If you know you will be away for an extended period during an on-call | ||||||||||||||||||||||||||||||||||
shift, it is *your* responsibility to find someone to cover your shift. | ||||||||||||||||||||||||||||||||||
If you are unable to, talk to your lead and see if they can help. If | ||||||||||||||||||||||||||||||||||
you will be gone for more than a day or two, it may be easiest to swap | ||||||||||||||||||||||||||||||||||
the entire shift with someone. PagerDuty allows you to schedule these | ||||||||||||||||||||||||||||||||||
with [overrides](https://community.pagerduty.com/t/creating-a-schedule-override/850). | ||||||||||||||||||||||||||||||||||
* When you are paged, you are expected to respond to the alert within | ||||||||||||||||||||||||||||||||||
five minutes. This means that you have *acknowledged* the alert and are | ||||||||||||||||||||||||||||||||||
looking into the issue. Acknowledging the alert prevents it from | ||||||||||||||||||||||||||||||||||
automatically escalating (see [Escalation and Notification | ||||||||||||||||||||||||||||||||||
Policies](./README.md#escalation-and-notification-policies) for | ||||||||||||||||||||||||||||||||||
more information) and communicates that you are working on the issue. | ||||||||||||||||||||||||||||||||||
Do not forget to do this before you start working; there's nothing | ||||||||||||||||||||||||||||||||||
worse than getting a page as a secondary at an odd hour only to find | ||||||||||||||||||||||||||||||||||
that someone else is already taking care of the problem. While this five | ||||||||||||||||||||||||||||||||||
minute window may seem tight, alerts should be well-tuned so that you | ||||||||||||||||||||||||||||||||||
are not paged for things which are not urgent (see [Project | ||||||||||||||||||||||||||||||||||
Expectations](./README.md#project-expectations)). | ||||||||||||||||||||||||||||||||||
* The response time expectation does mean that your flexibility to take | ||||||||||||||||||||||||||||||||||
care of things away from internet access will be curtailed while on-call, | ||||||||||||||||||||||||||||||||||
but we want to reduce that burden as much as possible. If you need to | ||||||||||||||||||||||||||||||||||
run a quick errand, or if an emergency comes up, or you will be in | ||||||||||||||||||||||||||||||||||
transit for an extended period, you should notify your secondary (or | ||||||||||||||||||||||||||||||||||
your primary, if you are the secondary) and make sure they will be able | ||||||||||||||||||||||||||||||||||
to cover for you while you are away. | ||||||||||||||||||||||||||||||||||
* Despite the expectation you will be the first responder as the person | ||||||||||||||||||||||||||||||||||
on-call, *this does not mean you are expected to go it alone*. If you | ||||||||||||||||||||||||||||||||||
get an alert, and you can't figure out what is going on within 15 | ||||||||||||||||||||||||||||||||||
minutes and you believe the impact is such that it needs to be addressed | ||||||||||||||||||||||||||||||||||
immediately, you should feel free to page your secondary for assistance. | ||||||||||||||||||||||||||||||||||
If you are still stick (or you *were* the secondary), you should feel | ||||||||||||||||||||||||||||||||||
cblkwell marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
free to call upon your lead or a known subject matter expert (SME). | ||||||||||||||||||||||||||||||||||
cblkwell marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
* If you are *not* on-call, you should refrain from responding to alerts | ||||||||||||||||||||||||||||||||||
even if you see them in Slack or elsewhere. By doing so, you can reduce | ||||||||||||||||||||||||||||||||||
your own interrupts. However, if you believe you might be responsible, | ||||||||||||||||||||||||||||||||||
or know the on-call person is dealing with another higher-priority issue | ||||||||||||||||||||||||||||||||||
and want to assist, *let the on-call engineer know* and then make sure | ||||||||||||||||||||||||||||||||||
you take ownership of the alert in PagerDuty. Remember that they likely | ||||||||||||||||||||||||||||||||||
already got the alert notification and make sure they have acknowledged | ||||||||||||||||||||||||||||||||||
that you will be taking care of the alert before taking action, so that | ||||||||||||||||||||||||||||||||||
you are not working at cross-purposes. | ||||||||||||||||||||||||||||||||||
cblkwell marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
* You should make sure that you are keeping a persistent record of alerts | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I love this. @chrisgilmerproj what would it take for us to add this to Milmove? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It just takes someone to do it. Application teams can figure out what kind of log they want and how they want to keep it. |
||||||||||||||||||||||||||||||||||
and/or incidents each day. This can be as simple as a Google Doc filled | ||||||||||||||||||||||||||||||||||
out at the end of the day, but it should record at least the time of the | ||||||||||||||||||||||||||||||||||
alert, the alert that fired, and what was done to address the alert (even | ||||||||||||||||||||||||||||||||||
if that is "the alert went away on its own"). This serves as a way to | ||||||||||||||||||||||||||||||||||
pass knowledge onto the other on-call engineers or the next shift, and | ||||||||||||||||||||||||||||||||||
allows us to look at the previous week or month for alerts that are | ||||||||||||||||||||||||||||||||||
particularly troublesome. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## On-Call Rotations | ||||||||||||||||||||||||||||||||||
cblkwell marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
An on-call rotation consists of a pool of engineers who share a schedule | ||||||||||||||||||||||||||||||||||
that determines who is on-call at any one time. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* For a single on-call shift, you should have a primary responder, a | ||||||||||||||||||||||||||||||||||
secondary responder, and a tertiary "backstop", usually a lead, as the | ||||||||||||||||||||||||||||||||||
final link in the chain. During the shift, the primary is expected to | ||||||||||||||||||||||||||||||||||
respond to all alerts; the secondary and tertiary are there if for some | ||||||||||||||||||||||||||||||||||
reason they are unable to respond (see [Escalation and Notification | ||||||||||||||||||||||||||||||||||
Policies](./README.md#escalation-and-notification-policies) to see how | ||||||||||||||||||||||||||||||||||
this is accomplished). | ||||||||||||||||||||||||||||||||||
* The secondary and tertiary also exist as additional resources for the | ||||||||||||||||||||||||||||||||||
primary to call in as first points of contact for assistance if they | ||||||||||||||||||||||||||||||||||
have a particularly bad or difficult incident. They can help diagnose | ||||||||||||||||||||||||||||||||||
or remediate issues, contact subject matter experts for assistance, or | ||||||||||||||||||||||||||||||||||
handle the logistics of incident response if necessary. | ||||||||||||||||||||||||||||||||||
* In most cases, you should aim for 6-8 people in the pool for a rotation. | ||||||||||||||||||||||||||||||||||
This allows a schedule that maximizes the time you are not on-call, | ||||||||||||||||||||||||||||||||||
while still having the rotation frequent enough that knowledge does not | ||||||||||||||||||||||||||||||||||
become stale. In almost no case should there be fewer than 4 people nor | ||||||||||||||||||||||||||||||||||
more than 12 people in a rotation pool. A pool of 4 or fewer people | ||||||||||||||||||||||||||||||||||
means someone is likely on call at least half the time, which makes it | ||||||||||||||||||||||||||||||||||
extremely hard for them to recover before their next shift. A pool of | ||||||||||||||||||||||||||||||||||
12 or more means that knowledge can easily go stale between on-call | ||||||||||||||||||||||||||||||||||
shifts, and the area of coverage is likely so large that one person | ||||||||||||||||||||||||||||||||||
cannot have adequate knowledge to handle the incidents likely to come | ||||||||||||||||||||||||||||||||||
up. Instead, split the rotation up into two more specialized rotations | ||||||||||||||||||||||||||||||||||
(such as a backend and frontend rotation). | ||||||||||||||||||||||||||||||||||
* Only one person should be paged for an alert at once; paging more than | ||||||||||||||||||||||||||||||||||
one person increases the burden of on-call and can also result in | ||||||||||||||||||||||||||||||||||
confusion if two people are making changes at the same time. If the | ||||||||||||||||||||||||||||||||||
person responding needs additional assistance, they can always call in | ||||||||||||||||||||||||||||||||||
more help after they start responding. | ||||||||||||||||||||||||||||||||||
* The usual method for doing on-call rotations is to change them weekly; | ||||||||||||||||||||||||||||||||||
however, it is not uncommon to see a Sun-Wed/Thurs-Sat half-week | ||||||||||||||||||||||||||||||||||
rotation, which has the benefit of giving every on-call rotation at | ||||||||||||||||||||||||||||||||||
least one day on the weekend where they are not on-call. Either way, | ||||||||||||||||||||||||||||||||||
engineers are recommended to have PagerDuty notify them 24 hours before | ||||||||||||||||||||||||||||||||||
going on call so they are aware of their impending shift. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## Escalation and Notification Policies | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also include handoff policies for long running incidents? In my experience, engineers can really only be productive for 4 hours straight on an urgent incident requiring attention. Handing off during an incident and giving enough context for another person to assume primary is good practice to make sure notifications are handled appropriately and timely, while ensuring the health and well-being of the on-call engineer. This also prevents burnout for engineers in long on-call shifts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is definitely true, but for me, that falls into incident response procedures, which isn't quite the same as the on-call. I didn't want to try to squeeze in an adaptation of the ICS into this document. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, let's do incidence response in another doc soon! |
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
In PagerDuty terms, an escalation policy determines how an alert will | ||||||||||||||||||||||||||||||||||
proceed if it is not acknowledged; a notification policy is something | ||||||||||||||||||||||||||||||||||
which is set for each engineer individually that determines how they will | ||||||||||||||||||||||||||||||||||
be notified if they receive an alert. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* PagerDuty and similar products can be set up to send notifications to | ||||||||||||||||||||||||||||||||||
Slack for each alert; we recommend doing so. Using the [PagerDuty | ||||||||||||||||||||||||||||||||||
integration](https://www.pagerduty.com/docs/guides/slack-integration-guide/) | ||||||||||||||||||||||||||||||||||
will also allow engineers to acknowledge or resolve alerts from Slack | ||||||||||||||||||||||||||||||||||
if they so choose. | ||||||||||||||||||||||||||||||||||
* Engineers should have notification policies set to ensure that they | ||||||||||||||||||||||||||||||||||
cblkwell marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
will be notified within the expect five minute response window. This | ||||||||||||||||||||||||||||||||||
cblkwell marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
should use multiple notification methods to make sure things don't fall | ||||||||||||||||||||||||||||||||||
through the cracks. Keep in mind that an acknowledgement will break the | ||||||||||||||||||||||||||||||||||
notification chain. An example might be: | ||||||||||||||||||||||||||||||||||
* Immediately after the alert, notify me by push notification and email. | ||||||||||||||||||||||||||||||||||
* 1 minute later, notify via SMS (in case data coverage is bad). | ||||||||||||||||||||||||||||||||||
* 5 minutes after the alert, notify via voice call. | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels tight. For most projects in my lifetime I've seen this:
And then at 20 minutes have it escalate. Each project will obviously have different tolerances but I've been in a situation where I'm just about to check my app and my texts start blowing up only to start reading my texts and get called. If it turns out folks really aren't responding frequently within 10-15 minutes then we probably have a different problem on our hands. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem with that, as I outline in the document, is that using that policy basically guarantees you will drop to three-nines of availability if you have a critical issue that falls through. Obviously, this all depends on the tolerances of your project -- I will admit my last on-call experience was for a 24/7 web service where we were aiming for four-nines, so tolerances were tight. I would rather say the best practice is tight, and then have the engineering and product folks decide to loosen it, rather than the other way around. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm going to say that if MilMove gets 90% uptime we'll be in the clear. Maybe a recommendation chart that looks like this would be nicer (numbers made up):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess I'm trying to say that we should align our on-call with the expectation of the customer based on the SLA instead of making all teams attempt to get two or more 9's. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this was just an example, not a suggestion. It would depend on contract agreements Truss has with the client around response time, so examples should also. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a chart with examples would be nice, so maybe these totally made up numbers but based on project:
|
||||||||||||||||||||||||||||||||||
* Your on-call rotation should have an escalation policy that escalates | ||||||||||||||||||||||||||||||||||
from the primary to secondary after no more than 10 minutes, and from | ||||||||||||||||||||||||||||||||||
secondary to tertiary after no more than an additional 10 minutes. | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll say 10 seems tight to me from experience. But depends on project. I'm starting to think of maybe writing this up with recommendations based on project type or having a table with alternatives to choose from. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I answered this mostly in the last comment, but I think this basically just has to be a conscious decision by the product and engineering folks about what is required for the service you're running. If you don't care about downtime in the middle of the night, then you can just ignore alerts at night, or whatever -- but I feel like we should aim for a high level of reliability by default (especially in a public-facing document clients can potentially read). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add something about being conscious of the choices made. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a paragraph about this in there. |
||||||||||||||||||||||||||||||||||
Optimally, this should be as short as possible to ensure that there is | ||||||||||||||||||||||||||||||||||
a quick response; remember that an alert going unnoticed can incur a | ||||||||||||||||||||||||||||||||||
significant SLO impact. A 99.99% uptime requires no more than 13 | ||||||||||||||||||||||||||||||||||
minutes of downtime a quarter for instance -- if you have a 10 minute | ||||||||||||||||||||||||||||||||||
escalation, an alert that falls through to the secondary may blow the | ||||||||||||||||||||||||||||||||||
SLO on its own if the problem is serious enough. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## Project Expectations | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
In addition to the expectations we have for on-call engineers, there are | ||||||||||||||||||||||||||||||||||
also expectations we make for the project we are on-call for in order to | ||||||||||||||||||||||||||||||||||
ensure that on-call is not an undue burden. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* The on-call rotation *should not be getting more than 2-3 alerts per | ||||||||||||||||||||||||||||||||||
day*, and even that is bordering on excessive, especially if these are | ||||||||||||||||||||||||||||||||||
off-hours. Optimally, this should be no more than 2-3 alerts *per shift*. | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be read differently by different people. Slack may get notifications for several things that PagerDuty ends up turning into a single event and paging. Several notifications is different than several pages. I've seen PD and AWS CloudWatch set up in such a way that up to 20 CloudWatch notifications go out (to Slack/Email) but PD only gets one actual page and PD takes care of bundling up the notifications into one event. This is all to say that some folks are going to read this and say "notifications count and we should disable so many notifications" when in fact I think we're trying to say "alerts count, the ones that actually get to your phone/sms/voice". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I am talking about alerts here -- I would generally caution against a bunch of unactionable notifications too, at least if they are going through the same channel as alerts, just because it makes alert numbness more likely to set in. Do you think this would benefit from a definition of terms at the beginning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, common definitions would be super helpful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added definitions for Alerts and Notifications. |
||||||||||||||||||||||||||||||||||
If the on-call burden for that rotation is higher than that, there | ||||||||||||||||||||||||||||||||||
should be an understanding across engineering and product that engineering | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not always engineering time. In some cases, like MilMove, the customer, product team, and engineers need to decide what the correct alert threshold is for the phase of the project (prototype with pilots vs delivered product). I would make sure we use the RACI model and figure out who needs to be Consulted when modifying alerts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not quite sure what you're saying here. Are you saying that engineers aren't the only ones who have to burn time as a result of an excessive on-call burden? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, exactly. This is what I'm saying. Folks that need to burn time should all be consulted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed it from "engineering time" to "project time" -- does that match what you intended here? |
||||||||||||||||||||||||||||||||||
time needs to be devoted to reducing the on-call burden. This could mean | ||||||||||||||||||||||||||||||||||
relaxing SLOs or tuning alert thresholds, but it may also mean a deeper | ||||||||||||||||||||||||||||||||||
investigation, bug fixing, or code and/or infra improvements to prevent | ||||||||||||||||||||||||||||||||||
problems. The "SRE" way to do this is formal [SLOs](https://www.youtube.com/watch?v=tEylFyxbDLE) | ||||||||||||||||||||||||||||||||||
and [error budgets](https://www.youtube.com/watch?v=y2ILKr8kCJU), but | ||||||||||||||||||||||||||||||||||
they aren't always the right choice if the project is small or does | ||||||||||||||||||||||||||||||||||
not have the constraints that come with a 24/7 web service. | ||||||||||||||||||||||||||||||||||
* Engineers who are primary or secondary on-call should essentially be | ||||||||||||||||||||||||||||||||||
considered off project work; they should focus on taking care of | ||||||||||||||||||||||||||||||||||
immediate needs like writing or tuning alerts, fixing stability-threatening | ||||||||||||||||||||||||||||||||||
bugs, addressing reported security vulnerabilities, or updating | ||||||||||||||||||||||||||||||||||
documentation. If they can contribute to project work as well, that | ||||||||||||||||||||||||||||||||||
should be a bonus, not an expectation. | ||||||||||||||||||||||||||||||||||
* Alerts for any project should be well-documented so that an on-call | ||||||||||||||||||||||||||||||||||
engineer can at least begin the process of diagnosis. Questions this | ||||||||||||||||||||||||||||||||||
documentation should answer include: | ||||||||||||||||||||||||||||||||||
* What does this alert mean, literally? | ||||||||||||||||||||||||||||||||||
* What are common causes for this alert to fire? | ||||||||||||||||||||||||||||||||||
* What logs, tools, or other resources can I use to find out more about | ||||||||||||||||||||||||||||||||||
why this alert fired? | ||||||||||||||||||||||||||||||||||
* It's not unusual for us to have infra teams that are significantly | ||||||||||||||||||||||||||||||||||
smaller than application development teams on a project. For this | ||||||||||||||||||||||||||||||||||
reason, it is probably a good idea for all engineers to be involved | ||||||||||||||||||||||||||||||||||
in any sort of infra rotation; however, you should make sure that any | ||||||||||||||||||||||||||||||||||
single rotation has an infra engineer in the escalation path. | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've just removed Infra as part of the escalation path on MilMove. The idea is that infra can be used as SME's for any issues but they'd have to be paged separately outside of work or SLA hours. I'd maybe like to turn this on its head and say that application teams should look at their code as a set of "systems" and own those systems including the infrastructure tools they need to understand health and debug the system. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, I definitely agree that application teams should understand the infrastructure that their applications are built on enough to have a good mental model of how the components work and at least make some initial attempts at diagnosing infrastructure issues. I think that is one of the reasons to have application engineers go through the infra on-call escalation regularly. On the other hand, I'm wary of just yanking infra folks out of day-to-day on-call. It's good as both an empathy building tool, and in seeing how the application actually works on the infrastructure so that infra folks can make good recommendations about how to improve reliability and design new systems to support those applications. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I'm particularly afraid of here is saying "in the escalation path" as though an infra engineer ought to be able to solve any problem. This is what was happening on milmove: an alert related to the app would go through, it would have an affect that appeared to be infra related, it would get escalated, infra would have to figure out what was causing the problem only to determine it was the app being the problem, and then send it back. The tendency is then to say anything that looks infra related (which can be as tenuous as "alert came from AWS") needs to go directly to Infra and skip the app teams. I'd prefer that infra and app devs be on different escalation paths (each having their own primary/secondary/tertiary) and use each other as SMEs but retain the ownership of the event while an investigation is ongoing. I want to avoid the pager hot-potato that seems to happen on MilMove. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I think it's wise for at least one infra engineer to be in the infra escalation path. Putting them in the application engineering path makes less sense (although optimally, they should be getting some exposure to the application, and it would also force application engineers to document their alerts and responses). |
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## Resources | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* ["Being On-Call", Andrea Spadaccini, Google SRE Book](https://landing.google.com/sre/sre-book/chapters/being-on-call/) | ||||||||||||||||||||||||||||||||||
* ["Crafting Sustainable On-Call Rotations", Ryn Daniels, Increment April 2017](https://increment.com/on-call/crafting-sustainable-on-call-rotations/) | ||||||||||||||||||||||||||||||||||
* [On-Call Rotations and Schedules, PagerDuty](https://www.pagerduty.com/resources/learn/call-rotations-schedules/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primary/secondary isn't something the VA project or Milmove are using at the moment. Milmove has a paired on-call person (same level of notifications). The meaning is still the same, though. Communicate when you are working on the problem and when you need to take care of other responsibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some talk about that in the Google Doc where I had originally written up a lot of this; among the folks there, it sounded like recommending a primary/secondary method as opposed to the Bat-Team method was probably a better idea.
As I said in this document, paging multiple people at once is prone to causing confusion, but also increases the on-call burden for both people. I'm curious if most problems you get paged for on MilMove require two people, or if in most cases one person could handle it easily, and the other person is getting paged but doesn't really need to be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MilMove I think is going to transition away from A/B to Primary/Secondary/Tertiary. TBD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! I prefer Primary/Secondary/Tertiary anyway!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have a bat-team and a primary/secondary rotation. Bat-team merely means on-call engineers are not committed to sprint objectives for their team when they are on call but instead can do work to improve production support.