Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding on-call best practices document #32

Merged
merged 6 commits into from Sep 17, 2019

Conversation

@cblkwell
Copy link
Contributor

commented Sep 3, 2019

This is a better-put-together version of the on-call best practice Google Doc I put together last week; I want to add this to the playbook so I can reference it when writing the on-call docs for my current project.

Edit: Said Google Doc, for those who haven't seen it and may be curious about the discussion from there: https://docs.google.com/document/d/1XtGuhDfwi9ThLdDgZUGcL5pOBY5sEHfjRU9baqRhJt0/edit

cblkwell added 2 commits Sep 3, 2019
@kahlouie
Copy link

left a comment

This looks really great. I left a few notes since in practice, I've seen different processes being used in different projects here.

developing/on-call/README.md Outdated Show resolved Hide resolved
but we want to reduce that burden as much as possible. If you need to
run a quick errand, or if an emergency comes up, or you will be in
transit for an extended period, you should notify your secondary (or
your primary, if you are the secondary) and make sure they will be able

This comment has been minimized.

Copy link
@kahlouie

kahlouie Sep 4, 2019

Primary/secondary isn't something the VA project or Milmove are using at the moment. Milmove has a paired on-call person (same level of notifications). The meaning is still the same, though. Communicate when you are working on the problem and when you need to take care of other responsibilities.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

There was some talk about that in the Google Doc where I had originally written up a lot of this; among the folks there, it sounded like recommending a primary/secondary method as opposed to the Bat-Team method was probably a better idea.

As I said in this document, paging multiple people at once is prone to causing confusion, but also increases the on-call burden for both people. I'm curious if most problems you get paged for on MilMove require two people, or if in most cases one person could handle it easily, and the other person is getting paged but doesn't really need to be there.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

MilMove I think is going to transition away from A/B to Primary/Secondary/Tertiary. TBD.

This comment has been minimized.

Copy link
@kahlouie

kahlouie Sep 4, 2019

Yay! I prefer Primary/Secondary/Tertiary anyway!

This comment has been minimized.

Copy link
@tinyels

tinyels Sep 12, 2019

You can have a bat-team and a primary/secondary rotation. Bat-team merely means on-call engineers are not committed to sprint objectives for their team when they are on call but instead can do work to improve production support.

developing/on-call/README.md Outdated Show resolved Hide resolved
already got the alert notification and make sure they have acknowledged
that you will be taking care of the alert before taking action, so that
you are not working at cross-purposes.
* You should make sure that you are keeping a persistent record of alerts

This comment has been minimized.

Copy link
@kahlouie

kahlouie Sep 4, 2019

I love this. @chrisgilmerproj what would it take for us to add this to Milmove?

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

It just takes someone to do it. Application teams can figure out what kind of log they want and how they want to keep it.

developing/on-call/README.md Outdated Show resolved Hide resolved
developing/on-call/README.md Show resolved Hide resolved
engineers are recommended to have PagerDuty notify them 24 hours before
going on call so they are aware of their impending shift.

## Escalation and Notification Policies

This comment has been minimized.

Copy link
@kahlouie

kahlouie Sep 4, 2019

Should we also include handoff policies for long running incidents? In my experience, engineers can really only be productive for 4 hours straight on an urgent incident requiring attention. Handing off during an incident and giving enough context for another person to assume primary is good practice to make sure notifications are handled appropriately and timely, while ensuring the health and well-being of the on-call engineer. This also prevents burnout for engineers in long on-call shifts.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

This is definitely true, but for me, that falls into incident response procedures, which isn't quite the same as the on-call. I didn't want to try to squeeze in an adaptation of the ICS into this document. :)

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

Yeah, let's do incidence response in another doc soon!

cblkwell added 2 commits Sep 4, 2019
the notification chain. An example might be:
* Immediately after the alert, notify me by push notification and email.
* 1 minute later, notify via SMS (in case data coverage is bad).
* 5 minutes after the alert, notify via voice call.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

This feels tight. For most projects in my lifetime I've seen this:

  • Immediately via apps and email
  • 5 minutes SMS
  • 10-15 minutes voice call

And then at 20 minutes have it escalate. Each project will obviously have different tolerances but I've been in a situation where I'm just about to check my app and my texts start blowing up only to start reading my texts and get called. If it turns out folks really aren't responding frequently within 10-15 minutes then we probably have a different problem on our hands.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

The problem with that, as I outline in the document, is that using that policy basically guarantees you will drop to three-nines of availability if you have a critical issue that falls through. Obviously, this all depends on the tolerances of your project -- I will admit my last on-call experience was for a 24/7 web service where we were aiming for four-nines, so tolerances were tight. I would rather say the best practice is tight, and then have the engineering and product folks decide to loosen it, rather than the other way around.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

I'm going to say that if MilMove gets 90% uptime we'll be in the clear. Maybe a recommendation chart that looks like this would be nicer (numbers made up):

Notifications 99.99% 99.9% 99%
app/email immediately immediately immediately
SMS 1 5 10
Voice 5 10 15

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

I guess I'm trying to say that we should align our on-call with the expectation of the customer based on the SLA instead of making all teams attempt to get two or more 9's.

This comment has been minimized.

Copy link
@kahlouie

kahlouie Sep 4, 2019

I think this was just an example, not a suggestion. It would depend on contract agreements Truss has with the client around response time, so examples should also.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 13, 2019

Contributor

I think a chart with examples would be nice, so maybe these totally made up numbers but based on project:

Notifications MilMove SABER CMS
app/email immediately immediately immediately
SMS 5 5 10
Voice 10 15 10
* 5 minutes after the alert, notify via voice call.
* Your on-call rotation should have an escalation policy that escalates
from the primary to secondary after no more than 10 minutes, and from
secondary to tertiary after no more than an additional 10 minutes.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

I'll say 10 seems tight to me from experience. But depends on project.

I'm starting to think of maybe writing this up with recommendations based on project type or having a table with alternatives to choose from.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

I think I answered this mostly in the last comment, but I think this basically just has to be a conscious decision by the product and engineering folks about what is required for the service you're running. If you don't care about downtime in the middle of the night, then you can just ignore alerts at night, or whatever -- but I feel like we should aim for a high level of reliability by default (especially in a public-facing document clients can potentially read).

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 13, 2019

Contributor

Let's add something about being conscious of the choices made.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 13, 2019

Author Contributor

I added a paragraph about this in there.


* The on-call rotation *should not be getting more than 2-3 alerts per
day*, and even that is bordering on excessive, especially if these are
off-hours. Optimally, this should be no more than 2-3 alerts *per shift*.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

This can be read differently by different people. Slack may get notifications for several things that PagerDuty ends up turning into a single event and paging. Several notifications is different than several pages. I've seen PD and AWS CloudWatch set up in such a way that up to 20 CloudWatch notifications go out (to Slack/Email) but PD only gets one actual page and PD takes care of bundling up the notifications into one event.

This is all to say that some folks are going to read this and say "notifications count and we should disable so many notifications" when in fact I think we're trying to say "alerts count, the ones that actually get to your phone/sms/voice".

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

Yes, I am talking about alerts here -- I would generally caution against a bunch of unactionable notifications too, at least if they are going through the same channel as alerts, just because it makes alert numbness more likely to set in. Do you think this would benefit from a definition of terms at the beginning?

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

Yeah, common definitions would be super helpful.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 13, 2019

Author Contributor

Added definitions for Alerts and Notifications.

day*, and even that is bordering on excessive, especially if these are
off-hours. Optimally, this should be no more than 2-3 alerts *per shift*.
If the on-call burden for that rotation is higher than that, there
should be an understanding across engineering and product that engineering

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

It's not always engineering time. In some cases, like MilMove, the customer, product team, and engineers need to decide what the correct alert threshold is for the phase of the project (prototype with pilots vs delivered product). I would make sure we use the RACI model and figure out who needs to be Consulted when modifying alerts.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

I'm not quite sure what you're saying here. Are you saying that engineers aren't the only ones who have to burn time as a result of an excessive on-call burden?

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

yeah, exactly. This is what I'm saying. Folks that need to burn time should all be consulted.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 13, 2019

Author Contributor

I changed it from "engineering time" to "project time" -- does that match what you intended here?

smaller than application development teams on a project. For this
reason, it is probably a good idea for all engineers to be involved
in any sort of infra rotation; however, you should make sure that any
single rotation has an infra engineer in the escalation path.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

We've just removed Infra as part of the escalation path on MilMove. The idea is that infra can be used as SME's for any issues but they'd have to be paged separately outside of work or SLA hours.

I'd maybe like to turn this on its head and say that application teams should look at their code as a set of "systems" and own those systems including the infrastructure tools they need to understand health and debug the system.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 4, 2019

Author Contributor

So, I definitely agree that application teams should understand the infrastructure that their applications are built on enough to have a good mental model of how the components work and at least make some initial attempts at diagnosing infrastructure issues. I think that is one of the reasons to have application engineers go through the infra on-call escalation regularly.

On the other hand, I'm wary of just yanking infra folks out of day-to-day on-call. It's good as both an empathy building tool, and in seeing how the application actually works on the infrastructure so that infra folks can make good recommendations about how to improve reliability and design new systems to support those applications.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 4, 2019

Contributor

What I'm particularly afraid of here is saying "in the escalation path" as though an infra engineer ought to be able to solve any problem. This is what was happening on milmove: an alert related to the app would go through, it would have an affect that appeared to be infra related, it would get escalated, infra would have to figure out what was causing the problem only to determine it was the app being the problem, and then send it back. The tendency is then to say anything that looks infra related (which can be as tenuous as "alert came from AWS") needs to go directly to Infra and skip the app teams.

I'd prefer that infra and app devs be on different escalation paths (each having their own primary/secondary/tertiary) and use each other as SMEs but retain the ownership of the event while an investigation is ongoing. I want to avoid the pager hot-potato that seems to happen on MilMove.

This comment has been minimized.

Copy link
@cblkwell

cblkwell Sep 13, 2019

Author Contributor

Well, I think it's wise for at least one infra engineer to be in the infra escalation path. Putting them in the application engineering path makes less sense (although optimally, they should be getting some exposure to the application, and it would also force application engineers to document their alerts and responses).

@chrisgilmerproj
Copy link
Contributor

left a comment

Overall really great! I left some comments. I think MilMove is not a great example but since we're currently changing a lot of our practices I've brought stuff up here. Also I expect this document will also shape how MilMove eventually does on-call so that will be a good bit of dog-fooding.

developing/on-call/README.md Outdated Show resolved Hide resolved
@tinyels

This comment has been minimized.

Copy link

commented Sep 12, 2019

This is great! I've got a task to add some documentation about when to create a bat-team that I would like to introduce once this is merged. (It will be a separate document but effectively the bat-team is the on-call team so I do want to link them.)

@chrisgilmerproj
Copy link
Contributor

left a comment

I don't want to block getting this merged. If there is anything I can do to help please let me know.

@cblkwell

This comment has been minimized.

Copy link
Contributor Author

commented Sep 17, 2019

I am going to go ahead and merge this -- it's a living document, so folks should feel free to make PRs to add detail, and we can keep discussing this in Slack for sure.

@cblkwell cblkwell merged commit 9d6726e into master Sep 17, 2019

1 check passed

ci/circleci: validate Your tests passed on CircleCI!
Details

@cblkwell cblkwell deleted the on-call-best-practice branch Sep 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.