Advice for incident response, accident reporting, & post-mortems? #55

Closed
mstone opened this Issue Aug 23, 2014 · 5 comments

Projects

None yet

4 participants

@mstone
mstone commented Aug 23, 2014

Despite the best efforts of thousands of people, all sorts of things go wrong with the design and delivery of large-scale digital services. Therefore, what advice should we give to playbook readers about how to prepare, respond, & learn from things going wrong with their digital services?

@wvchris
wvchris commented Aug 26, 2014

True, VA VISTA has had the error trap as a means of trapping errors, the symbol table at the time of the failure, the line of code usually indicating the command that experienced the error. The errors are kept by the time that they happened. I created an index in VISTA that would summarize the errors and index them by the location in the code that the error happened, and some detail about the kind of error. This provides a convenient place that the support person can link into the the error summary and get a profile of the errors as they happened and a place for the programmer to store the solution to the error for the next time it happens or if the problem is really fixed. A lot of the time it is a training issue of the people who might be using the software. I made no secret that these errors trap is open source.

@cew821
Contributor
cew821 commented Sep 10, 2014

@wvchris Thanks for these comments. Your experience with VISTA seems interesting. Have you considered writing a blog post or two covering some of the key lessons learned from the experience?

I'd also be interested in checking out the code (I gather it's an open source project from your comment)? Where could we find that?

@jallspaw

I have a very vested and enthusiastic interest in this topic. "Postmortem" debriefings and retrospective analysis needs to be carefully approached, lest it fall to the organizationally damaging and counterproductive methods of traditional "root cause analysis", which should be considered outright harmful in the practice of many complex domains, including software development.

I will draft a pull request that contains suggestions and thoughts on what an evolved "Learning Review" looks like. You will almost certainly expect to see elements of current research on "human error" and systems safety, à la http://codeascraft.com/2012/05/22/blameless-postmortems/.

@cew821
Contributor
cew821 commented Sep 24, 2014

Thanks for the feedback! we've add a question to try to suss out how incidents are responded to (both real-time, and via post-mortems).

@cew821 cew821 closed this Sep 24, 2014
@jallspaw

FYI, it took roughly two years to followup on my above comment, but we have just published what I referred to as "suggestions and thoughts on what an evolved 'Learning Review'" in the form of a generalized Debriefing Facilitation Guide.

It is located here (in both markdown and PDF) and we hope it can serve as a reference for the field on this topic. I'd like to suggest that the USDS and like-focused groups within the government could consider using this approach.

FWIW, there is more background here.

@mdickers47 @pahlkadot @jezhumble @haleyvandyck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment