Resilience engineering for software: a FAQ

Resilience engineering for software: a FAQ

Resilience engineering for software: a FAQ

What is resilience engineering?

Resilience engineering can be viewed as a set of high-leverage approaches to managing failures in complex socio-technical systems -- which makes it a domain relevant to many technology companies.

Failure in complex systems is itself a complex subject. The paper How Complex Systems Fail by Richard Cook is an excellent short introduction. For a higher fidelity definition, see John Allspaw’s talk Resilience Engineering: The What and How.

Complex systems have bounded resources and you probably have limited time. Expect resilience engineering to be highest leverage when failures (e.g., incidents) are substantially impacting the sustainability of your systems, the happiness of your engineers, your ability to meet business needs, and/or the happiness of your customers (framing borrowed from Honeycomb).

What is the relationship between resilience engineering and DevOps/SRE?

DevOps’ approach to safety focuses on mitigating the impact of known modes of failure -- “known unknowns” like bad deploys, host failures, etc. Resilience engineering is concerned with the ability to adapt to unknown unknowns -- for example, how did your organization respond to the decades old latent flaw that became Spectre/Meltdown?

For further insight you can read the preface and conclusion chapters (of course, more if you're inspired) of Accelerate by Nicole Forsgren PhD, Jez Humble, and Gene Kim and Sustainable Operations in Complex Systems with Production Excellence by Liz Fong-Jones, and comparing the perspectives there with those of Cook and Allspaw expressed in the resources linked above.

I’m intrigued, what are my next steps?

For most software organizations, the low hanging fruit of resilience engineering will be learning more from incidents by improving the postmortem process. Etsy has an excellent guide.

Beyond that, if you want to learn more about the domain and enjoy reading academic papers, see Lorin Hochstein’s paper-centric introduction. If you prefer conference talks, John Allspaw has curated a YouTube playlist. Nora Jones also runs a resilience engineering focused Slack Community, Learning From Incidents in Software.

What should I do if I have other questions?

Your options include

Open an issue on this repo
Reaching out to Lorin Hochstein, Jacob Scott, or others in the resilience engineering community on Twitter
Contacting Allspaw, Cook, and Wood’s consultancy, Adaptive Capacity Labs, if you want to work with subject matter experts in a professional/contractual setting

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.pre-commit-config.yaml		.pre-commit-config.yaml
.travis.yml		.travis.yml
DEVELOPMENT.md		DEVELOPMENT.md
README.md		README.md
job_descriptions.md		job_descriptions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resilience engineering for software: a FAQ

What is resilience engineering?

What is the relationship between resilience engineering and DevOps/SRE?

I’m intrigued, what are my next steps?

What should I do if I have other questions?

About

Releases

Packages

Contributors 4

res-eng/resilience-for-software

Folders and files

Latest commit

History

Repository files navigation

Resilience engineering for software: a FAQ

What is resilience engineering?

What is the relationship between resilience engineering and DevOps/SRE?

I’m intrigued, what are my next steps?

What should I do if I have other questions?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages