Skip to content

Latest commit

 

History

History
30 lines (27 loc) · 10 KB

File metadata and controls

30 lines (27 loc) · 10 KB

Incident Response Runbook

About

This document outlines a set of standard procedures when an incident occurs. Incident commanders and incident response analysts should utilize this runbook to facilitate a consistent approach for handling incidents. Please see the incident response preparation checklist for steps on how to set up a program prior to using this document.

Incident Steps

When an incident is suspected, follow the steps below:

  1. Receiving information about a possible security incident: At this point, you've received information that something may be happening. Loop in your incident response on-call to initiate the process. This may involve notifying them in the main #incident channel at your company, or within the security incident channel if one exists.
  2. Spin up a private channel: Until you have ruled out the issue, it's best to create a private incident response channel for investigations and discussions. Follow a naming convention of #incident-private-subject (or something better suited for your organization) and invite initial stakeholders into the channel. This will allow you to investigate if the issue is worth declaring as an incident without flooding the main channel. It also restricts the audience for potentially sensitive information, which could include employee/customer Personally Identifiable Information (PII), an insider under investigation, or product vulnerabilities. Alternatively if you believe Slack or the main instant messaging channel may be compromised, you could spin up a verified Signal group chat to ensure the investigation itself is not monitored by potential attackers.
  3. Spin up an incident bridge: Having a Google Hangout or Zoom meeting to discuss the incident in real time is critical for responding quickly and hashing out details. This should be listed in the topic of the private incident channel so other stakeholders can easily find and join.
  4. Determine priority and loop in incident commander: Based on the available information, if it's suspected the issue could be valid, loop in the incident commander and declare an initial incident level based on the incident level matrix.
  5. Perform initial analysis: Attempt to troubleshoot the issue with the initial reporter, security team, and other stakeholder points of contact (POCs). Determine who to loop in based on the point of contact inventory. If you lack such an inventory Workday or equivalent Human Resource Information System (e.g. Workday) can be used to find the right POC.
  6. Verify impact: It's critical to understand 'What's the worst thing that can happen' to company systems, employees, or users. It's essential to ask stakeholders this question explicitly to ensure the full impact is understood. Once understood, adjust the incident level appropriately.
  7. Bring in additional support: Bring in engineers who can help troubleshoot root cause, determine solutions, and make decisions.
  8. Stop the bleeding: If a compromise, data breach, or exploitation is suspected the #1 priority is to stop the bleeding.  Work closely with operations/engineering POCs to mitigate the problem. This may require waking up people and notifying their team on-calls to help develop a strategy. The engineering/operations teams understand their systems, products and features and the unexpected side effects of disabling a service. It's critical they help determine the path forward. Only in the most dire of circumstances should things be turned off without directly consulting owners. Blindly disabling things can easily make the problem worse, and impede your ability to finish determining the extent of the problem.
  9. Create an incident working document: At this stage, it's verified to be a real issue. You should start capturing evidence, Slack conversations, notes and actions performed in the incident response template. The incident commander should be responsible for ensuring this occurs, and contains the required information.
  10. Develop a fix, or implement a mitigation: At this stage, the incident team either directly implements a fix, identifies another security control to reduce the damage, or takes steps to block the issue from being exploited in production.
  11. Perform necessary forensics: Log forensics and detailed analysis of potentially compromised hosts must be performed. This will help you determine the full impact of the issue, gather critical evidence for law enforcement, and determine if other systems or accounts are impacted. Sometimes teams simply want to redeploy a host they suspect is compromised, but it's essential you create forensic snapshots of anything you suspect is compromised, this is critical for lawsuits and criminal investigations. If you believe a host is compromised, you should never attempt to clean it; instead, opt for deploying a known clean instance. If you are unable to perform the forensic analysis yourself, there are many companies available to perform this analysis for you.
  12. Perform additional testing of the fix: Both engineering and security should attempt to bypass the fix to ensure the issue is correctly closed. If available, involve QA in this process to cover both positive testing and negative testing, as quick fixes can unexpectedly break intended functionality.
  13. Rollout fix: Once incident stakeholders are confident the fix/mitigation addresses the concern, it should be rolled out to production and very carefully monitored. It's not unusual for fixes to break aspects of a feature or site in unexpected ways. Extensive monitoring should be performed in real time once the fix is rolled out, likely for a few hours.
  14. Create an incident ticket: The priority up to this point has been understanding the issue, getting the right people involved, and stopping the bleeding. It's important to ensure an incident response ticket is filed to collect all related code changes, Jira tickets, incident documents, etc...
  15. Update incident working document: The document should clearly communicate the main issue, root cause, impacts, fix information, reference material, and timelines. This may be critical for prosecuting an attacker, assisting with law enforcement or regulatory investigations, and possible future legal proceedings against your company. The incident commander should have key engineering/operations stakeholders verify nothing major is miscommunicated or missing.
  16. When to loop in legal and public relations: If user/employee data is suspected to have been compromised, it's critical to loop in your legal team. Data laws vary country to country and have nuances the security and engineering team will be unfamiliar with. Similarly, if a public note/article needs to be published this should be heavily reviewed by both legal and the public relations team. The choice of words used in public documents can have important legal meanings that engineering/operations/security won't be aware of. Lastly, as a result of involving legal, the materials and channels involved with the investigation may need to be labeled as Attorney Client Privileged. The legal team can advise the best course of action here.
  17. When to loop in leadership: Typically sev1's and sev2's have some sort of executive communication. Sev1's typically have more frequent updates (e.g. every few hours) whereas Sev2 may only have daily updates until resolved. When important decisions need to be made (e.g. disabling core functionality temporarily) an executive signoff may be required. It's recommended you try and utilize the non-security incident response process to determine the right level of executive communications as a default.
  18. Closing the incident: Once security and operations/engineering parties are confident the fix has been deployed and is no longer an issue, the incident commander can make the determination to close the incident.
  19. Filing follow-up action items: There may be immediate issues requiring follow-up that were identified as part of the incident. These should be filed in your bug tracker/Jira to the right owners, linked from the main incident ticket, and may be contained within the incident working document for tracking.
  20. Scheduling a postmortem: The incident commander should schedule a postmortem meeting with all involved stakeholders as soon as possible to ensure information remains fresh. The goal of the postmortem is to determine how well the incident was handled, things that can be done to prevent the issue from recurring or being introduced elsewhere, as well as other action items stakeholders believe should occur. A postmortem template can be found here
  21. Performing the postmortem: A typical postmortem is held during a time friendly to all involved stakeholders, and often lasts 60-90 minutes. When possible send out the postmortem document and ask stakeholders to start filling it out in advance, to save expensive human time during a meeting. 

Runbook version 1.5 copied from Sectemplates.com 2025