# Day 29: AI-Generated ‚ÄúWhat If‚Äù Reliability Scenarios

Generate hypothetical reliability scenarios, impact predictions, mitigation strategies, and team checklists using AI and LLM.

In [1]:
# üì¶ Import libraries
import pandas as pd
import numpy as np
import requests
from IPython.display import display, Markdown

# üìÅ Load incident and topology data
incidents = pd.read_csv('incidents.csv')
topology = pd.read_csv('topology.csv')

# ü§ñ Generate hypothetical scenarios using probabilistic sampling
np.random.seed(42)
scenario_templates = [
    "What if the {service} fails during a {event}?",
    "What if the {service} hits max connections during {event}?",
    "What if a {type} occurs in {service} during {event}?"
]
events = ['product launch', 'Black Friday', 'major update', 'holiday traffic']

scenarios = []
for i in range(4):
    row = incidents.sample(1).iloc[0]
    topo = topology[topology['service'] == row['service']].iloc[0]
    template = np.random.choice(scenario_templates)
    event = np.random.choice(events)
    scenario_text = template.format(service=row['service'], type=row['type'], event=event)
    scenarios.append({
        'scenario': scenario_text,
        'service': row['service'],
        'type': row['type'],
        'impact': row['impact'],
        'mitigation': row['mitigation'],
        'dependencies': topo['dependencies'],
        'criticality': topo['criticality']
    })

# üß† Generate LLM narrative, impact, mitigation, and checklist
def llm_scenario_doc(scenario):
    prompt = (
        f"Scenario: {scenario['scenario']}\n"
        f"Service: {scenario['service']}\n"
        f"Type: {scenario['type']}\n"
        f"Impact: {scenario['impact']}\n"
        f"Mitigation: {scenario['mitigation']}\n"
        f"Dependencies: {scenario['dependencies']}\n"
        f"Criticality: {scenario['criticality']}\n\n"
        f"Generate a stakeholder-friendly scenario doc that includes:\n"
        f"- Narrative (What if... description)\n"
        f"- Predicted impact\n"
        f"- Mitigation strategies\n"
        f"- Team response checklist\n"
        f"Format as Markdown. Avoid generic praise or repetition."
    )
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": "llama3", "prompt": prompt, "stream": False}
    )
    return response.json().get("response", "No response from Llama 3.")

# üìä Display scenario docs
for scenario in scenarios:
    doc = llm_scenario_doc(scenario)
    display(Markdown(doc))

**Scenario Document**

### What if...

During the launch of our new product, the database connection limit is exceeded, causing a partial outage.

### Predicted Impact

* API requests will be delayed or rejected, resulting in a decrease in user engagement and potential revenue loss.
* Critical business processes may be affected, including order processing, inventory management, and customer support.
* The launch's overall success and customer satisfaction may be compromised.

### Mitigation Strategies

1. **Increase max connections**: Temporarily increase the maximum allowed database connections to alleviate pressure on the system.
2. **Optimize queries**: Review and optimize query performance to reduce the load on the database.
3. **Cache frequently accessed data**: Implement caching mechanisms for frequently accessed data to minimize the impact of increased traffic.

### Team Response Checklist

**Database Team:**

1. Monitor database performance and adjust connection limits as needed.
2. Optimize query performance to reduce load.
3. Collaborate with API team to implement rate limiting or caching.

**API Team:**

1. Implement rate limiting or caching to manage incoming requests.
2. Monitor API performance and adjust configuration as necessary.
3. Coordinate with Database Team to ensure seamless integration.

**Product Launch Team:**

1. Communicate the outage to stakeholders and provide regular updates on mitigation progress.
2. Collaborate with the Database and API Teams to implement temporary workarounds, if needed.
3. Review the impact of the outage on launch success metrics and plan for future mitigations.

### Criticality

High

This scenario highlights the importance of proactive monitoring and mitigation strategies to ensure the success of critical business events like product launches. By understanding the potential risks and having a plan in place, we can minimize the impact of unexpected outages and keep our customers satisfied.

**Scenario Document**
======================================================

### Holiday Traffic DDoS Attack on CDN

What if a Distributed Denial of Service (DDoS) attack occurs on our Content Delivery Network (CDN) during the holiday season, causing a major outage and significant disruption to our customers?

**Predicted Impact**

* Major outage affecting global customer traffic
* Frontend services and API integrations severely impacted
* Potential loss of revenue due to downtime
* Customer trust and loyalty at risk

### Mitigation Strategies

1. **Enable Web Application Firewall (WAF)**
	* Configure WAF rules to detect and block suspicious traffic
	* Collaborate with security team to optimize WAF settings
2. **Traffic Shaping and Policing**
	* Implement traffic shaping and policing techniques to manage sudden spikes in traffic
	* Adjust bandwidth allocation and QoS settings as needed

### Team Response Checklist

1. **CDN Operations Team**
	* Monitor CDN performance and alert management team of any issues
	* Collaborate with security team to implement WAF configuration changes
2. **Security Team**
	* Analyze attack patterns and collaborate with CDN operations team to optimize WAF settings
	* Provide guidance on traffic shaping and policing techniques
3. **DevOps Team**
	* Assist in implementing temporary workarounds or patches as needed
	* Collaborate with frontend and API teams to identify potential vulnerabilities
4. **Frontend and API Teams**
	* Work closely with CDN operations team to identify and resolve issues affecting their services
	* Review application logs for any unusual patterns or anomalies

**Criticality**

This scenario has a high criticality level due to the potential for significant downtime, revenue loss, and customer dissatisfaction during peak holiday traffic. Prompt attention and swift mitigation are crucial to minimize the impact of this event.

Let's work together to prepare for this scenario and ensure our customers receive the best possible experience!

**Black Friday CDN Failure Scenario**
=====================================

**Narrative**
If the CDN fails during Black Friday, our e-commerce platform will experience a major outage, resulting in significant revenue loss and customer frustration.

**Predicted Impact**

* Frontend: Users attempting to access the website will encounter errors or slow loading times.
* API: Backend services will be impacted, leading to issues with order processing, inventory management, and other critical functions.
* Customer Experience: Shoppers may abandon their purchases or seek alternatives, resulting in revenue loss and long-term brand damage.

**Mitigation Strategies**

1. **Reroute Traffic**: Immediate rerouting of traffic through alternative CDN routes or caching mechanisms to minimize the impact on users.
2. **Content Replication**: Ensure that critical content (e.g., product images) is replicated across multiple CDNs or stored locally for faster access.
3. **Communication**: Timely notification of customers, partners, and stakeholders about the issue, along with regular updates on resolution efforts.

**Team Response Checklist**

1. **Initial Response**
	* Identify the root cause of the failure
	* Notify relevant teams (Frontend, API) of the issue and its impact
2. **Mitigation Execution**
	* Reroute traffic through alternative CDN routes or caching mechanisms
	* Ensure content replication is in place for critical assets
3. **Monitoring and Updates**
	* Monitor the situation and provide regular updates on resolution efforts to customers, partners, and stakeholders
4. **Root Cause Analysis**
	* Conduct a thorough analysis of the failure to prevent similar incidents in the future

**Criticality Level**: High

**Scenario: CDN Failure During Product Launch**
=====================================================

### Narrative

What if our Content Delivery Network (CDN) fails during a product launch, causing a major outage that affects both frontend and API services?

### Predicted Impact

* All product-related traffic will be impacted, resulting in:
	+ Disrupted user experience
	+ Increased bounce rates
	+ Delayed feedback collection
	+ Potential revenue loss
* The severity of the impact will depend on the duration and scope of the outage.

### Mitigation Strategies

To minimize the effects of a CDN failure during a product launch:

1. **Reroute traffic**: Implement an emergency routing plan to redirect traffic through alternative CDNs or caching layers.
2. **Communicate proactively**: Keep stakeholders informed about the issue, its impact, and expected resolution timeframes.

### Team Response Checklist

To ensure a swift and effective response:

1. **CDN Operations**:
	+ Monitor CDN performance and alert the team to any issues
	+ Collaborate with CDN providers to troubleshoot and resolve the problem
2. **Frontend Development**:
	+ Coordinate with the API team to identify workarounds or alternative caching solutions
	+ Develop contingency plans for frontend services
3. **API Development**:
	+ Work with the frontend team to develop temporary API fallbacks
	+ Investigate API-specific issues and collaborate with CDN ops to resolve them
4. **Product Launch Team**:
	+ Keep stakeholders informed about the situation and any changes to the launch timeline
	+ Coordinate with the marketing team to communicate any adjustments to promotional plans

By following these strategies and responding promptly, we can minimize the impact of a CDN failure during a product launch and ensure a successful outcome for our users.