# 🧠 cloudChronicles Lab #001: Disaster Recovery Detective

**Lab Type:** Idea 
**Estimated Time:** 30–45 mins 
**Skill Level:** Beginner

In [None]:
# Let's begin by printing your name to personalize the notebook
your_name = "Cloud Architect"
print(f"Welcome to the lab, {your_name}!")

## 🔍 STAR Method Lab Prompt

**Situation:** 
A regional outage has occurred in Google Cloud's `us-central1` region. Our critical multi-tier web application, including its primary Cloud SQL for PostgreSQL database and core application data in Cloud Storage, is currently impacted. We need to restore service with minimal downtime and data loss.

**Task:** 
As the Cloud Architect, create a comprehensive STAR-based disaster recovery plan using Google Cloud tools. This plan must detail the failover and recovery process for our application, specifically addressing the database and static content, leveraging Cloud SQL read replicas, multi-region Cloud Storage, and Pub/Sub alerts.

**Action:** 
1.  **Detection and Alerting:** Explain how the outage in `us-central1` would be detected and how automated alerts via Cloud Monitoring and Pub/Sub would notify the operations team.
2.  **Database Failover (Cloud SQL):** Detail the steps to fail over the Cloud SQL for PostgreSQL database from `us-central1` to its pre-configured cross-region read replica in `us-east1`. Include considerations for RTO and RPO.
3.  **Static Content and Application Data (Cloud Storage):** Describe how multi-region Cloud Storage ensures data availability and how the application would access data from the resilient storage.
4.  **Application Rerouting:** Outline how traffic would be rerouted from the affected `us-central1` deployment to a healthy application deployment in `us-east1`. This might involve global load balancing.
5.  **Post-Recovery Steps:** Briefly mention any necessary steps after the initial failover, such as data synchronization once `us-central1` recovers, and restoring the original architecture.

**Expected Result:** 
A detailed STAR-based disaster recovery plan outlining the detection, failover, and recovery procedures for our application and its data during a `us-central1` regional outage, leveraging specified Google Cloud services. The plan should clearly articulate the role of each service and the sequence of actions.

## ✍️ Your Assignment

_Use this section to complete your deliverable:_

### **STAR-Based Disaster Recovery Plan: `us-central1` Regional Outage**

**Situation:**
A regional outage has occurred in Google Cloud's `us-central1` region, rendering our primary application deployments, including the Cloud SQL for PostgreSQL database and associated Cloud Storage buckets, inaccessible. This is a critical incident requiring immediate action to maintain business continuity for our multi-tier web application.

**Task:**
To execute a predefined disaster recovery plan, ensuring our web application, with its database and critical static content, fails over to a secondary region (`us-east1`) with minimal Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

**Action:**

#### 1. Detection and Alerting:
- **Cloud Monitoring:** Pre-configured Cloud Monitoring alerts will detect the unavailability of key services (e.g., GKE cluster health checks, Cloud SQL instance status, network latency to `us-central1`).
  - **Health Checks:** HTTP(S) Load Balancer health checks on backend services will immediately mark `us-central1` instances as unhealthy.
  - **Custom Metrics:** Cloud Monitoring will track application-specific metrics (e.g., error rates, request latency) which will spike during an outage.
- **Pub/Sub Notifications:** Upon alert thresholds being breached, Cloud Monitoring will publish messages to a dedicated Pub/Sub topic (`dr-alerts`).
  - **Subscriber Action:** Subscribers (e.g., Cloud Functions, internal alerting systems like PagerDuty/Slack integration) will be triggered to notify the on-call operations team of the regional outage.
  - **Automated Runbook Trigger:** A Cloud Function subscribed to `dr-alerts` could potentially trigger an automated or semi-automated failover runbook (e.g., via Cloud Build or custom scripts).

#### 2. Database Failover (Cloud SQL for PostgreSQL):
- **Pre-configured Cross-Region Read Replica:** A Cloud SQL for PostgreSQL read replica is already provisioned and continuously replicating data from the primary instance in `us-central1` to a secondary region, `us-east1`. This ensures a near-real-time RPO.
- **Manual Failover Process:**
  1.  **Verify Primary Unavailability:** Confirm the primary Cloud SQL instance in `us-central1` is indeed unreachable via Cloud Console and `gcloud` commands (`gcloud sql instances describe [INSTANCE_NAME] --region=us-central1`).
  2.  **Promote Read Replica:** Using the Google Cloud Console or `gcloud` CLI, promote the read replica in `us-east1` to a standalone primary instance:
      ```bash
      gcloud sql instances promote-replica [REPLICA_INSTANCE_NAME] --project=[YOUR_PROJECT_ID] --region=us-east1
      ```
      This action transforms the read replica into a fully writable primary database.
  3.  **Update Application Configuration:** Once the `us-east1` replica is promoted, the application deployments (e.g., GKE deployments) in `us-east1` must be reconfigured to point to the new primary database's connection string/IP address. This can be done by updating Kubernetes Secrets, ConfigMaps, or environment variables in the GKE deployment.
      - **Consideration:** If using a global load balancer for the application, ensure the new primary database is accessible from the `us-east1` GKE cluster.
- **RTO/RPO Considerations:**
  -   **RPO (Recovery Point Objective):** Near-zero, as data is continuously replicated to the read replica. The RPO is limited by the replication lag, which is typically very low (seconds).
  -   **RTO (Recovery Time Objective):** Dependent on the time taken to detect the outage, promote the replica, and reconfigure the application. With a well-practiced runbook, this can be minutes (e.g., 5-15 minutes).

#### 3. Static Content and Application Data (Cloud Storage):
- **Multi-Region Buckets:** All critical static content (images, videos, CSS/JS files) and application-generated data (user uploads, logs) are stored in **multi-region Cloud Storage buckets** (e.g., `US` multi-region).
  - **Automatic Redundancy:** Multi-region buckets automatically replicate data across multiple distinct geographic locations within the chosen multi-region. This means data remains available even if one entire region (like `us-central1`) experiences an outage.
  - **Application Access:** The application is configured to access these buckets using a consistent bucket name, which remains globally available. No changes are required for the application to read/write data from these buckets during a regional outage.

#### 4. Application Rerouting:
- **Global HTTP(S) Load Balancer:** Our web application uses a Global HTTP(S) Load Balancer with backend services configured for GKE clusters in both `us-central1` (primary) and `us-east1` (standby/secondary).
  - **Health Checks:** The load balancer continuously performs health checks on the backend services in both regions.
  - **Automatic Failover:** When the `us-central1` backend services fail their health checks due to the regional outage, the Global HTTP(S) Load Balancer will automatically direct all new incoming traffic to the healthy backend services in `us-east1`.
  - **DNS:** The application's public DNS record points to the IP address of the Global HTTP(S) Load Balancer, ensuring seamless traffic redirection without DNS propagation delays.
- **GKE Deployment in `us-east1`:** A scaled-down or identical GKE cluster and application deployment are maintained in `us-east1`, ready to take on full traffic load. Horizontal Pod Autoscaling (HPA) can be used to scale up resources in `us-east1` as traffic shifts.

#### 5. Post-Recovery Steps:
- **Monitor `us-central1` Recovery:** Continuously monitor the status of `us-central1` and Google Cloud's official incident reports.
- **Data Synchronization (if needed):** If `us-central1` primary database instance recovers, ensure any data written to the new `us-east1` primary is synchronized back to `us-central1` before considering a failback. This might involve setting up `us-central1` as a read replica of `us-east1` for a period.
- **Planned Failback (Optional):** Once `us-central1` is fully stable and data is synchronized, plan a controlled failback to `us-central1` during a maintenance window to restore the original architecture. This involves promoting the `us-central1` instance, updating application configurations, and reversing traffic flow.
- **Post-Mortem Analysis:** Conduct a thorough post-mortem to analyze the incident, identify any gaps in the DR plan, and implement improvements for future resilience.

```