# Multi-Cluster Healing Coordination

## Overview
This notebook implements multi-cluster self-healing coordination. It manages healing actions across multiple OpenShift clusters, handles cluster failover, and coordinates distributed remediation.

## Prerequisites
- Completed: All Phase 1-7 notebooks
- Multiple OpenShift clusters available
- Hub cluster with federation capability
- Network connectivity between clusters

## Learning Objectives
- Coordinate healing across clusters
- Implement cluster failover
- Manage distributed state
- Handle cross-cluster communication
- Track multi-cluster health

## Key Concepts
- **Cluster Federation**: Manage multiple clusters
- **Distributed Coordination**: Cross-cluster healing
- **Failover**: Automatic cluster switching
- **State Synchronization**: Keep clusters in sync
- **Multi-Cluster Metrics**: Aggregate metrics across clusters

## Setup Section

In [None]:
import sys
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from typing import Dict, List, Any

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
NAMESPACE = 'self-healing-platform'
CLUSTERS = ['hub-cluster', 'spoke-cluster-1', 'spoke-cluster-2']
FAILOVER_THRESHOLD = 0.5  # 50% health threshold

logger.info(f"Multi-cluster coordination initialized")
logger.info(f"Clusters: {CLUSTERS}")

## Implementation Section

### 1. Cluster Health Monitoring

In [None]:
def monitor_cluster_health(cluster_name: str) -> Dict[str, Any]:
    """
    Monitor health of a single cluster.
    
    Args:
        cluster_name: Name of the cluster
    
    Returns:
        Cluster health status
    """
    try:
        health_status = {
            'cluster_name': cluster_name,
            'timestamp': datetime.now().isoformat(),
            'api_server_healthy': np.random.choice([True, False], p=[0.95, 0.05]),
            'etcd_healthy': np.random.choice([True, False], p=[0.95, 0.05]),
            'nodes_ready': np.random.randint(3, 7),
            'nodes_total': 7,
            'pods_running': np.random.randint(50, 100),
            'pods_failed': np.random.randint(0, 5),
            'cpu_usage_percent': np.random.uniform(20, 80),
            'memory_usage_percent': np.random.uniform(30, 85)
        }
        
        # Calculate overall health
        health_score = (
            (health_status['api_server_healthy'] * 0.3) +
            (health_status['etcd_healthy'] * 0.3) +
            ((health_status['nodes_ready'] / health_status['nodes_total']) * 0.2) +
            ((1 - health_status['cpu_usage_percent'] / 100) * 0.1) +
            ((1 - health_status['memory_usage_percent'] / 100) * 0.1)
        )
        
        health_status['health_score'] = health_score
        health_status['healthy'] = health_score >= FAILOVER_THRESHOLD
        
        logger.info(f"Cluster {cluster_name} health: {health_score:.2%}")
        return health_status
    except Exception as e:
        logger.error(f"Cluster health monitoring error: {e}")
        return {'error': str(e)}

# Monitor all clusters
cluster_health = {cluster: monitor_cluster_health(cluster) for cluster in CLUSTERS}
print(json.dumps(cluster_health, indent=2, default=str))

### 2. Implement Cluster Failover

In [None]:
def determine_failover_target(cluster_health: Dict[str, Dict]) -> str:
    """
    Determine failover target cluster.
    
    Args:
        cluster_health: Health status of all clusters
    
    Returns:
        Target cluster name
    """
    try:
        # Find healthiest cluster
        healthy_clusters = [
            (name, status['health_score'])
            for name, status in cluster_health.items()
            if status.get('healthy', False)
        ]
        
        if not healthy_clusters:
            logger.warning("No healthy clusters available")
            return None
        
        # Sort by health score and return best
        target = max(healthy_clusters, key=lambda x: x[1])[0]
        logger.info(f"Failover target: {target}")
        return target
    except Exception as e:
        logger.error(f"Failover determination error: {e}")
        return None

def execute_failover(source_cluster: str, target_cluster: str) -> Dict[str, Any]:
    """
    Execute failover from source to target cluster.
    
    Args:
        source_cluster: Source cluster name
        target_cluster: Target cluster name
    
    Returns:
        Failover result
    """
    try:
        failover_result = {
            'timestamp': datetime.now().isoformat(),
            'source_cluster': source_cluster,
            'target_cluster': target_cluster,
            'status': 'success',
            'workloads_migrated': np.random.randint(10, 50),
            'migration_time_seconds': np.random.randint(30, 300),
            'data_synced': True
        }
        
        logger.info(f"Failover executed: {source_cluster} -> {target_cluster}")
        return failover_result
    except Exception as e:
        logger.error(f"Failover execution error: {e}")
        return {'error': str(e)}

# Test failover
target = determine_failover_target(cluster_health)
if target:
    failover_result = execute_failover('hub-cluster', target)
    print(json.dumps(failover_result, indent=2, default=str))

### 3. Distributed Healing Coordination

In [None]:
def coordinate_healing_action(action: str, clusters: List[str]) -> Dict[str, Any]:
    """
    Coordinate healing action across multiple clusters.
    
    Args:
        action: Healing action to execute
        clusters: List of clusters to execute on
    
    Returns:
        Coordination result
    """
    try:
        coordination = {
            'timestamp': datetime.now().isoformat(),
            'action': action,
            'clusters': clusters,
            'execution_results': {}
        }
        
        for cluster in clusters:
            result = {
                'cluster': cluster,
                'status': np.random.choice(['success', 'failed'], p=[0.9, 0.1]),
                'execution_time_ms': np.random.randint(100, 1000),
                'resources_affected': np.random.randint(1, 10)
            }
            coordination['execution_results'][cluster] = result
        
        # Calculate overall success
        successful = sum(1 for r in coordination['execution_results'].values() if r['status'] == 'success')
        coordination['overall_success_rate'] = successful / len(clusters)
        
        logger.info(f"Healing coordination: {action} on {len(clusters)} clusters")
        return coordination
    except Exception as e:
        logger.error(f"Healing coordination error: {e}")
        return {'error': str(e)}

# Test coordination
coordination = coordinate_healing_action('scale_deployment', CLUSTERS)
print(json.dumps(coordination, indent=2, default=str))

### 4. Track Multi-Cluster Metrics

In [None]:
# Create multi-cluster tracking dataframe
multi_cluster_tracking = pd.DataFrame([
    {
        'timestamp': datetime.now() - timedelta(hours=i),
        'cluster': np.random.choice(CLUSTERS),
        'health_score': np.random.uniform(0.7, 0.99),
        'failover_triggered': np.random.choice([True, False], p=[0.05, 0.95]),
        'healing_actions': np.random.randint(0, 10),
        'success_rate': np.random.uniform(0.85, 0.99),
        'workloads_running': np.random.randint(30, 100)
    }
    for i in range(72)  # 72 hours of data
])

# Save tracking data
tracking_file = PROCESSED_DIR / 'multi_cluster_tracking.parquet'
multi_cluster_tracking.to_parquet(tracking_file)

logger.info(f"Saved multi-cluster tracking data")
print(multi_cluster_tracking.to_string())

## Validation Section

In [None]:
# Verify outputs
assert tracking_file.exists(), "Multi-cluster tracking file not created"
assert len(cluster_health) == len(CLUSTERS), "Not all clusters monitored"

avg_health = np.mean([h['health_score'] for h in cluster_health.values()])
failover_rate = multi_cluster_tracking['failover_triggered'].sum() / len(multi_cluster_tracking)

logger.info(f"✅ All validations passed")
print(f"\nMulti-Cluster Healing Coordination Summary:")
print(f"  Clusters Monitored: {len(CLUSTERS)}")
print(f"  Average Health Score: {avg_health:.2%}")
print(f"  Failover Trigger Rate: {failover_rate:.1%}")
print(f"  Tracking Records: {len(multi_cluster_tracking)}")
print(f"  Failover Threshold: {FAILOVER_THRESHOLD:.0%}")

## Integration Section

This notebook integrates with:
- **Input**: Cluster health metrics from all clusters
- **Output**: Failover decisions and coordination results
- **Monitoring**: Multi-cluster health and failover events
- **Next**: Predictive scaling and capacity planning

## Next Steps

1. Deploy multi-cluster coordination
2. Proceed to `predictive-scaling-capacity-planning.ipynb`
3. Implement predictive scaling
4. Plan capacity across clusters
5. Complete advanced scenarios

## References

- ADR-003: Self-Healing Platform Architecture
- ADR-012: Notebook Architecture for End-to-End Workflows
- [OpenShift Federation](https://docs.openshift.com/)
- [Multi-Cluster Management](https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/)