# Rule-Based Remediation

## Overview
This notebook implements deterministic rule-based remediation logic that maps detected anomalies to specific healing actions. Rules are executed through the coordination engine.

## Prerequisites
- Completed: All Phase 2 notebooks (anomaly detection)
- Coordination engine running
- Ensemble predictions available

## Learning Objectives
- Define remediation rules
- Map anomalies to actions
- Execute kubectl commands
- Validate remediation success
- Track remediation outcomes

## Key Concepts
- **Rule Engine**: Deterministic decision logic
- **Action Mapping**: Anomaly type → Remediation action
- **Kubectl Integration**: Execute Kubernetes operations
- **Validation**: Verify remediation effectiveness

## Setup Section

In [None]:
import sys
import os
import json
import logging
import requests
from pathlib import Path
from datetime import datetime
import subprocess

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Using fallback setup_environment")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Import coordination engine client
from coordination_engine_client import get_client

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
MODELS_DIR = Path('/opt/app-root/src/models')

logger.info("Coordination engine client ready (via coordination_engine_client.py)")

## Implementation Section

### 1. Define Remediation Rules

In [None]:
# Define remediation rules
REMEDIATION_RULES = {
    'high_cpu': {
        'description': 'High CPU usage detected',
        'actions': [
            {'type': 'scale', 'target': 'deployment', 'replicas': 3},
            {'type': 'restart', 'target': 'pod'}
        ]
    },
    'high_memory': {
        'description': 'High memory usage detected',
        'actions': [
            {'type': 'scale', 'target': 'deployment', 'replicas': 2},
            {'type': 'restart', 'target': 'pod'}
        ]
    },
    'pod_crash': {
        'description': 'Pod crash loop detected',
        'actions': [
            {'type': 'restart', 'target': 'pod'},
            {'type': 'check_logs', 'target': 'pod'}
        ]
    },
    'network_issue': {
        'description': 'Network connectivity issue',
        'actions': [
            {'type': 'restart', 'target': 'pod'},
            {'type': 'check_network', 'target': 'node'}
        ]
    }
}

logger.info(f"Loaded {len(REMEDIATION_RULES)} remediation rules")
for rule_name, rule_config in REMEDIATION_RULES.items():
    logger.info(f"  - {rule_name}: {rule_config['description']}")

### 2. Anomaly to Rule Mapping

In [None]:
def map_anomaly_to_rule(anomaly_type, metric_values):
    """
    Map detected anomaly to remediation rule.
    
    Args:
        anomaly_type: Type of anomaly detected
        metric_values: Current metric values
    
    Returns:
        Remediation rule name
    """
    # Simple heuristic-based mapping
    if metric_values.get('cpu', 0) > 80:
        return 'high_cpu'
    elif metric_values.get('memory', 0) > 85:
        return 'high_memory'
    elif metric_values.get('restart_count', 0) > 3:
        return 'pod_crash'
    elif metric_values.get('network_errors', 0) > 10:
        return 'network_issue'
    else:
        return None

# Test mapping
test_metrics = {'cpu': 85, 'memory': 50}
rule = map_anomaly_to_rule('test', test_metrics)
logger.info(f"Test mapping: {test_metrics} -> {rule}")

### 3. Execute Remediation Actions

In [None]:
def execute_remediation(rule_name, namespace='self-healing-platform', pod_name=None):
    """
    Execute remediation actions for a rule.
    
    Args:
        rule_name: Name of remediation rule
        namespace: Kubernetes namespace
        pod_name: Target pod name
    
    Returns:
        Execution result
    """
    if rule_name not in REMEDIATION_RULES:
        logger.error(f"Unknown rule: {rule_name}")
        return {'success': False, 'error': 'Unknown rule'}
    
    rule = REMEDIATION_RULES[rule_name]
    results = []
    
    for action in rule['actions']:
        try:
            if action['type'] == 'restart' and pod_name:
                # Delete pod to trigger restart
                cmd = f"oc delete pod {pod_name} -n {namespace}"
                logger.info(f"Executing: {cmd}")
                # In real scenario, execute the command
                results.append({'action': action['type'], 'status': 'executed'})
            elif action['type'] == 'scale':
                logger.info(f"Scaling to {action['replicas']} replicas")
                results.append({'action': action['type'], 'status': 'executed'})
            else:
                logger.info(f"Action {action['type']}: {action}")
                results.append({'action': action['type'], 'status': 'executed'})
        except Exception as e:
            logger.error(f"Error executing action: {e}")
            results.append({'action': action['type'], 'status': 'failed', 'error': str(e)})
    
    return {'success': True, 'rule': rule_name, 'actions': results}

# Test execution
result = execute_remediation('high_cpu', pod_name='test-pod')
logger.info(f"Remediation result: {result}")

### 4. Submit to Coordination Engine

In [None]:
def submit_incident_to_engine(anomaly_data, rule_name):
    """
    Submit incident to coordination engine using production client.

    Uses CoordinationEngineClient for proper error handling,
    retries, and consistent API usage.

    Args:
        anomaly_data: Detected anomaly data
        rule_name: Remediation rule to apply

    Returns:
        Engine response
    """
    incident = {
        'timestamp': datetime.now().isoformat(),
        'type': 'anomaly_detected',
        'severity': 'high',  # Determine from anomaly_data
        'anomaly_data': anomaly_data,
        'remediation_rule': rule_name,
        'status': 'pending'
    }

    try:
        client = get_client()
        response = client.submit_incident(incident)
        logger.info(f"Submitted incident: {response}")
        return response
    except Exception as e:
        logger.error(f"Error submitting incident: {e}")
        return {'error': str(e)}

# Test submission
test_anomaly = {'metric_0': 95.5, 'metric_1': 88.2}
# result = submit_incident_to_engine(test_anomaly, 'high_cpu')
logger.info("Incident submission ready (using CoordinationEngineClient)")

### 5. Track Remediation Outcomes

In [None]:
import pandas as pd

# Create remediation tracking
remediation_log = []

def log_remediation(incident_id, rule_name, success, duration_seconds):
    """
    Log remediation outcome.
    
    Args:
        incident_id: Incident identifier
        rule_name: Applied rule
        success: Whether remediation succeeded
        duration_seconds: Time taken
    """
    log_entry = {
        'timestamp': datetime.now().isoformat(),
        'incident_id': incident_id,
        'rule': rule_name,
        'success': success,
        'duration_seconds': duration_seconds
    }
    remediation_log.append(log_entry)
    logger.info(f"Logged remediation: {incident_id} - {rule_name} - {success}")

# Test logging
log_remediation('INC-001', 'high_cpu', True, 15.5)
log_remediation('INC-002', 'pod_crash', True, 8.2)

# Save log
log_df = pd.DataFrame(remediation_log)
log_df.to_parquet(PROCESSED_DIR / 'remediation_log.parquet')
logger.info(f"Saved remediation log: {len(remediation_log)} entries")

## Validation Section

In [None]:
# Verify outputs
assert len(REMEDIATION_RULES) > 0, "No remediation rules defined"
assert (PROCESSED_DIR / 'remediation_log.parquet').exists(), "Remediation log not saved"
assert len(remediation_log) > 0, "No remediation entries logged"

logger.info("✅ All validations passed")
print(f"\nRemediation Rules: {len(REMEDIATION_RULES)}")
print(f"Remediation Log Entries: {len(remediation_log)}")

## Integration Section

This notebook integrates with:
- **Input**: Ensemble predictions from Phase 2
- **Output**: Remediation actions to coordination engine
- **Coordination Engine**: Executes remediation and tracks outcomes

## Next Steps

1. Review remediation rules
2. Proceed to `ai-driven-decision-making.ipynb`
3. Combine rule-based and AI-driven approaches
4. Deploy hybrid healing workflows

## References

- ADR-002: Hybrid Deterministic-AI Self-Healing Approach
- ADR-012: Notebook Architecture for End-to-End Workflows
- [Kubernetes API](https://kubernetes.io/docs/reference/generated/kubernetes-api/)