# Pod Crash Loop Healing

## Overview
This notebook demonstrates automatic detection and healing of pod crash loops. It monitors pod restart counts, analyzes logs, and executes remediation actions to restore service health.

## Prerequisites
- Completed: All Phase 1-4 notebooks
- Inference pipeline deployed and running
- Coordination engine accessible
- OpenShift cluster with monitoring enabled

## Learning Objectives
- Detect pod crash loops automatically
- Analyze container logs for root causes
- Execute targeted remediation actions
- Track healing success rates
- Implement recovery workflows

## Key Concepts
- **Crash Loop**: Pod repeatedly crashing and restarting
- **Restart Count**: Number of times pod has restarted
- **Log Analysis**: Extract error patterns from logs
- **Remediation**: Restart, scale, or update pod
- **Success Tracking**: Monitor recovery outcomes

## Setup Section

In [None]:
import sys
import os
import json
import logging
import subprocess
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import re

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
NAMESPACE = 'self-healing-platform'
RESTART_THRESHOLD = 3  # Restart count threshold for crash loop
LOG_LINES = 50  # Number of log lines to analyze

logger.info(f"Pod crash loop healing initialized")

## Implementation Section

### 1. Detect Crash Loops

In [None]:
def detect_crash_loops(namespace, restart_threshold=3):
    """
    Detect pods in crash loop by checking restart count.
    
    Args:
        namespace: Kubernetes namespace
        restart_threshold: Restart count threshold
    
    Returns:
        List of pods in crash loop
    """
    crash_loop_pods = []
    
    try:
        # In real scenario, query Kubernetes API
        # For demo, create sample data
        sample_pods = [
            {'name': 'app-pod-1', 'restart_count': 5, 'status': 'CrashLoopBackOff'},
            {'name': 'app-pod-2', 'restart_count': 2, 'status': 'Running'},
            {'name': 'app-pod-3', 'restart_count': 8, 'status': 'CrashLoopBackOff'},
        ]
        
        for pod in sample_pods:
            if pod['restart_count'] >= restart_threshold:
                crash_loop_pods.append(pod)
                logger.warning(f"Crash loop detected: {pod['name']} (restarts: {pod['restart_count']})")
        
        return crash_loop_pods
    except Exception as e:
        logger.error(f"Error detecting crash loops: {e}")
        return []

# Detect crash loops
crash_loop_pods = detect_crash_loops(NAMESPACE, RESTART_THRESHOLD)
logger.info(f"Found {len(crash_loop_pods)} pods in crash loop")

### 2. Analyze Pod Logs

In [None]:
def analyze_pod_logs(pod_name, namespace, lines=50):
    """
    Analyze pod logs to identify root cause.
    
    Args:
        pod_name: Pod name
        namespace: Kubernetes namespace
        lines: Number of log lines to analyze
    
    Returns:
        Log analysis result
    """
    try:
        # Sample error patterns
        error_patterns = {
            'OOMKilled': r'(OOMKilled|Out of memory)',
            'CrashLoopBackOff': r'(CrashLoopBackOff|crash loop)',
            'ImagePullBackOff': r'(ImagePullBackOff|Failed to pull image)',
            'ConfigError': r'(ConfigError|configuration error)',
            'HealthCheckFailed': r'(HealthCheckFailed|liveness probe failed)',
        }
        
        # Sample logs
        sample_logs = [
            'ERROR: Connection refused to database',
            'FATAL: Out of memory',
            'ERROR: Failed to initialize service',
        ]
        
        detected_errors = []
        for error_type, pattern in error_patterns.items():
            for log_line in sample_logs:
                if re.search(pattern, log_line, re.IGNORECASE):
                    detected_errors.append(error_type)
        
        analysis = {
            'pod_name': pod_name,
            'detected_errors': detected_errors if detected_errors else ['Unknown'],
            'log_sample': sample_logs[:3],
            'analysis_time': datetime.now().isoformat()
        }
        
        logger.info(f"Log analysis for {pod_name}: {analysis['detected_errors']}")
        return analysis
    except Exception as e:
        logger.error(f"Error analyzing logs: {e}")
        return {'error': str(e)}

# Analyze logs for crash loop pods
log_analyses = []
for pod in crash_loop_pods:
    analysis = analyze_pod_logs(pod['name'], NAMESPACE, LOG_LINES)
    log_analyses.append(analysis)

logger.info(f"Analyzed logs for {len(log_analyses)} pods")

### 3. Execute Remediation

In [None]:
def execute_remediation(pod_name, namespace, error_type):
    """
    Execute remediation action based on error type.
    
    Args:
        pod_name: Pod name
        namespace: Kubernetes namespace
        error_type: Type of error detected
    
    Returns:
        Remediation result
    """
    remediation_actions = {
        'OOMKilled': {
            'action': 'scale_resources',
            'command': f'oc set resources pod {pod_name} -n {namespace} --limits=memory=2Gi'
        },
        'CrashLoopBackOff': {
            'action': 'restart_pod',
            'command': f'oc delete pod {pod_name} -n {namespace}'
        },
        'ImagePullBackOff': {
            'action': 'update_image',
            'command': f'oc set image pod {pod_name} -n {namespace}'
        },
        'ConfigError': {
            'action': 'update_config',
            'command': f'oc rollout restart deployment -n {namespace}'
        },
        'HealthCheckFailed': {
            'action': 'restart_pod',
            'command': f'oc delete pod {pod_name} -n {namespace}'
        },
    }
    
    try:
        action_config = remediation_actions.get(error_type, remediation_actions['CrashLoopBackOff'])
        
        logger.info(f"Executing {action_config['action']} for {pod_name}")
        # In real scenario: subprocess.run(action_config['command'], shell=True)
        
        return {
            'success': True,
            'pod_name': pod_name,
            'action': action_config['action'],
            'timestamp': datetime.now().isoformat()
        }
    except Exception as e:
        logger.error(f"Remediation error: {e}")
        return {'success': False, 'error': str(e)}

# Execute remediation
remediation_results = []
for analysis in log_analyses:
    error_type = analysis['detected_errors'][0]
    result = execute_remediation(analysis['pod_name'], NAMESPACE, error_type)
    remediation_results.append(result)

logger.info(f"Executed remediation for {len(remediation_results)} pods")

### 4. Track Healing Success

In [None]:
# Create healing tracking dataframe
healing_tracking = pd.DataFrame([
    {
        'timestamp': datetime.now().isoformat(),
        'pod_name': result['pod_name'],
        'error_type': log_analyses[i]['detected_errors'][0],
        'remediation_action': result['action'],
        'success': result['success'],
        'recovery_time_seconds': np.random.randint(5, 30),
        'status': 'Running' if result['success'] else 'Failed'
    }
    for i, result in enumerate(remediation_results)
])

# Save tracking data
tracking_file = PROCESSED_DIR / 'pod_crash_loop_healing.parquet'
healing_tracking.to_parquet(tracking_file)

logger.info(f"Saved healing tracking data")
print(healing_tracking.to_string())

## Validation Section

In [None]:
# Verify outputs
assert len(crash_loop_pods) > 0, "No crash loop pods detected"
assert len(log_analyses) > 0, "No log analyses performed"
assert len(remediation_results) > 0, "No remediation executed"
assert tracking_file.exists(), "Healing tracking file not created"

success_rate = healing_tracking['success'].sum() / len(healing_tracking)
logger.info(f"✅ All validations passed")
print(f"\nPod Crash Loop Healing Summary:")
print(f"  Crash Loop Pods Detected: {len(crash_loop_pods)}")
print(f"  Remediation Actions Executed: {len(remediation_results)}")
print(f"  Success Rate: {success_rate:.1%}")
print(f"  Average Recovery Time: {healing_tracking['recovery_time_seconds'].mean():.1f}s")

## Integration Section

This notebook integrates with:
- **Input**: Inference pipeline predictions and pod metrics
- **Output**: Healing tracking and recovery metrics
- **Monitoring**: Prometheus for pod health metrics
- **Next**: Resource exhaustion detection

## Next Steps

1. Monitor pod recovery status
2. Proceed to `resource-exhaustion-detection.ipynb`
3. Implement resource scaling workflows
4. Test complete end-to-end scenarios
5. Validate healing success rates

## References

- ADR-003: Self-Healing Platform Architecture
- ADR-012: Notebook Architecture for End-to-End Workflows
- [Kubernetes Pod Lifecycle](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/)
- [OpenShift Debugging Pods](https://docs.openshift.com/container-platform/latest/support/troubleshooting/investigating-pod-issues.html)