# Healing Success Tracking

## Overview
This notebook implements comprehensive tracking of remediation success rates. It analyzes failure patterns, generates reports, and provides insights for platform optimization.

## Prerequisites
- Completed: `model-performance-monitoring.ipynb`
- Remediation history available
- Incident data collected
- Coordination engine logs accessible

## Learning Objectives
- Track remediation success rates
- Analyze failure patterns
- Generate comprehensive reports
- Identify improvement opportunities
- Optimize healing strategies

## Key Concepts
- **Success Rate**: Percentage of successful remediations
- **Failure Analysis**: Root cause of failures
- **MTTR**: Mean Time To Resolution
- **Incident Trends**: Patterns in incidents
- **Optimization**: Improve healing effectiveness

## Setup Section

In [None]:
import sys
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from typing import Dict, List, Any

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR = DATA_DIR / 'reports'
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
NAMESPACE = 'self-healing-platform'
SUCCESS_THRESHOLD = 0.90  # Target 90% success rate

logger.info(f"Healing success tracking initialized")

## Implementation Section

### 1. Track Remediation Success

In [None]:
def calculate_success_metrics(remediation_history: pd.DataFrame) -> Dict[str, Any]:
    """
    Calculate remediation success metrics.
    
    Args:
        remediation_history: DataFrame with remediation records
    
    Returns:
        Success metrics
    """
    try:
        total_remediations = len(remediation_history)
        successful = remediation_history['success'].sum()
        failed = total_remediations - successful
        success_rate = successful / total_remediations if total_remediations > 0 else 0
        
        # Calculate by action type
        success_by_action = remediation_history.groupby('action_type')['success'].agg(['sum', 'count'])
        success_by_action['rate'] = success_by_action['sum'] / success_by_action['count']
        
        metrics = {
            'timestamp': datetime.now().isoformat(),
            'total_remediations': total_remediations,
            'successful': successful,
            'failed': failed,
            'success_rate': success_rate,
            'success_by_action': success_by_action.to_dict()
        }
        
        logger.info(f"Success rate: {success_rate:.1%} ({successful}/{total_remediations})")
        return metrics
    except Exception as e:
        logger.error(f"Success metrics error: {e}")
        return {'error': str(e)}

# Create sample remediation history
remediation_history = pd.DataFrame([
    {
        'timestamp': datetime.now() - timedelta(hours=i),
        'action_type': np.random.choice(['restart', 'scale', 'update', 'restart_pod']),
        'success': np.random.choice([True, False], p=[0.92, 0.08]),
        'resolution_time_seconds': np.random.randint(5, 120)
    }
    for i in range(100)  # 100 remediation records
])

success_metrics = calculate_success_metrics(remediation_history)
print(json.dumps(success_metrics, indent=2, default=str))

### 2. Analyze Failure Patterns

In [None]:
def analyze_failures(remediation_history: pd.DataFrame) -> Dict[str, Any]:
    """
    Analyze failure patterns in remediation history.
    
    Args:
        remediation_history: DataFrame with remediation records
    
    Returns:
        Failure analysis
    """
    try:
        failures = remediation_history[~remediation_history['success']]
        
        if len(failures) == 0:
            logger.info("No failures to analyze")
            return {'failures': 0, 'analysis': 'No failures detected'}
        
        # Analyze failure patterns
        failure_by_action = failures['action_type'].value_counts().to_dict()
        avg_resolution_time = failures['resolution_time_seconds'].mean()
        
        analysis = {
            'timestamp': datetime.now().isoformat(),
            'total_failures': len(failures),
            'failure_rate': len(failures) / len(remediation_history),
            'failures_by_action': failure_by_action,
            'avg_resolution_time_seconds': avg_resolution_time,
            'top_failure_action': max(failure_by_action, key=failure_by_action.get) if failure_by_action else None
        }
        
        logger.info(f"Failure analysis: {len(failures)} failures ({analysis['failure_rate']:.1%})")
        return analysis
    except Exception as e:
        logger.error(f"Failure analysis error: {e}")
        return {'error': str(e)}

failure_analysis = analyze_failures(remediation_history)
print(json.dumps(failure_analysis, indent=2, default=str))

### 3. Calculate MTTR (Mean Time To Resolution)

In [None]:
def calculate_mttr(remediation_history: pd.DataFrame) -> Dict[str, Any]:
    """
    Calculate Mean Time To Resolution (MTTR).
    
    Args:
        remediation_history: DataFrame with remediation records
    
    Returns:
        MTTR metrics
    """
    try:
        successful = remediation_history[remediation_history['success']]
        failed = remediation_history[~remediation_history['success']]
        
        mttr_metrics = {
            'timestamp': datetime.now().isoformat(),
            'overall_mttr_seconds': remediation_history['resolution_time_seconds'].mean(),
            'successful_mttr_seconds': successful['resolution_time_seconds'].mean() if len(successful) > 0 else 0,
            'failed_mttr_seconds': failed['resolution_time_seconds'].mean() if len(failed) > 0 else 0,
            'p50_resolution_time': remediation_history['resolution_time_seconds'].quantile(0.5),
            'p95_resolution_time': remediation_history['resolution_time_seconds'].quantile(0.95),
            'p99_resolution_time': remediation_history['resolution_time_seconds'].quantile(0.99)
        }
        
        logger.info(f"MTTR: {mttr_metrics['overall_mttr_seconds']:.1f}s (p95: {mttr_metrics['p95_resolution_time']:.1f}s)")
        return mttr_metrics
    except Exception as e:
        logger.error(f"MTTR calculation error: {e}")
        return {'error': str(e)}

mttr_metrics = calculate_mttr(remediation_history)
print(json.dumps(mttr_metrics, indent=2, default=str))

### 4. Generate Comprehensive Report

In [None]:
def generate_report(success_metrics: Dict, failure_analysis: Dict, mttr_metrics: Dict) -> Dict[str, Any]:
    """
    Generate comprehensive healing success report.
    
    Args:
        success_metrics: Success metrics
        failure_analysis: Failure analysis
        mttr_metrics: MTTR metrics
    
    Returns:
        Comprehensive report
    """
    try:
        report = {
            'report_date': datetime.now().isoformat(),
            'summary': {
                'total_remediations': success_metrics.get('total_remediations', 0),
                'success_rate': success_metrics.get('success_rate', 0),
                'target_success_rate': SUCCESS_THRESHOLD,
                'target_met': success_metrics.get('success_rate', 0) >= SUCCESS_THRESHOLD
            },
            'success_metrics': success_metrics,
            'failure_analysis': failure_analysis,
            'mttr_metrics': mttr_metrics,
            'recommendations': []
        }
        
        # Generate recommendations
        if success_metrics.get('success_rate', 0) < SUCCESS_THRESHOLD:
            report['recommendations'].append(
                f"Improve success rate from {success_metrics.get('success_rate', 0):.1%} to {SUCCESS_THRESHOLD:.0%}"
            )
        
        if failure_analysis.get('top_failure_action'):
            report['recommendations'].append(
                f"Focus on improving {failure_analysis['top_failure_action']} action reliability"
            )
        
        logger.info(f"Report generated with {len(report['recommendations'])} recommendations")
        return report
    except Exception as e:
        logger.error(f"Report generation error: {e}")
        return {'error': str(e)}

report = generate_report(success_metrics, failure_analysis, mttr_metrics)
print(json.dumps(report, indent=2, default=str))

### 5. Track Healing Success History

In [None]:
# Create healing success tracking dataframe
healing_tracking = pd.DataFrame([
    {
        'timestamp': datetime.now() - timedelta(days=i),
        'total_incidents': np.random.randint(10, 50),
        'resolved_incidents': np.random.randint(8, 50),
        'success_rate': np.random.uniform(0.85, 0.98),
        'avg_mttr_seconds': np.random.randint(10, 120),
        'failure_rate': np.random.uniform(0.02, 0.15)
    }
    for i in range(30)  # 30 days of data
])

# Save tracking data
tracking_file = PROCESSED_DIR / 'healing_success_tracking.parquet'
healing_tracking.to_parquet(tracking_file)

# Save report
report_file = REPORTS_DIR / f"healing_success_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(report_file, 'w') as f:
    json.dump(report, f, indent=2, default=str)

logger.info(f"Saved healing success tracking data and report")
print(healing_tracking.to_string())

## Validation Section

In [None]:
# Verify outputs
assert tracking_file.exists(), "Healing tracking file not created"
assert report_file.exists(), "Report file not created"

avg_success_rate = healing_tracking['success_rate'].mean()
avg_mttr = healing_tracking['avg_mttr_seconds'].mean()
total_incidents = healing_tracking['total_incidents'].sum()

logger.info(f"✅ All validations passed")
print(f"\nHealing Success Tracking Summary:")
print(f"  Tracking Records: {len(healing_tracking)}")
print(f"  Average Success Rate: {avg_success_rate:.1%}")
print(f"  Average MTTR: {avg_mttr:.1f}s")
print(f"  Total Incidents: {total_incidents}")
print(f"  Success Target: {SUCCESS_THRESHOLD:.0%}")
print(f"  Target Met: {'✅ Yes' if avg_success_rate >= SUCCESS_THRESHOLD else '❌ No'}")

## Integration Section

This notebook integrates with:
- **Input**: Remediation history and incident data
- **Output**: Success metrics, reports, and recommendations
- **Monitoring**: Success rates, MTTR, and failure patterns
- **Next**: Phase 7 complete

## Next Steps

1. Monitor healing success continuously
2. Review reports and recommendations
3. Implement improvements based on analysis
4. Optimize remediation strategies
5. Complete notebook roadmap implementation

## References

- ADR-003: Self-Healing Platform Architecture
- ADR-012: Notebook Architecture for End-to-End Workflows
- [MTTR Metrics](https://en.wikipedia.org/wiki/Mean_time_to_recovery)
- [Incident Management Best Practices](https://en.wikipedia.org/wiki/Incident_management)