# Network Anomaly Response

## Overview
This notebook detects network anomalies (latency, packet loss, connectivity issues) and executes network healing actions. It monitors network health and responds to connectivity problems automatically.

## Prerequisites
- Completed: `resource-exhaustion-detection.ipynb`
- Network monitoring enabled
- Prometheus metrics available
- Coordination engine accessible

## Learning Objectives
- Detect network anomalies
- Analyze connectivity issues
- Execute network healing actions
- Monitor network recovery
- Track network health metrics

## Key Concepts
- **Network Metrics**: Latency, packet loss, throughput
- **Connectivity Issues**: DNS failures, connection timeouts
- **Network Healing**: Restart services, update routes
- **Health Checks**: Verify connectivity restoration
- **Performance Monitoring**: Track network metrics

## Setup Section

In [None]:
import sys
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
NAMESPACE = 'self-healing-platform'
LATENCY_THRESHOLD = 500  # milliseconds
PACKET_LOSS_THRESHOLD = 5  # percentage
CONNECTION_TIMEOUT = 10  # seconds

logger.info(f"Network anomaly response initialized")

## Implementation Section

### 1. Detect Network Anomalies

In [None]:
def detect_network_anomalies(namespace, latency_threshold=500, packet_loss_threshold=5):
    """
    Detect network anomalies in cluster.
    
    Args:
        namespace: Kubernetes namespace
        latency_threshold: Latency threshold in ms
        packet_loss_threshold: Packet loss threshold in %
    
    Returns:
        List of detected network anomalies
    """
    anomalies = []
    
    try:
        # Sample network metrics
        network_metrics = [
            {'service': 'api-service', 'latency_ms': 650, 'packet_loss': 2.5, 'status': 'degraded'},
            {'service': 'database-service', 'latency_ms': 200, 'packet_loss': 0.1, 'status': 'healthy'},
            {'service': 'cache-service', 'latency_ms': 1200, 'packet_loss': 8.5, 'status': 'critical'},
        ]
        
        for metric in network_metrics:
            if metric['latency_ms'] > latency_threshold or metric['packet_loss'] > packet_loss_threshold:
                anomalies.append({
                    'service': metric['service'],
                    'latency_ms': metric['latency_ms'],
                    'packet_loss': metric['packet_loss'],
                    'severity': 'critical' if metric['latency_ms'] > 1000 else 'high',
                    'detected_at': datetime.now().isoformat()
                })
                logger.warning(f"Network anomaly detected: {metric['service']}")
        
        return anomalies
    except Exception as e:
        logger.error(f"Error detecting network anomalies: {e}")
        return []

# Detect anomalies
network_anomalies = detect_network_anomalies(NAMESPACE, LATENCY_THRESHOLD, PACKET_LOSS_THRESHOLD)
logger.info(f"Detected {len(network_anomalies)} network anomalies")

### 2. Analyze Connectivity Issues

In [None]:
def analyze_connectivity(service_name, namespace):
    """
    Analyze connectivity issues for a service.
    
    Args:
        service_name: Service name
        namespace: Kubernetes namespace
    
    Returns:
        Connectivity analysis
    """
    try:
        # Simulate connectivity checks
        connectivity_checks = {
            'dns_resolution': True,
            'tcp_connection': False,
            'http_health_check': False,
            'service_endpoints': 2,
            'network_policies': 'blocking_traffic'
        }
        
        analysis = {
            'service': service_name,
            'checks': connectivity_checks,
            'root_cause': 'network_policy' if connectivity_checks['network_policies'] == 'blocking_traffic' else 'service_unavailable',
            'analysis_time': datetime.now().isoformat()
        }
        
        logger.info(f"Analyzed connectivity for {service_name}")
        return analysis
    except Exception as e:
        logger.error(f"Error analyzing connectivity: {e}")
        return {'error': str(e)}

# Analyze connectivity
connectivity_analyses = []
for anomaly in network_anomalies:
    analysis = analyze_connectivity(anomaly['service'], NAMESPACE)
    connectivity_analyses.append(analysis)

logger.info(f"Analyzed connectivity for {len(connectivity_analyses)} services")

### 3. Execute Network Healing

In [None]:
def execute_network_healing(analysis, namespace):
    """
    Execute network healing actions.
    
    Args:
        analysis: Connectivity analysis
        namespace: Kubernetes namespace
    
    Returns:
        Healing result
    """
    try:
        root_cause = analysis.get('root_cause', 'unknown')
        service = analysis['service']
        
        healing_actions = {
            'network_policy': {
                'action': 'update_network_policy',
                'command': f'oc patch networkpolicy -n {namespace}'
            },
            'service_unavailable': {
                'action': 'restart_service',
                'command': f'oc rollout restart deployment {service} -n {namespace}'
            },
            'dns_failure': {
                'action': 'restart_dns',
                'command': f'oc rollout restart deployment coredns -n kube-system'
            }
        }
        
        action_config = healing_actions.get(root_cause, healing_actions['service_unavailable'])
        
        logger.info(f"Executing {action_config['action']} for {service}")
        
        return {
            'success': True,
            'service': service,
            'action': action_config['action'],
            'timestamp': datetime.now().isoformat()
        }
    except Exception as e:
        logger.error(f"Healing error: {e}")
        return {'success': False, 'error': str(e)}

# Execute healing
healing_results = []
for analysis in connectivity_analyses:
    result = execute_network_healing(analysis, NAMESPACE)
    healing_results.append(result)

logger.info(f"Executed network healing for {len(healing_results)} services")

### 4. Verify Network Recovery

In [None]:
def verify_network_recovery(service_name, namespace):
    """
    Verify network recovery after healing.
    
    Args:
        service_name: Service name
        namespace: Kubernetes namespace
    
    Returns:
        Recovery verification result
    """
    try:
        # Simulate recovery checks
        recovery_metrics = {
            'latency_ms': np.random.randint(100, 300),
            'packet_loss': np.random.uniform(0, 1),
            'connection_success_rate': np.random.uniform(0.95, 1.0),
            'service_healthy': True
        }
        
        return {
            'service': service_name,
            'recovered': recovery_metrics['service_healthy'],
            'metrics': recovery_metrics,
            'verification_time': datetime.now().isoformat()
        }
    except Exception as e:
        logger.error(f"Error verifying recovery: {e}")
        return {'error': str(e)}

# Verify recovery
recovery_verifications = []
for anomaly in network_anomalies:
    verification = verify_network_recovery(anomaly['service'], NAMESPACE)
    recovery_verifications.append(verification)

logger.info(f"Verified recovery for {len(recovery_verifications)} services")

### 5. Track Network Health

In [None]:
# Create network health tracking dataframe
network_tracking = pd.DataFrame([
    {
        'timestamp': datetime.now().isoformat(),
        'service': anomaly['service'],
        'initial_latency_ms': anomaly['latency_ms'],
        'initial_packet_loss': anomaly['packet_loss'],
        'healing_action': healing_results[i]['action'] if i < len(healing_results) else 'none',
        'recovery_success': recovery_verifications[i]['recovered'] if i < len(recovery_verifications) else False,
        'final_latency_ms': recovery_verifications[i]['metrics']['latency_ms'] if i < len(recovery_verifications) else None,
        'final_packet_loss': recovery_verifications[i]['metrics']['packet_loss'] if i < len(recovery_verifications) else None
    }
    for i, anomaly in enumerate(network_anomalies)
])

# Save tracking data
tracking_file = PROCESSED_DIR / 'network_anomaly_response.parquet'
network_tracking.to_parquet(tracking_file)

logger.info(f"Saved network health tracking data")
print(network_tracking.to_string())

## Validation Section

In [None]:
# Verify outputs
assert len(network_anomalies) > 0, "No network anomalies detected"
assert len(connectivity_analyses) > 0, "No connectivity analyses performed"
assert len(healing_results) > 0, "No healing executed"
assert tracking_file.exists(), "Network tracking file not created"

recovery_rate = network_tracking['recovery_success'].sum() / len(network_tracking) if len(network_tracking) > 0 else 0
logger.info(f"✅ All validations passed")
print(f"\nNetwork Anomaly Response Summary:")
print(f"  Anomalies Detected: {len(network_anomalies)}")
print(f"  Healing Actions Executed: {len(healing_results)}")
print(f"  Recovery Success Rate: {recovery_rate:.1%}")
if len(network_tracking) > 0:
    print(f"  Avg Latency Improvement: {(network_tracking['initial_latency_ms'].mean() - network_tracking['final_latency_ms'].mean()):.1f}ms")

## Integration Section

This notebook integrates with:
- **Input**: Network metrics and connectivity checks
- **Output**: Network healing actions and recovery tracking
- **Monitoring**: Prometheus for network health metrics
- **Next**: Complete platform demonstration

## Next Steps

1. Monitor network health metrics
2. Proceed to `complete-platform-demo.ipynb`
3. Run full end-to-end workflow
4. Demonstrate all healing capabilities
5. Validate complete platform functionality

## References

- ADR-003: Self-Healing Platform Architecture
- ADR-012: Notebook Architecture for End-to-End Workflows
- [Kubernetes Network Policies](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
- [OpenShift Network Troubleshooting](https://docs.openshift.com/container-platform/latest/networking/troubleshooting-network.html)