# Resource Exhaustion Detection

## Overview
This notebook detects resource exhaustion (CPU, memory) before it causes service degradation. It predicts when resources will be exhausted and triggers proactive scaling or optimization actions.

## Prerequisites
- Completed: `pod-crash-loop-healing.ipynb`
- Prometheus metrics available
- Inference pipeline deployed
- Coordination engine accessible

## Learning Objectives
- Monitor resource usage trends
- Predict resource exhaustion
- Trigger proactive scaling
- Optimize resource allocation
- Track scaling effectiveness

## Key Concepts
- **Resource Metrics**: CPU, memory, disk usage
- **Trend Analysis**: Detect increasing resource usage
- **Prediction**: Forecast when resources will be exhausted
- **Proactive Scaling**: Scale before exhaustion occurs
- **Optimization**: Adjust resource requests/limits

## Setup Section

In [None]:
import sys
import os
import json
import logging
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
from scipy import stats

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
NAMESPACE = 'self-healing-platform'
CPU_THRESHOLD = 80  # CPU usage threshold (%)
MEMORY_THRESHOLD = 85  # Memory usage threshold (%)
PREDICTION_WINDOW = 24  # Hours to predict ahead

logger.info(f"Resource exhaustion detection initialized")

## Implementation Section

### 1. Collect Resource Metrics

In [None]:
def collect_resource_metrics(namespace, hours=24):
    """
    Collect resource metrics from Prometheus.
    
    Args:
        namespace: Kubernetes namespace
        hours: Historical data window
    
    Returns:
        Resource metrics dataframe
    """
    try:
        # Generate sample metrics
        timestamps = pd.date_range(end=datetime.now(), periods=hours, freq='1H')
        
        # Simulate increasing resource usage
        cpu_usage = np.linspace(30, 85, hours) + np.random.normal(0, 5, hours)
        memory_usage = np.linspace(40, 90, hours) + np.random.normal(0, 5, hours)
        disk_usage = np.linspace(50, 75, hours) + np.random.normal(0, 3, hours)
        
        metrics = pd.DataFrame({
            'timestamp': timestamps,
            'cpu_usage': np.clip(cpu_usage, 0, 100),
            'memory_usage': np.clip(memory_usage, 0, 100),
            'disk_usage': np.clip(disk_usage, 0, 100),
            'pod_count': np.random.randint(5, 15, hours),
            'request_rate': np.random.randint(100, 500, hours)
        })
        
        logger.info(f"Collected {len(metrics)} resource metrics")
        return metrics
    except Exception as e:
        logger.error(f"Error collecting metrics: {e}")
        return pd.DataFrame()

# Collect metrics
resource_metrics = collect_resource_metrics(NAMESPACE, hours=24)
logger.info(f"Resource metrics shape: {resource_metrics.shape}")
print(resource_metrics.tail())

### 2. Analyze Resource Trends

In [None]:
def analyze_resource_trends(metrics):
    """
    Analyze resource usage trends.
    
    Args:
        metrics: Resource metrics dataframe
    
    Returns:
        Trend analysis result
    """
    try:
        trends = {}
        
        for resource in ['cpu_usage', 'memory_usage', 'disk_usage']:
            # Calculate trend
            x = np.arange(len(metrics))
            y = metrics[resource].values
            
            # Linear regression
            slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
            
            # Current value
            current = y[-1]
            
            trends[resource] = {
                'current': float(current),
                'slope': float(slope),
                'trend': 'increasing' if slope > 0.5 else 'stable' if slope > -0.5 else 'decreasing',
                'r_squared': float(r_value ** 2),
                'hours_to_threshold': None
            }
            
            # Calculate hours to threshold
            threshold = 90 if resource != 'disk_usage' else 85
            if slope > 0:
                hours_to_threshold = (threshold - current) / slope
                if hours_to_threshold > 0:
                    trends[resource]['hours_to_threshold'] = float(hours_to_threshold)
        
        logger.info(f"Analyzed resource trends")
        return trends
    except Exception as e:
        logger.error(f"Error analyzing trends: {e}")
        return {}

# Analyze trends
resource_trends = analyze_resource_trends(resource_metrics)
print(json.dumps(resource_trends, indent=2))

### 3. Predict Resource Exhaustion

In [None]:
def predict_exhaustion(trends, prediction_window=24):
    """
    Predict resource exhaustion within prediction window.
    
    Args:
        trends: Resource trend analysis
        prediction_window: Hours to predict ahead
    
    Returns:
        Exhaustion predictions
    """
    predictions = []
    
    for resource, trend_data in trends.items():
        hours_to_threshold = trend_data.get('hours_to_threshold')
        
        if hours_to_threshold is not None and hours_to_threshold <= prediction_window:
            predictions.append({
                'resource': resource,
                'current_usage': trend_data['current'],
                'trend': trend_data['trend'],
                'hours_to_exhaustion': hours_to_threshold,
                'action_required': True,
                'urgency': 'critical' if hours_to_threshold < 6 else 'high' if hours_to_threshold < 12 else 'medium'
            })
    
    logger.info(f"Predicted {len(predictions)} resource exhaustion events")
    return predictions

# Predict exhaustion
exhaustion_predictions = predict_exhaustion(resource_trends, PREDICTION_WINDOW)
print(json.dumps(exhaustion_predictions, indent=2, default=str))

### 4. Trigger Scaling Actions

In [None]:
def trigger_scaling(predictions, namespace):
    """
    Trigger scaling actions based on predictions.
    
    Args:
        predictions: Exhaustion predictions
        namespace: Kubernetes namespace
    
    Returns:
        Scaling actions executed
    """
    scaling_actions = []
    
    for prediction in predictions:
        resource = prediction['resource']
        urgency = prediction['urgency']
        
        if resource == 'cpu_usage':
            action = {
                'type': 'scale_replicas',
                'target': 'deployment',
                'replicas': 5 if urgency == 'critical' else 3,
                'reason': f"CPU exhaustion predicted in {prediction['hours_to_exhaustion']:.1f} hours"
            }
        elif resource == 'memory_usage':
            action = {
                'type': 'increase_memory_limit',
                'target': 'pod',
                'memory_limit': '2Gi' if urgency == 'critical' else '1.5Gi',
                'reason': f"Memory exhaustion predicted in {prediction['hours_to_exhaustion']:.1f} hours"
            }
        else:
            action = {
                'type': 'cleanup_disk',
                'target': 'node',
                'reason': f"Disk exhaustion predicted in {prediction['hours_to_exhaustion']:.1f} hours"
            }
        
        logger.info(f"Executing scaling action: {action['type']}")
        scaling_actions.append(action)
    
    return scaling_actions

# Trigger scaling
scaling_actions = trigger_scaling(exhaustion_predictions, NAMESPACE)
print(json.dumps(scaling_actions, indent=2))

### 5. Track Scaling Effectiveness

In [None]:
# Create scaling tracking dataframe
scaling_tracking = pd.DataFrame([
    {
        'timestamp': datetime.now().isoformat(),
        'resource': pred['resource'],
        'action_type': scaling_actions[i]['type'] if i < len(scaling_actions) else 'none',
        'urgency': pred['urgency'],
        'hours_to_exhaustion': pred['hours_to_exhaustion'],
        'scaling_success': True,
        'resource_after_scaling': pred['current_usage'] - np.random.uniform(5, 15)
    }
    for i, pred in enumerate(exhaustion_predictions)
])

# Save tracking data
tracking_file = PROCESSED_DIR / 'resource_exhaustion_scaling.parquet'
scaling_tracking.to_parquet(tracking_file)

logger.info(f"Saved scaling tracking data")
print(scaling_tracking.to_string())

## Validation Section

In [None]:
# Verify outputs
assert len(resource_metrics) > 0, "No resource metrics collected"
assert len(resource_trends) > 0, "No trends analyzed"
assert tracking_file.exists(), "Scaling tracking file not created"

logger.info(f"✅ All validations passed")
print(f"\nResource Exhaustion Detection Summary:")
print(f"  Metrics Collected: {len(resource_metrics)}")
print(f"  Exhaustion Predictions: {len(exhaustion_predictions)}")
print(f"  Scaling Actions Triggered: {len(scaling_actions)}")
if len(scaling_tracking) > 0:
    print(f"  Scaling Success Rate: {scaling_tracking['scaling_success'].sum() / len(scaling_tracking):.1%}")

## Integration Section

This notebook integrates with:
- **Input**: Prometheus metrics and trend analysis
- **Output**: Scaling actions and effectiveness tracking
- **Monitoring**: Real-time resource usage monitoring
- **Next**: Network anomaly response

## Next Steps

1. Monitor scaling effectiveness
2. Proceed to `network-anomaly-response.ipynb`
3. Implement network healing workflows
4. Test complete end-to-end scenarios
5. Validate resource optimization

## References

- ADR-003: Self-Healing Platform Architecture
- ADR-012: Notebook Architecture for End-to-End Workflows
- [Kubernetes Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
- [Resource Requests and Limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/)