# Synthetic Anomaly Generation

## Overview
This notebook generates realistic synthetic anomalies for testing and training anomaly detection models. It creates labeled datasets that simulate real-world OpenShift cluster failures and performance degradation scenarios.

## Prerequisites
- Completed: `feature-store-demo.ipynb`
- Access to `/opt/app-root/src/data` directory
- NumPy, Pandas, Scikit-learn installed

## Learning Objectives
- Generate realistic time series anomalies
- Create labeled training datasets
- Simulate cluster failure scenarios
- Validate synthetic data quality

## Key Concepts
- **Anomaly Types**: Point anomalies, contextual anomalies, collective anomalies
- **Synthetic Data**: Programmatically generated data that mimics real patterns
- **Labeling**: Marking anomalies for supervised learning
- **Validation**: Ensuring synthetic data is realistic and useful

## Setup Section

In [None]:
import sys
import os
import numpy as np
import pandas as pd
import pickle
import logging
from datetime import datetime, timedelta
from pathlib import Path

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]

    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)

    # Fallback: search upward from cwd
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent

    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"✅ Utils path found: {utils_path}")
else:
    print("⚠️ Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("✅ Common functions imported")
except ImportError as e:
    print(f"⚠️ Common functions not available: {e}")
    print("   Using minimal fallback implementations")

    # Minimal fallback implementation
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/data/synthetic', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {
            'data_dir': '/opt/app-root/src/data',
            'models_dir': '/opt/app-root/src/models',
            'working_dir': os.getcwd()
        }

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"Data directory: {DATA_DIR}")
logger.info(f"Processed directory: {PROCESSED_DIR}")

## Implementation Section

### 1. Generate Normal Time Series Data

In [None]:
def generate_normal_timeseries(n_points=1000, n_features=5, seed=42):
    """
    Generate normal (non-anomalous) time series data.
    
    Args:
        n_points: Number of time points
        n_features: Number of features (metrics)
        seed: Random seed for reproducibility
    
    Returns:
        DataFrame with normal time series data
    """
    np.random.seed(seed)
    
    # Generate timestamps
    start_time = datetime.now() - timedelta(days=30)
    timestamps = [start_time + timedelta(minutes=i) for i in range(n_points)]
    
    # Generate normal data with trends and seasonality
    data = {}
    for i in range(n_features):
        # Base trend
        trend = np.linspace(50, 60, n_points)
        # Seasonal component
        seasonal = 10 * np.sin(np.linspace(0, 4*np.pi, n_points))
        # Random noise
        noise = np.random.normal(0, 2, n_points)
        data[f'metric_{i}'] = trend + seasonal + noise
    
    df = pd.DataFrame(data)
    df['timestamp'] = timestamps
    df['label'] = 0  # 0 = normal
    
    return df

# Generate normal data
normal_data = generate_normal_timeseries(n_points=1000, n_features=5)
logger.info(f"Generated normal data: {normal_data.shape}")
print(normal_data.head())

### 2. Generate Anomalies

In [None]:
def inject_point_anomalies(df, n_anomalies=50, magnitude=3.0):
    """
    Inject point anomalies (sudden spikes/drops).
    
    Args:
        df: DataFrame with normal data
        n_anomalies: Number of anomalies to inject
        magnitude: How many standard deviations from normal
    
    Returns:
        DataFrame with injected anomalies
    """
    df_anomaly = df.copy()
    
    # Randomly select indices for anomalies
    anomaly_indices = np.random.choice(len(df), n_anomalies, replace=False)
    
    for idx in anomaly_indices:
        # Randomly select features to anomalize
        features = np.random.choice(5, 2, replace=False)
        for feat in features:
            col = f'metric_{feat}'
            std = df_anomaly[col].std()
            df_anomaly.loc[idx, col] += magnitude * std * np.random.choice([-1, 1])
        
        df_anomaly.loc[idx, 'label'] = 1  # 1 = anomaly
    
    return df_anomaly

# Inject anomalies
data_with_anomalies = inject_point_anomalies(normal_data, n_anomalies=50)
logger.info(f"Injected point anomalies")
print(f"Anomalies: {(data_with_anomalies['label'] == 1).sum()}")

### 3. Save Synthetic Dataset

In [None]:
# Save to Parquet
output_file = PROCESSED_DIR / 'synthetic_anomalies.parquet'
data_with_anomalies.to_parquet(output_file)

# Verify file was created
if output_file.exists():
    file_size_mb = output_file.stat().st_size / (1024 * 1024)
    print(f"\n✅ SYNTHETIC DATA SAVED")
    print(f"   File: {output_file}")
    print(f"   Size: {file_size_mb:.2f} MB")
    print(f"   Rows: {len(data_with_anomalies)}")
    print(f"   Columns: {len(data_with_anomalies.columns)}")
    logger.info(f"Saved synthetic data to {output_file}")
else:
    print(f"\n❌ ERROR: File not created at {output_file}")
    raise FileNotFoundError(f"Failed to save {output_file}")

# Save metadata
metadata = {
    'n_samples': len(data_with_anomalies),
    'n_features': 5,
    'n_anomalies': (data_with_anomalies['label'] == 1).sum(),
    'anomaly_ratio': (data_with_anomalies['label'] == 1).sum() / len(data_with_anomalies),
    'created_at': datetime.now().isoformat()
}

metadata_file = PROCESSED_DIR / 'synthetic_metadata.pkl'
with open(metadata_file, 'wb') as f:
    pickle.dump(metadata, f)

if metadata_file.exists():
    print(f"\n✅ METADATA SAVED")
    print(f"   File: {metadata_file}")
    print(f"   Anomalies: {metadata['n_anomalies']}")
    print(f"   Anomaly Ratio: {metadata['anomaly_ratio']:.2%}")
    logger.info(f"Metadata: {metadata}")
else:
    print(f"\n❌ ERROR: Metadata file not created at {metadata_file}")
    raise FileNotFoundError(f"Failed to save {metadata_file}")

## Validation Section

In [None]:
# Verify output
assert output_file.exists(), "Output file not created"
assert len(data_with_anomalies) > 0, "No data generated"
assert (data_with_anomalies['label'] == 1).sum() > 0, "No anomalies generated"

logger.info("✅ All validations passed")
print(f"\nDataset Summary:")
print(f"  Total samples: {len(data_with_anomalies)}")
print(f"  Normal samples: {(data_with_anomalies['label'] == 0).sum()}")
print(f"  Anomalous samples: {(data_with_anomalies['label'] == 1).sum()}")
print(f"  Anomaly ratio: {metadata['anomaly_ratio']:.2%}")

## Integration Section

This synthetic dataset is used by:
- `02-anomaly-detection/isolation-forest-implementation.ipynb` - Training anomaly detection models
- `02-anomaly-detection/time-series-anomaly-detection.ipynb` - Validating time series methods
- `02-anomaly-detection/lstm-based-prediction.ipynb` - Training LSTM models

## Next Steps

1. Review the generated synthetic data
2. Proceed to `02-anomaly-detection/time-series-anomaly-detection.ipynb`
3. Train models on this synthetic dataset
4. Validate model performance

## References

- ADR-012: Notebook Architecture for End-to-End Workflows
- ADR-013: Data Collection and Preprocessing Workflows
- [Anomaly Detection Techniques](https://en.wikipedia.org/wiki/Anomaly_detection)