# Tutorial 2: Dataset Generation

**Comprehensive guide to generating synthetic data for benchmarking**

This notebook demonstrates all dataset generators in fastcpd:
1. Mean/Variance changes
2. Regression changes (Linear & LASSO)
3. GLM changes (Binomial & Poisson) - UNIQUE!
4. Time series (ARMA & GARCH) - UNIQUE!
5. Multi-annotator simulation - UNIQUE!

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from fastcpd.datasets import (
    make_mean_change,
    make_variance_change,
    make_regression_change,
    make_glm_change,
    make_arma_change,
    make_garch_change,
    add_annotation_noise
)
from fastcpd.visualization import plot_dataset_characteristics

%matplotlib inline

## 1. Mean Change Detection

Generate data with mean shifts. Returns rich metadata including SNR.

In [None]:
# Generate mean change data
data_dict = make_mean_change(
    n_samples=500,
    n_changepoints=3,
    noise_std=1.0,
    change_type='jump',  # or 'drift' for gradual changes
    seed=42
)

print("Mean Change Dataset:")
print(f"  Data shape: {data_dict['data'].shape}")
print(f"  Change points: {data_dict['changepoints']}")
print(f"  True means: {data_dict['true_means']}")
print(f"\nMetadata:")
print(f"  SNR: {data_dict['metadata']['snr_db']:.2f} dB")
print(f"  Difficulty: {data_dict['metadata']['difficulty']:.3f}")
print(f"  Segment lengths: {data_dict['metadata']['segment_lengths']}")

# Visualize
fig, axes = plot_dataset_characteristics(data_dict, figsize=(12, 8))
plt.show()

## 2. Variance Change Detection

Generate data with variance/volatility changes. Includes kurtosis analysis.

In [None]:
data_dict = make_variance_change(
    n_samples=500,
    n_changepoints=3,
    variance_ratios=[1.0, 4.0, 0.5, 2.0],  # Custom variance multipliers
    change_type='multiplicative',  # or 'additive'
    seed=42
)

print("Variance Change Dataset:")
print(f"  Change points: {data_dict['changepoints']}")
print(f"  Variance ratios: {data_dict['metadata']['variance_ratios']}")
print(f"  True variances: {[f'{v:.2f}' for v in data_dict['true_variances']]}")
print(f"  Kurtosis per segment: {[f'{k:.2f}' for k in data_dict['metadata']['kurtosis_per_segment']]}")

## 3. Regression Changes

Generate linear regression data with coefficient changes. Includes R² and condition number.

In [None]:
data_dict = make_regression_change(
    n_samples=500,
    n_changepoints=3,
    n_features=5,
    coef_changes='random',  # 'sign_flip', 'magnitude', or custom
    correlation=0.3,  # Covariate correlation
    noise_std=0.5,
    seed=42
)

print("Regression Change Dataset:")
print(f"  Data shape: {data_dict['data'].shape} (y + X)")
print(f"  X shape: {data_dict['X'].shape}")
print(f"  Change points: {data_dict['changepoints']}")
print(f"\nMetadata:")
print(f"  R² per segment: {[f'{r:.3f}' for r in data_dict['metadata']['r_squared_per_segment']]}")
print(f"  Condition number: {data_dict['metadata']['condition_number']:.2f}")
print(f"  Effect size: {data_dict['metadata']['effect_size']:.2f}")
print(f"\nTrue coefficients (first segment): {data_dict['true_coefs'][0]}")

## 4. GLM Changes - UNIQUE Feature!

### 4.1 Binomial (Logistic Regression)

In [None]:
data_dict = make_glm_change(
    n_samples=500,
    n_changepoints=3,
    n_features=3,
    family='binomial',
    coef_changes='random',
    seed=42
)

print("Binomial GLM Dataset:")
print(f"  y shape: {data_dict['y'].shape}")
print(f"  y values: {np.unique(data_dict['y'])}")
print(f"  Change points: {data_dict['changepoints']}")
print(f"\nMetadata:")
print(f"  Family: {data_dict['metadata']['family']}")
print(f"  Separation (AUC) per segment: {[f'{s:.3f}' if s else 'N/A' for s in data_dict['metadata']['separation_per_segment']]}")

### 4.2 Poisson (Count Data)

In [None]:
data_dict = make_glm_change(
    n_samples=500,
    n_changepoints=3,
    n_features=3,
    family='poisson',
    seed=42
)

print("Poisson GLM Dataset:")
print(f"  y range: [{data_dict['y'].min()}, {data_dict['y'].max()}]")
print(f"  Change points: {data_dict['changepoints']}")
print(f"\nMetadata:")
print(f"  Overdispersion per segment: {[f'{od:.2f}' for od in data_dict['metadata']['overdispersion_per_segment']]}")

## 5. Time Series - ARMA - UNIQUE Feature!

Generate ARMA processes with parameter changes. Includes stationarity checks.

In [None]:
data_dict = make_arma_change(
    n_samples=500,
    n_changepoints=3,
    orders=[(1,1), (2,0), (0,2), (1,1)],  # ARMA orders per segment
    innovation='normal',  # 't', 'skew_normal'
    seed=42
)

print("ARMA Dataset:")
print(f"  Data shape: {data_dict['data'].shape}")
print(f"  Change points: {data_dict['changepoints']}")
print(f"  Orders: {data_dict['metadata']['orders']}")
print(f"\nStationarity checks:")
print(f"  Is stationary: {data_dict['metadata']['is_stationary']}")
print(f"  Is invertible: {data_dict['metadata']['is_invertible']}")
print(f"\nTrue parameters (segment 1):")
print(f"  AR coefs: {data_dict['true_params'][0]['ar']}")
print(f"  MA coefs: {data_dict['true_params'][0]['ma']}")
print(f"  Sigma: {data_dict['true_params'][0]['sigma']}")

## 6. Time Series - GARCH - UNIQUE Feature!

Generate GARCH processes with volatility regime changes.

In [None]:
data_dict = make_garch_change(
    n_samples=600,
    n_changepoints=2,
    volatility_regimes=['low', 'high', 'low'],  # Predefined regimes
    seed=42
)

print("GARCH Dataset:")
print(f"  Returns shape: {data_dict['data'].shape}")
print(f"  Volatility shape: {data_dict['volatility'].shape}")
print(f"  Change points: {data_dict['changepoints']}")
print(f"\nMetadata:")
print(f"  Volatility regimes: {data_dict['metadata']['volatility_regimes']}")
print(f"  Avg volatility per segment: {[f'{v:.3f}' for v in data_dict['metadata']['avg_volatility_per_segment']]}")
print(f"  Volatility ratios: {[f'{r:.2f}' for r in data_dict['metadata']['volatility_ratios']]}")
print(f"  Kurtosis per segment: {[f'{k:.2f}' for k in data_dict['metadata']['kurtosis_per_segment']]}")

# Plot returns and volatility
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8), sharex=True)
ax1.plot(data_dict['data'], linewidth=0.5)
ax1.set_ylabel('Returns')
ax1.set_title('GARCH Returns with Volatility Regime Changes')
for cp in data_dict['changepoints']:
    ax1.axvline(cp, color='r', linestyle='--', alpha=0.7)
ax1.grid(True, alpha=0.3)

ax2.plot(data_dict['volatility'], color='orange', linewidth=1)
ax2.set_ylabel('Conditional Volatility')
ax2.set_xlabel('Time')
for cp in data_dict['changepoints']:
    ax2.axvline(cp, color='r', linestyle='--', alpha=0.7)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Multi-Annotator Simulation - UNIQUE Feature!

Simulate multiple human annotators with varying agreement levels.

In [None]:
from fastcpd.visualization import plot_annotators

# True change points
true_cps = [100, 200, 300]

# Simulate 5 annotators
annotators = add_annotation_noise(
    true_changepoints=true_cps,
    n_annotators=5,
    noise_std=5.0,  # Location variability
    agreement_rate=0.8,  # Probability of including each CP
    seed=42
)

print("Multi-Annotator Simulation:")
print(f"  True CPs: {true_cps}\n")
for i, ann_cps in enumerate(annotators, 1):
    print(f"  Annotator {i}: {ann_cps}")

# Generate sample data for visualization
data = np.concatenate([
    np.random.normal(0, 1, 100),
    np.random.normal(5, 1, 100),
    np.random.normal(2, 1, 100),
    np.random.normal(-2, 1, 100)
])

# Simulated algorithm prediction
pred_cps = [98, 205, 295]

# Visualize
fig, ax = plot_annotators(data, annotators, pred_cps, figsize=(14, 6))
plt.show()

## 8. Dataset Comparison Table

| Dataset | fastcpd | ruptures | Unique Features |
|---------|---------|----------|----------------|
| Mean Change | ✅ | ✅ | SNR, difficulty score, jump/drift |
| Variance Change | ✅ | ✅ | Kurtosis, multiplicative/additive |
| Regression | ✅ | ✅ | R², condition number, correlation |
| **GLM (Binomial)** | ✅ | ❌ | **AUC per segment** 🌟 |
| **GLM (Poisson)** | ✅ | ❌ | **Overdispersion** 🌟 |
| ARMA | ✅ | ⚠️ | **Stationarity checks** 🌟 |
| **GARCH** | ✅ | ❌ | **Volatility tracking** 🌟 |
| **Multi-Annotator** | ✅ | ❌ | **Annotation simulation** 🌟 |

**fastcpd provides 3 UNIQUE dataset types not available in ruptures!**

## 9. Best Practices

### Choosing Dataset Parameters

1. **SNR for difficulty control**:
   - High SNR (>10 dB): Easy detection
   - Medium SNR (0-10 dB): Moderate difficulty
   - Low SNR (<0 dB): Challenging

2. **Segment lengths**:
   - Too short: Difficult to estimate parameters
   - Too long: Easy to detect
   - Balanced: n/(n_cp+1) ≈ 50-200

3. **Reproducibility**:
   - Always set `seed` for reproducible experiments
   - Document all parameters

4. **Metadata usage**:
   - Use SNR/R²/AUC to assess dataset quality
   - Check stationarity for time series
   - Verify condition number for regression

## Summary

This tutorial covered:

✅ **Core datasets**: Mean, Variance, Regression  
✅ **GLM datasets**: Binomial, Poisson (UNIQUE!)  
✅ **Time series**: ARMA, GARCH (UNIQUE!)  
✅ **Multi-annotator**: Simulation (UNIQUE!)  
✅ **Rich metadata**: SNR, R², AUC, stationarity  
✅ **Visualization**: Dataset characteristics  

### Next Steps

- **Tutorial 3**: End-to-End Benchmarking
- Combine datasets with fastcpd detection
- Evaluate using comprehensive metrics

---

**fastcpd-python** provides the most comprehensive dataset generation for change point research! 🚀