# Data Ingestion and Quality Control

This notebook demonstrates how to load seismic data from multiple formats and perform quality control checks.

**Prerequisites:**
- Python 3.10+
- Basic understanding of seismic data formats

**Estimated Runtime:** 5 minutes

**Supported Formats:**
- SEG-Y (.sgy, .segy)
- miniSEED (.mseed)
- SAC (.sac)

## 1. Installation and Setup

In [None]:
# Uncomment to install from PyPI:
# !pip install promethium-seismic==1.0.1

In [None]:
import promethium
from promethium import (
    load_segy,
    load_miniseed,
    load_sac,
    generate_synthetic_traces,
    set_seed,
)

import numpy as np
import matplotlib.pyplot as plt

print(f"Promethium version: {promethium.__version__}")
set_seed(42)

## 2. Loading SEG-Y Data

SEG-Y is the most common format for seismic data exchange.

In [None]:
# Example: Loading SEG-Y file
# Replace with your actual file path
# data = load_segy("path/to/your/file.sgy")

# For demonstration, we generate synthetic data
print("Generating synthetic data for demonstration...")
data, metadata = generate_synthetic_traces(
    n_traces=100,
    n_samples=1000,
    sample_rate=250.0,
    seed=42
)

print(f"Data shape: {data.shape}")
print(f"Number of traces: {metadata['n_traces']}")
print(f"Samples per trace: {metadata['n_samples']}")
print(f"Sample rate: {metadata['sample_rate']} Hz")
print(f"Duration: {metadata['duration']:.2f} seconds")

## 3. Data Inspection

Examine basic statistics and properties of the loaded data.

In [None]:
# Basic statistics
print("Dataset Statistics")
print("=" * 40)
print(f"Shape: {data.shape}")
print(f"Data type: {data.dtype}")
print(f"Min value: {np.min(data):.6f}")
print(f"Max value: {np.max(data):.6f}")
print(f"Mean: {np.mean(data):.6f}")
print(f"Std deviation: {np.std(data):.6f}")
print(f"Memory usage: {data.nbytes / 1024:.2f} KB")

In [None]:
# Per-trace statistics
trace_means = np.mean(data, axis=1)
trace_stds = np.std(data, axis=1)
trace_maxabs = np.max(np.abs(data), axis=1)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].plot(trace_means, 'b-', linewidth=0.5)
axes[0].set_xlabel('Trace Index')
axes[0].set_ylabel('Mean Amplitude')
axes[0].set_title('Trace Mean Values')
axes[0].grid(True, alpha=0.3)

axes[1].plot(trace_stds, 'g-', linewidth=0.5)
axes[1].set_xlabel('Trace Index')
axes[1].set_ylabel('Standard Deviation')
axes[1].set_title('Trace Standard Deviations')
axes[1].grid(True, alpha=0.3)

axes[2].plot(trace_maxabs, 'r-', linewidth=0.5)
axes[2].set_xlabel('Trace Index')
axes[2].set_ylabel('Max Absolute Value')
axes[2].set_title('Trace Peak Amplitudes')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Quality Control Checks

In [None]:
def quality_control_report(data, sample_rate=250.0):
    """Generate a quality control report for seismic data."""
    n_traces, n_samples = data.shape
    
    # Check for NaN values
    nan_count = np.sum(np.isnan(data))
    
    # Check for infinite values
    inf_count = np.sum(np.isinf(data))
    
    # Identify dead traces (zero or near-zero energy)
    trace_energy = np.sum(data**2, axis=1)
    dead_threshold = np.mean(trace_energy) * 0.01
    dead_traces = np.where(trace_energy < dead_threshold)[0]
    
    # Identify anomalous traces (energy > 3 std from mean)
    energy_mean = np.mean(trace_energy)
    energy_std = np.std(trace_energy)
    anomalous_traces = np.where(np.abs(trace_energy - energy_mean) > 3 * energy_std)[0]
    
    # Check amplitude distribution
    amplitude_range = np.max(data) - np.min(data)
    
    report = {
        'n_traces': n_traces,
        'n_samples': n_samples,
        'sample_rate_hz': sample_rate,
        'duration_s': n_samples / sample_rate,
        'nan_count': nan_count,
        'inf_count': inf_count,
        'dead_traces': len(dead_traces),
        'dead_trace_indices': dead_traces.tolist(),
        'anomalous_traces': len(anomalous_traces),
        'anomalous_trace_indices': anomalous_traces.tolist(),
        'amplitude_range': amplitude_range,
        'data_quality': 'GOOD' if (nan_count == 0 and inf_count == 0 and len(dead_traces) == 0) else 'NEEDS_REVIEW'
    }
    
    return report

# Generate QC report
qc_report = quality_control_report(data, sample_rate=metadata['sample_rate'])

print("Quality Control Report")
print("=" * 50)
for key, value in qc_report.items():
    if not key.endswith('_indices'):
        print(f"{key:>25}: {value}")

## 5. Visualization

In [None]:
# Plot seismic gather (image view)
fig, ax = plt.subplots(figsize=(12, 8))

clip_val = np.percentile(np.abs(data), 99)
extent = [0, data.shape[0], data.shape[1] / metadata['sample_rate'], 0]

im = ax.imshow(
    data.T,
    aspect='auto',
    cmap='seismic',
    vmin=-clip_val,
    vmax=clip_val,
    extent=extent
)

ax.set_xlabel('Trace Number')
ax.set_ylabel('Time (s)')
ax.set_title('Seismic Data Gather')
plt.colorbar(im, ax=ax, label='Amplitude')
plt.tight_layout()
plt.show()

In [None]:
# Amplitude histogram
fig, ax = plt.subplots(figsize=(10, 5))

ax.hist(data.flatten(), bins=100, density=True, alpha=0.7, color='steelblue')
ax.set_xlabel('Amplitude')
ax.set_ylabel('Density')
ax.set_title('Amplitude Distribution')
ax.axvline(0, color='red', linestyle='--', linewidth=1, label='Zero')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Summary

This notebook demonstrated:

1. **Data Loading**: Using `load_segy()`, `load_miniseed()`, `load_sac()`
2. **Data Inspection**: Computing statistics and per-trace properties
3. **Quality Control**: Detecting dead traces, anomalies, and data issues
4. **Visualization**: Seismic gather plots and amplitude distributions

### Next Steps

- **03_signal_processing_basics.ipynb**: Apply filters and transforms
- **04_matrix_completion_and_compressive_sensing.ipynb**: Recover missing traces