# DroneDetect V2 - Raw IQ Feature Extraction

## Overview
This notebook implements Raw IQ (downsampled) feature extraction for RF-based drone classification using deep learning architectures.

## Methodology
We follow a **two-stage pipeline**: (1) IQ Downsampling → (2) Deep Learning Training (RF-UAVNet)

The IQ extraction process:
1. Loads raw I/Q samples from .dat files (60 MHz sampling rate)
2. **Normalizes per-file** (Min-max [0,1]) before segmentation - **DIFFERENT from PSD/Spectrogram!**
3. Segments continuous signals into fixed-duration windows (20ms)
4. **Downsamples** from ~1.2M to 10k samples via linear interpolation
5. Saves complex IQ data as (2, 10000) = [real, imag]

## Why Raw IQ Features?

**Raw IQ**: Preserves phase information and signal structure for end-to-end learning. Enables neural networks to discover representations beyond hand-crafted features.

**Advantages**:
- Preserves full signal information (magnitude + phase)
- No information loss from feature engineering
- End-to-end learnable representations
- Matches RF-UAVNet architecture

## Parameter Selection

**Aligned with RF-UAVNet reference:**
- **Segment duration**: **20ms**
- **Downsampled size**: **10,000 samples** (from ~1.2M)
- **Normalization**: **Min-max [0,1] per-file** - Matches RadioML2016 protocol
- **Format**: **(2, 10000)** - [I channel, Q channel]
- **Downsampling method**: **Linear interpolation** - Preserves signal shape

### Critical Difference: Min-Max Normalization

Unlike PSD/Spectrogram (Z-score), IQ uses **min-max [0,1]** normalization:
- Prevents negative values (important for some architectures)
- Bounded range [0,1] stabilizes training
- Matches RF-UAVNet paper implementation
- Per-channel normalization (I and Q independently)

## Downstream Usage

IQ features are consumed by:
- `05_training_rfuavnet_COLAB.ipynb` - RF-UAVNet deep learning

## Reference Alignment
Parameters verified against REFERENTIEL_DRONEDETECT_RFCLASSIFICATION.md Section 3.2 and arXiv:2308.11833.


In [None]:
import sys
sys.path.insert(0, '../src')

import gc
import psutil
import os
import numpy as np
from tqdm import tqdm
from typing import Any, List
from sklearn.preprocessing import LabelEncoder
import zipfile

from dronedetect import config, data_loader, preprocessing, features

# Memory monitoring utility

def get_memory_mb():
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024


# Create output directory
config.FEATURES_DIR.mkdir(parents=True, exist_ok=True)

print(f"Initial memory: {get_memory_mb():.0f} MB")

## Specifications

All preprocessing parameters defined here for reproducibility.

In [None]:
# =============================================================================
# PREPROCESSING SPECIFICATIONS
# =============================================================================
# All parameters controlling feature extraction defined here for reproducibility.

# --- Signal Segmentation ---
SEGMENT_DURATION_MS = 20        # Window length (config.DEFAULT_SEGMENT_MS)
                                 # RFClassification results: 10ms→20ms→50ms improves 76.9%→83.6%→89.4%

# --- Feature Extraction Parameters ---
N_FFT = 1024                     # FFT size - Reference: REFERENTIEL Section 1.2.1 (nperseg=1024)
                                 # Frequency resolution = 60MHz/1024 ≈ 58.6 kHz/bin

SPECTROGRAM_SIZE = (224, 224)    # Output image dimensions (config.DEFAULT_SPEC_SIZE)
                                 # Matches VGG16/ResNet input for transfer learning

IQ_DOWNSAMPLE_TARGET = 10000     # Downsampled IQ samples (config.DEFAULT_IQ_DOWNSAMPLE)
                                 # Original: ~1.2M samples/segment → 10k for memory efficiency

# --- Batch Processing ---
BATCH_SIZE = 2                  # Files per batch (adjust based on available RAM)

# --- Dataset Info ---
SAMPLING_RATE_MHZ = 60           # DroneDetect V2 sampling frequency
EXPECTED_SEGMENTS_PER_FILE = 100 # Approx (2s recording / 20ms window)

print("Specifications loaded:")
print(f"  Segment: {SEGMENT_DURATION_MS}ms | FFT: {N_FFT} | Batch: {BATCH_SIZE} files")

## Preprocessing Architecture

Modular pipeline system for flexible feature extraction.

In [None]:
# =============================================================================
# PREPROCESSING ARCHITECTURE
# =============================================================================


class PreprocessingStep:
    """Base class for preprocessing steps."""
    
    def process(self, segment: np.ndarray) -> Any:
        """Process a signal segment.
        
        Args:
            segment: Input signal segment
            
        Returns:
            Processed output (type depends on step)
        """
        raise NotImplementedError
    
    def __repr__(self):
        return self.__class__.__name__


class PSDStep(PreprocessingStep):
    """Power Spectral Density via Welch method with per-sample normalization."""
    
    def __init__(self, nfft=1024):
        self.nfft = nfft
    
    def process(self, segment):
        _, psd = features.compute_psd(segment, nfft=self.nfft)
        # Per-sample normalization (reference: REFERENTIEL Section 1.3.1, line 220)
        # Division by max with zero-protection
        psd_max = np.max(psd)
        if psd_max < 1e-15:  # Essentially zero power
            return np.zeros_like(psd)
        return psd / psd_max
    
    def __repr__(self):
        return f"PSDStep(nfft={self.nfft})"


class SpectrogramStep(PreprocessingStep):
    """Spectrogram via STFT + resize."""
    
    def __init__(self, target_size=(224, 224)):
        self.target_size = target_size
    
    def process(self, segment):
        return features.compute_spectrogram(segment, target_size=self.target_size)
    
    def __repr__(self):
        return f"SpectrogramStep(size={self.target_size})"


class DownsampleIQStep(PreprocessingStep):
    """Downsample IQ via linear interpolation."""
    
    def __init__(self, target_samples=10000):
        self.target_samples = target_samples
    
    def process(self, segment):
        return preprocessing.downsample_iq(segment, target_samples=self.target_samples)
    
    def __repr__(self):
        return f"DownsampleIQStep(n={self.target_samples})"


class FeaturePipeline:
    """Pipeline orchestrating multiple preprocessing steps."""
    
    def __init__(self, name: str, steps: List[PreprocessingStep]):
        """Initialize pipeline.
        
        Args:
            name: Pipeline identifier (used for output filename)
            steps: List of preprocessing steps to apply in order
        """
        self.name = name
        self.steps = steps
    
    def process_segment(self, segment: np.ndarray) -> np.ndarray:
        """Apply all steps sequentially.
        
        Args:
            segment: Input signal segment
            
        Returns:
            Final processed features
        """
        data = segment
        for step in self.steps:
            data = step.process(data)
        return data
    
    def get_output_filename(self) -> str:
        """Get output filename for this pipeline."""
        return f"{self.name}_features.npz"
    
    def __repr__(self):
        steps_str = ' → '.join([str(s) for s in self.steps])
        return f"Pipeline({self.name}): {steps_str}"


print("Preprocessing architecture loaded")

## Pipeline Configuration

Define preprocessing pipelines for different feature types.

In [None]:
# =============================================================================
# PIPELINE CONFIGURATION
# =============================================================================

# Single pipeline for IQ features
pipeline = FeaturePipeline(
    name='iq',
    steps=[DownsampleIQStep(target_samples=IQ_DOWNSAMPLE_TARGET)]
)

print(f"Configured pipeline: {pipeline}")


## 1. Scan Dataset

In [None]:
df = data_loader.get_dataset_metadata(config.DATA_DIR)
print(f"Total files: {len(df)}")

In [None]:
df.describe(include="all")

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info(verbose=True, show_counts=True)

## 2. Extract Features (Batch Processing)

Process files in batches to avoid memory saturation. Features are written progressively to disk.

In [None]:
# =============================================================================
# BATCH PROCESSING - Extract Features
# =============================================================================

# Initialize label encoders
from sklearn.preprocessing import LabelEncoder

drone_encoder = LabelEncoder()
interference_encoder = LabelEncoder()
state_encoder = LabelEncoder()

drone_encoder.fit(df['drone_code'].unique())
interference_encoder.fit(df['interference'].unique())
state_encoder.fit(df['state'].unique())

print(f"Label encoders initialized")
print(f"  Drones: {drone_encoder.classes_}")
print(f"  Interference: {interference_encoder.classes_}")
print(f"  States: {state_encoder.classes_}")

# Batch processing
batch_files = []
batch_idx = 0

for batch_start in tqdm(range(0, len(df), BATCH_SIZE), desc="Processing batches"):
    batch_df = df.iloc[batch_start:batch_start + BATCH_SIZE]

    # Storage for current batch
    batch_features = []
    labels_batch = []
    file_ids_batch = []

    print(f"\nBatch {batch_idx}: files {batch_start}-{batch_start + len(batch_df)}")

    for idx, row in batch_df.iterrows():
        try:
            # Load IQ data
            iq_data = data_loader.load_raw_iq(row['file_path'])

            # CRITICAL: Min-max [0,1] normalization for IQ/RF-UAVNet
            iq_normalized = preprocessing.normalize_minmax(iq_data)

            # Segment into 20ms windows
            segments = preprocessing.segment_signal(iq_normalized, segment_ms=SEGMENT_DURATION_MS)

            # Process each segment through pipeline
            for seg in segments:
                seg = seg.copy()  # Break view to allow memory release

                # Extract features
                feature_output = pipeline.process_segment(seg)
                batch_features.append(feature_output)

                # Encode labels
                labels_batch.append({
                    'drone': drone_encoder.transform([row['drone_code']])[0],
                    'interference': interference_encoder.transform([row['interference']])[0],
                    'state': state_encoder.transform([row['state']])[0]
                })
                file_ids_batch.append(idx)

                del seg

            # Free memory
            del iq_data, iq_normalized, segments
            gc.collect()

        except Exception as e:
            print(f"  Error processing {row['file_path']}: {e}")
            raise Exception from e

    # Save batch to disk
    batch_file = config.FEATURES_DIR / f'batch_{batch_idx:03d}.npz'
    np.savez_compressed(
        batch_file,
        iq=np.array(batch_features),
        labels=np.array(labels_batch, dtype=object),
        file_ids=np.array(file_ids_batch, dtype=np.int32)
    )
    batch_files.append(batch_file)

    print(f"  Saved {len(batch_features)} samples to {batch_file.name}")
    print(f"  Memory: {get_memory_mb():.0f} MB")

    # Clean batch data
    del batch_features, labels_batch, file_ids_batch
    gc.collect()

    batch_idx += 1

print(f"\nBatch processing complete: {len(batch_files)} batches saved")
print(f"Memory: {get_memory_mb():.0f} MB")


## 3. Merge Batches into Final Files

Combine all batch files into single feature files.

In [None]:
# =============================================================================
# OPTIMIZED BATCH MERGING (Disk-backed, ZERO RAM accumulation)
# =============================================================================
print("Merging batches (disk-backed memmap)...")

MERGE_CHUNK_SIZE = 5  # Process N batches at a time


def merge_pipeline_chunked(pipeline_name, batch_files, chunk_size=5):
    """Merge batches for a single pipeline using disk-backed memory mapping.

    This approach prevents memory saturation by:
    1. Writing directly to disk via memmap (NO RAM accumulation)
    2. Processing one pipeline at a time
    3. Pre-allocating final array size
    4. Keeping memmap for direct npz save (no RAM load)

    Args:
        pipeline_name: Name of pipeline ('psd', 'spectrogram', 'iq')
        batch_files: List of batch file paths
        chunk_size: Number of batches to process simultaneously

    Returns:
        dict with memmap + metadata (labels stay in RAM - small)
    """
    print(f"\n[{pipeline_name}] Processing {len(batch_files)} batches (disk-backed)...")

    # STEP 1: Calculate total size by scanning batches
    print(f"[{pipeline_name}] Scanning batches to determine size...")
    total_samples = 0
    feature_shape = None

    for batch_file in batch_files:
        data = np.load(batch_file, allow_pickle=True)
        batch_features = data[pipeline_name]
        total_samples += len(batch_features)
        if feature_shape is None:
            feature_shape = batch_features.shape[1:]  # (1024,) or (224,224,3)
        del data

    print(f"[{pipeline_name}] Total samples: {total_samples}, feature shape: {feature_shape}")

    # STEP 2: Create memmap file (writes directly to disk)
    memmap_file = config.FEATURES_DIR / f'{pipeline_name}_features_X.mmap'
    final_shape = (total_samples,) + feature_shape

    print(f"[{pipeline_name}] Creating memmap: {final_shape} ({np.prod(final_shape)*4/1e9:.2f} GB)")
    X_memmap = np.memmap(memmap_file, dtype='float32', mode='w+', shape=final_shape)

    # Storage for labels (small, can stay in RAM)
    all_drone_labels = []
    all_interference_labels = []
    all_state_labels = []
    all_file_ids = []

    # STEP 3: Fill memmap progressively
    write_offset = 0

    for chunk_start in tqdm(range(0, len(batch_files), chunk_size), desc=f"{pipeline_name} chunks"):
        chunk_batch_files = batch_files[chunk_start:chunk_start + chunk_size]

        for batch_file in chunk_batch_files:
            data = np.load(batch_file, allow_pickle=True)

            # Write features directly to disk
            batch_features = data[pipeline_name]
            n_samples = len(batch_features)
            X_memmap[write_offset:write_offset + n_samples] = batch_features
            write_offset += n_samples

            # Collect labels (small)
            labels = data['labels']
            all_drone_labels.extend([label['drone'] for label in labels])
            all_interference_labels.extend([label['interference'] for label in labels])
            all_state_labels.extend([label['state'] for label in labels])
            all_file_ids.append(data['file_ids'])

            del data, batch_features

        # Flush to disk periodically
        X_memmap.flush()
        gc.collect()

        print(f"  Written {write_offset}/{total_samples} samples, RAM: {get_memory_mb():.0f} MB")

    # STEP 4: Final flush (keep memmap, do NOT load into RAM)
    X_memmap.flush()
    
    final_file_ids = np.concatenate(all_file_ids, axis=0)

    return {
        'X_memmap': X_memmap,  # Return memmap directly
        'memmap_file': memmap_file,
        'memmap_shape': final_shape,
        'y_drone': np.array(all_drone_labels, dtype=np.int32),
        'y_interference': np.array(all_interference_labels, dtype=np.int32),
        'y_state': np.array(all_state_labels, dtype=np.int32),
        'file_ids': final_file_ids
    }


# Process each pipeline separately (reduces peak memory ~3x)
final_data = {}
# Process single pipeline
print(f"\n{'='*60}")
final_data = merge_pipeline_chunked(
    "iq",
    batch_files,
    chunk_size=MERGE_CHUNK_SIZE
)
print(f"[iq] Shape: {final_data['memmap_shape']}")
print(f"Memory after iq: {get_memory_mb():.0f} MB")

print("\n" + "="*60)
print("Final shapes (memmap-backed):")
# Process single pipeline
print(f"  {pipeline.name}: {final_data['memmap_shape']}")
print(f"  File IDs: {final_data['file_ids'].shape} (unique: {len(np.unique(final_data['file_ids']))})")
print(f"Final memory: {get_memory_mb():.0f} MB")

## 4. Save Final Features to Disk

In [None]:
# Save features for each pipeline (directly from memmap, no RAM load)
print("Saving features to npz (streaming from memmap)...")

# Save single pipeline
output_file = config.FEATURES_DIR / "iq_features.npz"
    
# Save directly from memmap (npz will read from disk as needed)
np.savez_compressed(
    output_file,
    X=final_data['X_memmap'],  # Memmap array
    y_drone=final_data['y_drone'],
    y_interference=final_data['y_interference'],
    y_state=final_data['y_state'],
    file_ids=final_data['file_ids'],
    drone_classes=drone_encoder.classes_,
    interference_classes=interference_encoder.classes_,
    state_classes=state_encoder.classes_,
    # Metadata for frequency conversion (baseband to absolute RF)
    fs=config.FS,
    center_freq=config.CENTER_FREQ,
    bandwidth=config.BANDWIDTH
)
print(f"Saved iq features to {output_file}")
    
# Clean up memmap
del final_data['X_memmap']
final_data['memmap_file'].unlink()

# Free remaining memory
del final_data
gc.collect()

print(f"\nAll features saved successfully! Final RAM: {get_memory_mb():.0f} MB")

## 5. Verify Saved Files

In [None]:
import zipfile

with zipfile.ZipFile(config.FEATURES_DIR / 'iq_features.npz', 'r') as z:
    print("Keys in archive:", z.namelist())
    
    # Read X shape
    with z.open('X.npy') as f:
        version = np.lib.format.read_magic(f)
        shape, fortran, dtype = np.lib.format._read_array_header(f, version)
        print(f"  X shape: {shape}, dtype: {dtype}")
    
    # Read y_drone shape
    with z.open('y_drone.npy') as f:
        version = np.lib.format.read_magic(f)
        shape, fortran, dtype = np.lib.format._read_array_header(f, version)
        print(f"  y_drone shape: {shape}, dtype: {dtype}")
    
    # Classes can be loaded (small arrays)
    drone_classes = np.load(z.open('drone_classes.npy'), allow_pickle=True)
    interference_classes = np.load(z.open('interference_classes.npy'), allow_pickle=True)
    state_classes = np.load(z.open('state_classes.npy'), allow_pickle=True)
    
    print(f"  Drone classes: {drone_classes}")
    print(f"  Interference classes: {interference_classes}")
    print(f"  State classes: {state_classes}")

## Appendix: Stratified Train/Test Split with File-Level Grouping

### Problem: Temporal Autocorrelation and Data Leakage

Each `.dat` file in DroneDetect V2 contains approximately 2 seconds of continuous RF signal sampled at 60 MHz. During preprocessing, this signal is segmented into ~100 overlapping windows of 20ms each.

**Critical observation:** Consecutive segments from the same recording exhibit strong temporal autocorrelation. The RF characteristics (carrier frequency drift, hardware imperfections, environmental noise) remain largely constant within a single acquisition.

If segments from the same source file appear in both training and test sets, the model may learn to recognize recording-specific artifacts rather than generalizable drone RF signatures. This constitutes **data leakage** and leads to overly optimistic performance estimates that fail to generalize to unseen recordings.

### Solution: StratifiedGroupKFold

We implement a file-grouped stratified split using `sklearn.model_selection.StratifiedGroupKFold`:

1. **Grouping constraint**: All segments from a given `.dat` file are assigned to the same split (train OR test, never both)
2. **Stratification**: Splits maintain the drone class distribution to ensure balanced representation
3. **Validation**: An assertion verifies zero file overlap between splits

This approach ensures that test set performance reflects the model's ability to generalize to entirely new recordings, providing a realistic estimate of real-world deployment accuracy.

In [None]:
from sklearn.model_selection import StratifiedGroupKFold


def get_stratified_file_split(X, y, file_ids, test_size=0.2, random_state=42):
    """
    Split data at FILE level to prevent data leakage.
    
    Segments from the same .dat file (~100 segments) will never appear
    in both train and test sets.
    
    Parameters
    ----------
    X : array-like
        Features (n_samples, ...)
    y : array-like
        Labels for stratification (n_samples,)
    file_ids : array-like
        Source file ID for each sample (n_samples,)
    test_size : float
        Approximate test set proportion (actual may vary due to file grouping)
    random_state : int
        Random seed for reproducibility
        
    Returns
    -------
    train_idx, test_idx : arrays
        Indices for train/test split
    """
    n_splits = int(1 / test_size)  # e.g., test_size=0.2 -> 5 splits -> 1 fold = 20%
    
    sgkf = StratifiedGroupKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    # Take first fold as train/test split
    train_idx, test_idx = next(sgkf.split(X, y, groups=file_ids))
    
    # Verify no file leakage
    train_files = set(file_ids[train_idx])
    test_files = set(file_ids[test_idx])
    assert len(train_files & test_files) == 0, "Data leakage detected: files in both splits"
    
    return train_idx, test_idx


# Usage example (run after loading features):
# psd_data = np.load(config.FEATURES_DIR / 'iq_features.npz')
# X, y, file_ids = psd_data['X'], psd_data['y_drone'], psd_data['file_ids']
# train_idx, test_idx = get_stratified_file_split(X, y, file_ids)
# X_train, X_test = X[train_idx], X[test_idx]
# y_train, y_test = y[train_idx], y[test_idx]

print("Split function defined: get_stratified_file_split(X, y, file_ids, test_size=0.2)")