# Explainable AI (XAI) for Partially Spoofed Audio Detection with Grad-CAM
**Author:** Tianchi Liu  
**Status:** In Progress

**Reference:** [How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?](https://arxiv.org/abs/2406.02483)


## Overview

This tutorial explains the **step-by-step workflow** for applying **Explainable AI (XAI)** techniques to **partially spoofed audio detection** using the **Gradient-weighted Class Activation Mapping (Grad-CAM)** method.

**Partially spoofed audio** refers to utterances where only certain segments are synthetic while others remain genuine.

### üìÇ Reference Implementation Path

```bash
egs/detection/partialspoof/x12_ssl_res1d/
```

### Key Components

| File | Purpose |
|------|--------|
| `run.sh` | Main pipeline orchestrating Stages 1-10 |
| `conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml` | Model configuration |
| `local/prepare_data.sh` | Data preparation script |
| `wedefense/bin/train.py` | Model training |
| `wedefense/bin/XAI_GradCam_infer.py` | XAI heatmap extraction |
| `wedefense/bin/XAI_Score_analysis.py` | XAI score analysis and visualization |

### What This Tutorial Covers

‚úÖ **Complete Pipeline** - From data preparation to XAI analysis  
‚úÖ **Model Architecture** - SSL-Res1D for partial spoofing detection  
‚úÖ **Grad-CAM Theory** - How temporal activation maps are computed  
‚úÖ **XAI Extraction** - Step-by-step extraction process  
‚úÖ **Result Interpretation** - Understanding and analyzing XAI scores

## Complete Pipeline Overview

The `run.sh` script implements a 10-stage pipeline:

```
Stage 1: Data Preparation          ‚Üí wav.scp, utt2lab, lab2utt
Stage 2: Data Format Conversion    ‚Üí Shard/Raw format
Stage 3: Model Training            ‚Üí SSL-Res1D training
Stage 4: Model Averaging           ‚Üí Average best checkpoints
Stage 5: Extract Logits            ‚Üí Model inference
Stage 6: Compute LLR Scores        ‚Üí Log-likelihood ratios
Stage 7: Performance Evaluation    ‚Üí EER, min t-DCF metrics
Stage 8: Analysis                  ‚Üí Statistical tests
Stage 9: XAI Extraction            ‚Üí Grad-CAM heatmaps
Stage 10: XAI Analysis             ‚Üí Visualization and interpretation
```

**This tutorial focuses on Stages 9-10** (XAI extraction and analysis), assuming Stages 1-8 are complete.

## Grad-CAM Theory for Audio

### What is Grad-CAM?

Grad-CAM (Gradient-weighted Class Activation Mapping) identifies which regions of the input the model focuses on when making predictions.

### Mathematical Formulation

For a target class $c$ (e.g., spoof class):

1. **Forward Pass**: 
   - Input audio ‚Üí SSL Frontend ‚Üí Classifier (Res1D) ‚Üí Classification score $y^c$
   - Extract feature maps $A^k$ from target layer

2. **Backward Pass**:
   - Compute gradients: $\frac{\partial y^c}{\partial A^k}$

3. **Weight Calculation** (Global Average Pooling):
   $$\alpha_k^c = \frac{1}{T}\sum_{t=1}^{T}\frac{\partial y^c}{\partial A^k_t}$$
   
   where $T$ is the temporal dimension.

4. **Weighted Combination**:
   $$L^c_{\text{Grad-CAM}} = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$

5. **Temporal Heatmap**:
   - Normalize to [0, 1]
   - High values indicate regions important for classification

### Why Grad-CAM for Partial Spoofing?

Unlike fully synthetic audio (uniform fake), partially spoofed audio requires:
- **Temporal localization**: Identify *when* spoofing occurs
- **Boundary detection**: Find transitions between real/fake
- **Segment-level understanding**: Distinguish mixed content

Grad-CAM provides this temporal resolution by showing activation strength over time.

## Model Architecture: SSL-Res1D

### Pipeline Components

```
Audio Input (16kHz)
    ‚Üì
[SSL Frontend] XLSR-53
    ‚Üì
[Classifier] Res1D Backend
    ‚Üì
Classification Score (Bonafide/Spoof)
```

### Key Configuration

From `conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml`:

```yaml
model: ssl_multireso_gmlp
model_args:
  feat_dim: 768          # XLSR-53 feature dimension
  embed_dim: -2          # Output embedding dimension
  num_scale: 6           # Multi-resolution scales
  gmlp_layers: 1
  batch_first: true
  flag_pool: ap          # Attentive pooling

frontend: xlsr_53
xlsr_53_args:
  layer: 12              # Use 12th layer of XLSR-53
  
projection_args:
  project_type: arc_margin
  scale: 30.0
  margin: 0.2
```

### Why This Architecture?

1. **XLSR-53**: Self-supervised speech representations capture fine-grained acoustic patterns
2. **Res1D**: 1D residual blocks effective for temporal modeling
3. **Multi-Resolution**: Captures artifacts at different temporal scales
4. **Arc Margin**: Enhances inter-class separation

## Stage 1: Data Preparation

### Script: `local/prepare_data.sh`

### Purpose
Prepare the PartialSpoof dataset in WeDefense format.

### Input
- PartialSpoof database directory
- Protocol files: `PartialSpoof.LA.cm.{train,dev,eval}.trl.txt`

### Process

1. **Create wav.scp**
   ```bash
   find ${PS_dir}/${dset}/con_wav -name "*.wav" | awk -F"/" '{print $NF,$0}' | sort
   ```
   Format: `utterance_id /path/to/audio.wav`

2. **Extract labels (utt2lab)**
   ```bash
   cut -d' ' -f2,5 ${PS_dir}/protocols/PartialSpoof_LA_cm_protocols/PartialSpoof.LA.cm.${dset}.trl.txt
   ```
   Format: `utterance_id bonafide/spoof`

3. **Create lab2utt mapping**
   ```bash
   ./tools/utt2lab_to_lab2utt.pl ${data}/${dset}/utt2lab
   ```
   Groups utterances by label

4. **Compute durations**
   ```bash
   python tools/wav2dur.py ${data}/${dset}/wav.scp ${data}/${dset}/utt2dur
   ```

### Output Files
```
data/{train,dev,eval}/
  ‚îú‚îÄ‚îÄ wav.scp      # Audio paths
  ‚îú‚îÄ‚îÄ utt2lab      # Utterance labels
  ‚îú‚îÄ‚îÄ lab2utt      # Label-to-utterance mapping
  ‚îî‚îÄ‚îÄ utt2dur      # Audio durations
```

## Stage 3: Model Training (Overview)

### Command

```bash
torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:$PORT \
  --nnodes=1 --nproc_per_node=$num_gpus \
  wedefense/bin/train.py --config $config \
    --exp_dir ${exp_dir} \
    --gpus $gpus \
    --num_avg ${num_avg} \
    --data_type "${data_type}" \
    --train_data ${data}/train/${data_type}.list \
    --train_label ${data}/train/utt2lab
```

### Training Process

1. **Data Loading**: Batch sampling from shard/raw format
2. **Frontend**: Extract XLSR-53 features (Layer 12)
3. **Augmentation**: Optional spec augmentation, speed perturbation
4. **Forward**: Encoder ‚Üí Pooling ‚Üí Projection
5. **Loss**: Arc Margin Softmax loss
6. **Optimization**: AdamW with learning rate scheduling

### Key Training Parameters

- **Batch size**: Typically 64-128
- **Learning rate**: 1e-4 with warmup
- **Epochs**: 50-100 with early stopping
- **Checkpointing**: Save every epoch

### Output

```
exp/singlereso_utt_xlsr_53_ft_backend_Res1D/
  ‚îú‚îÄ‚îÄ config.yaml
  ‚îú‚îÄ‚îÄ models/
  ‚îÇ   ‚îú‚îÄ‚îÄ model_1.pt
  ‚îÇ   ‚îú‚îÄ‚îÄ model_2.pt
  ‚îÇ   ‚îî‚îÄ‚îÄ ...
  ‚îî‚îÄ‚îÄ tensorboard/
```

## Stage 4: Model Averaging

### Purpose
Average the top-N best model checkpoints to improve robustness.

### Command

```bash
python wedefense/bin/average_model.py \
  --dst_model $exp_dir/models/avg_model.pt \
  --src_path $exp_dir/models \
  --num 10
```

### Process

1. Identify top-10 checkpoints by validation performance
2. Load state dictionaries
3. Average parameters: $\theta_{\text{avg}} = \frac{1}{N}\sum_{i=1}^{N}\theta_i$
4. Save averaged model

### Output

```
exp/singlereso_utt_xlsr_53_ft_backend_Res1D/models/avg_model.pt
```

This averaged model is used for all subsequent stages.

## Stage 9: XAI Extraction with Grad-CAM

### Script: `wedefense/bin/XAI_GradCam_infer.py`

### Command

```bash
CUDA_VISIBLE_DEVICES=0 python wedefense/bin/XAI_GradCam_infer.py \
  --config ${exp_dir}/config.yaml \
  --model_path $exp_dir/models/avg_model.pt \
  --data_type "shard" \
  --data_list ${data}/dev/shard.list \
  --batch_size 1 \
  --num_workers 1 \
  --num_classes 2 \
  --xai_scores_path ${exp_dir}/xai_scores/dev.pkl
```

### Step-by-Step Process

#### 1. Model Preparation

```python
# Load pretrained model
model = get_model(configs['model'])(**configs['model_args'])
load_checkpoint(model, model_path)

# Wrap with projection head
projection = get_projection(configs['projection_args'])
full_model = FullModel(model, projection, test_conf)
```

#### 2. Target Layer Selection

```python
# For SSL-Res1D, target the final pooling layer
target_layer = [full_model.encoder.stat_pooling]
```

**Why this layer?**
- Final representation before classification
- Captures high-level temporal features
- Maintains temporal resolution

#### 3. Grad-CAM Initialization

```python
from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget

cam = GradCAM(model=full_model, target_layers=target_layer)
```

#### 4. Per-Utterance Extraction

For each audio utterance:

```python
# Load audio
wavs = batch['wav'].float().to(device)  # Shape: (1, wav_length)

# Target spoof class (class 1)
targets = [ClassifierOutputTarget(1)]

# Extract Grad-CAM heatmap
cam_output = cam(input_tensor=wavs, targets=targets)
# cam_output shape: (temporal_frames,) ranging [0, 1]
```

#### 5. Save Results

```python
results = []
for utt, heatmap in zip(utterance_ids, cam_outputs):
    results.append([[utt], heatmap.tolist()])

with open(xai_scores_path, 'wb') as f:
    pickle.dump(results, f)
```

### Output Format

```python
# xai_scores/dev.pkl structure:
[
  [["utt_id_1"], [0.12, 0.23, 0.89, ..., 0.34]],  # Heatmap for utterance 1
  [["utt_id_2"], [0.08, 0.15, 0.76, ..., 0.21]],  # Heatmap for utterance 2
  ...
]
```

Each heatmap is a 1D array where:
- **Length**: Number of temporal frames
- **Values**: [0, 1] indicating activation strength
- **High values**: Model focuses on these regions for spoof detection

## Stage 10: XAI Score Analysis

### Script: `wedefense/bin/XAI_Score_analysis.py`

### Command

```bash
python3 wedefense/bin/XAI_Score_analysis.py \
  --set dev \
  --pkl_path ${exp_dir}/xai_scores/dev.pkl \
  --vad_path "$VAD_PATH"
```

### Analysis Components

#### 1. Load XAI Scores and VAD Information

```python
# Load XAI heatmaps
with open(pkl_path, 'rb') as f:
    xai_results = pickle.load(f)

# Load voice activity detection (optional)
# VAD helps focus on speech regions only
vad_info = load_vad(vad_path)
```

#### 2. Compute Statistics

For each utterance:

```python
heatmap = np.array(xai_result[1])

# Basic statistics
mean_activation = np.mean(heatmap)
max_activation = np.max(heatmap)
std_activation = np.std(heatmap)

# Temporal analysis
peak_indices = find_peaks(heatmap, threshold=0.5)
peak_regions = group_consecutive_peaks(peak_indices)
```

#### 3. Segment Detection

**Threshold-based segmentation:**

```python
threshold = 0.5  # Tunable parameter
spoofed_mask = heatmap > threshold

# Find continuous regions
segments = []
in_segment = False
for t, is_spoof in enumerate(spoofed_mask):
    if is_spoof and not in_segment:
        start = t
        in_segment = True
    elif not is_spoof and in_segment:
        end = t
        segments.append((start, end))
        in_segment = False
```

#### 4. Visualization

Generate plots for each utterance:

**A. Temporal Activation Profile**
```python
plt.figure(figsize=(12, 4))
plt.plot(time_axis, heatmap, linewidth=2, color='red')
plt.fill_between(time_axis, heatmap, alpha=0.3, color='red')
plt.axhline(y=threshold, linestyle='--', color='blue', label='Threshold')
plt.xlabel('Time (s)')
plt.ylabel('Activation')
plt.title(f'XAI Temporal Activation - {utterance_id}')
plt.legend()
```

**B. Spectrogram with Heatmap Overlay**
```python
# Load audio and compute spectrogram
audio, sr = librosa.load(audio_path, sr=16000)
D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)))

# Overlay heatmap
heatmap_2d = np.tile(heatmap, (D.shape[0], 1))  # Repeat along frequency
plt.imshow(heatmap_2d, aspect='auto', cmap='hot', alpha=0.6)
```

**C. Detected Segment Boundaries**
```python
# Mark detected spoofed regions
for start, end in segments:
    plt.axvspan(start, end, alpha=0.3, color='red', label='Detected Spoof')
```

#### 5. Aggregate Analysis

**Compare Bonafide vs Spoof distributions:**

```python
# Separate by ground truth label
bonafide_activations = []
spoof_activations = []

for result, label in zip(xai_results, labels):
    mean_act = np.mean(result[1])
    if label == 'bonafide':
        bonafide_activations.append(mean_act)
    else:
        spoof_activations.append(mean_act)

# Plot distributions
plt.hist(bonafide_activations, bins=50, alpha=0.5, label='Bonafide', color='green')
plt.hist(spoof_activations, bins=50, alpha=0.5, label='Spoof', color='red')
plt.xlabel('Mean Activation')
plt.ylabel('Count')
plt.legend()
```

### Output

```
exp/xai_scores/
  ‚îú‚îÄ‚îÄ dev.pkl                    # Raw heatmaps
  ‚îú‚îÄ‚îÄ analysis/
  ‚îÇ   ‚îú‚îÄ‚îÄ temporal_profiles/     # Per-utterance plots
  ‚îÇ   ‚îú‚îÄ‚îÄ segment_detection/     # Detected boundaries
  ‚îÇ   ‚îú‚îÄ‚îÄ statistics.csv         # Aggregate stats
  ‚îÇ   ‚îî‚îÄ‚îÄ distribution.png       # Bonafide vs Spoof comparison
```

## Interpreting XAI Results

### Activation Patterns and Their Meanings

| Pattern | Visual Appearance | Interpretation | Example Scenario |
|---------|-------------------|----------------|------------------|
| **Sharp Peaks** | üìà Sudden spikes at specific time points | Splice boundaries detected | Partially spoofed audio with clear transitions |
| **Sustained High Activation** | üåä Long regions with elevated values | Continuous spoofed segment | TTS-generated insertion |
| **Low Flat Profile** | üìâ Consistently low values | Genuine speech | Bonafide utterance |
| **Multiple Peaks** | üéØ Several distinct high regions | Multiple spoofed insertions | Complex partial spoofing |
| **Gradual Rise/Fall** | üìä Smooth transitions | Soft boundaries or gradual blending | Advanced synthesis with smoothing |

### Decision Guidelines

#### For Bonafide Audio:
- ‚úÖ Expected: Low mean activation (<0.3)
- ‚úÖ Expected: Small standard deviation (<0.15)
- ‚úÖ Expected: No sustained high-activation regions

#### For Partially Spoofed Audio:
- ‚úÖ Expected: Moderate to high mean activation (>0.4)
- ‚úÖ Expected: High variance in temporal profile
- ‚úÖ Expected: Clear peaks corresponding to fake segments
- ‚ö†Ô∏è Watch for: Peaks aligning with VAD boundaries (may indicate model bias)

### Common Pitfalls

1. **Edge Effects**: High activation at utterance boundaries may be artifacts
   - **Solution**: Ignore first/last 100ms

2. **VAD Correlation**: Model may focus on silence/non-speech regions
   - **Solution**: Compare XAI with VAD labels

3. **Threshold Sensitivity**: Different thresholds yield different segmentations
   - **Solution**: Use multiple thresholds (0.3, 0.5, 0.7) for robustness

4. **Model Overfitting**: Consistent patterns across all spoof types
   - **Solution**: Analyze per-algorithm breakdown

### Validation Checklist

‚úÖ Do activation peaks align with known spoofed segments (if ground truth available)?  
‚úÖ Are bonafide utterances consistently low-activation?  
‚úÖ Do different spoofing algorithms show distinct patterns?  
‚úÖ Are high activations focused on speech regions (not silence)?  
‚úÖ Can you aurally perceive artifacts in high-activation regions?

## Practical Usage Guide

### Running the Complete Pipeline

#### 1. Setup Environment

```bash
cd egs/detection/partialspoof/x12_ssl_res1d
source path.sh
```

#### 2. Configure Paths

Edit `run.sh`:
```bash
PS_dir=/path/to/PartialSpoof/database
data=/path/to/output/data
config=conf/singlereso_utt_xlsr_53_ft_backend_Res1D.yaml
exp_dir=exp/singlereso_utt_xlsr_53_ft_backend_Res1D
VAD_PATH=/path/to/vad_annotations  # Optional
```

#### 3. Run Data Preparation (Stage 1-2)

```bash
bash run.sh --stage 1 --stop_stage 2
```

#### 4. Train Model (Stage 3-4)

```bash
bash run.sh --stage 3 --stop_stage 4 --gpus "[0]"
```

**Training time**: ~24-48 hours on single GPU

#### 5. Evaluate Model (Stage 5-7)

```bash
bash run.sh --stage 5 --stop_stage 7
```

Check performance:
```
EER: X.XX%
min t-DCF: X.XXX
```

#### 6. Extract XAI (Stage 9)

```bash
bash run.sh --stage 9 --stop_stage 9 --gpus "[0]"
```

**Extraction time**: ~1-2 hours for eval set

#### 7. Analyze XAI (Stage 10)

```bash
bash run.sh --stage 10 --stop_stage 10
```

### Customization Options

#### Change Target Layer

In `wedefense/bin/XAI_GradCam_infer.py`:
```python
# Original: final pooling layer
target_layer = [full_model.encoder.stat_pooling]

# Alternative: intermediate layer
target_layer = [full_model.encoder.layer4]  # Earlier features
```

#### Adjust Detection Threshold

In `XAI_Score_analysis.py`:
```python
# Default threshold
threshold = 0.5

# Stricter detection (fewer false positives)
threshold = 0.7

# More sensitive (catch subtle spoofs)
threshold = 0.3
```

#### Target Different Class

```python
# Original: target spoof class
targets = [ClassifierOutputTarget(1)]

# Alternative: target bonafide class (what makes it genuine?)
targets = [ClassifierOutputTarget(0)]
```

## Summary

This tutorial covered the complete workflow for XAI-based partially spoofed audio detection:

### Key Takeaways

‚úÖ **Pipeline Architecture**
   - 10-stage pipeline from data to XAI analysis
   - SSL-Res1D model with XLSR-53 frontend
   - Grad-CAM for temporal activation mapping

‚úÖ **XAI Extraction Process**
   - Target layer selection critical for interpretability
   - Per-utterance temporal heatmaps
   - Batch processing for efficiency

‚úÖ **Result Interpretation**
   - Activation patterns indicate spoofed regions
   - Threshold-based segment detection
   - Statistical validation essential

‚úÖ **Practical Considerations**
   - Model quality affects XAI quality
   - VAD integration improves focus
   - Cross-validation with audio inspection

### Limitations and Future Directions

‚ö†Ô∏è **Current Limitations:**
- Grad-CAM shows correlation, not causation
- Requires well-trained model
- Threshold selection is dataset-dependent
- May miss subtle artifacts

üî¨ **Future Work:**
- Multi-layer XAI fusion
- Attention-based explainability
- Frame-level ground truth comparison
- Real-time XAI for streaming audio

### Resources

üìÇ **Implementation**: `egs/detection/partialspoof/x12_ssl_res1d/`  
üìÑ **Paper**: [arxiv.org/abs/2406.02483](https://arxiv.org/abs/2406.02483)  
üíª **GitHub**: [github.com/zlin0/wedefense](https://github.com/zlin0/wedefense)  
üìñ **Docs**: [wedefense.readthedocs.io](https://wedefense.readthedocs.io)

## References

1. **Partial Spoofing Detection**: Liu et al., "How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?", 2024 [[paper](https://arxiv.org/abs/2406.02483)]

2. **Grad-CAM**: Selvaraju et al., "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization", ICCV 2017 [[paper](https://arxiv.org/abs/1610.02391)]

3. **SSL Representations**: Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations", NeurIPS 2020 [[paper](https://arxiv.org/abs/2006.11477)]

4. **XLSR**: Conneau et al., "Unsupervised Cross-lingual Representation Learning for Speech Recognition", Interspeech 2021 [[paper](https://arxiv.org/abs/2006.13979)]

5. **PartialSpoof Dataset**: Guo et al., "Partially Spoofed Audio Detection", ASVspoof 2019 [[paper](https://arxiv.org/abs/2105.08050)]

6. **WeDefense Framework**: [[GitHub](https://github.com/zlin0/wedefense)] [[Documentation](https://wedefense.readthedocs.io)]

7. **PyTorch Grad-CAM**: [[GitHub](https://github.com/jacobgil/pytorch-grad-cam)]