# Spoof Detection Tutorial with WeDefense
**Author:** Lin Zhang
**Date:** 2026-02-05
**Status:** Draft

# What is Spoof Detection?

Spoof detection (also known as anti-spoofing or fake audio detection) aims to detect whether an input audio sample is genuine (bonafide) or artificially generated/modified (spoofed). This is crucial for protecting automatic speaker verification systems and ensuring the authenticity of audio content.

## Task Definition

Given an audio input $x$, the goal is to produce a score $s$ that indicates how likely the input is genuine. 

<img src="../../_static/figures/detection.png" width="300" />

In WeDefense, we model $s$ to LLR for final decision as:

$$
x \xrightarrow{\text{Model}} \text{embedding} \xrightarrow{\text{Projection}} \text{logits} \xrightarrow{\text{Calibration}} \text{LLR score}
$$

The final LLR (Log-Likelihood Ratio) score = $\log \frac{p(x|H_a)}{p(x|H_r)}$ determines the decision:
- **Positive LLR** → classified as bonafide (real)
- **Negative LLR** → classified as spoof (fake)

Where $H_a$ and $H_r$ represents the accept hypothesis (the input audio is real) and reject hypothesis (the input audio is fake), respectively. 


## Why use LLR instead of raw posterior from network?

While spoof detection can be framed as binary classification, we recommend using calibrated LLR scores instead of raw logits or posteriors for several reasons:

1. **Prior-aware calibration**: LLR incorporates the prior probability of spoofing attacks, making it more suitable for real-world deployment where attack frequencies vary.
2. **Interpretability**: LLR provides a principled decision threshold (0) with clear probabilistic meaning that $p(x|H_a) = p(x|H_r)$.
3. **Robustness**: Calibrated scores are less sensitive to training data imbalance and generalize better across datasets.

Note,
- Genuine/Bona fide/Real: Speech spoken by human or authentic audio naturally captured from real sources.
- Spoof/Fake: Generated or modified audio (e.g., TTS, voice conversion, etc.)

# Step-by-Step Implementation in WeDefense

----
The implementation stages in WeDefense are:

- **Stage 1–2:** Data preparation and list generation.
- **Stage 3:** Model training.
- **Stage 4:** Model averaging and embedding extraction.
- **Stage 5:** Logit extraction (and optional posterior output via softmax).
- **Stage 6:** Score calibration (logits to LLR).
- **Stage 7:** Performance evaluation.

**Note:** 
1. WeDefense separates embedding extraction (Stage 4) from logit/posterior prediction (Stage 5) to keep the pipeline modular. This makes it easier to analyze or visualize embeddings, reuse the same embeddings with different back-end scoring methods, and debug each stage independently. 
2. For evaluation, we convert logits to LLR (Stage 6) to apply prior-aware calibration and obtain well-calibrated decision scores rather than using raw logits or posteriors directly.

In the following of this notebook, we provide a step-by-step guide to running a anti-spoofing detection experiment using the WeDefense toolkit. We will follow the structure of the `run.sh` script for the `detection/asvspoof5/v03_resnet18` recipe on the PartialSpoof dataset.



## Prerequisites

1.  **WeDefense Installation:** Ensure you have successfully installed the WeDefense toolkit and all its dependencies.
2.  **Dataset:** This tutorial assumes you have access to the PartialSpoof dataset. The script will attempt to download it automatically if it's not found.
3.  **Environment:** Make sure you are running this notebook from the `egs/detection/partialspoof/v03_resnet18/` directory. And installed conda enviorment success.
4.  **Hardware:** A GPU is highly recommended for the training stage (Stage 3).


## Initial Configuration

First, we set up all the necessary paths and parameters for our experiment. These are the same variables you would find at the top of the `run.sh` script.

In [None]:
import os

# --- Path and Data Configuration ---

# TODO: IMPORTANT! Please modify this path to your PartialSpoof database directory.
PS_dir = '/path/to/your/PartialSpoof/database'
# Directory to store prepared data files (wav.scp, utt2lab, etc.)
data_dir = 'data/partialspoof_tutorial'
# The format for the dataloader. 'shard' is recommended for large datasets
# as it groups audio files into .tar files, improving I/O efficiency.
# 'raw' loads individual files.
data_type = 'shard'

# --- Model and Experiment Configuration ---
# The configuration file for the model architecture and training parameters.
config = 'conf/resnet.yaml'
# Directory to save model checkpoints, logs, and results.
exp_dir = 'exp/resnet_tutorial'

# --- Execution Configuration ---
# Specify which GPUs to use, e.g., "[0]" or "[0,1]".
gpus = "[0]"
# Number of models to average for inference. >0 to use averaging, <=0 to use the single best model.
num_avg = -1
# Save a model checkpoint every N epochs.
save_epoch_interval = 5
# Patience for early stopping. <0 disables it.
early_stop_patience = -1
# How often to run validation (in epochs).
validate_interval = 1

## Stage 1: Data Preparation

In this stage, we process the raw PartialSpoof dataset into a standard format required by the toolkit. The `local/prepare_data.sh` script will:
1.  Download the dataset if it's not found.
2.  Create `wav.scp`: Maps a unique utterance ID to its audio file path. (<wav_id> \<path> )
3.  Create `utt2lab`: Maps each utterance ID to its label (`bonafide` or `spoof`). (<wav_id> <label>)
4.  Create `lab2utt`: An inverted index of `utt2lab`. <Label> <wav_id>
5.  Create `utt2dur`: Maps each utterance ID to its duration in seconds. <wav_id> <duration>

This process is run in parallel for the `train`, `dev`, and `eval` sets.

Let's inspect the generated files for the train set to understand their format.

In [None]:
$ head -n 3 .data/partialspoof/train/*
# ==> ./lab2utt <==
# spoof CON_T_0000029 CON_T_0000069 ...
# bonafide LA_T_1138215 LA_T_1271820 ...

# ==> ./utt2dur <==
# CON_T_0000000 2.74725
# CON_T_0000001 4.2501875
# CON_T_0000002 3.1415

# ==> ./utt2lab <==
# CON_T_0000029 spoof
# CON_T_0000069 spoof
# CON_T_0000072 spoof

# ==> ./wav.scp <==
# CON_T_0000000 /export/fs05/lzhan268/workspace/PUBLIC/PartialSpoof/database/train/con_wav/CON_T_0000000.wav
# CON_T_0000001 /export/fs05/lzhan268/workspace/PUBLIC/PartialSpoof/database/train/con_wav/CON_T_0000001.wav
# CON_T_0000002 /export/fs05/lzhan268/workspace/PUBLIC/PartialSpoof/database/train/con_wav/CON_T_0000002.wav

SyntaxError: invalid syntax (1659245375.py, line 1)

## Stage 2: Data Formatting and Augmentation

For efficient data loading, especially in distributed training, we convert our data lists into a `shard` format. This involves bundling multiple audio files and their labels into larger `.tar` files.

-   `tools/make_shard_list.py`: Creates the sharded dataset.
-   `tools/make_raw_list.py`: Creates a simple file list (used if `data_type="raw"`).

We will also prepare the MUSAN (noise) and RIRS (reverberation) datasets for data augmentation during training. This step creates `wav.scp` files for them.

In [None]:
$ head -n 3 .data/partialspoof/train/shard.list
# ==> ./shard.list <==
# data/partialspoof/train/shards/shards_000000000.tar
# data/partialspoof/train/shards/shards_000000001.tar
# data/partialspoof/train/shards/shards_000000002.tar

# ==> ./shards <==
# head: error reading './shards': Is a directory

$ ls .data/partialspoof/train/shards
# shards_000000000.tar shards_000000001.tar shards_000000002.tar ...

## Stage 3: Training

Now we are ready to train the model. We use `torchrun` for distributed training, which is efficient even on a single machine with multiple GPUs.

The training process will:
- Load the model architecture and training parameters from the YAML config file (`conf/resnet.yaml`).
- Use the prepared data lists (`shard.list` or `raw.list`).
- Save model checkpoints and logs to the experiment directory (`exp/resnet_tutorial`).
- Periodically evaluate performance on the development set.

In [None]:
torchrun --rdzv_backend=c10d --rdzv_endpoint=$(hostname):${port} --nnodes=1 --nproc_per_node=$num_gpus \
    wedefense/bin/train.py --config $config \
      --exp_dir "${exp_dir}" \
      --gpus "$gpus" \
      --num_avg "${num_avg}" \
      --data_type "${data_type}" \
      --train_data "${data}/train/${data_type}.list" \
      --train_label "${data}/train/utt2lab" \
      --val_data "${data}/dev/${data_type}.list" \
      --val_label "${data}/dev/utt2lab" \
      --save_epoch_interval "${save_epoch_interval}" \
      --early_stop_patience "${early_stop_patience}" \
      --validate_interval "${validate_interval}"
      # Add the following lines if you have prepared augmentation data
      # --reverb_data data/rirs/lmdb \
      # --noise_data data/musan/lmdb

More about variables used in training please refer to 


## Stage 4: Model Averaging & Embedding Extraction

After training, we can proceed with inference. We have two options for the model to use:
1.  **Best Model:** The single checkpoint that performed best on the development set (`best_model.pt`).
2.  **Averaged Model:** An average of the last `num_avg` checkpoints. This often yields more robust performance.

First, we average the model if `num_avg > 0`. Then, we use the chosen model to extract embeddings (fixed-size vector representations) for each utterance in the `dev` and `eval` sets. These embeddings are the input to the final classification layer.

In [None]:
# Determine which model path to use based on num_avg
if num_avg > 0:
    model_path = os.path.join(exp_dir, 'models/avg_model.pt')
else:
    model_path = os.path.join(exp_dir, 'models/best_model.pt')

print(f"Using model: {model_path}")

In [None]:
echo "Starting Stage 4: Model Averaging and Embedding Extraction..."

if [ ${num_avg} -gt 0 ]; then
  echo "Averaging the last ${num_avg} models..."
  python wedefense/bin/average_model.py \
    --dst_model "${exp_dir}/models/avg_model.pt" \
    --src_path "${exp_dir}/models" \
    --num "${num_avg}"
fi

echo "Extracting embeddings..."
# We use a helper script for parallel embedding extraction
local/extract_emb.sh \
   --exp_dir "$exp_dir" --model_path "$model_path" \
   --nj "$nj" --gpus "$gpus" --data_type "$data_type" --data "${data}"

echo "Stage 4 finished."

## Stage 5: Logit Extraction

With the embeddings extracted, we now pass them through the final classification layer of the model to get the raw output scores, known as **logits**. These logits represent the model's confidence for each class (`bonafide` vs. `spoof`) before any normalization.

In [None]:
echo "Starting Stage 5: Extracting logits..."

for dset in dev eval; do
  echo "Processing ${dset} set..."
  mkdir -p "${exp_dir}/posteriors/${dset}"
  python wedefense/bin/infer.py --model_path "$model_path" \
    --config "${exp_dir}/config.yaml" \
    --num_classes 2 \
    --embedding_scp_path "${exp_dir}/embeddings/${dset}/embedding.scp" \
    --out_path "${exp_dir}/posteriors/${dset}"
done

echo "Stage 5 finished."

## Stage 6: Score Calibration (Logits to LLR)

The raw logits from the model are not always well-calibrated. To make them more interpretable and robust for decision-making, we convert them into Log-Likelihood Ratios (LLR). This process calibrates the scores based on the prior probabilities of the classes observed in the training data.

A positive LLR score will indicate a prediction of 'spoof', while a negative score will indicate 'bonafide'.

In [None]:
echo "Starting Stage 6: Converting logits to Log-Likelihood Ratios (LLR)..."

# First, calculate the number of bonafide vs. spoof utterances in the training set.
# This is used for calibration.
cut -f2 -d" " "${data}/train/utt2lab" | sort | uniq -c | awk '{print $2 " " $1}' > "${data}/train/lab2num_utts"
echo "Training label counts:"
cat "${data}/train/lab2num_utts"

for dset in dev eval; do
    echo "Calibrating scores for ${dset} set..."
    python wedefense/bin/logits_to_llr.py \
        --logits_scp_path "${exp_dir}/posteriors/${dset}/logits.scp" \
        --training_counts "${data}/train/lab2num_utts" \
        --train_label "${data}/train/utt2lab" \
        --pi_spoof 0.05 # Assumed prior probability of a spoof trial

done

echo "Stage 6 finished."

**Logits → LLR (binary case)**

Let the model output logits $s_{\text{spoof}}$ and $s_{\text{bonafide}}$. The posterior is

$$
P(\text{spoof}\mid x)=\frac{e^{s_{\text{spoof}}}}{e^{s_{\text{spoof}}}+e^{s_{\text{bonafide}}}},\quad
P(\text{bonafide}\mid x)=\frac{e^{s_{\text{bonafide}}}}{e^{s_{\text{spoof}}}+e^{s_{\text{bonafide}}}}
$$

The log-likelihood ratio is

$$
\text{LLR}(x)=\log\frac{P(\text{spoof}\mid x)}{P(\text{bonafide}\mid x)}-\log\frac{\pi_{\text{spoof}}}{1-\pi_{\text{spoof}}}
$$

where $\pi_{\text{spoof}}$ is the prior spoof probability used for calibration.

## Stage 7: Performance Evaluation

Finally, we measure the performance of our system using the calibrated LLR scores. The primary metric for anti-spoofing is the **Equal Error Rate (EER)**.

- **EER:** The error rate at which the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). A lower EER indicates better performance.

We will calculate the EER for both the development and evaluation sets.

In [None]:
echo "Starting Stage 7: Measuring Performance..."

for dset in dev eval; do
  echo "Evaluating on ${dset} set..."
  
  # Prepare the ground truth key file in the required format: <utt_id>\t<label>
  key_file="${data}/${dset}/cm_key_file.txt"
  echo -e "filename\tcm-label" > "${key_file}"
  # The sed command replaces the first space with a tab
  sed 's/ /\t/' "${data}/${dset}/utt2lab" >> "${key_file}"

  # Run the evaluation script
  # The output will be displayed here and also saved to a file in the experiment directory.
  python wedefense/metrics/detection/evaluation.py \
      --m t1 \
      --cm "${exp_dir}/posteriors/${dset}/llr.txt" \
      --cm_key "${key_file}" 2>&1 | tee "${exp_dir}/results_${dset}.txt"
done

echo "Stage 7 finished. Results are saved in ${exp_dir}/"

## Conclusion

Congratulations! You have successfully completed all the stages of training and evaluating an anti-spoofing model on the PartialSpoof dataset.

You have learned how to:
- **Prepare** a dataset in the standard Kaldi-style format.
- **Format** the data for efficient training using shards.
- **Train** a ResNet-based model.
- **Extract** embeddings and logits for inference.
- **Calibrate** scores and **evaluate** the system's performance using the EER metric.

### Next Steps
- **Analyze the results:** Check the `results_eval.txt` file in your experiment directory for the final performance.
- **Experiment with hyperparameters:** Try changing the model architecture, learning rate, or other parameters in the `conf/resnet.yaml` file.
- **Embedding visualization:** You may also try to visualize the embedding extracted from the stage 4 following `wedefense/egs/embedding_visualization/embedding_visulization_umap.ipynb`.