Detecting AI-generated fake voices using CNN-LSTM and spectrogram analysis.
With the rise of AI voice cloning tools (ElevenLabs, VALL-E, etc.), it has become easy to generate fake audio that sounds exactly like a real person. VoxGuard is a deep learning system that detects whether a voice recording is genuine or AI-generated.
It works by:
- Extracting MFCC features from the audio (a compact representation of the sound spectrum)
- Passing them through a CNN to detect local patterns in the sound
- Passing through an LSTM to analyse how those patterns change over time
- Outputting a probability — how likely the voice is fake
Audio File (.wav / .flac)
│
▼
MFCC Extraction (librosa)
→ shape: (time_steps × 40 coefficients)
│
▼
┌─────────────────┐
│ CNN Block 1 │ Conv2D(32) → BN → MaxPool → Dropout
│ CNN Block 2 │ Conv2D(64) → BN → MaxPool → Dropout
└─────────────────┘
│
▼
LSTM Layer (64 units)
→ Reads temporal patterns across the MFCC sequence
│
▼
Dense(32) → Dense(1, sigmoid)
│
▼
Output: probability of being FAKE
> 0.5 → FAKE | < 0.5 → REAL
VoxGuard/
├── config.py # All settings — change hyperparameters here
├── extract_features.py # Step 1: Extract MFCC features from audio files
├── model.py # CNN-LSTM model definition
├── train.py # Step 2: Train the model
├── evaluate.py # Step 3: Print metrics + confusion matrix
├── predict.py # Step 4: Check any audio file
├── requirements.txt
├── data/
│ ├── real/ # Put genuine voice files here (.wav / .flac)
│ ├── fake/ # Put AI-spoofed voice files here
│ └── README.md # Dataset download instructions
└── results/
├── training_curves.png
└── confusion_matrix.png
git clone https://github.com/ramlasyaa/VoxGuard.git
cd VoxGuardpython3 -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install -r requirements.txtSee data/README.md for instructions.
Recommended: ASVspoof 2019 LA partition.
data/
├── real/ ← copy genuine .wav files here
└── fake/ ← copy spoofed .wav files here
python extract_features.pypython train.pypython evaluate.pypython predict.py path/to/voice.wav
# or an entire folder
python predict.py path/to/audio_folder/Results on the ASVspoof 2019 LA evaluation set:
| Metric | Score |
|---|---|
| Accuracy | ~91% |
| Precision | ~89% |
| Recall | ~93% |
| F1-Score | ~91% |
| ROC-AUC | ~0.96 |
Results may vary depending on dataset size and split.
| Component | Tool / Library |
|---|---|
| Language | Python 3.9+ |
| Deep Learning | TensorFlow / Keras |
| Audio Processing | Librosa |
| Features | MFCC (40 coefficients) |
| ML Metrics | Scikit-learn |
| Visualization | Matplotlib, Seaborn |
- MFCC — Mel-Frequency Cepstral Coefficients. Compact audio features that capture how the human ear perceives sound.
- CNN — Detects local spatial patterns in the MFCC "image".
- LSTM — Captures how those patterns evolve over time (temporal context).
- Binary Cross-Entropy — Loss function for real/fake binary classification.
- EarlyStopping — Prevents overfitting by stopping training when validation loss plateaus.
This project is based on our paper:
"VoxGuard: Fake Audio (Deepfake Voice) Detection Using Spectrogram Analysis"
Ram Lasya et al. — CONIT 2026 (IEEE)
Built by Ram Lasya · Amrita Vishwa Vidyapeetham