AudioSR-GAN: Speech Bandwidth Extension (8 kHz → 22.05 kHz)

Overview

This project performs Audio Super‑Resolution (Speech Bandwidth Extension), transforming 8 kHz narrow‑band audio into 22.05 kHz wide‑band audio using:

A CNN‑based generator (SRResNet‑style)
A PatchGAN discriminator for realism
Mel‑spectrogram domain training
neural vocoders (Vocos) for waveform reconstruction

Full Workflow (High‑Level)

  Low‑Resolution Audio (8 kHz)
            │
            ▼
     Upsample to 22.05 kHz
            │
            ▼
   Convert to Log‑Mel Spectrogram
            │
            ▼
   ┌──────────────────────────────┐
   │   CNN / SRResNet Generator   │
   └──────────────────────────────┘
            │
            ▼
   Predicted High‑Resolution Mel
            │
            ▼
   Neural Vocoder (HiFi‑GAN/Vocos)
            │
            ▼
  Final Reconstructed 22.05 kHz Audio

Model Architecture

Training Pipeline

Inference Pipeline

Mel configuration used across training, inference, and vocoder reconstruction

Sample rate: 22,050 Hz (fixed requirement — mel extractor MUST match vocoder SR)

n_mels: 80 (required by vocos-mel-22kHz)

n_fft: 1024 (internally forced)

Hop length: 256 (internally forced)

Window length: 1024 (internally forced)

Padding mode: 'same' (ensures correct temporal alignment)

Mel scale: HTK mel scale (default in extractor)

Log mel: apply_log = False → linear mel, because Vocos expects linear mel inputs

✔ Why these configs matter

80 mel bins are standard for speech models and required for the Vocos vocoder you use.

apply_log = False prevents incompatibility with vocos‑mel‑22kHz (which internally applies its own log processing).

Fixing n_fft = 1024 and hop = 256 gives a 12 ms hop, a sweet spot between temporal detail and GAN stability.

This consistent mel configuration ensures:

Stable generator training

Proper alignment between LR/HR mel pairs

Correct conditioning for both HiFi‑GAN and Vocos vocoders strike a balance between time–frequency resolution, GAN stability, and compatibility with HiFi‑GAN/Vocos vocoders. The hop length of 256 ensures smooth reconstruction and avoids phase artifacts after vocoder synthesis. The 128‑mel dimension provides enough resolution for high-frequency detail without making training unstable or memory-heavy.

1. Dataset Preparation (8 kHz → 22.05 kHz)

Generate HR (22,050 Hz) & LR (8,000 Hz) pairs

python -m  src.dataset.make_low_sr_dataset --in_root archive

Check Bandwidth of dataset

python -m  src.dataset.check_bw --input data/hr/p225_001.wav 
python -m  src.dataset.audio_properties data/hr/p225_001.wav

2. Train the CNN Baseline Model (Or the CNN-GAN model in Step 3 )

python -m src.train --dataset_dir data --out_dir checkpoints --batch_size 32 --epochs 100

3. Train the GAN Model

python -m src.train_gan --dataset_dir data --out_dir checkpoints_gan --batch_size 32 --epochs 100

4. Standard Inference (Generator Only)

A. Run inference on entire HR directory

python -m src.infer --generator_ckpt checkpoints_gan/generator_final.pt --output_dir outputs_gan

B. Limit inference to N files

python -m src.infer --generator_ckpt checkpoints_gan/generator_final.pt --max_files 20 --output_dir outputs_gan

C. Inference on a single LR file

python -m src.infer --input_wav data/low_sr/p225_001.wav --generator_ckpt checkpoints/generator_final.pt --hr_dir data/hr --output_dir outputs_gan

5. Inference (Using Griffin Lim Vocoder)**

python -m src.infer --input_wav data/low_sr/p225_001.wav --generator_ckpt checkpoints/generator_final.pt --hr_dir data/hr --output_dir outputs --vocoder griffin

6. Quick Testing Utilities

python -m src.infer --input_wav data/low_sr/p225_001.wav
python -m src.infer --max_files 5

7. Evaluation

Evaluate SR quality (STOI, PESQ, LSD, SNR, MSE)

python -m src.utils.eval --sr_file outputs_gan/p225_001_sr.wav

Evaluate Vocoder Reconstructions to Test Vocoder Quality

python -m src.utils.vocoder_eval --hr_dir data/test/ --max_files 5 --save_recon

Block Diagram (Mel‑Domain GAN)

   ┌───────────────────────────────────────────────┐
   │                 Low‑SR Mel (Input)            │
   └───────────────────────────────────────────────┘
                        │
                        ▼
          ┌──────────────────────────────┐
          │   SRResNet Generator (G)     │
          └──────────────────────────────┘
                        │
                        ▼
           Predicted High‑Resolution Mel
                        │
                        ▼
     ┌────────────────────────────────────────────┐
     │ PatchGAN Discriminator (D) – optional GAN  │
     └────────────────────────────────────────────┘
                        │
                        ▼
                 Adversarial Loss

Future Improvements

Diffusion‑based vocoders
Multi‑speaker embeddings
Perceptual losses (wav2vec2 / Encodec)
Large‑scale dataset training (VCTK, LibriSpeech)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
checkpoints		checkpoints
checkpoints_gan		checkpoints_gan
docs		docs
src		src
.gitignore		.gitignore
Audio-Super-Resolution.code-workspace		Audio-Super-Resolution.code-workspace
PROJECT_BOARD.md		PROJECT_BOARD.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioSR-GAN: Speech Bandwidth Extension (8 kHz → 22.05 kHz)

Overview

Full Workflow (High‑Level)

Model Architecture

Training Pipeline

Inference Pipeline

Mel configuration used across training, inference, and vocoder reconstruction

1. Dataset Preparation (8 kHz → 22.05 kHz)

Generate HR (22,050 Hz) & LR (8,000 Hz) pairs

Check Bandwidth of dataset

2. Train the CNN Baseline Model (Or the CNN-GAN model in Step 3 )

3. Train the GAN Model

4. Standard Inference (Generator Only)

A. Run inference on entire HR directory

B. Limit inference to N files

C. Inference on a single LR file

5. Inference (Using Griffin Lim Vocoder)**

6. Quick Testing Utilities

7. Evaluation

Evaluate SR quality (STOI, PESQ, LSD, SNR, MSE)

Evaluate Vocoder Reconstructions to Test Vocoder Quality

Block Diagram (Mel‑Domain GAN)

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AudioSR-GAN: Speech Bandwidth Extension (8 kHz → 22.05 kHz)

Overview

Full Workflow (High‑Level)

Model Architecture

Training Pipeline

Inference Pipeline

Mel configuration used across training, inference, and vocoder reconstruction

1. Dataset Preparation (8 kHz → 22.05 kHz)

Generate HR (22,050 Hz) & LR (8,000 Hz) pairs

Check Bandwidth of dataset

2. Train the CNN Baseline Model (Or the CNN-GAN model in Step 3 )

3. Train the GAN Model

4. Standard Inference (Generator Only)

A. Run inference on entire HR directory

B. Limit inference to N files

C. Inference on a single LR file

5. Inference (Using Griffin Lim Vocoder)**

6. Quick Testing Utilities

7. Evaluation

Evaluate SR quality (STOI, PESQ, LSD, SNR, MSE)

Evaluate Vocoder Reconstructions to Test Vocoder Quality

Block Diagram (Mel‑Domain GAN)

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages