Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling

Abstract: Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.

We provide our implementation as open source in this repository. Audio samples can be found at the demo website.

Pre-requisites

Python >= 3.9.
Clone this repository.
Install python requirements. Please refer requirements.txt.
Download datasets
1. Download and extract the VCTK-0.92 dataset, and move its wav48 directory into VCTK-Corpus-0.92 and rename it as wav48_origin.
2. Trim the silence of the dataset, and the trimmed files will be saved to wav48_silence_trimmed.
```
cd VCTK-Corpus-0.92
python flac2wav.py
```
3. Move all the trimmed training files from wav48_silence_trimmed to wav48/train following the indexes in training.txt, and move all the untrimmed test files from wav48_origin to wav48/test following the indexes in test.txt.

Training

cd train
CUDA_VISIBLE_DEVICES=0 python train_16k.py --config [config file path]
CUDA_VISIBLE_DEVICES=0 python train_48k.py --config [config file path]

Checkpoints and copies of the configuration file are saved in the cp_model directory by default.
You can change the path by using the --checkpoint_path option. Here is an example:

CUDA_VISIBLE_DEVICES=0 python train_16k.py --config ../configs/config_2kto16k.json --checkpoint_path ../checkpoints/AP-BWE_2kto16k

Inference

cd inference
python inference_16k.py --checkpoint_file [generator checkpoint file path]
python inference_48k.py --checkpoint_file [generator checkpoint file path]

You can download the pretrained weights we provide and move all the files to the checkpoints directory.
Generated wav files are saved in generated_files by default. You can change the path by adding --output_dir option. Here is an example:

python inference_16k.py --checkpoint_file ../checkpoints/2kto16k/g_2kto16k --output_dir ../generated_files/2kto16k

Model Structure

Comparison with other speech BWE methods

2k/4k/8kHz to 16kHz

8k/12k/16/24kHz to 48kHz

Acknowledgements

We referred to HiFi-GAN and NSPP to implement this.

Citation

@article{lu2024towards,
  title={Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction},
  author={Lu, Ye-Xin and Ai, Yang and Du, Hui-Peng and Ling, Zhen-Hua},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={33},
  pages={236--250},
  year={2024}
}

@inproceedings{lu2024multi,
  title={Multi-Stage Speech Bandwidth Extension with Flexible Sampling Rate Control},
  author={Lu, Ye-Xin and Ai, Yang and Sheng, Zheng-Yan and Ling, Zhen-Hua},
  booktitle={Proc. Interspeech},
  pages={2270--2274},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit yxlu-0102 Update README.md Mar 24, 2025 be7f5c6 · Mar 24, 2025 History 165 Commits
Figures	Figures	Add files via upload	Nov 2, 2024
VCTK-Corpus-0.92	VCTK-Corpus-0.92	initial	Oct 28, 2024
checkpoints	checkpoints	initial	Oct 28, 2024
configs	configs	initial	Oct 28, 2024
datasets	datasets	initial	Oct 28, 2024
docs	docs	initial	Nov 5, 2024
inference	inference	initial	Oct 28, 2024
models	models	initial	Oct 28, 2024
train	train	initial	Oct 28, 2024
LICENSE	LICENSE	Create LICENSE	Dec 23, 2023
README.md	README.md	Update README.md	Mar 24, 2025
cal_metrics.py	cal_metrics.py	Add files via upload	Apr 19, 2024
cal_visqol_48k.py	cal_visqol_48k.py	Add files via upload	Apr 19, 2024
env.py	env.py	initial	Oct 28, 2024
requirements.txt	requirements.txt	Update requirements.txt	Oct 28, 2024
utils.py	utils.py	initial	Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling

Pre-requisites

Training

Inference

Model Structure

Comparison with other speech BWE methods

2k/4k/8kHz to 16kHz

8k/12k/16/24kHz to 48kHz

Acknowledgements

Citation

About

Releases

Packages

Languages

License

yxlu-0102/AP-BWE

Folders and files

Latest commit

History

Repository files navigation

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling

Pre-requisites

Training

Inference

Model Structure

Comparison with other speech BWE methods

2k/4k/8kHz to 16kHz

8k/12k/16/24kHz to 48kHz

Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages