Skip to content

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation. Used for Speech Enhancement and robust ASR.

License

Notifications You must be signed in to change notification settings

wangtao2668129173/DeepXi

 
 

Repository files navigation

Deep Xi has been updated to TensorFlow 2!

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation.

Deep Xi is implemented in TensorFlow 2 and is used for speech enhancement, noise estimation, for mask estimation, and as a front-end for robust ASR.

Deep Xi (where the Greek letter 'xi' or ξ is pronounced /zaɪ/) is a deep learning approach to a priori SNR estimation that was proposed in [1] and is implemented in TensorFlow 2. Some of its use cases include:

  • It can be used by minimum mean-square error (MMSE) approaches to speech enhancement like the MMSE short-time spectral amplitude (MMSE-STSA) estimator.
  • It can be used by minimum mean-square error (MMSE) approaches to noise estimation, as in DeepMMSE [2].
  • Estimate the ideal binary mask (IBM) for missing feature approaches or the ideal ratio mask (IRM).
  • A front-end for robust ASR, as shown in Figure 1.

Figure 1: Deep Xi used as a front-end for robust ASR. The back-end (Deep Speech) is available here. The noisy speech magnitude spectrogram, as shown in (a), is a mixture of clean speech with voice babble noise at an SNR level of -5 dB, and is the input to Deep Xi. Deep Xi estimates the a priori SNR, as shown in (b). The a priori SNR estimate is used to compute an MMSE approach gain function, which is multiplied elementwise with the noisy speech magnitude spectrum to produce the clean speech magnitude spectrum estimate, as shown in (c). MFCCs are computed from the estimated clean speech magnitude spectrogram, producing the estimated clean speech cepstrogram, as shown in (d). The back-end system, Deep Speech, computes the hypothesis transcript, from the estimated clean speech cepstrogram, as shown in (e).

How does Deep Xi work?

A training example is shown in Figure 2. A deep neural network (DNN) within the Deep Xi framework is fed the noisy-speech short-time magnitude spectrum as input. The training target of the DNN is a mapped version of the instantaneous a priori SNR (i.e. mapped a priori SNR). The instantaneous a priori SNR is mapped to the interval [0,1] to improve the rate of convergence of the used stochastic gradient descent algorithm. The map is the cumulative distribution function (CDF) of the instantaneous a priori SNR, as given by Equation (13) in [1]. The statistics for the CDF are computed over a sample of the training set. An example of the mean and standard deviation of the sample for each frequency bin is shown in Figure 3. The training examples in each mini-batch are padded to the longest sequence length in the mini-batch. The sequence mask is used by TensorFlow to ensure that the DNN is not trained on the padding. During inference, the a priori SNR estimate is computed from the mapped a priori SNR using the sample statistics and Equation (12) from [2].

Figure 2: A training example for Deep Xi. Generated using eval_example.m.

Figure 3: The normal distribution for each frequency bin is computed from the mean and standard deviation of the instantaneous a priori SNR (dB) over a sample of the training set. Generated using eval_stats.m

Which audio do I use with Deep Xi?

Deep Xi operates on mono/single-channel audio (not stereo/dual-channel audio). Single-channel audio is used due to most cell phones using a single microphone. The available trained models operate on a sampling frequency of f_s=16000Hz, which is currently the standard sampling frequency used in the speech enhancement community. The sampling frequency can be changed in run.sh. Deep Xi can be trained using a higher sampling frequency (e.g. f_s=44100Hz), but this is unnecessary as human speech rarely exceeds 8 kHz (the Nyquist frequency of f_s=16000Hz is 8 kHz). The available trained models operate on a window duration and shift of T_d=32ms and T_s=16ms, respectively. To train a model on a different window duration and shift, T_d and T_s can be changed in run.sh. Currently, Deep Xi supports .wav, .mp3, and .flac audio codecs. The audio codec and bit rate does not affect the performance of Deep Xi.

Where can I get a dataset for Deep Xi?

Open-source training and testing sets are available for Deep Xi on IEEE DataPort:

Deep Xi Training Set: http://dx.doi.org/10.21227/3adt-pb04.

Deep Xi Test Set: http://dx.doi.org/10.21227/h3xh-tm88.

Test set from the original Deep Xi paper: http://dx.doi.org/10.21227/0ppr-yy46.

The MATLAB scripts used to generate these sets can be found in set.

Naming convention in the set/ directory

The following is already configured in the Deep Xi Training Set and Deep Xi Test Set.

Training set

The filenames of the waveforms in the train_clean_speech and train_noise directories are not restricted. There can be a different number of waveforms in each. The Deep Xi framework utilises each of the waveforms in train_clean_speech once during an epoch. For each train_clean_speech waveform of a mini-batch, the Deep Xi framework selects a random section of a randomely selected waveform from train_noise (that is at a length greater than or equal to the train_clean_speech waveform) and adds it to the train_clean_speech waveform at a randomly selected SNR level (the SNR level range can be set in run.sh).

Validation set

As the validation set must not change from epoch to epoch, a set of restrictions apply to the waveforms in val_clean_speech and val_noise. There must be the same amount of waveforms in val_clean_speech and val_noise. One waveform in val_clean_speech corresponds to only one waveform in val_noise, i.e. a clean speech and noise validation waveform pair. Each clean speech and noise validation waveform pair must have identical filenames and and an identical number of samples. Each clean speech and noise validation waveform pair must have the SNR level (dB) that they are to be mixed at placed at the end of their filenames. The convention used is _XdB, where X is replaced with the desired SNR level. E.g. val_clean_speech/NAME_-5dB.wav and val_noise/NAME_-5dB.wav. An example of the filenames for a clean speech and noise validation waveform pair is as follows: val_clean_speech/198_19-198-0003_Machinery17_15dB.wav and val_noise/198_19-198-0003_Machinery17_15dB.wav.

Test set

The filenames of the waveforms in the test_noisy_speech directory are not restricted. This is all that is required if you want inference outputs from Deep Xi, i.e. ./run.sh VER="ANY_NAME" INFER=1. If you are obtaining objective scores by using ./run.sh VER="ANY_NAME" TEST=1, then reference waveforms for the objective measures need to be placed in test_clean_speech. The waveforms in test_clean_speech and test_noisy_speech that correspond to each other must have the same number of samples (i.e. the same sequence length). The filename of the waveform in test_clean_speech that corresponds to a waveform in test_noisy_speech must be contained in the corresponding test noisy speech waveforn filename. E.g. if the filename of a test noisy speech waveform is test_noisy_speech/61-70968-0000_SIGNAL021_-5dB.wav, then the filename of the corresponding test clean speech waveform must be contained in the filename of the test noisy speech waveform: test_clean_speech/61-70968-0000.wav. This is because a test clean speech waveform may be used as a reference for multiple waveforms in test_noisy_speech (e.g. test_noisy_speech/61-70968-0000_SIGNAL021_0dB.wav, test_noisy_speech/61-70968-0000_SIGNAL021_5dB.wav, and test_noisy_speech/61-70968-0000_SIGNAL021_10dB.wav are additional test noisy speech waveforms that the test clean speech waveform from the previous example is a reference for).

Current networks

Recurrent neural networks (RNNs) and temporal convolutional networks (TCNs), are available:

  • ResNet: Residual network.
  • ResLSTM: Residual long short-term memory network.

Deep Xi utilising a ResNet TCN (Deep Xi-ResNet) was proposed in [2]. It uses bottleneck residual blocks and a cyclic dilation rate. The network comprises of approximately 2 million parameters and has a contextual field of approximately 8 seconds. An example of Deep Xi-ResNet is shown in Figure 4. A trained model for version resnet-1.0c is available in the model directory. It is trained using the Deep Xi Training Set.

Deep Xi utilising a ResLSTM network (Deep Xi-ResLSTM) was proposed in [1]. Each of its residual blocks contain a single LSTM cell. The network comprises of approximately 10 million parameters.

Figure 4: (left) Deep Xi-ResNet with B bottlekneck blocks. Each block has a bottlekneck size of d_f, and an output size of d_model. The middle convolutional unit has a kernel size of k and a dilation rate of d. The input to the ResNet is the noisy speech magnitude spectrum for frame l. The output is the corresponding mapped a priori SNR estimate for each component of the noisy speech magnitude spectrum. (right) An example of Deep Xi-ResNet with B=6, a kernel size of k=3, and a maximum dilation rate of 4. The dilation rate increases with the block index, b, by a power of 2 and is cycled if the maximum dilation rate is exceeded.

Deep Xi Versions

There are multiple Deep Xi versions, comprising of different networks and restrictions. An example of the ver naming convention is resnet-1.0c. The network type is given at the start of ver. Versions with c are causal. Versions with n are non-causal. The version iteration is also given, i.e. 1.0. Here are the current versions:

resnet-1.0c (available in the model directory)

d_model=256
n_blocks=40
d_f=64
k=3
max_d_rate=16
test_epoch=180
mbatch_size=8
causal=1

resnet-1.0n (technically, this is not a TCN due to the use of non-causal dilated 1D kernels)

d_model=256
n_blocks=40
d_f=64
k=3
max_d_rate=16
test_epoch=180
mbatch_size=8
causal=0

reslstm-1.0c

d_model=512  
n_blocks=5   
test_epoch=  
mbatch_size=8   

Results for Deep Xi Test Set

Average objective scores obtained over the conditions in the Deep Xi Test Set. SNR levels between -10 dB and 20 dB are considered only. MOS-LQO is the mean opinion score (MOS) objective listening quality score obtained using Wideband PESQ. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). eSTOI is extended STOI. Results for each condition can be found in log/results

Method Gain Causal MOS-LQO PESQ STOI eSTOI
Deep Xi-ResNet (resnet-1.0c) MMSE-STSA Yes 1.90 2.34 80.92 65.90
Deep Xi-ResNet (resnet-1.0c) MMSE-LSA Yes 1.92 2.37 80.79 65.77
Deep Xi-ResNet (resnet-1.0c) SRWF/IRM Yes 1.87 2.31 80.98 65.94
Deep Xi-ResNet (resnet-1.0c) cWF Yes 1.92 2.34 81.11 65.79
Deep Xi-ResNet (resnet-1.0c) WF Yes 1.75 2.21 78.30 63.96
Deep Xi-ResNet (resnet-1.0c) IBM Yes 1.38 1.73 70.85 55.95

Results for the DEMAND -- Voice Bank test set

Objective scores obtained on the DEMAND--Voicebank test set described here. As in previous works, the objective scores are averaged over all tested conditions. CSIG, CBAK, and COVL are mean opinion score (MOS) predictors of the signal distortion, background-noise intrusiveness, and overall signal quality, respectively. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). The highest scores attained for each measure are indicated in boldface.

Method Causal CSIG CBAK COVL PESQ STOI
Noisy speech -- 3.35 2.44 2.63 1.97 92 (91.5)
Wiener Yes 3.23 2.68 2.67 2.22 --
SEGAN No 3.48 2.94 2.80 2.16 93
WaveNet No 3.62 3.23 2.98 -- --
MMSE-GAN No 3.80 3.12 3.14 2.53 93
Deep Feature Loss Yes 3.86 3.33 3.22 -- --
Metric-GAN No 3.99 3.18 3.42 2.86 --
Deep Xi-ResNet (1.0c, causal) MMSE-LSA Yes 4.14 3.32 3.46 2.77 93 (93.2)
Deep Xi-ResNet (1.0n, non-causal) MMSE-LSA No 4.28 3.46 3.64 2.95 94 (93.6)

Installation

Prerequisites for GPU usage:

To install:

  1. git clone https://github.com/anicolson/DeepXi.git
  2. virtualenv --system-site-packages -p python3 ~/venv/DeepXi
  3. source ~/venv/DeepXi/bin/activate
  4. cd DeepXi
  5. pip install -r requirements.txt

How to Use the Deep Xi

Use run.sh to configure and run Deep Xi.

Inference: To perform inference and save the outputs, use the following:

./run.sh VER="resnet-1.0c" INFER=1 GAIN="mmse-lsa"

Please look in thoth/args.py for available gain functions and run.sh for further options.

Testing: To perform testing and get objective scores, use the following:

./run.sh VER="resnet-1.0c" TEST=1 GAIN="mmse-lsa"

Please look in log/results for the results.

Training:

./run.sh VER="resnet-1.0c" TRAIN=1 GAIN="mmse-lsa"

Ensure to delete the data directory before training. This will allow training lists and statistics for your training set to be saved and used. To retrain from a certain epoch, set --resume_epoch in run.sh to the desired epoch.

Current issues and potential areas of improvement

If you would like to contribute to Deep Xi, please investigate the following and compare it to current models:

  • Currently, the ResLSTM network is not performing as well as expected (when compared to TensorFlow 1.x performance).

Citation guide

Please cite the following depending on what you are using:

  • If using Deep Xi-ResLSTM, please cite [1].
  • If using Deep Xi-ResNet, please cite [1] and [2].
  • If using DeepMMSE, please cite [2].

[1] A. Nicolson, K. K. Paliwal, Deep learning for minimum mean-square error approaches to speech enhancement, Speech Communication 111 (2019) 44 - 55, https://doi.org/10.1016/j.specom.2019.06.002.

[2] Q. Zhang, A. M. Nicolson, M. Wang, K. Paliwal and C. Wang, "DeepMMSE: A Deep Learning Approach to MMSE-based Noise Power Spectral Density Estimation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1404-1415, 2020, doi: 10.1109/TASLP.2020.2987441.

About

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation. Used for Speech Enhancement and robust ASR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 69.4%
  • MATLAB 19.1%
  • Shell 11.5%