In this project, we present an end-to-end data-driven system for enhancing the quality of speech signals using a convolutional-recurrent neural network. We present a quantitative and qualitative analysis of our model's performance on a real-world noisy speech dataset and evaluate our proposed system’s performance using several metrics such as SNR, PESQ, etc. We have employed wavelet pooling mechanism instead of max-pooling layer in the convolutional layer and compared the performances of these variants.
We use the CSR-WSJ01 dataset for clean signals and use noise recordings from ACE corpus dataset.
The CSR-WSJ dataset is in the wv1/wv2 file format. We use the conversion tools available here to convert to sphere (sph) file format followed by conversion to .wav using this repo. Steps to set up the sphere conversion tool (wv1/2 - sph) -
- gcc -o sph2pipe *.c -lm
- Add executable to path - PATH="$(pwd):$PATH"
Folder structure for data -
-data
-clean_data (contains all dataset from [CSR-WSJ01 dataset](https://catalog.ldc.upenn.edu/LDC93s6a) as sph files)
-train_set.txt
-val_set.txt
-test_set.txt
- The 2 models can be trained via train.py configured by yaml file like in configs.
- Evalaute model on test set using generateMetrics.py
- Use the plot_curves.ipynb to generate loss / evaluation plots.