S2SCycleGAN

The big idea is to create a voice conversion (VC) model using GANs.
Two attempts were taken: direct application of SEGAN and a novel approach using the Tacotron TTS system. Both raw audio VC and spectrogram VC were attempted. Additionally, a cycle-consistency loss was included during training. Nothing has been successful, yet, and detailed information can be found in the respective folders. Brief overviews of the attempts can be found here in the following subsections.

The idea of voice conversion using GANs (in fact, just deep learning) is a promising idea that is still in its nacency. I hope this repo will serve as a starting point for others interested in the idea.

Data

Voice conversion was attempted using about 1200 paired utterances from two speakers (one male and one female). The data comes for the CMU Artic dataset. Download data with downloadData.sh, or get the data here.

SEGAN

The SEGAN approach to VC is a direct application of methods described in the paper. where the noisy input was replaced with the one speaker and the clean audio was replaced with the other speaker.

Tacotron GAN

In this approach, I tried to use components of the famous and mysterious TTS system Tacotron to create a VC model that was trained adversarially. The implementation I used originated from here. The generator was a slightlty altered version of the original model that simply took linear-spectrogram audio as input. The discrminator consisted of the encoder part of Tacotron with an additional FC layer that was fed the last state of the encoder.

One of the concerning points to this attempt was the lack of strong foundation for a the speech generator. The model of Tacotron I used had not been fully vetted. Thus, I spend the later half of ML Camp Jeju 2017 developing a strong implementation of this model. You can follow my progress on this project here.

Setup

Download the data with sh downloadData.sh
Install python dependencies into a new conda environment with conda create -f env.yml

Port Audio and PyAudio

sudo apt-get install libasound-dev

wget http://portaudio.com/archives/pa_stable_v190600_20161030.tgz

tar -zxvf pa_stable_v190600_20161030.tgz

cd portaudio

./configure && make

sudo make install

cd ..

source activate tensorflow1

pip install pyaudio

Future work

Deep learning is a promising avenue for speech synthesis, especially VC. As the foundation of speech synthesis solidify, I'll continue to attempt VC in a GAN setting in this repo. The official list of action items is

Remove skip connections for SEGAN
CycleGAN for Tacotron GAN

Add speaker embedding and train both SEGAN and Tacotron GAN on OpenSLR

References

Pascual, et al., "SEGAN: Speech Enhancement Generative Adversarial Network." March 2017
Wang, et al., "Tacotron: Towards End-to-End Speech Synthesis." March 2017
Zhu, et al., "Unpaired Image-toImage Translation using Cycle-Consistent Adversarial Networks." March 2017

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
SEGAN		SEGAN
Tacotron_GAN		Tacotron_GAN
imgs		imgs
tb_logs		tb_logs
.gitignore		.gitignore
README.md		README.md
downloadData.sh		downloadData.sh
env.yml		env.yml
license		license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S2SCycleGAN

Data

SEGAN

Tacotron GAN

Setup

Port Audio and PyAudio

Future work

References

About

Releases

Packages

Languages

License

tmulc18/S2SCycleGAN

Folders and files

Latest commit

History

Repository files navigation

S2SCycleGAN

Data

SEGAN

Tacotron GAN

Setup

Port Audio and PyAudio

Future work

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages