<a href="https://colab.research.google.com/github/secutron/RunTime/blob/master/AudioPipe0Clone_Ad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Adversarial Voice Conversion - demo

## Make sure you use GPU as a hardware accelerator and display some info about the device
Keep in mind that you will need at least 10 GB of GPU memory.
You can get Tesla P100 if you're lucky :)

In [None]:
!nvidia-smi

## Clone the repository with the code

In [None]:
!git clone https://github.com/hubertsiuzdak/voice-conversion

## Get the data
You will need at least two voices or whatever audio files you want to convert. You can use your own data and then create list of the files as shown below. For the purposes of this demo, we gonna download some .wav files from [CMU_ARCTIC speech synthesis databases.](http://www.festvox.org/cmu_arctic/) 
BDL represents male voice, SLT - female.

In [None]:
!wget http://festvox.org/cmu_arctic/cmu_arctic/packed/cmu_us_bdl_arctic-0.95-release.zip
!unzip -qj cmu_us_bdl_arctic-0.95-release.zip 'cmu_us_bdl_arctic/wav/*' -d bdl_wav_files
!wget http://festvox.org/cmu_arctic/cmu_arctic/packed/cmu_us_slt_arctic-0.95-release.zip
!unzip -qj cmu_us_slt_arctic-0.95-release.zip 'cmu_us_slt_arctic/wav/*' -d slt_wav_files

If you want to use your own data - upload audio files do Google Drive and then mount your Google Drive in the runtime's virtual machine using an authorization code:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Once executed, you should be able to access your Google Drive files. 

## Create lists of training files

In [None]:
!ls /content/bdl_wav_files/*.wav | sort -R > /content/voice-conversion/train_files_0.txt
!ls /content/slt_wav_files/*.wav | sort -R > /content/voice-conversion/train_files_1.txt

## Edit and save config file. 


---


### Note: 

*   Leave "checkpoint_path" empty if you want to train from the scratch.
*   Consinder mounting your Google Drive and make it "output_directory" so that you won't lose your checkpoints if Colab disconnects.
*   Start the training with alpha set to 0. Once discriminator starts to recognize speakers (domain loss gets close to 0) you can increase alpha parameter. Discriminator then becomes adversarial (it tries to maximize classification loss, resulting in speaker-invariant features).
*   Setting alpha too high can make the model not converging. On the other hand - too low alpha may result in identity function. Simply put, there would be no conversion.
*   Setting alpha to 0.001 after a few thousand of iterations and then gradually increasing seems to work.

In [None]:
%%writefile /content/voice-conversion/config.json
{
    "train_config": {
        "output_directory": "checkpoints",
        "epochs": 1000,
        "learning_rate": 0.003,
        "alpha": 0,
        "iters_per_checkpoint": 1000,
        "num_workers": 4,
        "batch_size": 8,
        "pin_memory": "True",
        "seed": 1234,
        "checkpoint_path": ""
    },

    "data_config": {
        "segment_length": 16000,
        "mu_quantization": 256,
        "sampling_rate": 16000
    },

    "model_config": {
        "n_speakers": 2,
        "n_in_channels": 256,
        "n_layers": 16,
        "max_dilation": 128,
        "n_residual_channels": 64,
        "n_skip_channels": 256,
        "n_out_channels": 256,
        "n_cond_channels": 64,
        "upsamp_window": 1050,
        "upsamp_stride": 200
    }
}

## Run the training script
Output gets audible after 10-20 thousands of iterations.

In [None]:
%cd /content/voice-conversion/
!python train.py -c config.json

## Build nv-wavenet and C-wrapper for the inference
See the [Nvidia repository](https://github.com/NVIDIA/nv-wavenet) for more details.

In [None]:
!make
!python build.py install

## Create the list of audio files that you want to convert

In [None]:
!cat train_files_0.txt | head -10 > inference.txt
!mkdir speaker_0 speaker_1 

## Run the inference sript

*   `-f` list of files
*   `-c` path to the checkpoint
*   `-o` output folder
*   `-id` id of the decoder to use (target voice)




In [None]:
!python inference.py -f inference.txt -c path/to/checkpoint -o speaker_1 -id 1