# Intra-lingual A2A VC with S3PRL; S3PRL-VC
[![Generic badge](https://img.shields.io/badge/GitHub-s3plr-9cf.svg)][github]
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][notebook]

Author: [tarepan]

[github]:https://github.com/tarepan/s3prl
[notebook]:https://colab.research.google.com/github/tarepan/s3prl/blob/master/s3prl/downstream/a2a-vc-vctk/training.ipynb
[tarepan]:https://github.com/tarepan

## Colab Check
Check
- Google Colaboratory runnning time
- GPU type
- Python version
- CUDA version

In [None]:
!cat /proc/uptime | awk '{print $1 /60 /60 /24 "days (" $1 "sec)"}'
!head -n 1 /proc/driver/nvidia/gpus/**/information
!python --version
!pip show torch | sed '2!d'
!/usr/local/cuda/bin/nvcc --version | sed '4!d'

## Setup

Mount GoogleDrive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Clone the `tarepan/s3plr` repository and install dependencies

In [None]:
!git clone https://github.com/tarepan/s3prl.git

%cd s3prl

# !pip install "torch==1.10.0" -q      # Based on your PyTorch environment
# !pip install "torchaudio==0.10.0" -q # Based on your PyTorch environment

!apt-get install sox
!pip install -e ./   # Repository itself
# Need fairseq master (not latest stable version) & Patched Gambel-softmax
!pip install "git+https://github.com/tarepan/fairseq.git#egg=fairseq"

%cd ./s3prl/downstream/a2a-vc-vctk

!pip install -r requirements.txt

Data preparation

In [None]:
# Get pre-trained HiFi-GAN checkpoint archive and extract contents
!./vocoder_download.sh ./

# Get upstream's private mirror
!mkdir -p /root/.cache/torch/hub
!cp -r /content/gdrive/MyDrive/ML_data/s3prl_cache /root/.cache/torch/hub

## Preprocessing

## Training

Preprocessing is included in training scripts

In [None]:
# @./s3prl
%cd ../..
!mkdir -p /content/gdrive/MyDrive/ML_results/S3PRL_VC/a2a/vq_wav2vec_default_vctk

In [None]:
# Training
!python run_downstream.py \
    -u       vq_wav2vec \
    -d       a2a-vc-vctk \
    -m       train \
    --config downstream/a2a-vc-vctk/config_ar_taco2.yaml \
    -p       /content/gdrive/MyDrive/ML_results/S3PRL_VC/a2a/vq_wav2vec_default_vctk \
    # -o       "config.downstream_expert.data.corpus.train.name=JVS,,config.downstream_expert.data.corpus.val.name=JVS" \
    # -e       /content/gdrive/MyDrive/ML_results/S3PRL_VC/a2a_vc_vctk_default_vq_wav2vec/states-50000.ckpt \


## Evaluation

Synthesize waveforms from already generated spectrograms and objectively evaluate them.

In [None]:
# Waveforms will be properly synthesized and saved, but objective evaluation will failed.
!./downstream/a2a-vc-vctk/decode.sh ./downstream/a2a-vc-vctk/hifigan_vctk /content/gdrive/MyDrive/ML_results/S3PRL_VC/a2a_vc_vctk_default_vq_wav2vec/50000

In [None]:
# Only evaluation (not work now)
# !python downstream/a2a-vc-vctk/evaluate.py --wavdir ./result/downstream/a2a_vc_vctk_default_vq_wav2vec/50001/hifigan_wav --samples 1 --task task1  --data_root ./downstream/a2a-vc-vctk/data

In [None]:
# # Launch TensorBoard
# %load_ext tensorboard
# %tensorboard --logdir gdrive/MyDrive/ML_results/S3PRL_VC

In [None]:
# # Usage stat
# ## GPU
# !nvidia-smi -l 3
# ## CPU
# !vmstat 5
# !top

## Inference

In [None]:
!apt-get install sox
!pip install "git+https://github.com/tarepan/extorch.git"
!pip install "git+https://github.com/tarepan/speechdatasety.git"
!pip install "git+https://github.com/tarepan/speechcorpusy.git@main"
!pip install "git+https://github.com/tarepan/fairseq.git#egg=fairseq"
!pip install "git+https://github.com/tarepan/s3prl.git"
!pip install Resemblyzer

In [None]:
import torch
import torch.cuda
from s3prl import hub, downstream


# Inputs/Config
device = torch.device("cuda") if torch.cuda.is_available() else device("cpu")
source_path  = "./<path>/<to>/source.wav"
target_paths = ["./<path>/<to>/target_1.wav", "./<path>/<to>/target_2.wav"]
source_wave = librosa.load(p, sr=16000)[0]
target_waves = [librosa.load(p)[0] for p in target_paths]
name_s2u = "XXXXX" # Upstream model name
tacovc_path = "/<path>/<to>/checkpoint.ckpt"
vocoder_path = "/<path>/<to>/checkpoint.ckpt"

# Init
wav2unit = getattr(hub, name_s2u)().to(device)
tacovc = getattr(downstream.experts, "a2a-vc-vctk")().load_state_dict(torch.load(tacovc_path))
vocoder = YourVocoder.from_pretrained(vocoder_path)

# wave2unit:: List[(T_wave,)] -> [Batch=1, T_unit=T_mel, Feat]
unit_series = wav2unit([torch.from_numpy(source_wave).to(device)])["feature_x"]

# unit2mel:: ([Batch=1, T_unit=T_mel, Feat], [Batch=1, Emb]) -> [Batch=1, T_mel, Freq] -> [T_mel, Freq]
target_emb = tacovc.wavs2emb(target_waves)
mel_by_tacovc, _ = tacovc.predict_step((unit_series, target_emb)) # Currently, both spec and spec_len are returned.
mel_by_tacovc = torch.squeeze(mel_by_tacovc, 0)

# mel2wave:: [T_mel, Freq] -> ([T_wave,], sampling_rate::int)
mel_for_vocoder = your_mel_shape_conversion(mel_by_tacovc)
o_wave, sr = vocoder.predict(mel_for_vocoder)