The LLVC repo provides a sleek minimal implementation of the RVC (Realtime Voice Conversion) v2 model, that can perform 10x realtime on a Colab Tesla T4, but 10x slower than realtime on Colab CPU. This fork aims to extend the same for ONNX support to squeeze more performance out of it
Tested on Ubuntu 22.04 WSL 2:
git clone https://github.com/tripathiarpan20/LLVC/tree/main
cd LLVC
virtualenv llvcenv
source llvcenv/bin/activate
pip install -r requirements.txt #Takes a while to complete
python download_models.py
#Dealing with some bs with ffmpeg: https://github.com/kkroening/ffmpeg-python/issues/174
sudo apt-get update
sudo apt-get install ffmpeg
pip uninstall ffmpeg
pip uninstall ffmpeg-python
pip install ffmpeg-python
# Download any `.pth` model file tagged with 'RVC v2' from https://www.weights.gg/
# Paste the `model.pth` RVC v2 file downloaded in last step into the current folder (`LLVC`)
# Paste a sample input audio `.wav` file into the current folder (`LLVC`)
python minimal_rvc/_infer_file.py --input_file libri_sample.wav --out_dir matpat_out --model_path model.pth
The above command would process libri_sample.wav with the RVC v2 model (model.pth) and create an out.wav file in the matpat_out folder
This repository contains the code necessary to train Koe AI's LLVC models and to reproduce the LLVC paper.
LLVC paper: https://koe.ai/papers/llvc.pdf
LLVC samples: https://koeai.github.io/llvc-demo/
Windows executable: https://koe.ai/recast/download/
Koe AI homepage: https://koe.ai/
- Create a Python environment with e.g. conda:
conda create -n llvc python=3.11 - Activate the new environment:
conda activate llvc - Install torch and torchaudio from https://pytorch.org/get-started/locally/
- Install requirements with
pip install -r requirements.txt - Download models with
python download_models.py eval.pyhas requirements that conflict withrequirements.txt, so before running this file, create a seperate new Python virtual environment with python 3.9 and runpip install -r eval_requirements.txt
You should now be able to run python infer.py and convert all of the files in test_wavs with the pretrained llvc checkpoint, with the resulting files saved to converted_out.
python infer.py -p my_checkpoint.pth -c my_config.pth -f input_file -o my_out_dir will convert a single audio file or folder of audio files using the given LLVC checkpoint and save the output to the folder my_out_dir. The -s argument simulate a streaming environment for conversion. The -n argument allows the user to specify the size of input audio chunks in streaming mode, trading increased latency for better RTF.
compare_infer.py allows you to reproduce our streaming no-f0 RVC and QuickVC conversions on input audio of your choice. By default, window_ms and extra_convert_size are set to the values used for no-f0 RVC conversion. See the linked paper for the QuickVC conversion parameters.
- Create a folder
experiments/my_runcontaining aconfig.json(seeexperiments/llvc/config.jsonfor an example) - Edit the
config.jsonto reflect the location of your dataset and desired architectural modifications python train.py -d experiments/my_run- The run will be logged to Tensorboard in the directory
experiments/my_run/logs
Datasets are comprised of a folder containing three subfolders: dev, train and val. Each of these folders contains audio files of the form PREFIX_original.wav, which are audio clips recorded by a variety of input speakers, and PREFIX_converted.wav, which are the original audio clips converted to a single target speaker. val contains clips from the same speakers as test. dev contains clips from different speakers than test.
To recreate the dataset that we use in our paper:
- Download dev-clean.tar.gz and train-clean-360.tar.gz from https://www.openslr.org/12 and unzip to
llvc/LibriSpeech
python -m minimal_rvc._infer_folder \
--train_set_path "LibriSpeech/train-clean-360" \
--dev_set_path "LibriSpeech/dev-clean" \
--out_path "f_8312_ls360" \
--flatten \
--model_path "llvc_models/models/rvc/f_8312_32k-325.pth" \
--model_name "f_8312" \
--target_sr 16000 \
--f0_method "rmvpe" \
--val_percent 0.02 \
--random_seed 42 \
--f0_up_key 12
- Download test-clean.tar.gz from https://www.openslr.org/12
- Use
infer.pyto convert the test-clean folder using the checkpoint that you want to evaluate - Activate the eval environment and run
eval.pyon your converted audio and directory of ground-truth audio files.
Many of the modules written in minimal_rvc/ are based on the following repositories:
- https://github.com/ddPn08/rvc-webui
- https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI
- https://github.com/teftef6220/Voice_Separation_and_Selection
If you find out work relevant to your research, please cite:
@misc{sadov2023lowlatency,
title={Low-latency Real-time Voice Conversion on CPU},
author={Konstantine Sadov and Matthew Hutter and Asara Near},
year={2023},
eprint={2311.00873},
archivePrefix={arXiv},
primaryClass={cs.SD}
}