<a href="https://colab.research.google.com/github/studiomd2025/notebooks/blob/main/medleyvox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab Inference for MedleyVox

Medley Vox is a [dataset for testing algorithms for separating multiple singers](https://arxiv.org/pdf/2211.07302) within a single music track. Also, the [authors of Medley Vox](https://github.com/jeonchangbin49/MedleyVox) proposed a neural network architecture for separating singers. However, unfortunately, they did not publish the weights. Later, their training process was [repeated by Cyru5](https://huggingface.co/Cyru5/MedleyVox/tree/main), who trained several models and published the weights in the public domain. Now this WebUI is created to use the trained models and weights for inference. Here are some precautions:
1. Put the [downloaded models](https://huggingface.co/Cyru5/MedleyVox) in the 'checkpoints' folder in folder format, with each model folder containing a model file (.pth) and its corresponding configuration file (.json).
2. If you use overlapadd and the choice of model is 'w2v' or 'w2v_chunk', you need to download the pretrained model [xlsr_53_56k.pt](https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt) and put it in the 'pretrained' folder.
3. At present, the audio output sampling rate supported by the model is 24000kHz and cannot be changed. To solve this, you can use [AudioSR](https://github.com/haoheliu/versatile_audio_super_resolution), [Apollo](https://github.com/JusperLee/Apollo), or [Music Source Separation Training](https://github.com/ZFTurbo/Music-Source-Separation-Training) for audio super-resolution.
4. When using WebUI on cloud platforms or Colab, please place the audio to be processed in the 'inputs' folder, and the processing results will be stored in the 'results' folder. The 'Select folder' and 'Open folder' buttons are invalid in the cloud.
5. If the input is too long, it may be impossible to inference due to lack of VRAM. In that case, use 'use_overlapadd'. Among the 'use_overlapadd' options, "ola", "ola_norm", and "w2v" all work well. Use w2v_chunk or sf_chunk if these fail or as desired. You can also try experimenting with 'vad_method' options spec and webrtc when using either of the "_chunk" methods. Chunking has proven to be very useful therefore it is on by default.

# Initialize environment

In [None]:
# @title Clone repository and install requirements {"display-mode":"form"}
#@markdown # Clone repository and install requirements
#@markdown

!nvidia-smi
!git clone https://github.com/SUC-DriverOld/MedleyVox-Inference-WebUI
%cd /content/MedleyVox-Inference-WebUI
!python -m pip install --upgrade pip==24.0 setuptools
!python -m pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
!mkdir -p inputs
!mkdir -p results

# Create the checkpoints directory
!mkdir -p "/content/MedleyVox-Inference-WebUI/checkpoints"

# Download and move vocals_238
!mkdir -p "/content/MedleyVox-Inference-WebUI/checkpoints/vocals_238"
%cd "/content/MedleyVox-Inference-WebUI/checkpoints/vocals_238"
!wget https://huggingface.co/Cyru5/MedleyVox/resolve/main/vocals%20238/vocals.pth
!wget https://huggingface.co/Cyru5/MedleyVox/resolve/main/vocals%20238/vocals.json

# Download and move multi_singing_librispeech_138
!mkdir -p "/content/MedleyVox-Inference-WebUI/checkpoints/multi_singing_librispeech_138"
%cd "/content/MedleyVox-Inference-WebUI/checkpoints/multi_singing_librispeech_138"
!wget https://huggingface.co/Cyru5/MedleyVox/resolve/main/multi_singing_librispeech_138/vocals.pth
!wget https://huggingface.co/Cyru5/MedleyVox/resolve/main/multi_singing_librispeech_138/vocals.json

# Download and move singing_librispeech_ft_iSRNet
!mkdir -p "/content/MedleyVox-Inference-WebUI/checkpoints/singing_librispeech_ft_iSRNet"
%cd "/content/MedleyVox-Inference-WebUI/checkpoints/singing_librispeech_ft_iSRNet"
!wget https://huggingface.co/Cyru5/MedleyVox/resolve/main/singing_librispeech_ft_iSRNet/vocals.pth
!wget https://huggingface.co/Cyru5/MedleyVox/resolve/main/singing_librispeech_ft_iSRNet/vocals.json

# Download pretrained model
!mkdir -p "/content/MedleyVox-Inference-WebUI/pretrained"
%cd "/content/MedleyVox-Inference-WebUI/pretrained"
!wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt

%cd /content/MedleyVox-Inference-WebUI

# Inference

### Place the audio to be processed in the 'inputs' folder, and the processing results will be stored in the 'results' folder. There are two ways to tun inference: use WebUI or use command line.

- Use WebUI: Run the WebUI startup code block and then access the WebUI through the public link.
- Use command line: Select appropriate inference parameters and run the the command line code block.

### Explanation of reasoning parameters. For more infrmation, refer to `inference.py`.

- `Model name`: Select which model you want to use.
- `Use overlapadd`: Use overlapadd functions, ola, ola_norm, w2v will work with ola_window_len, ola_hop_len argugments. w2v_chunk and sf_chunk is chunk-wise processing based on VAD, so you have to specify the vad_method args. If you use sf_chunk (spectral_featrues_chunk), you also need to specify spectral_features.
- `Separate storage`: Save results in separate folders with the same name as the input file.
- `Output format`: Select the output format of the results.
- `VAD method`: What method do you want to use for 'voice activity detection (vad) -- split chunks -- processing. Only valid when 'w2v_chunk' or 'sf_chunk' for args.use_overlapadd.
- `Spectral features`: What spectral feature do you want to use in correlation calc in speaker assignment (only valid when using sf_chunk)
- `OLA window length`: OLA window size in [sec], only valid when using ola or ola_norm. Set 0 to use the default value (None).
- `OLA hop length`: OLA hop size in [sec], only valid when using ola or ola_norm. Set 0 to use the default value (None).
- `Wav2Vec nth layer output`: Wav2Vec nth layer output, only valid when using w2v or w2v_chunk. For example: 0 1 2 3, default: 0
- `Use EMA model`: Use EMA model or online model? Only vaind when args.ema it True (model trained with EMA).
- `Mix consistent output`: Only valid when the model is trained with mixture_consistency loss.
- `Reorder chunks`: OLA reorder chunks. Only valid when using ola or ola_norm.
- `Skip error files`: Skip error files while separating instead of stopping.

If the input is too long, it may be impossible to inference due to lack of VRAM. In that case, use `use_overlapadd`. Among the `use_overlapadd` options, "ola", "ola_norm", and "w2v" all work well. Use w2v_chunk or sf_chunk if these fail or as desired. You can also try experimenting with `vad_method` options spec and webrtc when using either of the "_chunk" methods. Chunking has proven to be very useful therefore it is on by default.

In [None]:
# @title Run inference in WebUI {"display-mode":"form"}
#@markdown # Run inference in WebUI
#@markdown

#@markdown

#@markdown Language Setting
language = "English" #@param ["English", "简体中文"]

import os
language_dict = {"English": "en_US", "简体中文": "zh_CN"}
os.environ["LANGUAGE"] = language_dict[language]

%cd /content/MedleyVox-Inference-WebUI
!python webui.py -s

In [None]:
# @title Run inference in Command Line {"display-mode":"form"}
#@markdown # Run inference in Command Line
#@markdown

#@markdown

#@markdown File and model Parameters
folder_input = "inputs" #@param {type:"string"}
store_dir = "results" #@param {type:"string"}
model_name = "multi_singing_librispeech_138" #@param ["vocals_238", "multi_singing_librispeech_138", "singing_librispeech_ft_iSRNet"]

#@markdown

#@markdown Common Parameters
use_overlapadd = "ola" #@param ["None", "ola", "ola_norm", "w2v", "w2v_chunk", "sf_chunk"]
separate_storage = True #@param {type:"boolean"}
skip_error = True #@param {type:"boolean"}
output_format = "wav" #@param ["wav", "flac", "mp3"]

#@markdown

#@markdown Advanced Parameters
vad_method = "spec" #@param ["spec", "webrtc"]
spectral_features = "mfcc" #@param ["mfcc", "spectral_centroid"]
ola_window_len = "0" #@param {type:"string"}
ola_hop_len = "0" #@param {type:"string"}
w2v_nth_layer_output = "0" #@param {type:"string"}
use_ema_model = True #@param {type:"boolean"}
mix_consistent_out = True #@param {type:"boolean"}
reorder_chunks = True #@param {type:"boolean"}

import os
import glob

MODEL_DIR = "checkpoint"
PRETRAINED_MODEL_DIR = "pretrained"
use_gpu = True

model_file = os.path.basename(glob.glob(os.path.join(MODEL_DIR, model_name, "*.pth"))[0])
target = model_file.replace(".pth", "")
exp_name = model_name
model_dir = MODEL_DIR
params = f"--target \"{target}\" --exp_name \"{exp_name}\" --model_dir \"{model_dir}\""
if use_gpu:
    params += " --use_gpu y"
else:
    params += " --use_gpu n"
if use_overlapadd != "None":
    params += f" --use_overlapadd {use_overlapadd}"
params += f" --vad_method {vad_method} --spectral_features {spectral_features} --w2v_ckpt_dir {PRETRAINED_MODEL_DIR} --w2v_nth_layer_output {w2v_nth_layer_output}"
if ola_window_len != "0":
    params += f" --ola_window_len {ola_window_len}"
if ola_hop_len != "0":
    params += f" --ola_hop_len {ola_hop_len}"
if use_ema_model:
    params += " --use_ema_model y"
else:
    params += " --use_ema_model n"
if mix_consistent_out:
    params += " --mix_consistent_out y"
else:
    params += " --mix_consistent_out n"
if reorder_chunks:
    params += " --reorder_chunks y"
else:
    params += " --reorder_chunks n"
if skip_error:
    params += " --skip_error y"
else:
    params += " --skip_error n"
if separate_storage:
    params += f" --separate_storage y"
else:
    params += f" --separate_storage n"
params += f" --output_format {output_format} --inference_data_dir \"{folder_input}\" --results_save_dir \"{store_dir}\""
print(params)

%cd /content/MedleyVox-Inference-WebUI
!python inference.py {params}