# 简介

sovits包括**训练、合成**两部分，替换至本篇进行合成的模型必须是**Rcell版引入f0参数的sovits方式训练出的模型（三件套的colab，模型仅在内部互通）**

**格式参考vits专栏三件套（评论区）**[vits注解](https://www.bilibili.com/read/cv18478187)

95%的问题都可以参考专栏解决，剩下的我也不会了

[一键制作数据集](https://colab.research.google.com/drive/1qzTZQp7ew7HMal4oApm4DI1cK2JR1N0t)

[一键训练](https://colab.research.google.com/drive/1xTCtNOK0Rglzq06H3iAwBwVnBuz0fgxJ)

**支持一键合成长时间的音频（5min以上），建议使用GPU（CPU比较慢）**

按照[Rcell](https://space.bilibili.com/343303724)大佬的思路拼合soft-vc与vits，
使用[Francis-Komizu](https://space.bilibili.com/636704927)大佬的原colab结构，并延续Sovits的称呼。

hubert.pt为[soft-vc](https://github.com/bshall/hubert)发布的内容合成器模型，generator_idxr.pth为R佬在huggingface发布的模型；采用存在谷歌云盘的方式，节约下载时间。
[Sovits](https://github.com/IceKyrin/Sovits) fork自F佬的[github](https://github.com/Francis-Komizu/Sovits)，其中内置了R佬pth的config.json及官方hubert模块（改为加载本地模型方式），以方便使用。

# 配置环境

In [None]:
!git clone https://github.com/xzy-git/sovits_infer_rcell
%cd sovits_infer_rcell
!pip install -r requirements.txt
!mkdir pth
!mkdir raw
!mkdir results
%cd wav_temp
!mkdir input
!mkdir output
%cd ..

Cloning into 'sovits_infer_rcell'...
remote: Enumerating objects: 94, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 94 (delta 7), reused 7 (delta 7), pack-reused 85[K
Unpacking objects: 100% (94/94), done.
/content/sovits_infer_rcell
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Cython==0.29.21
  Downloading Cython-0.29.21-cp37-cp37m-manylinux1_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 5.0 MB/s 
[?25hCollecting matplotlib==3.3.1
  Downloading matplotlib-3.3.1-cp37-cp37m-manylinux1_x86_64.whl (11.6 MB)
[K     |████████████████████████████████| 11.6 MB 44.7 MB/s 
Collecting scipy==1.5.2
  Downloading scipy-1.5.2-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 MB 1.6 MB/s 
[?25hCollecting tensorboard==2.3.0
  Downloading tensorboard-2.3.0-py3-none-any.whl (6.8 MB)
[K     |██████

/content/sovits_infer_rcell/wav_temp
/content/sovits_infer_rcell


In [None]:
import os
import shutil
import utils
import torch
import hubert
import librosa
import logging
import soundfile
import torchcrepe
import torchaudio
import numpy as np
from wav_temp import merge
from models import SynthesizerTrn
from text.symbols import symbols
from text import text_to_sequence
from pydub import AudioSegment

logging.getLogger('numba').setLevel(logging.WARNING)

# python删除文件的方法 os.remove(path)path指的是文件的绝对路径,如：
def del_file(path_data):
    for i in os.listdir(path_data):  # os.listdir(path_data)#返回一个列表，里面是当前目录下面的所有东西的相对路径
        os.remove(path_data + i)


def cut(cut_time, file_path, vocal_name, out_dir):
    audio_segment = AudioSegment.from_file(file_path, format='wav')

    total = int(audio_segment.duration_seconds / cut_time)  # 计算音频切片后的个数
    for i in range(total):
        # 将音频10s切片，并以顺序进行命名
        audio_segment[i * cut_time * 1000:(i + 1) * cut_time * 1000].export(f"{out_dir}/{vocal_name}-{i}.wav",
                                                                            format="wav")
    audio_segment[total * cut_time * 1000:].export(f"{out_dir}/{vocal_name}-{total}.wav", format="wav")  # 缺少结尾的音频片段


def resample_to_22050(audio_path):
    raw_audio, raw_sample_rate = torchaudio.load(audio_path)
    audio_22050 = torchaudio.transforms.Resample(orig_freq=raw_sample_rate, new_freq=22050)(raw_audio)[0]
    soundfile.write(audio_path, audio_22050, 22050)


def get_text(text, hps):
    text_norm = text_to_sequence(text, hps.data.text_cleaners)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = torch.LongTensor(text_norm)
    return text_norm


def resize2d(source, target_len):
    source[source < 0.001] = np.nan
    target = np.interp(np.arange(0, len(source) * target_len, len(source)) / target_len, np.arange(0, len(source)),
                       source)
    return np.nan_to_num(target)


def convert_wav_22050_to_f0():
    if torch.cuda.is_available():
        audio, sr = torchcrepe.load.audio(source_path)
        tmp = torchcrepe.predict(audio=audio, fmin=50, fmax=550,
                                 sample_rate=22050, model='full',
                                 batch_size=2048, device='cuda:0').numpy()[0]
    else:
        tmp = librosa.pyin(librosa.load(source_path)[0],
                           fmin=librosa.note_to_hz('C2'),
                           fmax=librosa.note_to_hz('C7'),
                           frame_length=1780)[0]
    f0 = np.zeros_like(tmp)
    f0[tmp > 0] = tmp[tmp > 0]
    return f0

# 加载模型

## 加载内容编码器

In [None]:
# 这个东西是https://github.com/bshall/hubert/releases/tag/v0.1 的hubert-soft-0d54a1f4.pt，可以自己替换来源、但是不能换其他模型（路径自己改）。
!gdown --id '1cA37nsiSnsouF2TJkaXb3_VoA-rbifTu' --output /content/sovits_infer_rcell/pth/hubert.pt
hubert_soft = hubert.hubert_soft('/content/sovits_infer_rcell/pth/hubert.pt')

Downloading...
From: https://drive.google.com/uc?id=1cA37nsiSnsouF2TJkaXb3_VoA-rbifTu
To: /content/sovits_infer_rcell/pth/hubert.pt
100% 378M/378M [00:04<00:00, 88.7MB/s]


## 加载生成器

如果要**替换自己的模型**，将 !gdown这行注释掉（行首加个“#”即可，注释成功则变绿）

将**自己的配置json（上一篇生成了的）**上传至/content/sovits_infer_rcell/configs/文件夹
将**自己的模型（上一篇生成了的）**上传至/content/sovits_infer_rcell/pth文件夹

In [None]:
from google.colab import drive

#@markdown 是否使用谷歌盘内模型（不勾选则自动下载猫雷）
g_drive = True #@param {type:"boolean"}
if g_drive:
  drive.mount('/content/drive/')
  config_path = '/content/drive/MyDrive/Altoria/config.json' #@param {type:"string"}
  model_path = '/content/drive/MyDrive/Altoria/G_12000.pth' #@param {type:"string"}
else:
  # 这个东西是https://huggingface.co/spaces/innnky/soft-vits-singingvc 的G.pth（猫雷），可以换成自己的模型（必须是按照sovits方式训练出的其他角色模型）
  !gdown --id '1gg1Igsa7nOtsLohtv-hNq2mmXCsbFqZJ' --output /content/sovits_infer_rcell/pth/G.pth
  #@markdown 不勾选谷歌盘，则使用猫雷，不需要改路径
  config_path = "./configs/vctk_base.json" #@param {type:"string"}
  model_path = "/content/sovits_infer_rcell/pth/G.pth" #@param {type:"string"}

hps_ms = utils.get_hparams_from_file(config_path)
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
net_g_ms = SynthesizerTrn(
    len(symbols),
    hps_ms.data.filter_length // 2 + 1,
    hps_ms.train.segment_size // hps_ms.data.hop_length,
    n_speakers=hps_ms.data.n_speakers,
    **hps_ms.model)
_ = utils.load_checkpoint(model_path, net_g_ms, None)
_ = net_g_ms.eval().to(dev)

Mounted at /content/drive/


# 声音转换

支持{1、2}**任选一个方式**的声音转换！
支持**10s以上5分钟以内**的音频（再久合成时间会过长）。
上传到/content/sovits_infer_rcell/raw文件夹，支持自动合成歌曲

使用[spleeter](https://github.com/deezer/spleeter)的2stems模式分离歌曲，自动生成这两个文件。（请自行阅读官方使用文档）

spleeter separate -p spleeter:2stems -o output audio_example.mp3

**结果自动输出至results文件夹。**自行下载，无预览
mp3为自动合成的带伴奏歌曲，out_vits为纯人声。

跑调破音基本是因为直播采样到的音域不够，这个没办法。（狗头）猫雷高音上不去、低音下不去。
例子是牵丝戏，可以不下。明显感觉开头的低音、戏腔都炸了，其他部分还好。

In [None]:
# 进results下载试听
!gdown --id '1ymJDK1VSESzv2xv_2Ce8h4QoSnzoplt7' --output /content/sovits_infer_rcell/results/demo.mp3

Downloading...
From: https://drive.google.com/uc?id=1ymJDK1VSESzv2xv_2Ce8h4QoSnzoplt7
To: /content/sovits_infer_rcell/results/demo.mp3
100% 3.83M/3.83M [00:00<00:00, 289MB/s]


1、使用参考音频

In [None]:
!gdown --id '10JQMPdzp0gjg9cVVersxVZWhIr4UwrFF' --output /content/sovits_infer_rcell/raw/vocals.wav

Downloading...
From: https://drive.google.com/uc?id=10JQMPdzp0gjg9cVVersxVZWhIr4UwrFF
To: /content/sovits_infer_rcell/raw/vocals.wav
100% 882k/882k [00:00<00:00, 131MB/s]


2、使用上传音频

自行上传至raw文件夹（单声道，22050hz，wav格式），可有bgm.wav（必须为wav格式），无伴奏则为纯人声合成

3、合成音频

In [None]:
#@markdown **单声道，22050hz，wav格式**

#@markdown 人声文件名（不带.wav）
clean_name = "vocals" #@param {type:"string"}
#@markdown 伴奏文件名（可以不放伴奏）（不带.wav）
bgm_name = "bgm" #@param {type:"string"}
#@markdown 每次处理的长度，建议30s以内，大了炸显存
cut_time = "30" #@param {type:"string"}
#@markdown 变音高，一般不动
vc_transform = 1 #@param {type:"string"}
#@markdown 角色id（0号为猫雷）
speaker_id = "0" #@param {type:"string"}

out_audio_name = clean_name

resample_to_22050(f'./raw/{clean_name}.wav')
del_file("./wav_temp/input/")
del_file("./wav_temp/output/")

raw_audio_path = f"./raw/{clean_name}.wav"

audio, sample_rate = torchaudio.load(raw_audio_path)

audio_time = audio.shape[-1] / 22050
if audio_time > 1.3 * int(cut_time):
    cut(int(cut_time), raw_audio_path, clean_name, "./wav_temp/input")
else:
    shutil.copy(f"./raw/{clean_name}.wav", f"./wav_temp/input/{clean_name}-0.wav")
file_list = os.listdir("./wav_temp/input")

count = 0
for file_name in file_list:
    source_path = "./wav_temp/input/" + file_name
    vc_transform = 1
    audio, sample_rate = torchaudio.load(source_path)
    input_size = audio.shape[-1]
    if sample_rate != 16000:
        audio = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio)[0]
    audio22050 = torchaudio.transforms.Resample(orig_freq=16000, new_freq=22050)(audio)[0]

    # 此版本使用torchcrepe加速获取f0
    f0 = convert_wav_22050_to_f0()

    source = torch.FloatTensor(audio).to(dev).unsqueeze(0).unsqueeze(0)
    with torch.inference_mode():
        units = hubert_soft.units(source)
        soft = units.squeeze(0).cpu().numpy()
        f0 = resize2d(f0, len(soft[:, 0])) * int(vc_transform)
        soft[:, 0] = f0 / 10

    sid = torch.LongTensor([int(speaker_id)]).to(dev)
    stn_tst = torch.FloatTensor(soft)
    x_tst = stn_tst.to(dev).unsqueeze(0)
    x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(dev)

    with torch.no_grad():
        audio = net_g_ms.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=0, noise_scale_w=0, length_scale=1)[0][
            0, 0].data.cpu().float().numpy()
    soundfile.write("./wav_temp/output/" + file_name, audio, int(audio.shape[0] / input_size * 22050))
    count += 1
    print("%s success: %.2f%%" % (file_name, 100 * count / len(file_list)))
merge.run(clean_name, bgm_name, out_audio_name)

torch.Size([1, 256, 187]) tensor([187], device='cuda:0')
torch.Size([1, 1, 187])
tensor([321], device='cuda:0')
vocals-0.wav success: 100.00%
out vits success


# 参考

https://github.com/bshall/soft-vc

[基于VITS和SoftVC实现任意对一VoiceConversion](https://www.bilibili.com/video/BV1S14y1x78X?share_source=copy_web&vd_source=630b87174c967a898cae3765fba3bfa8)

