# 创建数据集（本colab仅支持多人，**使用预模型**可以只放一个人物，格式与多人相同）

**[sovits合集导航](https://github.com/IceKyrin/sovits_guide)**

1、可以使用**预模型（已替换为22050hz）**节省训练时间，id范围0-7，**最多八个人物**，按提示操作即可

2、使用预模型，即**把高练度的模型直接加载训练**，在基础上**洗去原有音色**，两千条数据出效果仅需**100-200epochs**，从头开始训练需要约500-800epochs（与数据量有关）

3、数据要求**无伴奏、5-15s、22050采样率、单声道、wav格式**，报错请先检查数据集

[自动切片机](https://github.com/openvpi/audio-slicer)（**切出来的会有过长音频，不要切完就做数据集**）

4、不同人物语音要分不同id，单角色至少1000条效果较好（500-5000条均可，与效果成正比）

5、预模型是**多人模型**，**可以只放一个id的数据训练**；**从头训练**的模型，至少**2个**人物，**总**数据量**大于3000条**

6、本专栏**一次**只能做**一个id**的数据，**清数据、断联重启**，重复做几次（改id）。

后续：[一键训练](https://colab.research.google.com/drive/1DexYpwWIdD_RRqQ165l-YoWMzFAHIbPy)

后续：[一键合成](https://colab.research.google.com/drive/1F3VpHCi9eridGw1F1hbqR7qhXGKuSCus#scrollTo=vjkgBV7j2cVJ)


数据集格式：

1、每个人物起一个英文/拼音名，**纯英文、数字**，假设名字为paimeng

2、建立工程文件夹paimeng，里面放wavs文件夹，wavs里面放train和val文件夹

3、train为训练数据，val为验证数据，把处理好的wav文件分开放在里面，比例约9:1（每个人物均含这两个），val最低50条即可

4、**把paimeng这个文件夹右键打包成zip**，右键解压到当前文件夹，出来的是paimeng，里面是wavs等。**不要出现直接解压出wavs、或套了几层paimeng的情况**

5、在谷歌云盘的根目录，建立dataset文件夹，然后把paimeng.zip上传进去

6、**wav的名称**，汉字、英文、数字、下划线均可，**别出现空格**或其他乱七八糟的名字

参考下面的格式
```
paimeng
└───wavs
    ├───val
    │   ├───xxx1-xxx1.wav
    │   ├───...
    │   └───Lxx-0xx8.wav
    └───train
        ├───xx2-0xxx2.wav
        ├───...
        └───xxx7-xxx007.wav
```


In [1]:
#@title 加载Google云端硬盘
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#@title 数据集设置
#@markdown 数据集名称（**人物的英文/拼音名**，与建数据文件夹时统一；不带zip。）
DATASETNAME = "Altoria"  #@param {type:"string"}
#@markdown 压缩包路径（谷歌盘路径，传到dataset的就不改这个）
ZIP_PATH = "/content/drive/MyDrive/dataset/"  #@param {type:"string"}
ZIP_PATH = ZIP_PATH + DATASETNAME+".zip"
DATASETPATH = "/content/" + DATASETNAME
!cp {ZIP_PATH} {DATASETPATH}.zip
%cd /content/
!unzip -q {DATASETNAME}.zip
!pip install torchaudio soundfile
import os
import soundfile
import torchaudio
#@markdown **训练专栏内置22050预模型可选下载**

#@markdown 自动将采样率转换为22050，若勾选，则使用44100高采样率（体积翻倍且模型config需匹配）

#@markdown **不明白这条别打钩**
high_sample_rate = False #@param {type:"boolean"}
target_sample = 44100 if high_sample_rate else 22050
def resample_to_22050(audio_path):
    raw_audio, raw_sample_rate = torchaudio.load(audio_path)
    if raw_sample_rate != target_sample:
      audio_22050 = torchaudio.transforms.Resample(orig_freq=raw_sample_rate, new_freq=22050)(raw_audio)[0]
      soundfile.write(audio_path, audio_22050, target_sample)

for i in os.listdir(f"{DATASETPATH}/wavs/train"):
  resample_to_22050(f"{DATASETPATH}/wavs/train/{i.split('.')[0]}.wav")
for i in os.listdir(f"{DATASETPATH}/wavs/val"):
  resample_to_22050(f"{DATASETPATH}/wavs/val/{i.split('.')[0]}.wav")

/content
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
#@title 自动预处理(每两千条5-15s数据，花3min左右，自行估算时间)
%cd {DATASETPATH}
!git clone https://github.com/xzy-git/hubert.git
%cd hubert
!python encode.py soft  {DATASETPATH}/wavs {DATASETPATH}/soft --extension .wav 

/content/Altoria
Cloning into 'hubert'...
remote: Enumerating objects: 111, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (80/80), done.[K
remote: Total 111 (delta 54), reused 78 (delta 30), pack-reused 0[K
Receiving objects: 100% (111/111), 527.00 KiB | 21.96 MiB/s, done.
Resolving deltas: 100% (54/54), done.
/content/Altoria/hubert
Loading hubert checkpoint
  "You are about to download and run code from an untrusted repository. In a future release, this won't "
Downloading: "https://github.com/bshall/hubert/zipball/main" to /root/.cache/torch/hub/main.zip
Downloading: "https://github.com/bshall/hubert/releases/download/v0.1/hubert-soft-0d54a1f4.pt" to /root/.cache/torch/hub/checkpoints/hubert-soft-0d54a1f4.pt
100% 361M/361M [00:17<00:00, 21.8MB/s]
Encoding dataset at /content/Altoria/wavs
100% 4033/4033 [04:30<00:00, 14.92it/s]


In [4]:
#@title 生成txt
#@markdown 填写角色id，如"|0|"、"|1|"（每个人物唯一，编号从0开始）

#@markdown 若使用预模型加速训练，id只能为0-7，**最多八个人物**
ID = "|0|"  #@param {type:"string"}
%cd {DATASETPATH}
import os 
with open("train.txt","w") as f:
  for i in os.listdir(f"{DATASETPATH}/soft/train"):
    f.write(f"dataset/{DATASETNAME}/wavs/train/{i.split('.')[0]}.wav{ID}dataset/{DATASETNAME}/soft/train/{i}|dataset/{DATASETNAME}/pitch/train/{i.split('.')[0]}.npy\n")

with open("val.txt","w") as f:
  for i in os.listdir(f"{DATASETPATH}/soft/val"):
    f.write(f"dataset/{DATASETNAME}/wavs/val/{i.split('.')[0]}.wav{ID}dataset/{DATASETNAME}/soft/val/{i}|dataset/{DATASETNAME}/pitch/val/{i.split('.')[0]}.npy\n")

/content/Altoria


In [5]:
#@markdown 等这个下载完
%cd /content
!git clone https://github.com/xzy-git/so-vits-svc
!pip install pyworld

/content
Cloning into 'so-vits-svc'...
remote: Enumerating objects: 131, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 131 (delta 10), reused 8 (delta 8), pack-reused 117[K
Receiving objects: 100% (131/131), 23.65 MiB | 19.67 MiB/s, done.
Resolving deltas: 100% (36/36), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyworld
  Downloading pyworld-0.3.0.tar.gz (212 kB)
[K     |████████████████████████████████| 212 kB 15.7 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: pyworld
  Building wheel for pyworld (PEP 517) ... [?25l[?25hdone
  Created wheel for pyworld: filename=pyworld-0.3.0-cp37-cp37m-linux_x86_64.whl size=609476 sha256=13f5674c9da24fd6e0d81151dbbc548761bc38c35f74fa23

In [6]:
#@title 生成pitch文件
%cd /content/so-vits-svc
!mkdir /content/{DATASETNAME}/pitch
!mkdir /content/{DATASETNAME}/pitch/train
!mkdir /content/{DATASETNAME}/pitch/val
import os
import numpy as np
from preprocess_wave import FeatureInput

hop_size = 512 if high_sample_rate else 256
featureInput = FeatureInput(target_sample, hop_size)

def resize2d(x, target_len):
    source = np.array(x)
    source[source<0.001] = np.nan
    target = np.interp(np.arange(0, len(source)*target_len, len(source))/ target_len, np.arange(0, len(source)), source)
    res = np.nan_to_num(target)
    return res

def get_end_file(dir_path, end):
    file_lists = []
    for root, dirs, files in os.walk(dir_path):
        for f_file in files:
            if f_file.endswith(end):
                file_lists.append(os.path.join(root, f_file).replace("\\", "/"))

    return file_lists


folder = "train"
wav_paths = get_end_file(f"/content/{DATASETNAME}/wavs/{folder}/", "wav")
for wav_path in wav_paths:
    soft = np.load(wav_path.replace("wavs", "soft").replace(".wav", ".npy"))
    featur_pit = featureInput.compute_f0(wav_path)
    featur_pit = resize2d(featur_pit, soft.shape[0])
    pitch = featureInput.coarse_f0(featur_pit)
    np.save(wav_path.replace("wavs", "pitch").replace(".wav", ".npy"), pitch)

folder = "val"
wav_paths = get_end_file(f"/content/{DATASETNAME}/wavs/{folder}/", "wav")
for wav_path in wav_paths:
    soft = np.load(wav_path.replace("wavs", "soft").replace(".wav", ".npy"))
    featur_pit = featureInput.compute_f0(wav_path)
    featur_pit = resize2d(featur_pit, soft.shape[0])
    pitch = featureInput.coarse_f0(featur_pit)
    np.save(wav_path.replace("wavs", "pitch").replace(".wav", ".npy"), pitch)

/content/so-vits-svc


In [7]:
#@title 打包数据集
#@markdown 处理好的压缩包路径
ZIP_PATH = "/content/drive/MyDrive/dataset/"  #@param {type:"string"}
ZIP_PATH = ZIP_PATH + "out_" + DATASETNAME + ".zip"
%cd /content/
!zip -q -r {ZIP_PATH} {DATASETNAME}
#@markdown 刷新一下谷歌盘，会多一个out_xxx.zip（根据你填的路径）

/content


多人数据集就是按序号在ID处填写"|0|"、"|1|"，每次制作单人的压缩包，然后断联、清数据再一次制作其他人的

例如：

第一个角色为paimeng，ID处写"|0|"，生成out_paimeng.zip

第二个角色为zhangsan,ID处写"|1|"，生成out_zhangsan.zip

然后进入一键训练脚本