## 测试效果

- 测试代码: [speed_test.ipynb](speed_test.ipynb)
- 测试环境: Intel i5-12400 CPU, 32GB RAM, 1x NVIDIA GeForce RTX 4070
- 运行环境: Ubuntu 24.04.1 LTS, cuda 12.4, python 3.10.16
- 测试说明: 单任务执行的数据（非并发测试）


##### 使用**注意事项**，需要将该文件移动到 cosyvoise 目录下，并安装 Ipython 模块运行

## 默认情况下

In [None]:
import time
import asyncio
import torchaudio

import sys
sys.path.append('third_party/Matcha-TTS')

from cosyvoice.cli.cosyvoice import  CosyVoice2
from cosyvoice.utils.file_utils import load_wav

prompt_text = '希望你以后能够做得比我还好哟'
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)

cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=True)

In [None]:
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:00:30,056 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:00:33,147 INFO yield speech len 11.32, rtf 0.27304929895030317
100%|██████████| 1/1 [00:03<00:00,  3.43s/it]


In [None]:
for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=True)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:00:42,181 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:00:43,121 INFO yield speech len 1.84, rtf 0.5107895187709642
2025-02-25 18:00:43,828 INFO yield speech len 2.0, rtf 0.35290777683258057
2025-02-25 18:00:44,463 INFO yield speech len 2.0, rtf 0.31650030612945557
2025-02-25 18:00:45,168 INFO yield speech len 2.0, rtf 0.35182738304138184
2025-02-25 18:00:45,841 INFO yield speech len 1.96, rtf 0.3418404228833257
100%|██████████| 1/1 [00:03<00:00,  4.00s/it]


In [None]:
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

2025-02-25 18:00:50,102 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:00:50,104 INFO get tts_text generator, will return _extract_text_token_generator!
2025-02-25 18:00:50,378 INFO synthesis text <generator object text_generator at 0x7b922c6beb20>
2025-02-25 18:00:50,380 INFO append 5 text token 15 speech token
2025-02-25 18:00:50,381 INFO append 5 text token 15 speech token
2025-02-25 18:00:50,381 INFO append 5 text token 15 speech token
2025-02-25 18:00:50,381 INFO append 5 text token 15 speech token
2025-02-25 18:00:50,382 INFO append 5 text token 15 speech token
2025-02-25 18:00:50,382 INFO append 5 text token 15 speech token
2025-02-25 18:00:50,382 INFO not enough text token to decode, wait for more
2025-02-25 18:00:50,383 INFO append 5 text token 4 speech token
2025-02-25 18:00:50,577 INFO fill_token index 11 next fill_token index 27
2025-02-25 18:00:50,578 INFO get fill token, need to append more text token
2025-02-25 1

In [None]:
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=True)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

2025-02-25 18:00:58,436 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:00:58,442 INFO get tts_text generator, will return _extract_text_token_generator!
2025-02-25 18:00:58,796 INFO synthesis text <generator object text_generator at 0x7b922c6bf0d0>
2025-02-25 18:00:58,798 INFO append 5 text token 15 speech token
2025-02-25 18:00:58,799 INFO append 5 text token 15 speech token
2025-02-25 18:00:58,799 INFO append 5 text token 15 speech token
2025-02-25 18:00:58,799 INFO append 5 text token 15 speech token
2025-02-25 18:00:58,799 INFO append 5 text token 15 speech token
2025-02-25 18:00:58,800 INFO append 5 text token 15 speech token
2025-02-25 18:00:58,800 INFO not enough text token to decode, wait for more
2025-02-25 18:00:58,800 INFO append 5 text token 4 speech token
2025-02-25 18:00:58,921 INFO fill_token index 11 next fill_token index 27
2025-02-25 18:00:58,921 INFO get fill token, need to append more text token
2025-02-25 1

In [None]:
# instruct usage
for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:01:10,454 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:01:13,449 INFO yield speech len 10.76, rtf 0.2784381790232038
100%|██████████| 1/1 [00:03<00:00,  3.38s/it]


In [None]:
# instruct usage
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
for i, j in enumerate(cosyvoice.inference_instruct2(text_generator(), '用四川话说这句话', prompt_speech_16k, stream=False)):
    torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)


2025-02-25 18:01:16,618 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:01:16,625 INFO get tts_text generator, will return _extract_text_token_generator!
2025-02-25 18:01:16,973 INFO synthesis text <generator object text_generator at 0x7b922c6bf140>
2025-02-25 18:01:16,975 INFO get fill token, need to append more text token
2025-02-25 18:01:16,975 INFO append 5 text token
2025-02-25 18:01:17,232 INFO fill_token index 15 next fill_token index 31
2025-02-25 18:01:17,233 INFO get fill token, need to append more text token
2025-02-25 18:01:17,233 INFO append 5 text token
2025-02-25 18:01:17,382 INFO fill_token index 31 next fill_token index 47
2025-02-25 18:01:17,383 INFO get fill token, need to append more text token
2025-02-25 18:01:17,383 INFO append 5 text token
2025-02-25 18:01:17,531 INFO fill_token index 47 next fill_token index 63
2025-02-25 18:01:17,532 INFO get fill token, need to append more text token
2025-02-25 18:01:17

## 使用vllm加速llm推理

In [None]:
import time
import asyncio
import torchaudio

import sys
sys.path.append('third_party/Matcha-TTS')

from async_cosyvoice.async_cosyvoice import AsyncCosyVoice2
from cosyvoice.utils.file_utils import load_wav

prompt_text = '希望你以后能够做得比我还好哟'
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)

cosyvoice = AsyncCosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=True)

In [None]:
i = 0
async for j in cosyvoice.inference_sft('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', spk_id='xiaohe', stream=False):
    torchaudio.save('sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:07:13,230 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:07:14,959 INFO yield speech len 11.4, rtf 0.15169396735074228
100%|██████████| 1/1 [00:01<00:00,  1.75s/it]


In [8]:
i = 0
async for j in cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=False):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]

2025-02-25 18:07:52,486 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:07:53,997 INFO yield speech len 10.04, rtf 0.15043036871222387
100%|██████████| 1/1 [00:01<00:00,  1.89s/it]


In [9]:
i = 0
async for j in cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', prompt_text, prompt_speech_16k, stream=True):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1

  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:07:57,040 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:07:57,756 INFO yield speech len 1.84, rtf 0.38921198119287903
2025-02-25 18:07:58,321 INFO yield speech len 2.0, rtf 0.2819598913192749
2025-02-25 18:07:58,757 INFO yield speech len 2.0, rtf 0.21722662448883057
2025-02-25 18:07:59,194 INFO yield speech len 2.0, rtf 0.2178330421447754
2025-02-25 18:07:59,624 INFO yield speech len 2.0, rtf 0.2143700122833252
2025-02-25 18:07:59,894 INFO yield speech len 0.44, rtf 0.6105135787617076
100%|██████████| 1/1 [00:03<00:00,  3.23s/it]


In [10]:
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
i = 0
async for j in cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=False):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


2025-02-25 18:08:04,360 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:08:04,366 INFO get tts_text generator, will return _extract_text_token_generator!
2025-02-25 18:08:04,714 INFO synthesis text <generator object text_generator at 0x71cf742b2c00>
2025-02-25 18:08:04,715 INFO not enough text token to decode, wait for more
2025-02-25 18:08:04,761 INFO get fill token, need to append more text token
2025-02-25 18:08:04,762 INFO append 5 text token
2025-02-25 18:08:04,817 INFO get fill token, need to append more text token
2025-02-25 18:08:04,818 INFO append 5 text token
2025-02-25 18:08:04,953 INFO no more text token, decode until met eos
2025-02-25 18:08:06,182 INFO yield speech len 10.96, rtf 0.13393809760574005
100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


In [11]:
def text_generator():
    yield '收到好友从远方寄来的生日礼物，'
    yield '那份意外的惊喜与深深的祝福'
    yield '让我心中充满了甜蜜的快乐，'
    yield '笑容如花儿般绽放。'
i = 0
async for j in cosyvoice.inference_zero_shot(text_generator(), prompt_text, prompt_speech_16k, stream=True):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


2025-02-25 18:08:17,563 INFO get tts_text generator, will skip text_normalize!
  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:08:17,568 INFO get tts_text generator, will return _extract_text_token_generator!
2025-02-25 18:08:17,955 INFO synthesis text <generator object text_generator at 0x71cf742b3ed0>
2025-02-25 18:08:17,956 INFO not enough text token to decode, wait for more
2025-02-25 18:08:18,048 INFO get fill token, need to append more text token
2025-02-25 18:08:18,050 INFO append 5 text token
2025-02-25 18:08:18,162 INFO get fill token, need to append more text token
2025-02-25 18:08:18,162 INFO append 5 text token
2025-02-25 18:08:18,218 INFO no more text token, decode until met eos
2025-02-25 18:08:18,707 INFO yield speech len 1.84, rtf 0.40890963181205414
2025-02-25 18:08:19,173 INFO yield speech len 2.0, rtf 0.23182785511016846
2025-02-25 18:08:19,611 INFO yield speech len 2.0, rtf 0.21812725067138672
2025-02-25 18:08:20,046 INFO yield speech len 2.0, rtf 0.2168172597885

In [12]:
# instruct usage
i = 0
async for j in cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物[breath]，那份意外的惊喜与深深的祝福[breath]让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False):
    torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1

  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:08:29,219 INFO synthesis text 收到好友从远方寄来的生日礼物[breath]，那份意外的惊喜与深深的祝福[breath]让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:08:30,561 INFO yield speech len 11.28, rtf 0.11893760227987953
100%|██████████| 1/1 [00:01<00:00,  1.71s/it]


In [14]:
# instruct usage
i = 0
async for j in cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物[breath]，那份意外的惊喜与深深的祝福[breath]让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False):
    torchaudio.save('instruct2_2{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1

  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:08:44,239 INFO synthesis text 收到好友从远方寄来的生日礼物[breath]，那份意外的惊喜与深深的祝福[breath]让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:08:45,728 INFO yield speech len 11.16, rtf 0.13341679367967832
100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


In [15]:
# instruct usage
i = 0
async for j in cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=True):
    torchaudio.save('instruct2_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1

  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:08:57,660 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:08:58,467 INFO yield speech len 1.84, rtf 0.43843945731287415
2025-02-25 18:08:58,944 INFO yield speech len 2.0, rtf 0.2378394603729248
2025-02-25 18:08:59,352 INFO yield speech len 2.0, rtf 0.20360076427459717
2025-02-25 18:08:59,778 INFO yield speech len 2.0, rtf 0.21184217929840088
2025-02-25 18:09:00,200 INFO yield speech len 2.0, rtf 0.21045100688934326
2025-02-25 18:09:00,504 INFO yield speech len 1.04, rtf 0.29043050912710333
100%|██████████| 1/1 [00:03<00:00,  3.17s/it]


In [None]:
i = 0
async for j in cosyvoice.inference_zero_shot_by_spk_id('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', spk_id='xiaohe', stream=False):
    torchaudio.save('instruct_tts_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:07:42,805 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:07:43,990 INFO yield speech len 9.2, rtf 0.12883686501046887
100%|██████████| 1/1 [00:01<00:00,  1.21s/it]


In [23]:
i = 0
async for j in cosyvoice.inference_zero_shot_by_spk_id('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', spk_id='xiaohe', stream=True):
    torchaudio.save('instruct_tts_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 21:34:48,174 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 21:34:48,840 INFO yield speech len 1.84, rtf 0.3619770640912263
2025-02-25 21:34:49,302 INFO yield speech len 2.0, rtf 0.2302844524383545
2025-02-25 21:34:49,681 INFO yield speech len 2.0, rtf 0.18872606754302979
2025-02-25 21:34:50,112 INFO yield speech len 2.0, rtf 0.21439659595489502
2025-02-25 21:34:50,548 INFO yield speech len 2.0, rtf 0.21722090244293213
2025-02-25 21:34:50,835 INFO yield speech len 1.24, rtf 0.23061229336646297
100%|██████████| 1/1 [00:02<00:00,  2.68s/it]


In [None]:
i = 0
async for j in cosyvoice.inference_instruct2_by_spk_id('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '使用台湾话说', spk_id='xiaohe', stream=False):
    torchaudio.save('instruct_sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 18:07:17,563 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 18:07:19,132 INFO yield speech len 11.56, rtf 0.13574085433590372
100%|██████████| 1/1 [00:01<00:00,  1.58s/it]


In [22]:
i = 0
async for j in cosyvoice.inference_instruct2_by_spk_id('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', '使用台湾话说', spk_id='xiaohe', stream=True):
    torchaudio.save('instruct_sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 21:33:42,312 INFO synthesis text 收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。
2025-02-25 21:33:43,060 INFO yield speech len 1.84, rtf 0.4065176715021548
2025-02-25 21:33:43,534 INFO yield speech len 2.0, rtf 0.2366253137588501
2025-02-25 21:33:43,952 INFO yield speech len 2.0, rtf 0.20793390274047852
2025-02-25 21:33:44,377 INFO yield speech len 2.0, rtf 0.21172189712524414
2025-02-25 21:33:44,793 INFO yield speech len 2.0, rtf 0.20734083652496338
2025-02-25 21:33:45,128 INFO yield speech len 0.92, rtf 0.36169549693231995
100%|██████████| 1/1 [00:02<00:00,  2.84s/it]


In [None]:
i = 0
async for j in cosyvoice.inference_instruct2_by_spk_id('用户可能需要提交这些更改，以确保索引处于干净状态，然后才能成功添加子模块。需要提醒用户在添加子模块后提交更改。', '使用台湾话说', spk_id='xiaohe', stream=True):
    torchaudio.save('instruct_sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
    i += 1


  0%|          | 0/1 [00:00<?, ?it/s]2025-02-25 21:32:41,634 INFO synthesis text 用户可能需要提交这些更改，以确保索引处于干净状态，然后才能成功添加子模块。需要提醒用户在添加子模块后提交更改。
2025-02-25 21:32:42,291 INFO yield speech len 1.84, rtf 0.3568847542223723
2025-02-25 21:32:42,763 INFO yield speech len 2.0, rtf 0.2352982759475708
2025-02-25 21:32:43,111 INFO yield speech len 2.0, rtf 0.17330658435821533
2025-02-25 21:32:43,547 INFO yield speech len 2.0, rtf 0.21745896339416504
2025-02-25 21:32:43,940 INFO yield speech len 2.0, rtf 0.19589221477508545
2025-02-25 21:32:44,269 INFO yield speech len 1.4, rtf 0.23354445184980122
100%|██████████| 1/1 [00:02<00:00,  2.66s/it]
