## Clone Repo

In [1]:
!cd /content
!rm -rf sample_data ChatTTS
!git clone https://github.com/2noise/ChatTTS.git

Cloning into 'ChatTTS'...
remote: Enumerating objects: 2680, done.[K
remote: Counting objects: 100% (749/749), done.[K
remote: Compressing objects: 100% (307/307), done.[K
remote: Total 2680 (delta 524), reused 443 (delta 442), pack-reused 1931 (from 3)[K
Receiving objects: 100% (2680/2680), 8.02 MiB | 17.97 MiB/s, done.
Resolving deltas: 100% (1604/1604), done.


## Install Requirements

In [2]:
!pip install -r /content/ChatTTS/requirements.txt
!ldconfig /usr/lib64-nvidia

Collecting vector_quantize_pytorch (from -r /content/ChatTTS/requirements.txt (line 6))
  Downloading vector_quantize_pytorch-1.21.2-py3-none-any.whl.metadata (30 kB)
Collecting vocos (from -r /content/ChatTTS/requirements.txt (line 8))
  Downloading vocos-0.1.0-py3-none-any.whl.metadata (4.8 kB)
Collecting gradio (from -r /content/ChatTTS/requirements.txt (line 10))
  Downloading gradio-5.12.0-py3-none-any.whl.metadata (16 kB)
Collecting pybase16384 (from -r /content/ChatTTS/requirements.txt (line 11))
  Downloading pybase16384-0.3.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.5 kB)
Collecting pynini==2.1.5 (from -r /content/ChatTTS/requirements.txt (line 12))
  Downloading pynini-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting WeTextProcessing (from -r /content/ChatTTS/requirements.txt (line 13))
  Downloading WeTextProcessing-1.0.4.1-py3-none-any.whl.metadata (7.2 kB)
Collecting nemo_text_processing (from -r /c

## Import Packages

In [3]:
import torch

torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision("high")

from ChatTTS import ChatTTS
from ChatTTS.tools.logger import get_logger
from ChatTTS.tools.normalizer import normalizer_en_nemo_text, normalizer_zh_tn
from IPython.display import Audio

## Load Models

In [4]:
logger = get_logger("ChatTTS", format_root=True)
chat = ChatTTS.Chat(logger)

# try to load normalizer
try:
    chat.normalizer.register("en", normalizer_en_nemo_text())
except ValueError as e:
    logger.error(e)
except:
    logger.warning("Package nemo_text_processing not found!")
    logger.warning(
        "Run: conda install -c conda-forge pynini=2.1.5 && pip install nemo_text_processing",
    )
try:
    chat.normalizer.register("zh", normalizer_zh_tn())
except ValueError as e:
    logger.error(e)
except:
    logger.warning("Package WeTextProcessing not found!")
    logger.warning(
        "Run: conda install -c conda-forge pynini=2.1.5 && pip install WeTextProcessing",
    )

 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
[+0000 20250118 19:12:33] [[37mINFO[0m] NeMo-text-processing | tokenize_and_classify | Creating ClassifyFst grammars.
2025-01-18 19:13:09,441 WETEXT INFO found existing fst: /usr/local/lib/python3.11/dist-packages/tn/zh_tn_tagger.fst
[+0000 20250118 19:13:09] [[37mINFO[0m] wetext-zh_normalizer | processor | found existing fst: /usr/local/lib/python3.11/dist-packages/tn/zh_tn_tagger.fst
2025-01-18 19:13:09,444 WETEXT INFO                     /usr/local/lib/python3.11/dist-packages/tn/zh_tn_verbalizer.fst
[+0000 20250118 19:13:09] [[37mINFO[0m] wetext-zh_normalizer | processor |                     /usr/local/lib/python3.11/dist-packages/tn/zh_tn_verbalizer.fst
2025-01-18 19:13:09,448 WETEXT INFO skip building fst for zh_normalizer ...
[+0000 20250118 19:13:09] [[37mINFO[0m] wetext-zh_normalizer | processor | skip building fst for zh_normalizer ...


### Here are three choices for loading models,

#### 1. Load models from Hugging Face (recommend)

In [5]:
# use force_redownload=True if the weights have been updated.
chat.load(source="huggingface")

[+0000 20250118 19:13:38] [[37mINFO[0m] ChatTTS | core | download from HF: https://huggingface.co/2Noise/ChatTTS
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 14 files:   0%|          | 0/14 [00:00<?, ?it/s]

Embed.safetensors:   0%|          | 0.00/146M [00:00<?, ?B/s]

asset/tokenizer/tokenizer.json:   0%|          | 0.00/449k [00:00<?, ?B/s]

asset/gpt/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Decoder.safetensors:   0%|          | 0.00/104M [00:00<?, ?B/s]

Vocos.safetensors:   0%|          | 0.00/54.3M [00:00<?, ?B/s]

DVAE.safetensors:   0%|          | 0.00/60.4M [00:00<?, ?B/s]

asset/tokenizer/special_tokens_map.json:   0%|          | 0.00/7.85k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/853M [00:00<?, ?B/s]

asset/tokenizer/tokenizer_config.json:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

config/decoder.yaml:   0%|          | 0.00/117 [00:00<?, ?B/s]

config/dvae.yaml:   0%|          | 0.00/143 [00:00<?, ?B/s]

config/path.yaml:   0%|          | 0.00/309 [00:00<?, ?B/s]

config/vocos.yaml:   0%|          | 0.00/460 [00:00<?, ?B/s]

config/gpt.yaml:   0%|          | 0.00/346 [00:00<?, ?B/s]

[+0000 20250118 19:14:01] [[37mINFO[0m] ChatTTS | core | use device cuda:0
[+0000 20250118 19:14:02] [[37mINFO[0m] ChatTTS | core | vocos loaded.
[+0000 20250118 19:14:02] [[37mINFO[0m] ChatTTS | core | dvae loaded.
[+0000 20250118 19:14:03] [[37mINFO[0m] ChatTTS | core | embed loaded.
[+0000 20250118 19:14:03] [[37mINFO[0m] ChatTTS | core | gpt loaded.
[+0000 20250118 19:14:03] [[37mINFO[0m] ChatTTS | core | speaker loaded.
[+0000 20250118 19:14:03] [[37mINFO[0m] ChatTTS | core | decoder loaded.
[+0000 20250118 19:14:03] [[37mINFO[0m] ChatTTS | core | tokenizer loaded.


True

#### 2. Load models from local directories 'asset' and 'config'

In [None]:
chat.load()
# chat.load(source='local') same as above

#### 3. Load models from a custom path

In [None]:
# write the model path into custom_path
chat.load(source="custom", custom_path="YOUR CUSTOM PATH")

### You can also unload models to save the memory

In [None]:
chat.unload()

## Inference

### Batch infer

In [6]:
texts = [
    "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",
] * 3 + [
    "我觉得像我们这些写程序的人，他，我觉得多多少少可能会对开源有一种情怀在吧我觉得开源是一个很好的形式。现在其实最先进的技术掌握在一些公司的手里的话，就他们并不会轻易的开放给所有的人用。"
] * 3

wavs = chat.infer(texts)

text:   0%|          | 1/384(max) [00:01,  1.17s/it]We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
text:  28%|██▊       | 107/384(max) [00:03, 29.67it/s]
code:  42%|████▏     | 864/2048(max) [00:22, 39.06it/s]


In [7]:
Audio(wavs[0], rate=24_000, autoplay=True)

In [8]:
Audio(wavs[3], rate=24_000, autoplay=True)

### Custom params

In [9]:
params_infer_code = ChatTTS.Chat.InferCodeParams(
    prompt="[speed_5]",
    temperature=0.3,
)
params_refine_text = ChatTTS.Chat.RefineTextParams(
    prompt="[oral_2][laugh_0][break_6]",
)

wav = chat.infer(
    "四川美食可多了，有麻辣火锅、宫保鸡丁、麻婆豆腐、担担面、回锅肉、夫妻肺片等，每样都让人垂涎三尺。",
    params_refine_text=params_refine_text,
    params_infer_code=params_infer_code,
)

[+0000 20250118 19:17:36] [[37mINFO[0m] ChatTTS | norm | replace homophones: 涎->闲
text:  14%|█▍        | 53/384(max) [00:01, 36.07it/s]
code:  20%|█▉        | 400/2048(max) [00:09, 43.34it/s]


In [10]:
Audio(wav[0], rate=24_000, autoplay=True)

### fix random speaker

In [11]:
rand_spk = chat.sample_random_speaker()
print(rand_spk)  # save it for later timbre recovery

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_emb=rand_spk,
)

wav = chat.infer(
    "四川美食确实以辣闻名，但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等，这些小吃口味温和，甜而不腻，也很受欢迎。",
    params_infer_code=params_infer_code,
)

蘁淰敥欀憋湨帅傐硥斍裋嘐篪綿材芟眣峻凧扮徥漻譯蕴檲穾搼导焨殝檵旮嵯詃莛罳聢聛椙囏梼剷摯愿傌睠柾欉彍瑌幘榌懌萝肾膓禧侮犃圓虃価寄垡梾皣咁嗬榠沁梸矁赴欌旱讄翣耎兄灧嫏晲薝府办罕胰氺晚触埏咵碬槽婍匸膩蚨籧壔列誘束勶奂夅蠍樊烖倄喳蛜槹攨潸誐澬撔欕箽姆扰礗樿蠚翖嵕瘔屩烗佼歷蚴袼坉趌浅赓壜誁懏矅蟺往盱滦豵唠滬聊爷悑帯耘幖埂屰欭然嶜婍讶渦桖灟蔓争耉珙瞇湾睼劄嗋丠蒱蛡孜惞聢宦捩胊殇訨堅谤洂熰瞃溔竵璹籂懂猊杹螬蚺礋襔沥廆缥纱怡墓港獽栢祧瞿贿貘拿尣妡膙氃氎猬嶥蕟嶺焐埘击恾岜晏示蕳蓺歉羁差炇徲悩襇脢瘘琔埗廡堚蟝导峳莰丠罢罹偓繜嬇哳螏蒈战狺呔贐塗垝譺矿蒚浉笳蒬蒦忚巂圂檄蔲榩弩榗苿誻矵藄爱癭豅桂跊翳怹刭堄噼翓囜儠瀁纴矝恽穆猤根蕨脾瑯栊朲勏玲厔碗檈袚暇請樼衆売蓽腵紏巜藼暷腬沋澚唵漲弎灞株耝侽焴薰圤泽凾擛諎椚挭癕瓁幛垁蝛诪俒欥暯营猻湺婯斬殃嬄桛诩剏嘔种筜褨笎昆焧葨姓炩和侴氦疸再旚呬腳儺寉瞚祌豰呾萲沆澟弾拐荶敚旔登燛谴膑罯曑祤厲圌裤灮暤僣幙嘾秗瀤礠懣妴膖纒娾狁謉螋裄凬姭牺罢呝裇偬护讥俇核嬈缺帚澉嶗賁墧旉懻借儈貎抹崗痄寔晾峦婘狤誑熑臐呬袇拏瞔叽刚廵舯藶埒吽筱忮秒岞炝乚薓葤徨咞澀尜嚎肄誰岁笺蕝萸罔竿覆盭諎誧癮恕衷苶嶃忇蔀戅袒慾胘劍厽盝楘覼托磧圳簮榣准瘛俬嚾兹螟蝎褁炼苳伝礯荝皳敽惉拺歇筶瞪瑛冫蒼憶荡痨窎歘葛豉導僖溶佼僷磵囥猡玫蝘瑊趎昪唂讀熎圔灁寗葍恷橏烃溢泖妤倿殎瓣倕莃葿畧婲粖蠣嚫浤袭碧嵛芵芥偈偂劭忑茢泂粛愑菼仯哇坔菧観浙襭淕梳繣犲沕攸罪岵佶謘卝煰嘰皶綯莍磦苎贴苀倨磾媊娿沪划焵曘猉胊緓埈彘緪嫍聄呮蜑猌蒭胒榘殺岔摐彑刊涭偛怠枟垸讑讹渟暑絁溠襵僪惭萝摅莉砢扯囏籃璎夶剩悪稤疓彰謦樒恴蚘捘欯吼炴蚂肘婿岆囗垒喸簬呃毺豟泸棻襐惏漆崓练巧琊晭猓桪潘峽堿圧摬緕葌牦極俹狃敮襜趝笚蛱桺佞跺嚠爨蓔狡裝眊卹慸啪凣觮柛珂兪煣蒥溰巯徘凮丶丼就橗绯憹凕瓟涎縥譜淋琦蕠膿合曵借持悾耇潁橱塄宊弴嗵毥凷翀儶徭訞赗裈众孶癹绨帕斒聓稁绅僾瞸昰椱凶讘搔膧棼苈刹幇蔒氵芗谺杧喾嶁趸掳亻憺焤繈荓凡俄嗍揶瞝讧臽諂刓緓澣潫皺啲曦伊稽永低彭斑櫏腼瓀氇冑芄脷猸柠俈吉嚔朞蛜啦寍榑劮撃僔言


text:  17%|█▋        | 65/384(max) [00:01, 47.77it/s]
code:  22%|██▏       | 459/2048(max) [00:10, 43.47it/s]


In [12]:
Audio(wav[0], rate=24_000, autoplay=True)

### Zero shot (simulate speaker)

In [14]:
import os
os.listdir("./")

['.config', 'lyf2_8.wav', 'ChatTTS']

In [47]:
from ChatTTS.tools.audio import load_audio

spk_smp = chat.sample_audio_speaker(load_audio("lyf2_9.mp3", 24000))
print(spk_smp)  # save it in order to load the speaker without sample audio next time

params_infer_code = ChatTTS.Chat.InferCodeParams(
    spk_smp=spk_smp,
    txt_smp="所以，平时挺没正经的，有时候可以懒觉睡到很晚，要工作起来也是拼命，就是什么都不管不顾",
)

wav = chat.infer(
    "四川美食确实以辣闻名，但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等，这些小吃口味温和，甜而不腻，也很受欢迎。",
    params_infer_code=params_infer_code,
)

伀坰喀媷侻提佲乃弄垇艷榞羆聙奛彛訾柰俊毊絼涐哦豌嗸垴仺佁檿衱垵罇扞亶虺嘳蓄囍覴菳崯華拢缮橮唪瀋芪稯湙蒩店杻菻菘佂岺琨唼噦捡艚溋叱廚矧穄祱薭喘乄幇摭耠娽元繴菼垶渂埗稆慤蝅寊櫾肙葯玡墄哮涹芼贰伎翟緁穒洡愻稫嵘催洳磚祫伾櫁儀坨碥妤蚻汯愉楿曞腄摬澃埭聎菕憍起玬浈茍瀔芽扐袮哈皪蓹谧荈瞆嫸廱棫歾犂唏煭俺縯冟庐谙枒菁卒燴卂嵿枤军搖乷疛椞翍攪腰赚畢罒畠狵混幉梃綇摎筍栧蜆竉蓉聄眛豂剛崋祥桹冰綘將贼孾肈槭砌贼姹窺匒琇譈河戤修虎暹浢蒮嫊救盅掋篡樧芄祏昱悏抨秷纰潈脫罣贛舲惍浡礟展砫裓訠臾羊吧仄灴嶵偊蓋拉庢跰堦枳瓔席緊禞杚璎扶扛冾攱溅泅蚝斫刍溣茈敗祚毒半櫓虇蠛澮梄屘壪哤蒚棊濲裼蜷昌菐島帖緉愑估猠絴蔴煤捳藱滹挦檷噰敦椨拫桿疎毘嵊币壶枈弆炮掂斑瘎綁倓捱亻体煂圫惹购豄将勐裷稦嫉伸虠跒荐楊砤垞狜濌瀛佄橵焼労儗秖檫絘笼燇豅尌峪茙薐县慏燽腗涕挜貚澃婿幓濪窿譧狘蛉簁傁嵚儽姩悩殯壩媶枦叀猎蘋管珂砞刘枑滘緇胑褌糫柇動書誷賏俆桷跐罚壄觏焪裌憷篿汬厳緄把觲劼砸榻獃砆吰睸狝昹祁呻茳煬脌忦崩挀諛莈谼哏刟猃汘胓諴秏仛嘟耺肉爎山紻便嘦楜灝竪泿袙緭冦帋挅穠噀莤嵆翨跪嵯揳趤庿纗囆棕咊摩证螔渚嫼咒仱岈殌笿惥囮耭惉夙蜍笁簸溱惔澺墛惤岯橑臫媐痁哨舲窈瑉徎挚嶦巡澜庒刣萎蟏攼疰纉觃吆伥坎瀮灶昮珂唟楻唐烆粳譽癬枪桶塵硔笓搲慌糝斍葚坛潂绗贿箿庳曳扑昇忺単碦赜瀦壋櫀冹胦螥賙怏猁界臭兙蕘市嵐篽弪熪億懶恳榙翁姢甏嵉褫槢昇渗猛纂夒舘討謍慷嘂撛懬艀緗祥獝赍蛀絣篯悱洂庁萂暕芕竪牀琠凰昉豅磎蒝箤倕藚潀憛貿抲僐坚諛谸瘑樼胠诎繄絚忝埯丰嫪搡订反煵峇衏楶稃裎峨涉貭拒籶们俖昷湧堃尌岙浩艁岨璦沯厷垕捱冈玌晥剼卮蝾冧愚瘘商翊詄刁皉焱愂捂皻寙烥喏哷碴唄氜壆炱趭愾茧瘫棽聿坈赮嶌讻峐怗归貕嗸瑠狒瀳圽儨谧娿摃櫎偩杀芙縒傌叁広窸毷袗俼緝翲楍薥舽嵻诣紭肌悵潺篕診乕棵婧烢嘄睽名糛訧蚌膋旟欌莈佢澊滔翑潃暺勵炒蛫愜舨溭篰怀瓮狛蓷崹兦撒焥舋甛浤椇埸膒狰呫稒籰圯椉崫伊跉恻眚墝綉瓵蝓蔌燾璉礦摹孳搒歬浵枓嗷孲峒綀甖冨沘卡嚦畦笗冴盳疩萡癀堢爏凟捐窖殛襱楉脄胚擘凱桩吒垷祈縑茣殳仑皆詏洗亪藽亚悰熨砓聩業缶書乯互裶昡牽楿呆填怋怂譙勰甡址耕磶茨覤杵犰纙无褀维小贚瞤僳膣峂朷疋痟昷傻蛶糓緝婂犬莨攋磷竄幏瑴樷廂薤矸苺犉梤岎涄岒炑秆塔儑伡嵒衅堭偔摮檷疸旇暤瀀设蚕疯尝涅瓒紀坈滈媐膒牂篧蘟懞疝菏蕴粦詈册渽涢嶠嘦櫼璧箯梆谆冘岗檪衕嬖式觲谒涞蝃伖解惐糲芲則谋燷俙揋仺犚愪弱檥趃奃料做裔瑑篹橍渰瓋簑樐衹圧掐喼纛街宪未涽

text:  18%|█▊        | 69/384(max) [00:02, 34.41it/s]
code:  21%|██▏       | 438/2048(max) [00:09, 48.16it/s]


In [48]:
Audio(wav[0], rate=24_000, autoplay=True)

### Two stage control

In [30]:
text = "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with."
refined_text = chat.infer(text, refine_text_only=True)
refined_text

text:  22%|██▏       | 83/384(max) [00:01, 48.75it/s]


['so [uv_break] we found [uv_break] being competitive and collaborative [uv_break] was a huge [uv_break] way of staying motivated towards our goals, [uv_break] so one person to call when you fall off, [uv_break] one person who gets you back on then one person to actually do the activity with.']

In [31]:
wav = chat.infer(refined_text, skip_refine_text=True)

code:  34%|███▍      | 701/2048(max) [00:14, 47.13it/s]


In [32]:
Audio(wav[0], rate=24_000, autoplay=True)

## LLM Call

In [33]:
from ChatTTS.tools.llm import ChatOpenAI

API_KEY = "sk-212455fa28474204b2960be5ce341503"
client = ChatOpenAI(
    api_key=API_KEY, base_url="https://api.deepseek.com", model="deepseek-chat"
)

In [34]:
user_question = "四川有哪些好吃的美食呢?"

In [35]:
text = client.call(user_question, prompt_version="deepseek")
text

'四川的美食可多啦，有麻辣鲜香的火锅，香辣可口的回锅肉，还有让人垂涎的担担面。别忘了还有麻婆豆腐和夫妻肺片，每一道都让人回味无穷。'

In [42]:
text = client.call(text, prompt_version="deepseek_TN")
text

'四川的美食可多啦，有麻辣鲜香的火锅，香辣可口的回锅肉，还有让人垂涎的担担面。别忘了还有麻婆豆腐和夫妻肺片，每一道都让人回味无穷。'

In [51]:
wav = chat.infer(text,params_infer_code=params_infer_code)

[+0000 20250118 19:38:59] [[37mINFO[0m] ChatTTS | norm | replace homophones: 涎->闲
text:  19%|█▉        | 72/384(max) [00:01, 49.58it/s]
code:  23%|██▎       | 476/2048(max) [00:10, 45.73it/s]


In [52]:
Audio(wav[0], rate=24_000, autoplay=True)