Multi GPU (chatterbox instance) support #47

Mikec78660 · 2025-10-15T22:56:42Z

Mikec78660
Oct 15, 2025

Really love your project. I have it running as the backend audio for Open WebUI and it works great. The only issue I have and probably because of my ancient hardware is generation time. This has nothing to do with your project but I was hoping there was a way to wrap your API around two instances of chatterbox. With either a round robin approach or even better would be to check the utilization of the GPUs and choose the one with the lowest to submit the next request to.

Basically what I have is a P40 running ollama and two p4's that I can use for tts. With punctuation parsing sometimes the P4 hasn't finished the block before the speech on the previous one has finished. This is especially true if the user has set the audio rate at 1.25. The response is still pretty good with my setup. the P40 is really quick at around 80 token/s and the P4 starts processing the first sentence while the P40 is still generating the response. I can put ollama and chatterbox both on the P40 and there are no pauses on the audio response but the problem is it waits until the entire text response has been generated before it starts generating the audio. It just feels much more responsive using multiple GPUs. would be great if the audio generation could be split between both P4 GPUs. I understand there is probably a very small use case for this.

travisvn · 2025-10-16T00:10:51Z

travisvn
Oct 16, 2025
Maintainer

@Mikec78660 I don't own enough hardware to be able to even test a multi GPU setup, but anyone's free to contribute to the project if they want and I'll review the PR.

As for a round robin approach, I may be misunderstanding the premise here, but I don't think it would make sense for this use case. The longest wait time for generative AI in this space tends to be the initial model loading.

And given that RAM tends to be the inhibiting factor here, offloading that data and trying to load a different model (be it Chatterbox for TTS or an LLM via Ollama) to try and reach a maximum for performance would fall flat. I may be misunderstanding the scenario though.

2 replies

Mikec78660 Oct 16, 2025
Author

Yes, so Picture 3 GPUs.

Tesla P40 = GPU0
Tesla P4 = GPU1
Tesla P4 = GPU2

Two proxmox LXCs. One LXC passed the P40 running Ollama and one LXC passed both P4s running chatterbox (chatterbox-api). And a third LXC running openwebui. These could be docker containers or VMs or whatever, doesn't matter. LXCs are just what I am using.

Ollama using the P40 GPU with say gpt-oss loaded. Both P4's with chatterbox models loaded. So query goes to the P40 for inference, P40 starts spitting out a response. First sentence sent to the chatterboxAPI to be converted TTS on GPU1, Second sentence done, sent to GPU2 to be converted TTS. First sentence done processing on GPU1 sent back to openwebui and speech starts. Third sentence sent to GPU1 to be converted TTS. Second sentence done processing on GPU2 and sent to openwebui, and on and on until all output from ollama is processed.

So there isn't any model loading/unloading except on the initial load. I know the problem could be fixed by just replacing my 2 P4s with one T4 or something else that could process the audio fast enough. But it seems like this could be useful if you are running a front end as a service and have multiple responses to convert TTS because you are running ollama with OLLAMA_NUM_PARALLEL for example so you might be sending multiple TTS tasks to chatterbox at once.

But I understand if this is out of scope for your project. I really like it a lot. Simple to spin up and works really well! Thank you so much for creating it.

EDIT, so funny thing. before this came along I was using openai-edge-tts because I couldn't find a really good sounding TTS model that could be run locally. And just realized that was your project too :-)

Cyb4Black Dec 5, 2025

For what you are describing you would go for a load balancing approach between two identical deployments, that should be handled by the networking layer, not the app itself.
E.g. setting up a kubernetes deployment with deployment scaling set to 2 behind a single service endpoint would easily solve this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU (chatterbox instance) support #47

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi GPU (chatterbox instance) support #47

Uh oh!

Mikec78660 Oct 15, 2025

Replies: 1 comment · 2 replies

Uh oh!

travisvn Oct 16, 2025 Maintainer

Uh oh!

Uh oh!

Mikec78660 Oct 16, 2025 Author

Uh oh!

Cyb4Black Dec 5, 2025

Mikec78660
Oct 15, 2025

Replies: 1 comment 2 replies

travisvn
Oct 16, 2025
Maintainer

Mikec78660 Oct 16, 2025
Author