Multi GPU (chatterbox instance) support #47
Replies: 1 comment 2 replies
-
|
@Mikec78660 I don't own enough hardware to be able to even test a multi GPU setup, but anyone's free to contribute to the project if they want and I'll review the PR. As for a round robin approach, I may be misunderstanding the premise here, but I don't think it would make sense for this use case. The longest wait time for generative AI in this space tends to be the initial model loading. And given that RAM tends to be the inhibiting factor here, offloading that data and trying to load a different model (be it Chatterbox for TTS or an LLM via Ollama) to try and reach a maximum for performance would fall flat. I may be misunderstanding the scenario though. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Really love your project. I have it running as the backend audio for Open WebUI and it works great. The only issue I have and probably because of my ancient hardware is generation time. This has nothing to do with your project but I was hoping there was a way to wrap your API around two instances of chatterbox. With either a round robin approach or even better would be to check the utilization of the GPUs and choose the one with the lowest to submit the next request to.
Basically what I have is a P40 running ollama and two p4's that I can use for tts. With punctuation parsing sometimes the P4 hasn't finished the block before the speech on the previous one has finished. This is especially true if the user has set the audio rate at 1.25. The response is still pretty good with my setup. the P40 is really quick at around 80 token/s and the P4 starts processing the first sentence while the P40 is still generating the response. I can put ollama and chatterbox both on the P40 and there are no pauses on the audio response but the problem is it waits until the entire text response has been generated before it starts generating the audio. It just feels much more responsive using multiple GPUs. would be great if the audio generation could be split between both P4 GPUs. I understand there is probably a very small use case for this.
Beta Was this translation helpful? Give feedback.
All reactions