Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clients with slow networks blocking transcription cluster resources #1561

Closed
lfcnassif opened this issue Mar 2, 2023 · 10 comments
Closed

Clients with slow networks blocking transcription cluster resources #1561

lfcnassif opened this issue Mar 2, 2023 · 10 comments
Assignees
Labels

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Mar 2, 2023

Currently WAV transfer and transcription itself are synchronous and part of the same job. The number of jobs per node is 2x the number of GPUs. A colleague with a very slow connection did some tests, it is working correctly, but WAV transfer took much more time than transcription itself. Then I realized that the service nodes were blocked waiting too much time for WAV transfer, refusing new connections from other clients, while the GPUs were idle. We should make WAV transfer and transcription asynchronous regarding each other. A simple workaround would be to allow a higher number of simultaneous connections from clients for WAV transfer and restrict the number of simultaneous transcriptions using a Semaphore to the current number of jobs per node.

@lfcnassif lfcnassif self-assigned this Mar 2, 2023
@wladimirleite
Copy link
Member

@lfcnassif, I am sorry if you already considered this, but what about sending the original audio files instead?

I made a quick measure with ~50 K audio files collected from different cases and formats. Most of them (~90%) are OPUS.
The average original file size was ~70 KB, while the transmited average was ~850 KB, which is ~12x more.
Looking at the code, it isn't a simple change, but may be a significant improvement for slow network connections.

By the way, this is an additional comment about audio transcription, independent of the enhancement described in this issue.

@lfcnassif
Copy link
Member Author

@lfcnassif, I am sorry if you already considered this, but what about sending the original audio files instead?

I made a quick measure with ~50 K audio files collected from different cases and formats. Most of them (~90%) are OPUS. The average original file size was ~70 KB, while the transmited average was ~850 KB, which is ~12x more. Looking at the code, it isn't a simple change, but may be a significant improvement for slow network connections.

By the way, this is an additional comment about audio transcription, independent of the enhancement described in this issue.

Hi @tc-wleite. Actually we started sending WAVs (aiming to distribute part of the job, the wav conversion), switched to your suggestion (because of bandwidth usage concerns), then rolled back after we looked at the stats and saw the cluster was surprisingly spending half the time just for WAV conversion. After we rolled back, the cluster performance became 2x faster.

But WAV conversion is single threaded (using mplayer) while transcription uses almost half a physical CPU (together with the GPU, surprisingly, it needs both). The issue is that they were sequential. My changes will make audio transmission and transcription kind parallel. Because of that, maybe converting to WAV on server again (in parallel to transcription) wouldn't have that previous bottleneck, just testing... The code change is simple, I didn't threw up the previous logic, it is just disabled.

@wladimirleite
Copy link
Member

I see. Sorry, I could have guessed that you had already tried that.

@lfcnassif
Copy link
Member Author

lfcnassif commented Mar 2, 2023

You're welcome @tc-wleite, please continue to share your ideas. Commit above fixed slow clients blocking cluster resources. I already updated 4 from 6 nodes.

After I finish, I'll change the code to convert to WAV on service side again, unplug 1 node from the cluster and do some tests to see if the wav conversion overhead was decreased or not.

@lfcnassif
Copy link
Member Author

lfcnassif commented Mar 2, 2023

Just tested, numbers look great! I used a small aac/opus dataset, running 2x times for each configuration (first for warm up).

  • Converting to WAV on client side, first execution took 26s and second 23s (just AudioTranscriptTask). Below are the cluster stats after the 2 executions:
    image

  • Converting to WAV on server side, first execution took 27s and second 24s (just AudioTranscriptTask). Below are the cluster stats after the 2 executions:
    image

Those server times don't seem correct, but the relative values should be OK, so the overhead is about 4-5%. And for a fast client near the cluster, the impact was minimal. So I'll apply the change and update the cluster nodes. Thanks @tc-wleite for bringing this idea again!

@lfcnassif
Copy link
Member Author

PS1: Each audios in this data set is duplicated, so client side total time should be higher for a similar dataset with unique audios.

PS2: I'll fix the stats on server side, they should be changed after commit c7dc056. Wav real time would be a bit difficult to compute now.

@lfcnassif
Copy link
Member Author

lfcnassif commented Mar 2, 2023

@tc-wleite, just for your reference, the previous wav conversion approach was changed here: #1400

@lfcnassif
Copy link
Member Author

lfcnassif commented Mar 2, 2023

Now times are better, with just unique audios:
image

Wav real time is still wrong and I will disable this stat for now. As I have 48 processing threads (and the worker node accepts up to 128 simultaneous wav conversions for now, a bit high, but it is using the same number and logic for parallel audio transfers now...) in this specific test we just need to divide the wav cpu time by 48, so the wav conversion real time would be 1s. But with a variable number of clients sending requests, that would be a bit hard to compute, I think it is not worth.

@lfcnassif lfcnassif changed the title Improve transcription service to don't be blocked by clients with slow networks Improve transcription service to receive audios in parallel and covert them to WAV on server side again Mar 2, 2023
@lfcnassif
Copy link
Member Author

Closed by c7dc056 and a70d02e

@lfcnassif
Copy link
Member Author

Cluster nodes updated again. Clients should use a snapshot version to stop converting to WAV on their side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants