Clients with slow networks blocking transcription cluster resources #1561

lfcnassif · 2023-03-02T15:20:37Z

Currently WAV transfer and transcription itself are synchronous and part of the same job. The number of jobs per node is 2x the number of GPUs. A colleague with a very slow connection did some tests, it is working correctly, but WAV transfer took much more time than transcription itself. Then I realized that the service nodes were blocked waiting too much time for WAV transfer, refusing new connections from other clients, while the GPUs were idle. We should make WAV transfer and transcription asynchronous regarding each other. A simple workaround would be to allow a higher number of simultaneous connections from clients for WAV transfer and restrict the number of simultaneous transcriptions using a Semaphore to the current number of jobs per node.

wladimirleite · 2023-03-02T17:50:37Z

@lfcnassif, I am sorry if you already considered this, but what about sending the original audio files instead?

I made a quick measure with ~50 K audio files collected from different cases and formats. Most of them (~90%) are OPUS.
The average original file size was ~70 KB, while the transmited average was ~850 KB, which is ~12x more.
Looking at the code, it isn't a simple change, but may be a significant improvement for slow network connections.

By the way, this is an additional comment about audio transcription, independent of the enhancement described in this issue.

lfcnassif · 2023-03-02T18:06:34Z

@lfcnassif, I am sorry if you already considered this, but what about sending the original audio files instead?

I made a quick measure with ~50 K audio files collected from different cases and formats. Most of them (~90%) are OPUS. The average original file size was ~70 KB, while the transmited average was ~850 KB, which is ~12x more. Looking at the code, it isn't a simple change, but may be a significant improvement for slow network connections.

By the way, this is an additional comment about audio transcription, independent of the enhancement described in this issue.

Hi @tc-wleite. Actually we started sending WAVs (aiming to distribute part of the job, the wav conversion), switched to your suggestion (because of bandwidth usage concerns), then rolled back after we looked at the stats and saw the cluster was surprisingly spending half the time just for WAV conversion. After we rolled back, the cluster performance became 2x faster.

But WAV conversion is single threaded (using mplayer) while transcription uses almost half a physical CPU (together with the GPU, surprisingly, it needs both). The issue is that they were sequential. My changes will make audio transmission and transcription kind parallel. Because of that, maybe converting to WAV on server again (in parallel to transcription) wouldn't have that previous bottleneck, just testing... The code change is simple, I didn't threw up the previous logic, it is just disabled.

wladimirleite · 2023-03-02T18:14:30Z

I see. Sorry, I could have guessed that you had already tried that.

lfcnassif · 2023-03-02T20:03:02Z

You're welcome @tc-wleite, please continue to share your ideas. Commit above fixed slow clients blocking cluster resources. I already updated 4 from 6 nodes.

After I finish, I'll change the code to convert to WAV on service side again, unplug 1 node from the cluster and do some tests to see if the wav conversion overhead was decreased or not.

lfcnassif · 2023-03-02T21:33:44Z

Just tested, numbers look great! I used a small aac/opus dataset, running 2x times for each configuration (first for warm up).

Converting to WAV on client side, first execution took 26s and second 23s (just AudioTranscriptTask). Below are the cluster stats after the 2 executions:
Converting to WAV on server side, first execution took 27s and second 24s (just AudioTranscriptTask). Below are the cluster stats after the 2 executions:

Those server times don't seem correct, but the relative values should be OK, so the overhead is about 4-5%. And for a fast client near the cluster, the impact was minimal. So I'll apply the change and update the cluster nodes. Thanks @tc-wleite for bringing this idea again!

lfcnassif · 2023-03-02T21:53:18Z

PS1: Each audios in this data set is duplicated, so client side total time should be higher for a similar dataset with unique audios.

PS2: I'll fix the stats on server side, they should be changed after commit c7dc056. Wav real time would be a bit difficult to compute now.

lfcnassif · 2023-03-02T21:59:24Z

@tc-wleite, just for your reference, the previous wav conversion approach was changed here: #1400

lfcnassif · 2023-03-02T22:19:19Z

Now times are better, with just unique audios:

Wav real time is still wrong and I will disable this stat for now. As I have 48 processing threads (and the worker node accepts up to 128 simultaneous wav conversions for now, a bit high, but it is using the same number and logic for parallel audio transfers now...) in this specific test we just need to divide the wav cpu time by 48, so the wav conversion real time would be 1s. But with a variable number of clients sending requests, that would be a bit hard to compute, I think it is not worth.

lfcnassif · 2023-03-02T22:27:33Z

Closed by c7dc056 and a70d02e

lfcnassif · 2023-03-03T01:29:33Z

Cluster nodes updated again. Clients should use a snapshot version to stop converting to WAV on their side.

lfcnassif self-assigned this Mar 2, 2023

lfcnassif added the enhancement label Mar 2, 2023

lfcnassif added a commit that referenced this issue Mar 2, 2023

'#1561: makes audio transfer kind parallel to audio transcription

c7dc056

lfcnassif added a commit that referenced this issue Mar 2, 2023

'#1561: convert audios to WAV on server side again

a70d02e

lfcnassif added a commit that referenced this issue Mar 2, 2023

'#1561: stop computing WAV conversion real time for now, it is wrong

2f6f377

lfcnassif changed the title ~~Improve transcription service to don't be blocked by clients with slow networks~~ Improve transcription service to receive audios in parallel and covert them to WAV on server side again Mar 2, 2023

lfcnassif closed this as completed Mar 2, 2023

lfcnassif added a commit that referenced this issue Mar 3, 2023

'#1561: limit number of parallel WAV convs to logicalCores/numProcesses

78321d2

lfcnassif added a commit that referenced this issue Mar 3, 2023

'#1561: print timestamp of important events in cluster central node

8580ed3

lfcnassif added a commit that referenced this issue Mar 4, 2023

'#1561: simplify/fix commit 78321d, we shouldn't divide by numProcesses

55feb4e

lfcnassif changed the title ~~Improve transcription service to receive audios in parallel and covert them to WAV on server side again~~ Clients with slow networks blocking transcription cluster resources Mar 5, 2023

lfcnassif added bug and removed enhancement labels Mar 5, 2023

This was referenced Mar 5, 2023

Convert audios to WAV on transcription service side again #1566

Closed

Externalize MIN_TIMEOUT parameter for remote audio transcription #1569

Closed

lfcnassif added a commit that referenced this issue Mar 6, 2023

'#1561: uses a fairness policy to transcribe received audios

7f48197

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: makes audio transfer kind parallel to audio transcription

25671a1

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: convert audios to WAV on server side again

e8a5103

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: stop computing WAV conversion real time for now, it is wrong

8d4f828

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: limit number of parallel WAV convs to logicalCores/numProcesses

93759fb

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: print timestamp of important events in cluster central node

e8526f4

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: simplify/fix commit 78321d, we shouldn't divide by numProcesses

7d3ca44

lfcnassif added a commit that referenced this issue Mar 13, 2023

'#1561: uses a fairness policy to transcribe received audios

368c1a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clients with slow networks blocking transcription cluster resources #1561

Clients with slow networks blocking transcription cluster resources #1561

lfcnassif commented Mar 2, 2023 •

edited

wladimirleite commented Mar 2, 2023

lfcnassif commented Mar 2, 2023

wladimirleite commented Mar 2, 2023

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023

lfcnassif commented Mar 3, 2023

Clients with slow networks blocking transcription cluster resources #1561

Clients with slow networks blocking transcription cluster resources #1561

Comments

lfcnassif commented Mar 2, 2023 • edited

wladimirleite commented Mar 2, 2023

lfcnassif commented Mar 2, 2023

wladimirleite commented Mar 2, 2023

lfcnassif commented Mar 2, 2023 • edited

lfcnassif commented Mar 2, 2023 • edited

lfcnassif commented Mar 2, 2023

lfcnassif commented Mar 2, 2023 • edited

lfcnassif commented Mar 2, 2023 • edited

lfcnassif commented Mar 2, 2023

lfcnassif commented Mar 3, 2023

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023 •

edited

lfcnassif commented Mar 2, 2023 •

edited