New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214

lfcnassif · 2022-07-08T00:59:15Z

Awesome work published in this paper:
https://arxiv.org/pdf/2107.11414.pdf

Scripts, data sets references and models in this repo:
https://github.com/lucasgris/wav2vec4bp

400 hours of pt-BR audios used for training!

Average WER for pt-BR is between 10.5%-12.4% for tested datasets!

lfcnassif · 2022-07-08T01:17:40Z

It is possible to test the transcription sending audios to these sites:
https://huggingface.co/lgris/bp_400h_xlsr2_300M
https://huggingface.co/lgris/bp400-xlsr

edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model )

lfcnassif · 2022-07-08T05:22:09Z

Just found this possible better model using 1B params in portuguese language model (first above uses 300M), no WER reported for now:
https://huggingface.co/lgris/bp_400_xlsr2_1B

PS: all models seems Apache licensed :-)

lfcnassif · 2022-07-09T04:14:18Z

The author just put a MIT license term in his repo after I kindly asked to clarify it :-)

lfcnassif · 2022-07-20T02:07:46Z

Transcription time running on i5-8350U CPU (8 logical cores 1.7-1.9Ghz) over 80 small WAV audios (2s-4s) from voxforge test set: 4m25s

Roughly 1s per audio second

CPU usage was about 50% and RAM usage was about 1.6GB.

lfcnassif · 2022-07-20T04:45:56Z

CPU usage was about 50%

My fault, I forgot my notebook power cable unplugged :-). Plugging it and setting "max performance" in energy settings, CPU usage was about 90%-95% and running time dropped to almost half: 2m17s. RAM usage increased to 2GB-3GB

Testing on the same 301 audios (~5500s duration) data set used here (#248 (comment)) with the same 48 threads 2xCPU machine, checking a few transcriptions, accuracy is much better!

But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used, maybe just 1 processor was detected by PyTorch... Using 100% CPUs may decrease running time by half, but that would be still 8.5x slower, and vosk is already slow. Not sure if running this new algorithm on CPUs will be acceptable in practice...

lfcnassif · 2022-07-20T18:38:25Z

Just found a ranking of models, the first place is another one using 1B params + LM:
https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard

@tc-wleite ranking above made me remember you rs

lfcnassif · 2022-07-20T22:46:13Z

https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard

Just executed the current top pt-BR model on that rank (https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese) on the 301 audios data set (~5500s) using the 48 threads dual CPU: 2700s running time, just about 50% overall CPU usage.

lfcnassif · 2022-07-20T22:48:39Z

I'm waiting some remote access to a RTX-3090 GPU to measure the inference performance on GPU.

lfcnassif · 2022-07-21T04:02:18Z

@jonatasgrosman also fine tuned great models for english, spanish, german, italian, french and other languages:
https://huggingface.co/jonatasgrosman

lfcnassif · 2022-07-22T01:23:23Z

I forgot to quote this awesome repo I found:
https://github.com/huggingface/transformers

lfcnassif · 2022-07-22T19:03:06Z

But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used

Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.

lfcnassif · 2022-09-06T22:27:43Z

I just finished the section on the wiki manual about how to enable this new local or remote implementation:
https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2

Please let me know if it needs better explanation.

wladimirleite · 2023-02-18T14:26:39Z

@lfcnassif, just a quick feedback here, I downloaded 4.1.0 yesterday and used it to process a new case I am working on, and set
audio transcription to use this new algorithm wav2vec2.

Results were really impressive, but as you warned in the configuration file comments and in the Wiki, it is much slower than Vosk, using only CPU.
For my particular case, the total processing time was still fine, as there weren't that many audios.

Setup (in Windows) was pretty straightforward, I just followed IPED Wiki's instructions.
One minor detail, I got an error message "Error testing FFmpeg, is it on path? Audios longer than 1min need it to be transcribed" that I don't remember seeing before (in 4.0.x).
It was trivial to fix though (just downloaded FFmpeg for Windows and placed it in the path).
Maybe this could be included in the setup instructions in the Wiki.
Isn't it possible to include a FFmpeg Windows executable in IPED's distribution?

lfcnassif · 2023-02-18T15:05:03Z

Thank you for trying this out so quicky! What model did you use? Jonatasgrosman's large one is better, but of course slower.

We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?

PS: audio splitting is needed just by this new algorithm and by the google implementation.

wladimirleite · 2023-02-18T15:16:56Z

What model did you use? Jonatasgrosman's large one is better, but of course slower.

I used that large one. As I said, results were very good, considering that the audios were not easy to transcribe (noisy, lot of slangs and so on).

We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?

Yes, it would add some extra size to IPED release. I downloaded a "complete" version which is even larger (~120 MB).
I am not sure, but maybe it is possible to use MPlayer, which would be nice as it is already used.
I am going to check, and let you know if I find a way of using MPlayer instead.

lfcnassif added new feature enhancement and removed new feature labels Jul 8, 2022

lfcnassif self-assigned this Jul 8, 2022

lfcnassif added this to To do in 4.1 via automation Jul 8, 2022

lfcnassif changed the title ~~New local audio transcription implementation using wav2vec2 algorithm and bp_400 model~~ New local audio transcription implementation using wav2vec-2.0 algorithm and bp_400 model Jul 8, 2022

lfcnassif mentioned this issue Jul 8, 2022

Local Audio Transcription #248

Closed

lfcnassif changed the title ~~New local audio transcription implementation using wav2vec-2.0 algorithm and bp_400 model~~ New local audio transcription implementation using wav2vec-2.0 algorithm Jul 20, 2022

lfcnassif added a commit that referenced this issue Jul 22, 2022

'#1214: new config param for hugging face speech-to-text-model

a8801e7

lfcnassif added a commit that referenced this issue Jul 22, 2022

'#1214: script to start external process to run wav2vec2 transcription

770443f

lfcnassif added a commit that referenced this issue Jul 22, 2022

'#1214: new Wav2Vec2TranscriptTask implementation

276084e

lfcnassif added a commit that referenced this issue Jul 22, 2022

'#1214: update AudioTranscriptConfig with new params and comments

489329b

lfcnassif mentioned this issue Jul 22, 2022

#1214 Wav2vec2 audio transcription #1227

Merged

8 tasks

lfcnassif moved this from To do to In progress in 4.1 Jul 22, 2022

lfcnassif added a commit that referenced this issue Jul 22, 2022

-#1214: new dependency to query the number of physical processors

4269a88

lfcnassif added a commit that referenced this issue Jul 22, 2022

-#1214: uses a process pool w/ size equal to the number of physical CPUs

eaed223

lfcnassif added a commit that referenced this issue Jul 22, 2022

'#1214: reuse Google impl to split large audios before transcription

497f2a4

lfcnassif added a commit that referenced this issue Jul 22, 2022

'#1214: implement ping querying & restart if response is not ok

1d24dbb

lfcnassif added a commit that referenced this issue Jul 22, 2022

-#1214: restart external process after 1000 jobs to clean possible leaks

037cadd

lfcnassif added a commit that referenced this issue Sep 2, 2022

'#1214: fix external process crash if transcript fails with small audios

24af070

lfcnassif added a commit that referenced this issue Sep 2, 2022

'#1214: better error message

2df9428

lfcnassif added a commit that referenced this issue Sep 2, 2022

'#1214: decrease logging level

9eeae9b

lfcnassif added a commit that referenced this issue Sep 6, 2022

'#1214: fix discovery of remote worker ip

4fdd659

lfcnassif added a commit that referenced this issue Sep 6, 2022

'#1214: improve to detect number of CUDA devices and to use all of them

cffa8b1

lfcnassif added a commit that referenced this issue Sep 6, 2022

'#1214: improve to detect number of CUDA devices and to use all of them

22b92c7

lfcnassif added a commit that referenced this issue Sep 6, 2022

'#1214: allow to specify worker node local port

b596ec2

lfcnassif added a commit that referenced this issue Sep 6, 2022

'#1214: abort if too many connection errors are detected (cluster down)

0a25807

lfcnassif added a commit that referenced this issue Sep 6, 2022

'#1214: reorganize the audio transcription config file options

e40baa8

lfcnassif added a commit that referenced this issue Sep 8, 2022

'#1214: increase max number of jobs, it was restarting too often on GPU

7b100c3

lfcnassif added a commit that referenced this issue Sep 8, 2022

'#1214: fix to run on all available GPUs, not CPUs, at the same time

c5d88f4

lfcnassif closed this as completed in #1227 Sep 8, 2022

4.1 automation moved this from In progress to Done Sep 8, 2022

lfcnassif added a commit that referenced this issue Sep 9, 2022

'#1214: fix NPE on client side while transcribing corrupted audios

36a876b

lfcnassif added a commit that referenced this issue Sep 9, 2022

'#1214: set timeouts on client and server side to avoid hangs

0a9b468

lfcnassif added a commit that referenced this issue Sep 9, 2022

'#1214: minor change to avoid number rounding issues

07c8373

lfcnassif added a commit that referenced this issue Sep 9, 2022

'#1214: simple and better exception handling to provide better details

3ce7071

lfcnassif added a commit that referenced this issue Sep 10, 2022

'#1214: avoid unneeded writes in socket output stream if error is thrown

bb34a6e

lfcnassif added a commit that referenced this issue Sep 10, 2022

'#1214: avoid NPE if remote server process has unexpected crash

0d87af8

lfcnassif added a commit that referenced this issue Sep 10, 2022

'#1214: complement 36a876b: closing input stream auto closes the socket!

53d9e46

This was referenced Sep 11, 2022

Option to use a language model with Wav2Vec2 transcription #1312

Open

Create an internal audio->transcription data set to evaluate transcription models #1313

Open

lfcnassif mentioned this issue Sep 30, 2022

Timeouts while transcribing small audios (regression 4.0.4) #1349

Closed

lfcnassif added a commit that referenced this issue Sep 30, 2022

'#1214: add the MIN_TIMEOUT value to avoid timeouts with small audios

8b4c05c

lfcnassif changed the title ~~New local audio transcription implementation using wav2vec-2.0 algorithm~~ New local audio transcription using Facebook wav2vec-2.0 and transcription cluster service Jan 25, 2023

lfcnassif changed the title ~~New local audio transcription using Facebook wav2vec-2.0 and transcription cluster service~~ New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214

New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214

lfcnassif commented Jul 8, 2022

lfcnassif commented Jul 8, 2022 •

edited

lfcnassif commented Jul 8, 2022 •

edited

lfcnassif commented Jul 9, 2022

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 21, 2022 •

edited

lfcnassif commented Jul 22, 2022 •

edited

lfcnassif commented Jul 22, 2022

lfcnassif commented Sep 6, 2022

wladimirleite commented Feb 18, 2023

lfcnassif commented Feb 18, 2023 •

edited

wladimirleite commented Feb 18, 2023

New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214

New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214

Comments

lfcnassif commented Jul 8, 2022

lfcnassif commented Jul 8, 2022 • edited

lfcnassif commented Jul 8, 2022 • edited

lfcnassif commented Jul 9, 2022

lfcnassif commented Jul 20, 2022 • edited

lfcnassif commented Jul 20, 2022 • edited

lfcnassif commented Jul 20, 2022 • edited

lfcnassif commented Jul 20, 2022 • edited

lfcnassif commented Jul 20, 2022 • edited

lfcnassif commented Jul 21, 2022 • edited

lfcnassif commented Jul 22, 2022 • edited

lfcnassif commented Jul 22, 2022

lfcnassif commented Sep 6, 2022

wladimirleite commented Feb 18, 2023

lfcnassif commented Feb 18, 2023 • edited

wladimirleite commented Feb 18, 2023

lfcnassif commented Jul 8, 2022 •

edited

lfcnassif commented Jul 8, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 20, 2022 •

edited

lfcnassif commented Jul 21, 2022 •

edited

lfcnassif commented Jul 22, 2022 •

edited

lfcnassif commented Feb 18, 2023 •

edited