Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214

Closed
lfcnassif opened this issue Jul 8, 2022 · 30 comments · Fixed by #1227
Assignees
Projects

Comments

@lfcnassif
Copy link
Member

Awesome work published in this paper:
https://arxiv.org/pdf/2107.11414.pdf

Scripts, data sets references and models in this repo:
https://github.com/lucasgris/wav2vec4bp

400 hours of pt-BR audios used for training!

Average WER for pt-BR is between 10.5%-12.4% for tested datasets!

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 8, 2022

It is possible to test the transcription sending audios to these sites:
https://huggingface.co/lgris/bp_400h_xlsr2_300M
https://huggingface.co/lgris/bp400-xlsr

edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model )

@lfcnassif lfcnassif added this to To do in 4.1 via automation Jul 8, 2022
@lfcnassif lfcnassif changed the title New local audio transcription implementation using wav2vec2 algorithm and bp_400 model New local audio transcription implementation using wav2vec-2.0 algorithm and bp_400 model Jul 8, 2022
@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 8, 2022

Just found this possible better model using 1B params in portuguese language model (first above uses 300M), no WER reported for now:
https://huggingface.co/lgris/bp_400_xlsr2_1B

PS: all models seems Apache licensed :-)

@lfcnassif
Copy link
Member Author

The author just put a MIT license term in his repo after I kindly asked to clarify it :-)

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 20, 2022

Transcription time running on i5-8350U CPU (8 logical cores 1.7-1.9Ghz) over 80 small WAV audios (2s-4s) from voxforge test set: 4m25s
image

Roughly 1s per audio second

CPU usage was about 50% and RAM usage was about 1.6GB.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 20, 2022

CPU usage was about 50%

My fault, I forgot my notebook power cable unplugged :-). Plugging it and setting "max performance" in energy settings, CPU usage was about 90%-95% and running time dropped to almost half: 2m17s. RAM usage increased to 2GB-3GB

Testing on the same 301 audios (~5500s duration) data set used here (#248 (comment)) with the same 48 threads 2xCPU machine, checking a few transcriptions, accuracy is much better!

But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used, maybe just 1 processor was detected by PyTorch... Using 100% CPUs may decrease running time by half, but that would be still 8.5x slower, and vosk is already slow. Not sure if running this new algorithm on CPUs will be acceptable in practice...

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 20, 2022

Just found a ranking of models, the first place is another one using 1B params + LM:
https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard

@tc-wleite ranking above made me remember you rs

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 20, 2022

https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard

Just executed the current top pt-BR model on that rank (https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese) on the 301 audios data set (~5500s) using the 48 threads dual CPU: 2700s running time, just about 50% overall CPU usage.

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 20, 2022

I'm waiting some remote access to a RTX-3090 GPU to measure the inference performance on GPU.

@lfcnassif lfcnassif changed the title New local audio transcription implementation using wav2vec-2.0 algorithm and bp_400 model New local audio transcription implementation using wav2vec-2.0 algorithm Jul 20, 2022
@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 21, 2022

@jonatasgrosman also fine tuned great models for english, spanish, german, italian, french and other languages:
https://huggingface.co/jonatasgrosman

@lfcnassif
Copy link
Member Author

lfcnassif commented Jul 22, 2022

I forgot to quote this awesome repo I found:
https://github.com/huggingface/transformers

@lfcnassif
Copy link
Member Author

But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used

Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs.

@lfcnassif
Copy link
Member Author

I just finished the section on the wiki manual about how to enable this new local or remote implementation:
https://github.com/sepinf-inc/IPED/wiki/User-Manual#wav2vec2

Please let me know if it needs better explanation.

4.1 automation moved this from In progress to Done Sep 8, 2022
@lfcnassif lfcnassif changed the title New local audio transcription implementation using wav2vec-2.0 algorithm New local audio transcription using Facebook wav2vec-2.0 and transcription cluster service Jan 25, 2023
@lfcnassif lfcnassif changed the title New local audio transcription using Facebook wav2vec-2.0 and transcription cluster service New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service Feb 16, 2023
@wladimirleite
Copy link
Member

@lfcnassif, just a quick feedback here, I downloaded 4.1.0 yesterday and used it to process a new case I am working on, and set
audio transcription to use this new algorithm wav2vec2.

Results were really impressive, but as you warned in the configuration file comments and in the Wiki, it is much slower than Vosk, using only CPU.
For my particular case, the total processing time was still fine, as there weren't that many audios.

Setup (in Windows) was pretty straightforward, I just followed IPED Wiki's instructions.
One minor detail, I got an error message "Error testing FFmpeg, is it on path? Audios longer than 1min need it to be transcribed" that I don't remember seeing before (in 4.0.x).
It was trivial to fix though (just downloaded FFmpeg for Windows and placed it in the path).
Maybe this could be included in the setup instructions in the Wiki.
Isn't it possible to include a FFmpeg Windows executable in IPED's distribution?

@lfcnassif
Copy link
Member Author

lfcnassif commented Feb 18, 2023

Thank you for trying this out so quicky! What model did you use? Jonatasgrosman's large one is better, but of course slower.

We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?

PS: audio splitting is needed just by this new algorithm and by the google implementation.

@wladimirleite
Copy link
Member

What model did you use? Jonatasgrosman's large one is better, but of course slower.

I used that large one. As I said, results were very good, considering that the audios were not easy to transcribe (noisy, lot of slangs and so on).

We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible?

Yes, it would add some extra size to IPED release. I downloaded a "complete" version which is even larger (~120 MB).
I am not sure, but maybe it is possible to use MPlayer, which would be nice as it is already used.
I am going to check, and let you know if I find a way of using MPlayer instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
4.1
Done
Development

Successfully merging a pull request may close this issue.

2 participants