-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New local/remote audio transcription using Facebook wav2vec2 and transcription cluster service #1214
Comments
It is possible to test the transcription sending audios to these sites: edited: IMHO first uses a language model for portuguese, second uses no LM so it tends to transcribe more phonetically (possibly returning non existent words in pt-BR language but I think it can find words outside the used pt-BR language model ) |
Just found this possible better model using 1B params in portuguese language model (first above uses 300M), no WER reported for now: PS: all models seems Apache licensed :-) |
The author just put a MIT license term in his repo after I kindly asked to clarify it :-) |
My fault, I forgot my notebook power cable unplugged :-). Plugging it and setting "max performance" in energy settings, CPU usage was about 90%-95% and running time dropped to almost half: 2m17s. RAM usage increased to 2GB-3GB Testing on the same 301 audios (~5500s duration) data set used here (#248 (comment)) with the same 48 threads 2xCPU machine, checking a few transcriptions, accuracy is much better! But running time increased from 95s by our current vosk implementation to 1650s - 17 times slower - although just 50% of the dual CPU was used, maybe just 1 processor was detected by PyTorch... Using 100% CPUs may decrease running time by half, but that would be still 8.5x slower, and vosk is already slow. Not sure if running this new algorithm on CPUs will be acceptable in practice... |
Just found a ranking of models, the first place is another one using 1B params + LM: @tc-wleite ranking above made me remember you rs |
Just executed the current top pt-BR model on that rank (https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese) on the 301 audios data set (~5500s) using the 48 threads dual CPU: 2700s running time, just about 50% overall CPU usage. |
I'm waiting some remote access to a RTX-3090 GPU to measure the inference performance on GPU. |
@jonatasgrosman also fine tuned great models for english, spanish, german, italian, french and other languages: |
I forgot to quote this awesome repo I found: |
Running the transcription on that 301 audios data set from the new Iped task took 1280s using 1 CPU (I think the conversion to wav was parallelized) and 750s using both CPUs. |
I just finished the section on the wiki manual about how to enable this new local or remote implementation: Please let me know if it needs better explanation. |
@lfcnassif, just a quick feedback here, I downloaded 4.1.0 yesterday and used it to process a new case I am working on, and set Results were really impressive, but as you warned in the configuration file comments and in the Wiki, it is much slower than Vosk, using only CPU. Setup (in Windows) was pretty straightforward, I just followed IPED Wiki's instructions. |
Thank you for trying this out so quicky! What model did you use? Jonatasgrosman's large one is better, but of course slower. We can update the wiki for sure. It is possible to embed ffmpeg, I think its license is ok, but AFAIK it is 40-50MB size. Actually I just use ffmpeg to split WAV files. I didn't manage to do it with mplayer, do you know if it is possible? PS: audio splitting is needed just by this new algorithm and by the google implementation. |
I used that large one. As I said, results were very good, considering that the audios were not easy to transcribe (noisy, lot of slangs and so on).
Yes, it would add some extra size to IPED release. I downloaded a "complete" version which is even larger (~120 MB). |
Awesome work published in this paper:
https://arxiv.org/pdf/2107.11414.pdf
Scripts, data sets references and models in this repo:
https://github.com/lucasgris/wav2vec4bp
400 hours of pt-BR audios used for training!
Average WER for pt-BR is between 10.5%-12.4% for tested datasets!
The text was updated successfully, but these errors were encountered: