Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper_online_server.py not producing a correct text #28

Closed
gu-ma opened this issue Nov 14, 2023 · 2 comments
Closed

whisper_online_server.py not producing a correct text #28

gu-ma opened this issue Nov 14, 2023 · 2 comments

Comments

@gu-ma
Copy link

gu-ma commented Nov 14, 2023

Hello,

First thanks a lot for opensourcing the software 馃憤

I encounter some pbs with whisper_online_server.py, it does not seem to work and produce some "random" content. I tested the same file with whisper_online.py and it works perfectly. Any idea what it could be?

Client side

Command

ffmpeg -i GMT20231102-120819_Recording.m4a -f s16le -acodec pcm_s16le - | nc localhost 43007

Log

ffmpeg -i GMT20231102-120819_Recording.m4a -f s16le -acodec pcm_s16le - | nc localhost 43007
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'GMT20231102-120819_Recording.m4a':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    creation_time   : 2023-11-02T12:08:19.000000Z
  Duration: 01:22:02.94, start: 0.000000, bitrate: 127 kb/s
  Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 32000 Hz, mono, fltp, 126 kb/s (default)
    Metadata:
      creation_time   : 2023-11-02T12:08:19.000000Z
      handler_name    : AAC audio
      vendor_id       : [0][0][0][0]
Stream mapping:
  Stream #0:0 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, s16le, to 'pipe:':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    encoder         : Lavf58.76.100
  Stream #0:0(und): Audio: pcm_s16le, 32000 Hz, mono, s16, 512 kb/s (default)
    Metadata:
      creation_time   : 2023-11-02T12:08:19.000000Z
      handler_name    : AAC audio
      vendor_id       : [0][0][0][0]
      encoder         : Lavc58.134.100 pcm_s16le
0 1660  Thank you very much.9.37 bitrate= 512.6kbits/s speed=18.1x
4000 8420  dramatic presentation of a video12.1kbits/s speed=4.89x
8420 12420  for sports.
13260 17020  If you're interested in watching.1kbits/s speed=3.99x
38740 39180  please feel free to leave a= 512.1kbits/s speed=1.95x
42540 42560  please feel free to leave a comment or a like, and don't forget to

Server side

Command

python whisper_online_server.py --min-chunk-size 1

Log

Loading Whisper large-v2 model for en... done. It took 3.65 seconds.
Whisper is not warmed up
whisper-server-INFO: INFO: Listening on('localhost', 43007)
whisper-server-INFO: INFO: Connected to client on ('127.0.0.1', 56506)
...
INCOMPLETE: (4.04, 10.0, ' Well, that was nice of you. Okay, uh, maybe next time. Take care.')
len of buffer now: 10.37
(None, None, '')
b'C\xfeG\xfeK\xfeP\xfeT\xfe'
65536
PROMPT:
CONTEXT:  Thank you very much.
transcribing 12.42 seconds from 0.00
whisper-server-INFO: Processing audio with duration 00:12.416
>>>>COMPLETE NOW: (None, None, '')
INCOMPLETE: (4.0600000000000005, 12.38, " Well that is it for this episode of Gamer Gear. I hope you enjoyed it. Thank you all for watching. I'll see you all next time.")
len of buffer now: 12.42
(None, None, '')
b'\xdd\xff\xea\xff\x15\x00\xff\xff\xd6\xff'
65536
PROMPT:
CONTEXT:  Thank you very much.
transcribing 14.46 seconds from 0.00
whisper-server-INFO: Processing audio with duration 00:14.464
>>>>COMPLETE NOW: (None, None, '')
INCOMPLETE: (4.36, 13.26, " Well, that was not too hard, was it? No, it was a little bit difficult. It was a long time ago. It's hard to say.")
len of buffer now: 14.46
(None, None, '')
b'<\x07[\x07\x82\x08)\nj\x0c'
65536
PROMPT:
CONTEXT:  Thank you very much.
transcribing 16.51 seconds from 0.00
whisper-server-INFO: Processing audio with duration 00:16.512
>>>>COMPLETE NOW: (None, None, '')
INCOMPLETE: (4.079999999999999, 16.48, ' dramatic presentation of a video that we all enjoy watching. If you enjoy it, please subscribe.')
len of buffer now: 16.51
(None, None, '')
b'\xd3\xfd\xda\xfd\xe2\xfd\xef\xfd\x00\xfe'
65536
...
@Gldkslfmsd
Copy link
Collaborator

Hi,
are you sure that the ffmpeg options are correct? For file format conversion, I'm using ffmpeg -i in.mp3 -acodec pcm_s16le -ac 1 -ar 16000 out.wav, in.mp3 is input and out.wav is output.

Furthermore, I think that this way of using ffmpeg flushes all the bytes of the recording at once. If you want to simulate real-time mode, you should make sure it's being sent real-time. I convert file to wav and then I'm using a Python script that outputs X second of audio every X seconds.

@gu-ma
Copy link
Author

gu-ma commented Nov 24, 2023

Hey Dominik,
Thanks for the reply and sorry for not getting back to you earlier. I'll try different way to stream it and let you know what worked at the end :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants