-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flushing the output queue WITHOUT invalidating the decode pipeline #698
Comments
Pulling in a reference to #220, as the discussion on that issue directly informed the current spec requirements that a key frame must follow a flush. |
Whether a H.264 or H.265 stream can be decoded as 1-in-1-out precisely depends on how the bitstream is setup. Is your bitstream setup correctly? If you have a sample we can check it. https://crbug.com/1352442 discusses this for H.264. If the bitstream is setup correctly and this isn't working, that's a bug in that UA. If it's Chrome, you can file an issue at https://crbug.com/new with the chrome://gpu information and a sample bitstream. If the bitstream is not setup properly, there's nothing WebCodecs can do to help since the behavior would be up to the hardware decoder. |
First I should note that there is I expect that WebCodecs implementations will output frames as soon as possible, subject to the limitations of the codec implementation. This means that I predict 1-in-1-out behavior unless the bitstream prevents it. The most likely bitstream limitation is when frame reordering is enabled, as for example discussed in the crbug linked above. Some codec implementations do allow us to "flush without resetting", which can be used to force 1-in-1-out behavior, but I am reluctant to specify this feature knowing that not all implementations can support it (eg. Android MediaCodec). If you can post a sample of your bitstream, I'll take a look and let you know if there is a change you can make to improve decode latency. |
Thanks, your replies have given me some things to check on. I'll dig into the bitstream myself as well and see if I can spot any issues with the configuration. As requested, I've thrown together a minimal sample in a single html page, with some individual frames base64 encoded in: What I'd like to see is each frame appear as soon as you hit the corresponding button. Right now you'll observe the 1 frame of latency, where I have to send the frame twice, send a new frame in, or flush the decoder to get it to appear. |
I took a quick look at the bitstream you have provided, and at first glance it looks correctly configured to me (in particular, max_num_reoder_frames is set to zero). I also ran the sample on Chrome Canary on macOS, and while I was not able to decode the P frames, the I frames did display immediately. You may be experiencing a browser bug, which should be filed with the appropriate bug tracker (Chrome: https://bugs.chromium.org/p/chromium/issues/entry). |
Thanks, I also couldn't find problems in the bitstream. I did however diagnose the problem in Chrome. In the chromium project, under "src\media\gpu\h265_decoder.cc", inside the H265Decoder::Decode() method, there's a bit of code that currently looks like this:
As per the comment, the intent is to output the frame when the end of the bitstream is reached, but the FinishPrevFrameIfPresent() method doesn't actually output the frame, it just submits it for decoding. Fix is simple:
A corresponding change also needs to occur in h264_decoder.cc, which shares the same logic. I tested this change, and it fixes the issue. I'll prepare a submission for the Chromium project, see what they think. On the issue of the spec itself though, I'm still concerned by an apparent lack of guarantees around when frames submitted for decoding will be made visible. You mentioned that "I expect that WebCodecs implementations will output frames as soon as possible, subject to the limitations of the codec implementation. This means that I predict 1-in-1-out behavior unless the bitstream prevents it." This is reassuring to me, it's what I'd like to see myself, but I couldn't actually see any part of the standard that would seem to require it. Couldn't a perfectly conforming implementation hold onto 1000 frames right now, and claim they're meeting the spec? This might lead to inconsistent behaviour between implementations, and problems getting bug reports actioned, as they may be closed as "by design". From what you said before, I gather the intent of the "optimizeForLatency" flag is probably basically the same as what I was asking for with a "forceImmediateOutput", in that both of them really are intended to instruct the decoder not to hold onto any frames beyond what is required by the underlying bitstream. The wording in the spec right now doesn't really guarantee it does much of anything. The docs state about the "optimizeForLatency" flag that it is a: "Hint that the selected decoder SHOULD be configured to minimize the number of EncodedVideoChunks that have to be decoded before a VideoFrame is output. Making it a "hint" that the decoder "SHOULD" do something doesn't really sound reassuring. Can't we use stronger language here, such as "MUST"? It would also be helpful to elaborate on this flag a bit, maybe tie it back to the kind of use case I've outlined in this issue. I'd like to see the standard read more like this under the "optimizeForLatency" flag: "Instructs the selected decoder that it MUST ensure that the absolute minimum number of EncodedVideoChunks have to be decoded before a VideoFrame is output, subject to the limitations of the codec and the supplied bitstream. Where the codec and supplied bitstream allows frames to be produced immediately upon decoding, setting this flag guarantees that each decoded frame will be produced for output without further interaction, such as requiring a call to flush(), or submitting further frames for decoding." If I saw that in the spec as a developer making use of VideoDecoder, it would give me confidence that setting that flag would be sufficient to achieve 1-in-1-out behaviour, with appropriately prepared input data. A lack of clarity on that is what led me here to create this issue. It would also give stronger guidance to developers implementing this spec that they really needed to achieve this result to be conforming. Right now it reads more as a suggestion or a nice-to-have, but not required. |
For completeness, I'll add this issue has been reported to the Chromium project under https://crbug.com/1462868, and I've submitted a proposed fix under https://chromium-review.googlesource.com/c/chromium/src/+/4672706 |
Thanks, we can continue discussion of the Chromium issue there. In terms of the spec language, I think the current text accurately reflects that UAs are beholden to the underlying hardware decoders (and other OS constraints). I don't think we can use SHOULD/MUST here due to that. We could probably instead add some more non-normative text giving a concrete example (1-in-1-out) next to the Ultimately it seems like you're looking for confidence that you can always rely on 1-in-1-out behavior. I don't think you'll find that to be true; e.g., many Android decoders won't work that way. It's probably best to submit a test decode on your target platforms before relying on that functionality. |
Please would you re-open this issue? Am having the same problem, using Microsoft Edge on the latest macOS on Apple Silicon. The existing WebAssembly solution works flawlessly, under the hood it uses WASM-compiled FFmpeg to decode greyscale HEVC frames (received via WebSockets in real-time). The integrated WASM C code also applies colormaps to turn greyscale into full RGB prior to rendering in the HTML5 Canvas. The current client-side WebAssembly solution has ZERO latency: incoming frames are decoded by FFmpeg and they are available without any delays. On the server the video frames, coming at irregular intervals from a scientific application (astronomy), are encoded on-demand using the x265 C API. The known downside of the existing setup is the WebAssembly binary size is inflated due to the inclusion of FFmpeg, which is exactly what WebCodecs API is trying to solve. Having tried the WebCodecs API, the current WebCodecs API is basically unusable. There is a long delay (many many frames, simply unacceptable) before any output frames appear. Forcing
and no output frames appear. The same problem in Safari:
Have tried the latest Firefox 130 but it fails right at the outset: HEVC does not appear to be supported. So for real-time streaming applications where the video output needs to be available ASAP without any delays, the current WebCodecs API is unusable, which is a shame. I will need to stick to the existing (working without a hitch) WebAssembly FFmpeg for the foreseeable future, until WebCodecs API improves the specification and browsers revise their implementations. |
This issue is best file in the tracker for the corresponding implementations, and/or OS vendors (since implementations are using OS APIs for H265):
cc-ing @youennf and @aboba for Safari and Edge respectively. You're correct that Firefox doesn't support H265 encoding. Support for H265 isn't planed either, we're focusing on royalty-free formats. I'm not sure about Chromium, wpt.fyi marks it as not supporting it, but Edge as well so I'm not sure. |
Hi, thanks for a prompt reply. I guess it will take some time before all the kinks in the browser implementations are sorted out. In theory WebCodecs API seems God-sent, in practice the devil is in the details, imperfect / inconsistent OS implementations etc. |
https://w3c.github.io/webcodecs/#dom-videodecoder-flush is saying to set https://w3c.github.io/webcodecs/#dom-videodecoder-decode is then throwing if With regards to the delay, HEVC/H264 decoders need to compute the reordering window size. @jvo203, the error you mention is not unexpected. |
I am not calling This is my current experimental JS web worker code (just kicking the tires). By the way, it is rather difficult to find the correct information about the HEVC codec string. My string has been lifted somewhere on the Internet, but it's hard to find good info.
|
Regarding the unexpected initial delays when With the WebCodecs API, following the initial delay of even several hundred frames (!) (or 10 seconds or more in user time), afterwards in-coming HEVC frames are being decoded without any delays as they come in real-time. A new HEVC frame comes in and a corresponding decoded output video frame appears immediately. |
I feel like this might be a user agent bug. |
That's strange, perhaps a new Safari uses optimizeForLatency? My Safari is 17.6 (19618.3.11.11.5). The following error comes from Safari when optimizeForLatency is enabled:
Microsoft Edge as well as Chrome on macOS give
When the optimizeForLatency is disabled (the flag is commented out in the init object), all three browsers (Safari, Edge and Chrome) start outputting decoded video frames after a long initial delay. |
When you say "I feel like this might be a user agent bug.", what exactly do you mean by the "user agent"? Do you mean an incorrect HEVC codec string |
user agent = browser, like Safari. |
I see. No I am not using any wrapping JavaScript libraries. Nothing at all. Am calling VideoDecoder directly from JavaScript, no wrapping, no extra third-party WebCodecs API libraries, "no nothing". You can see my JavaScript code a few posts above. |
|
I ran the WebCodecs Encode/Decode Worker sample on Chrome Canary 130.0.6700.0 on Mac OS X 14.6.1 (MacBook Air M2), with H.265 enabled via the flags: Selecting H.265 (codec string "hvc1.1.6.L120.00" (Main profile, level 4.0)) with 500Kbps bitrate, "realtime" (which sets |
@aboba For this particular example the typical decode times using WebAssembly-compiled FFmpeg in a browser (non-threaded FFmpeg I guess) are between 1ms and 2ms for the small HEVC frames. Typical frame image dimensions are small, like 150x150 etc. Not bad, good enough for real-time. Even larger HEVC video frames like 1024x786 etc. can be decoded in a few milliseconds. @dalecurtis OK, I'll see what can be done to extract the raw NAL frames from the server and save them to disk, just prior to sending them via WebSockets to the client browser. This particular application uses Julia on the server, Julia in turn calls x265 C API bindings to generate raw HEVC NAL units (no wrapping video containers are used, just raw HEVC NAL units). In addition I'll look into trying to save the incoming NAL units on the receiving side too (received via WebSockets in JavaScript). Don't know if it can be done easily in JavaScript (disk access rights etc). Perhaps there is some discrepancy in the HEVC NAL units order on the server and the receiver. Let's assume for a moment that indeed my HEVC bitstreams are a bit dodgy, or some NAL units are getting "lost in transit", or the WebSockets receive NAL units in the wrong order. Then it seems that WASM FFmpeg is extraordinarily resilient to any such errors. This client-server architecture has been in use since about 2016 in various forms (C/C++ on the server, also Rust, Julia too as well as C/FORTRAN server parts, always calling the x265 C API). The WASM FFmpeg decoder has never complained about the incoming NAL units, it has always been able to decode those HEVC bitstreams in real-time without any delays. |
ffmpeg is indeed far more resilient than hardware decoders. Unless you configure it to explode on errors (and even then) it will decode all sorts of content that the GPU will often balk at. Since issues in the GPU process are more serious security issues, we will also check the bitstream for conformance before passing to the hardware decoder in Chromium. |
Here is some food for thought. Changing the codec string to the one used by @aboba makes the "keyframe required" problem go away in Chrome and Edge when
In both Chrome and Edge there is now no difference in behaviour with and without optimizeForLatency. In either case video frames are decoded without keyframe errors. However, the problem of a delayed video frame output still remains. After carefully observing the decoded frame timestamps (counts), the initial delay seems to be 258 frames. In other words, the initial frames with timestamps from 0 to 257 never appear in the VideoDecoder output queue. The first output frame always has a timestamp of 258. From the frame timestamp 258 onwards all frames immediately appear in the output queue. I.e. you pass an EncodedVideoChunk with timestamp 260 to the decoder, and almost immediately an output Video Frame with the same timestamp 260 comes out of a decoder output queue. So the mystery is why the first 258 NAL units are completely lost (like falling into a black hole). There is a new snag: Safari stopped working with On the subject of the codec string, there is a distinct lack of practical information regarding the correct HEVC codec strings. For vp8, vp9, AV1 googling yields some examples / useful information. Less so for the HEVC. Anyway, I will try to extract the HEVC bitstreams from the application, some clues might be found. |
For completeness attached is a raw HEVC bitstream (annexb file) that plays fine on macOS using ffmpeg (ffplay):
Filename: video-123-190.h265 When you play it the video starts with some noise-like dots so please don't be surprised. Each dot represents an X-ray event count bin from an X-ray orbiting satellite observatory. Abbreviated server-side Julia code:
Such HEVC streams experience that 258-NAL unit initial delay when being decoded with WebCodecs API. However, there is no initial delay observed with the WebAssembly FFmpeg on the client side. Any help / insights would be much appreciated! Edit: two more HEVC bitstream examples: P.S. The meaning of the RGB planes:
The WebAssembly C code then turns the decoded luminance and alpha channels into a proper RGBA array, displayed in an HTML5 Canvas. |
@sandersdan ^^^ -- I wonder how hard it'd be to make a JS bitstream analyzer for H.264, H.265 which verify if a bitstream is configured for 1 in / 1 out. |
I created a demo page for decoding a video from #698 (comment). Chrome 128 on M1 Macbook: no lag in decoding is observed. The decoder outputs frames as soon as it gets the corresponding chunks. In order to achieve that, I had to add a small wait after each |
Hmm, it's interesting that you hardcoded Another difference: you are incrementing the timestamp by 30000 and not 1 as I have been doing. Could it be that the WebCodecs API decoder is sensitive to the timestamps (my initial 258 "lost frames" were being lost due to insufficient timestamp differences?) Plus one major difference: your WebCodecs API code runs in the main thread, not in a Web Worker, but perhaps it should not matter too much. If anything using a Web Worker should help reduce a pressure on the main thread. |
Your way is more accurate. But I'm pretty sure it doesn't affect decoder's lag. |
@Djuffin There is also another difference, don't know how significant it might be. When you extract the individual NAL units from the HEVC bitstream, you are skipping the first sequence of three or four 0x00 bytes until 0x1. |
No, start codes are preserved by the demo page |
I see. |
Thanks to hints from @Djuffin finally a victory has been achieved. The key was waiting (without calling The solution in the demo from @Djuffin had the benefit of seeing the whole HEVC bitstream all at once. The rudimentary demuxer in the demo could extract enough NAL units so as to pass the first keyframe at the beginning of the decoding. In my case, using a real-time WebSockets stream, individual NAL units were coming one short NAL at a time, and the first keyframe in the HEVC bitstream would only appear after a few non-key control NAL units. The following code, whilst perhaps not optimal (perhaps merging the buffers might be done better), finally works. No initial frames are getting lost, and there is a perfect "1 frame in / 1 frame out" behaviour. Thank you all for your patience and advice!
As mentioned by @dalecurtis , the WebAssembly-compiled FFmpeg, used in my production solution, is very resilient and can "swallow" incoming NAL units "as-is", without any manual NAL unit accumulation. Well perhaps the FFmpeg is doing its own internal buffering. Anyway, victory at last! Edit: P.S. If only Firefox could support HEVC too in the WebCodecs API ... Then we'd have all the major browsers: Chrome, Edge, Firefox and Safari (listed in an alph. order) using the same client-side JavaScript code. P.S 2 . Even with Firefox getting onboard there would still be a need to provide a full WASM legacy back-up for older browsers. |
First up, let me say thanks to everyone who's worked on this very important spec. I appreciate the time and effort that's gone into this, and I've found it absolutely invaluable in my work. That said, I have hit a snag, which I consider to be a significant issue in the standard.
Some background first so you can understand what I'm trying to do.
So then, based on the above, hopefully my requirements become clearer. When I get a video frame at the client side, I need to ensure that frame gets displayed to the user, without any subsequent frame being required to arrive in order to "flush" it out of the pipeline. My frames come down at an irregular and unpredictable rate. If the pipeline has a frame or two of latency, I need to be able to manually flush that pipeline on demand, to ensure that all frames which have been sent have been sent to the codec for decoding have been made visible.
At this point, I find myself fighting a few parts of the WebCodecs spec. As stated in the spec, when I call the decode method (https://www.w3.org/TR/webcodecs/#dom-videodecoder-decode), it "enqueues a control message to decode the given chunk". That decode request ends up on the "codec work queue" (https://www.w3.org/TR/webcodecs/#dom-videodecoder-codec-work-queue-slot), and ultimately stuck in the "internal pending output" queue (https://www.w3.org/TR/webcodecs/#internal-pending-output). As stated in the spec, "Codec outputs such as VideoFrames that currently reside in the internal pipeline of the underlying codec implementation. The underlying codec implementation may emit new outputs only when new inputs are provided. The underlying codec implementation must emit all outputs in response to a flush." So frames may be held up in this queue, as per the spec. As also stated however, in response to a flush, all pending outputs MUST be emitted in response to a flush. There is, of course, an explicit flush method: https://www.w3.org/TR/webcodecs/#dom-videodecoder-flush
The spec states that this method "Completes all control messages in the control message queue and emits all outputs". Fantastic, that's what I need. Unfortunately, the spec also specifically states that in response to a flush call, the implementation MUST "Set [[key chunk required]] to true." This means that after a flush, I can only provide a key frame. Not so good. In my scenario, without knowing when, or if, a subsequent frame is going to arrive, I end up having to flush after every frame, and now due to this requirement that a key frame must follow a flush, every frame must be a keyframe. This increases my payload size significantly, and can cause noticeable delays and stuttering on slower connections.
When I use a dedicated desktop client, and have full control of the decoding hardware, I can perform a "flush" without invalidating the pipeline state, so I can, for example, process a sequence of five frames such as "IPPPP", flushing the pipeline after each one, and this works without issue. I'd like to be able to achieve the same thing under the WebCodecs API. Currently, this seems impossible, as following a call to flush, the subsequent frame must be an I frame, not a P frame.
My question now is, how can this be overcome? It seems to me, I'd need one of two things:
At the hardware encoding/decoding API level, my only experience is with NVENC/NVDEC, but I know what I want is possible under this implementation at least. Are there known hardware implementations where what I'm asking for isn't possible? Can anyone see a possible way around this situation?
I can tell you right now, I have two workarounds. One is to encode every frame as a keyframe. This is clearly sub-optimal for bandwidth, and not required for a standalone client. The second workaround is ugly, but functions. I can measure the "queue depth" of the output queue, and send the same frame for decoding multiple times. This works with I or P frames. With a queue depth of 1 for example, which is what I see on Google Chrome, for each frame I receive at the client end, I send it for decoding twice. The second copy of the frame "pushes" the first one out of the pipeline. A hack, for sure, and sub-optimal use of the decoding hardware, but it keeps my bandwidth at the expected level, and I'm able to implement it on the client side alone.
What I would like to see, ideally, is some extra control in the WebCodecs API. Perhaps a boolean flag in the codec configuration? We currently have the "optimizeForLatency" flag. I'd like to see a "forceImmediateOutput" flag or the like, which guarantees that every frame that is sent for decoding WILL be passed to the VideoFrameOutputCallback without the need to flush or force it through the pipeline with extra input. Failing that, an alternate method of flushing that doesn't invalidate the decode pipeline would work. Without either of these solutions though, it seems to me that WebCodecs as it stands is unsuitable for use with variable rate streams, as you have no guarantees about the depth of the internal pending output queue, and no way to flush it without breaking the stream.
The text was updated successfully, but these errors were encountered: