Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frame accurate seeking of HTML5 MediaElement #4

Open
tidoust opened this issue Jun 11, 2018 · 69 comments

Comments

Projects
None yet
@tidoust
Copy link
Contributor

commented Jun 11, 2018

I've heard a couple of companies point out that one of the problems that makes it hard (at least harder than it could be) to do post-production of videos in Web browsers is that there is no easy way to process media elements on a frame by frame basis, whereas that is the usual default in Non-Linear Editors (NLE).

The currentTime property takes a time, not a frame number or an SMPTE timecode. Conversion from/to times to/from frame numbers is doable but supposes one knows the framerate of the video, which is not exposed to Web applications (a generic NLE would thus not know about it). Plus that framerate may actually vary over time.

Also, internal rounding of time values may mean that one seeks to the end of the previous frame instead of the beginning of a specific video frame.

Digging around, I've found a number of discussions and issues around the topic, most notably:

  1. An long thread from 2011 on Frame accuracy / SMPTE, which led to improvements in the precision of seeks in browser implementations:
    https://lists.w3.org/Archives/Public/public-whatwg-archive/2011Jan/0120.html
  2. A list of use cases from 2012 for seeking to specific frames. Not sure if these use cases remain relevant today:
    https://www.w3.org/Bugs/Public/show_bug.cgi?id=22678
  3. A question from 2013 on whether there was interest to expose "versions of currentTime, fastSeek(), duration, and the TimeRanges accessors, in frames, for video data":
    https://www.w3.org/Bugs/Public/show_bug.cgi?id=8278#c3
  4. A proposal from 2016 to add a rational time value for seek() to solve rounding issues (still open as of June 2018):
    whatwg/html#609

There have probably been other discussions around the topic.

I'm raising this issue to collect practical use cases and requirements for the feature, and gauge interest from media companies to see a solution emerge. It would be good to precisely identify what does not work today, what minimal updates to media elements could solve the issue, and what these updates would imply from an implementation perspective.

@palemieux

This comment has been minimized.

Copy link

commented Jun 11, 2018

There have probably been other discussions around the topic.

Yes. Similar discussions happened during the MSE project: https://www.w3.org/Bugs/Public/show_bug.cgi?id=19676

@chrisn

This comment has been minimized.

Copy link
Member

commented Jun 12, 2018

There's some interesting research here, with a survey of current browser behaviour.

The current lack of frame accuracy effectively closes off entire fields of possibilities from the web, such as non-linear video editing, but it also has unfortunate effects on things as simple as subtitle rendering.

@jpiesing

This comment has been minimized.

Copy link

commented Jun 12, 2018

I should also mention that there is some uncertainty about the precise meaning of currentTime - particularly when you have a media pipeline where the frame/sample coming out of the end may be 0.5s further along the media timeline than the ones entering the media pipeline. Some people think currentTime reflects what is coming out of the display/speakers/headphones. Some people think it should reflect the time were video and graphics are composited as this is easy to test and suits apps trying to sync graphics to video or audio. Simple implementations may re-use a time available in a media decoder.

@Daiz

This comment has been minimized.

Copy link

commented Jun 12, 2018

what minimal updates to media elements could solve the issue

Related to the matter of frame accuracy on the whole, one idea would be to add a new property to VideoElement called .currentFrameTime which would hold the presentation time value of the currently displayed frame. As mentioned in the research repository of mine (also linked above), .currentTime is not actually sufficient right now in any browser for determining the currently displayed frame even if you know the exact framerate of the video. .currentFrameTime could at least solve this particular issue, and could also be used for monitoring the exact screen refreshes when displayed frames change.

@jpiesing

This comment has been minimized.

Copy link

commented Jun 12, 2018

Related to the matter of frame accuracy on the whole, one idea would be to add a new property to VideoElement called .currentFrameTime which would hold the presentation time value of the currently displayed frame.

The currently displayed frame can be hard to determine, e.g. if the UA is running on a device without a display with video being output over HDMI or (perhaps) a remote playback scenario ( https://w3c.github.io/remote-playback/ ).

@mfoltzgoogle

This comment has been minimized.

Copy link

commented Jun 12, 2018

Remote playback cases are always going to be best effort to keep the video element in sync with the remote playback state. For video editing use cases, remote playback is not as relevant (except maybe to render the final output).

There are a number of implementation constraints that are going to make it challenging to provide a completely accurate instantaneous frame number or presentation timestamp in a modern browser during video playback.

  • The JS event loop will run in a different thread than the one painting pixels on the screen. There will be buffering and jitter in the intermediate thread hops.
  • The event loop often runs at a different frequency than the underlying video, so frames will span multiple loops.
  • Video is often decoded, painted, and composited asynchronously in hardware or software outside of the browser. There may not be frame-accurate feedback on the exact paint time of a frame.

Some estimates could be made based on knowing the latency of the downstream pipeline. It might be more useful to surface the last presentation timestamp submitted to the renderer and the estimated latency until frame paint.

It may also be more feasible to surface the final presentation timestamp/time code when a seek is completed. That seems more useful from a video editing use case.

Understanding the use cases here and what exactly you need know would help guide concrete feedback from browsers.

@Daiz

This comment has been minimized.

Copy link

commented Jun 12, 2018

One of the main use cases for me would be the ability to synchronize content changes outside video to frame changes in the video. As a simple example, the test case in the frame-accurate-ish repo shows this with the background color change. In my case the main thing would be the ability to accurate synchronize custom subtitle rendering with frame changes. Being even one or two screen refreshes off becomes a notable issue when you want to ensure subtitles appearing/disappearing with scene changes - even a frame or two of subtitles hanging on the screen after a scene change happens is very much notable and ugly to look at during playback.

@mfoltzgoogle

This comment has been minimized.

Copy link

commented Jun 12, 2018

It depends on the inputs to the custom subtitle rendering algorithm. How do you determine when to render a text cue?

@Daiz

This comment has been minimized.

Copy link

commented Jun 13, 2018

Currently, I'm using video.currentTime and doing calculations based on the frame rate to try to have cues appear/disappear when the displayed frame changes (which is the behavior I want to achieve). As mentioned before, this is not sufficient for frame-accurate rendering even if you know the exact frame rate of the video. There are ways to improve the accuracy with some non-standard properties (like video.mozPaintedFrames in Firefox), but even then the results aren't perfect.

@jpiesing

This comment has been minimized.

Copy link

commented Jun 13, 2018

It depends on the inputs to the custom subtitle rendering algorithm. How do you determine when to render a text cue?

Perhaps @palemieux could comment on how the imsc.js library handles this?

@jpiesing

This comment has been minimized.

Copy link

commented Jun 13, 2018

One of the main use cases for me would be the ability to synchronize content changes outside video to frame changes in the video. As a simple example, the test case in the frame-accurate-ish repo shows this with the background color change. In my case the main thing would be the ability to accurate synchronize custom subtitle rendering with frame changes. Being even one or two screen refreshes off becomes a notable issue when you want to ensure subtitles appearing/disappearing with scene changes - even a frame or two of subtitles hanging on the screen after a scene change happens is very much notable and ugly to look at during playback.

This highlights the importance of being clear what currentTime means as hardware-based implementations or devices outputting via HDMI may have several frames difference between the media time of the frame being output from the display and the frame being composited with graphics.

@ingararntzen

This comment has been minimized.

Copy link

commented Jun 13, 2018

With the timingsrc [1] library we are able to sync content changes outside the video with errors <10ms (less than a frame).

The library achieves this by

  1. using an interpolated clock approximating currentTime (timingobject)
  2. synchronizing video (mediasync) relative to a timing object (errors about 7ms)
  3. synchronizing javascript cues (sequencer - based on setTimeout) relative to the same timing object (errors about 1ms)

This still leaves delays from DOM changes to on-screen rendering.

In any case, this should typically be sub-framerate sync.

This assumes that currentTime is a good representation of the reality of video presentation. If it isn't, but you know how wrong it is, you can easily compensate.

Not sure if this is relevant to the original issue, which I understood to be about accurate frame stepping - not sync during playback?

Ingar Arntzen

[1] https://webtiming.github.io/timingsrc/

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 13, 2018

how the imsc.js library handles this

@jpiesing I can't speak for @palemieux obviously but my understanding is that imsc.js does not play back video and therefore does not do any alignment; it merely identifies the times at which the presentation should change.

However it is integrated into the dash.js player which does need to synchronise the subtitle presentation with the media. I believe it uses Text Track Cues, and from what I've seen they can be up to 250ms late depending on when the Time Marches On algorithm happens to be run, which can be as infrequent as every 250ms, and in my experience often is.

As @Daiz points out, that's not nearly accurate enough.

@palemieux

This comment has been minimized.

Copy link

commented Jun 13, 2018

What @nigelmegitt said :)

What is needed is a means of displaying/hiding HTML (or TTML) snippets at precise offsets on the media timeline.

@ingararntzen

This comment has been minimized.

Copy link

commented Jun 13, 2018

What is needed is a means of displaying/hiding HTML (or TTML) snippets at precise offsets on the media timeline.

@palemieux this is exactly what I described above.

The sequencer of the timingsrc library does this. It may be used with any data, including HTML or TTML.

@chrisn

This comment has been minimized.

Copy link
Member

commented Jun 13, 2018

Not sure if this is relevant to the original issue, which I understood to be about accurate frame stepping - not sync during playback?

@ingararntzen It is a different use case, but a good one nonetheless. Presumably, frame accurate time reporting would help with synchronised media playback across multiple devices, particularly where different browser engines are involved, each with a different pipeline delay. But, you say you're already achieving sub-frame rate sync in your library, based on currentTime, so maybe not?

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 13, 2018

@ingararntzen forgive my lack of detailed knowledge, but the approach you describe does raise some questions at least in my mind:

  • does it change the event handling model so that it no longer uses Time Marches On?
  • What happens if the event handler for event n completes after event n+1 should begin execution?
  • Does the timing object synchronise against the video or does it cause the video to be synchronised with it? In other words, in the case of drift, what moves to get back into alignment?
  • How does the interpolating clock deal with non-linear movements along the media timeline in the video, such as pause, fast forward and rewind?

Just questions for my understanding, I'm not trying to be negative!

@Daiz

This comment has been minimized.

Copy link

commented Jun 13, 2018

On the matter of "sub-framerate sync", I would like to point out that for the purposes of high quality media playback, this is not enough. Things like subtitle scene bleeds (where a cue remains visible after a scene change occurs in the video) are noticeable and ugly even if they remain on-screen for just an extra 15-30 milliseconds (ie. less than a single 24FPS frame, which is ~42ms) after a scene change occurs. Again, you can clearly see this yourself with the background color change in this test case (which has various tricks applied to increase accuracy) - it is very clear when the sync is even slightly off. Desktop video playback software outside browsers do not have issues in this regard, and I would really like to be able to replicate that on the web as well.

@ingararntzen

This comment has been minimized.

Copy link

commented Jun 13, 2018

@nigelmegitt These are excellent questions, thank you 👍

does it change the event handling model so that it no longer uses Time Marches On?

yes. the sequencer is separate from the media element (which also means that you can use it for use cases where you don't have a media element). It takes direction from a timing object, which is basically just a thin wrapper around the system clock. The sequencer uses <setTimeout()> to schedule enter/exit events at the correct time.

What happens if the event handler for event n completes after event n+1 should begin execution?

Being run in the js environment, sequencer timeouts may be subject to delay if there are many other activities going on (just like any appcode). The sequencer guarantees the correct ordering, and will report how much it was delayed. It something like the sequencer was implemented by browsers natively, this situation could be improved further I suppose. The sequencer itself is light-weight, and you may use multiple for different data sources and/or different timing objects.

Does the timing object synchronise against the video or does it cause the video to be synchronised with it? In other words, in the case of drift, what moves to get back into alignment?

Excellent question! The model does not mandate one or the other. You may 1) continuously update the timing object from the currentTime, or 2) you may continuously monitor and adjust currentTime to match the timing object (e.g. using variable playbackrate).

Method 1) is fine if you only have one media element, you are doing sync only within one webpage, and you are ok with letting the media element be the master of whatever else you want to synchronize. In other scenarios you'll need method 2), for at least (N-1) synchronized things. We use method 1) only occasionally.

The timingsrc has a mediasync function for method 2) and a reversesync function for method 1) (...I think)

How does the interpolating clock deal with non-linear movements along the media timeline in the video, such as pause, fast forward and rewind?

The short answer: using mediasync or reversesync you don't have to think about that, it's all taken care of.

Some more details:
The mediasync library creates a interpolated clock internally as an approximation on currentTime. It can distinguish the natural increments and jitter of currentTime from hard changes by listening to events (i.e. seekTo, variableplaybackrate etc.)

@ingararntzen

This comment has been minimized.

Copy link

commented Jun 13, 2018

@chrisn

Presumably, frame accurate time reporting would help with synchronised media playback across multiple devices, particularly where different browser engines are involved, each with a different pipeline delay. But, you say you're already achieving sub-frame rate sync in your library, based on currentTime, so maybe not?

So, while the results are pretty good, there is no way to ensure that they are always that good (or that they will stay this good), unless these issues are put on the agenda through standardization work.

There are a number of ways to improve/simplify sync.

  • as you say, exposing accurate information on downstream delays, frame count, media offset is always a good thing.
  • currentTime values are also not timestamped, which means that you dont really know when it was sampled internally.
  • The jitter of currentTime is terrible.
  • Good sync depends on an interpolated clock. I guess this would also make it easier to convert back and forth between media offset and frame numbers.
  • there are also improvements seekTo and playbackrate which would improve things considerably
@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 14, 2018

you don't have to think about that

@ingararntzen in this forum we certainly do want to think about the details of how the thing works so we can assure ourselves that eventual users genuinely do not have to think about them. Having been "bitten" by the impact of timeupdate and Time Marches On we need to get it right next time!

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 14, 2018

Having noted that Time Marches On can conformantly not be run frequently enough to meet subtitle and caption use cases, it does have a lot of other things going for it, like smooth handling of events that take too long to process.

In the spirit of making the smallest change possible to resolve it, here's an alternative proposal:

  • Change the minimum frequency to 50 times per second, instead of 4 times per second.

I would expect that to be enough to get frame accuracy at 25fps.

@ingararntzen

This comment has been minimized.

Copy link

commented Jun 14, 2018

@nigelmegitt - sure thing - I was more thinking of the end user here - not you guys :)

If you want me to go more into details that's ok too :)

@kevinmarks-b

This comment has been minimized.

Copy link

commented Jun 14, 2018

Assuming that framerates are uniform is going to go astray at some point, as mp4 can contain media with different rates.
The underlying structure has Movie time and Media time - the former is usually an arbitrary fraction, the latter a ratio specifically designed to represent the timescale of the actual samples, so for US-originated video this will be 1001/30000.

Walking through the media rates and getting fame times is going to give you glitches with longer files

If you want to construct an API like this I'd suggest mirroring what QuickTime did - this had 2 parts: the movie export API, which would give you callbacks for each frame rendered in sequence, telling you the media and movie times.
Or the GetNextInterestingTime() API which you could call iteratively and it would do the work of walking the movie, track edits and media to get you the next frame or keyframe.

Mozilla did make seekToNextFrame, but that was deprecated:
https://developer.mozilla.org/en-US/docs/Web/API/HTMLMediaElement/seekToNextFrame

@mfoltzgoogle

This comment has been minimized.

Copy link

commented Jun 14, 2018

@Diaz For your purposes, is it more important to have a frame counter, or an accurate currentTime?
What do you believe currentTime should represent?

@Daiz

This comment has been minimized.

Copy link

commented Jun 14, 2018

@mfoltzgoogle That depends - what exactly do you mean by a frame counter? As in, a value that would tell me the absolute frame number of the currently displayed frame, like if I have a 40000 frame long video with a constant frame rate of 23.976 FPS, and when currentTime is about 00:12:34.567 (754.567s), this hypothetical frame counter would have a value of 18091? This would most certainly work be useful for me.

To reiterate, for me the most important use case for frame accuracy right now would be to accurately snap subtitle cue changes to frame changes. A frame counter like described above would definitely work for this. Though since I personally work on premium VOD content where I'm in full control of the content pipeline, accurate currentTime (assuming that it means that with a constant frame rate / full frame rate information I would be able to reliably calculate the currently displayed frame number) would also work. But I think the kind of frame counter described above would be a better fit as more general purpose functionality.

@mfoltzgoogle

This comment has been minimized.

Copy link

commented Jun 14, 2018

We would need to consider skipped frames, buffering states, splicing MSE buffers, and variable FPS video to nail down the algorithm to advance the "frame counter", but let's go with that as a straw-man. Say, adding a .frameCounter read-only property to <video>.

When you observe the .frameCounter for a <video> element, say in requestAnimationFrame, which frame would that correspond to?

@palemieux

This comment has been minimized.

Copy link

commented Jun 15, 2018

@mfoltzgoogle Instead of a "frame counter", which is video-centric, I would consider adding a combination of timelineOffset and timelineRate, with timelineOffset being an integer and timelineRate a rational, i.e. two integers. The absolute offset (in seconds) is then given by timelineOffset divided by timelineRate. If timelineRate is set to the frame rate, then timelineOffset is equal to an offset in # of frames. This can be adapted to other kinds of essence that do not have "frames".

@Daiz

This comment has been minimized.

Copy link

commented Jun 15, 2018

When you observe the .frameCounter for a element, say in requestAnimationFrame, which frame would that correspond to?

For frame accuracy purposes, it should obviously correspond to the currently displayed frame on the screen.

Also, something that I wanted to say is I understand that there's a lot of additional complexity to this subject under various playback scenarios and that it's probably not possible to guarantee frame accuracy under all scenarios. However, I don't think should stop us from pursuing frame accuracy where it would indeed be possible. Like if I have just a normal browser window in full control of video playback playing video on a normal screen attached to my computer, even having frame accuracy just there alone would be a huge win in my books.

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 15, 2018

The underlying structure has Movie time and Media time - the former is usually an arbitrary fraction, the latter a ratio specifically designed to represent the timescale of the actual samples, so for US-originated video this will be 1001/30000.

@kevinmarks-b "media time" is also used elsewhere as a generic term for "the timeline related to the media", independently of the syntax used, i.e. it can be expressed as an arbitrary fraction or a number of frames etc, for example in TTML.

@tidoust

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2018

One comment on @nigelmegitt's #4 (comment) and @Snarkdoof's #4 (comment). There seems to be a slight confusion between the frequency at which the "time marches on" algorithm runs and the frequency at which that algorithm triggers timeupdate events.

The "time marches on" algorithm only triggers events when needed, and timeupdate events once in a while. Applications willing to act on cues within a particular text track should not rely on timeupdate events but rather on cuechange events of the TextTrack object (or on enter/exit events of individual cues), which are fired as needed whenever the "time marches on" algorithm runs.

The HTML spec requires the user agent to run the "time marches on" algorithm when the current playback position of a media element changes, and notes that this means that "these steps are run as often as possible". The spec also mandates that the "current playback position" be increased monotonically when the media element is playing. I'm not sure how to read that in terms of minimum/maximum frequency. Probably as giving full leeway to implementations. Running the algorithm at 50Hz seems doable though (and wouldn't trigger 50 events per second unless there are cues that need to switch to a different state). Implementations may optimize the algorithm as long as it produces the same visible behavior. In other words, they could use timeouts if that's more efficient than looping through cues each time.

@tidoust

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2018

When you observe the .frameCounter for a element, say in requestAnimationFrame, which frame would that correspond to?

For frame accuracy purposes, it should obviously correspond to the currently displayed frame on the screen.

@Daiz requestAnimationFrame typically runs at 50-60Hz, so once every 16-20ms, before the next repaint. You mentioned elsewhere that 15-30ms delays were noticeable for subtitles. Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

I'm not saying that it's easy to get from an implementation perspective, given the comment raised by @mfoltzgoogle #4 (comment). In particular, I suspect browser repaints are not necessarily synchronized with video repaints, but the problem seems to exist in any case.

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 18, 2018

Thanks @tidoust that is helpful. Last time I looked at this, a month or so back, I assured myself that time marches on itself could be run only every 250ms conformantly, but the spec text you pointed to suggests that timing constraint only applies to timeupdate events. Now I wonder if I misread it originally.

Nevertheless, time marches on frequency is dependent on some unspecified observation of the current playback position of the media element changing, which looks like it should be more often than 4Hz (every frame? every audio sample?).

In practice, I don't think browsers actually run time marches on whenever the current playback position advances by e.g. 1 frame or 1 audio sample. The real world behaviour seems to match the timing requirements for firing timeupdate events, at the less frequent end.

@Snarkdoof

This comment has been minimized.

Copy link

commented Jun 18, 2018

@tidoust It's a good point that the "internal" loop of Time Marches On does not trigger JS events every time, but increasing the loop speed of any loop (or doing more work in each pass) will use more resources. As I see it there are two main ways of timing things that are relevant to media on the web:
1: Put the "sequencer" logic (what's being triggered when) inside the browser and trigger events and
2: Put the "sequencer" logic in JS and trigger events

If 1) is chosen, a lot more complexity is moved to a very tight loop that's frankly busy with more important stuff. It is also less flexible as arbitrary code cannot be run in this way (nor would we want it to!). 2) depends solely on exporting a timestamp with the currentTime (and preferably other media events too), which would allow a JS Timing Object to accurately export the internal clock of the media. As such, a highly flexible solution can be made using fairly simple tools, like the open timingsrc implementation. Why would we not want to choose a solution that is easier, more flexible and if anything, saves CPU cycles?

cuechange also has a lot of other annoying issues, like not trigging when as skip event occurs (e.g. jumping "mid" subtitle), making it necessary to cut and paste several lines of code to check the active cues to behave as expected.

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 18, 2018

  1. is chosen, a lot more complexity is moved to a very tight loop that's frankly busy with more important stuff.

@Snarkdoof is it really busy with more important stuff? Really?

2: Put the "sequencer" logic in JS and trigger events

Browsers only give a single thread for event handling and JS, right? So adding more code to run in that thread doesn't really help address contention issues.

cuechange also has a lot of other annoying issues, like not trigging when as skip event occurs

The spec is explicit that it is supposed to trigger in this circumstance. Is this a spec vs implementation-in-the-real-world issue?

I have the sense that we haven't got good data about how busy current CPUs are handling events during media playback in a browser, with subtitles alongside. The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

@Daiz

This comment has been minimized.

Copy link

commented Jun 18, 2018

@tidoust

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

@jpiesing

This comment has been minimized.

Copy link

commented Jun 18, 2018

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

I agree with that aim butt then you need to be very careful about definitions as there may be several frame-times worth of delay between where graphics and video are composited and what the user is actually seeing. I suspect both are needed!

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 18, 2018

The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

@Daiz as I just pointed out on the M&E call, this is only known to be true at the 25-30fps sort of rate, designed to be adequately free of flicker for video. It's unknown at high frame rates, and entirely inadequate at low frame rates, where synchronisation with audio is more important.

We should avoid generalising based on assumptions that the 25-30fps rate will continue to be prevalent, and gather data where we don't yet have it. We also need a model that works for other kinds of data than subtitles and captions, since they may have more or less stringent synchronisation requirements.

@tidoust

This comment has been minimized.

Copy link
Contributor Author

commented Jun 18, 2018

@Snarkdoof Like @nigelmegitt, I don't necessarily follow you on the performance penalties. Regardless, what I'm getting out of this discussion on subtitles is that there are possible different ways to improve the situation (they are not necessarily exclusive).

One possible way would be to have the user agent expose a frame number, or a rational number. This seems simple in theory, but apparently hard to implement. Good thing is that it would probably make it easy to act on frame boundaries, but these boundaries might be slightly artificial (because the user agent will interpolate these values in some cases).

Another way would be to make sure that an application can relate currentTime to the wall clock, possibly completed with some indication of the downstream latency. This is precisely what was done in the Web Audio API (see the definition of the AudioContext interface and notably the getOutputTimestamp() method and the outputLatency property). It seems easier to implement (it may be hard to compute the output latency, but adding a timestamp whenever currentTime changes seems easy). Now an app will still have some work to do to detect frame boundaries, but at least we don't ask the user agent to report possibly slightly incorrect values.

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

@Snarkdoof

This comment has been minimized.

Copy link

commented Jun 18, 2018

@nigelmegitt @tidoust - I guess I just never understood the whole time marches on algorithm to be honest, it seems like a very strange way to wait for a timeout to happen, in particular when the time to wait can be very reliably be calculated well in advance. The added benefit of doing this properly in JS is that the flexibility is excellent - there is no looping anywhere, there is an event after a setTimeout, re-calculated when some other event is triggered (skip, pause, play etc). We use it for all kinds of things - showing subtitles, switching between sources, altering css, preloading images at a fixed time, etc. Preloading is trivial if you give a sequencer a time shifted timing object. Say you need up to 9 seconds to prepare an image - time shift it to 10 seconds more than the playback clock and do nothing else!

I might of course be absolutely in the black on the Time marches on, text and data cues (I did test them, and found them horrible a couple of years ago). But the only thing I crave is the timestamp on the event - it will solve our every need (almost) and at barely any cost. :)

@Daiz

This comment has been minimized.

Copy link

commented Jun 18, 2018

@nigelmegitt As I also mentioned earlier, yes, I recognize that there are different things that are important too, but for the here and now (and I don't expect this to change anytime soon), I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web as possible, and having subtitles align on frame boundaries for scene changes in order to avoid scene bleeding is a basic building block of that.

I'm not too concerned with the exact details of how we get there, so that's open for discussion and what we're here for, but the important thing is that we do get there eventually in a nice and performant fashion (ie. one shouldn't have to compile a video decoder with emscripten to do it etc).

@ingararntzen

This comment has been minimized.

Copy link

commented Jun 18, 2018

In respons to @Snarkdoof's post about the two approaches to synchronizing cue events and @nigelmegitt's response

The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

I don't have have any input on the question on resource consumption, but a point concerning maximization of precision:

It is an important principle to put the synchronization as close as possible to what is being synchronized. Another way to put it is to say that the final step matters.

In approach 1, with sequencing logic internally in the media element, the last step is transport of the cue events across threads to JS.

In approach 2, with sequencing logic in JS, the final step is the firing of a timeout in JS. This seems precise down to 1 or 2 ms. Additionally the correctness of the timeout calculation depends on the correctness by which currentTime can be calculated in JS, which is also very precise (and could easily be improved).

I don't know the relevant details of approach 1). I'm worried that the latency of the thread switch might unknown or variable, and perhaps different across architectures. If so, this would detract from precision, but I don't know how much. Do anyone know?

Also, in my understanding a busy event loop in JS affects both approaches similarly.

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 19, 2018

I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web

@Daiz OK, within the constraints of your use case, I share the requirement. Outside of those constraints, it gets more complicated. Seems from the thread as though that's something we can both agree to.

@nigelmegitt

This comment has been minimized.

Copy link

commented Jun 19, 2018

There's been some speculation here about thread switching and the impact that may have, and if indeed there are multiple threads executing the script and therefore processing the event queue. It's always been my understanding that the script is only executed in a single thread. Can anyone clarify this point, perhaps a browser implementer?

@boushley

This comment has been minimized.

Copy link

commented Jun 20, 2018

Throwing my hat in the ring here with a couple alternative use cases. As background my company manages a large amount of police body camera video. We support redaction of video via the browser, as well as playback of evidence via the browser.

For the redaction and evidence playback use cases our customers want the ability to step through a video frame-by-frame. If you assume a constant framerate and are able to determine that framerate out of band then you can get something that approximates frame-by-frame seek. However there are many scenarios (be it rounding of the currentTime value, or encoder delay that renders a frame a few ms late) that can result in a frame being skipped (which is a big worry for our customers). There are hacks around this (rendering frames on the server and shipping down frame by frame view) but all the info we need is already in the browser, it would be great if we had the ability to progress through a video frame by frame.

For redaction we have a use case that is similar to the subtitles sync issue. When users are in the editing phase of redaction we do a preview of what will be redacted where we need JS controlled objects to be synced with the video as tightly as we can. In this use case it's slightly easier than subtitles because when playing back at normal speed (or 2x or 4x) redaction users are usually ok with some slight de-sync. If they see something concerning they usually pause the video and then investigate it frame-by-frame.

Some of the suggested solutions, like currentFrameTime, could be extended to enable the frame-by-frame use case.

@tidoust

This comment has been minimized.

Copy link
Contributor Author

commented Jun 21, 2018

@boushley Thanks, that is useful! From a user experience perspective, how would the frame-by-frame stepping work in your case, ideally?

  1. The user activates frame-by-frame stepping. Video playback is paused. The user controls which frame to render and when a new frame needs to be rendered (e.g. with a button or arrow keys). Under the hoods, the page seeks to the right frame, and video playback is effectively paused during the whole time.
  2. The user activates frame-by-frame stepping. The video moves from one frame to the other in slow motion without user interaction. Under the hoods, the page does that by setting playbackRate to some low value such as 0.1, and the user agent is responsible for playing back the video at that speed.

In both cases, it seems indeed hard to do frame by frame stepping without exposing the current frame/presentation time, and allowing the app to set it to some value to account for cases where the video uses variable framerate.

It seems harder to guarantee precision in 2. as seen in this thread [1], but perhaps that's doable when video is played back at low speed?

[1] #4 (comment)

@dholroyd

This comment has been minimized.

Copy link

commented Jun 21, 2018

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

We also perform media manipulation server-side on the basis of users choosing points in the media timeline in a browser-based GUI. Knowing exactly what the user is seeing when the media is paused is critical.

Challenges that we've found with current in-browser capabilities include,

  • Allowing the user to reliably review the points in time previously selected
    • Expected behaviour - browser will seek to a previously selected time-point and the user will see the same content as when they made their selection
    • Actual behaviour - in some cases, in some browsers, the frame the users sees may be off-by-one
    • Another way of describing the above is just to observe that, with playback paused, sometimes executing video.currentTime = video.currentTime in the js console will change the displayed video frame!
  • Matching results that server-side processing with what the user requested in the browser
    • Expected behaviour - the point on the media timeline 'chosen' by the user is reflected by back-end processing
    • Actual behaviour - it seems challenging to relate a currentTime value from the browser to a point on the media timeline within server-side components
    • To make the above more concrete, if you wanted to run ffmpeg on the server-side and have it make a jpg of video frame that the user is currently looking at, how would you transform the value of currentTime (or any other proposed mechanism) into a select video filter. (Substitute ffmpeg with your preferred media framework as desired :)

We currently do frame-stepping by giving the js knowledge (out of band) of the frame-rate and seeking in frame-sized steps.

Users also want to be able to step backwards and forwards by many frames at a time (e.g. hold 'shift' to skip faster). That's currently implemented by just seeking in larger steps.

@boushley

This comment has been minimized.

Copy link

commented Jun 21, 2018

@tidoust our current experience is that the user has a skip ahead / skip back X seconds control. When they pause that changes to a frame forward / frame back control. So we're definitely looking at use case 1. And if you're going for playback at something like 1/10 of normal speed (or 3-6 fps) you can pretty easily pull that off in JS if you have a way of progressing to the next frame or previous frame. This use case feel like it should be easily doable, although I think it'll be interesting if we can do it in a way that enables other use cases as well.

@dholroyd we've definitely seen some of these off by a single frame issues in our redaction setup. Would be great if there was a better way of identifying and linking between a web client and a backend process manipulating the video. I believe one of the keys for the editing style use case is that while we want playback to be as accurate as possible, the key is that when paused it needs to be exactly accurate.

@mfoltzgoogle

This comment has been minimized.

Copy link

commented Jun 29, 2018

@Diaz I spoke with the TL of Chrome's video stack and they gave me a pointer to an implementation that you can play around with now.

First, behind --enable-experimental-canvas-feature, are some additional attributes on HTMLVideoElement that contain metadata about frames uploaded as WebGL textures, including timestamp. [1]

The longer term plan is a WebGL extension to expose this data [2], and implementation has begun [3] but I am not sure of its status.

I agree there are use cases outside of WebGL upload for accurate frame timing data, and it should be possible to provide it on HTMLVideoElement's that are not uploaded as textures. However, if the canvas/WebGL solution works for you, then that makes a stronger case to expose it elsewhere.

Note that any solution may be racy with vsync depending on the implementation and it may be off by 16ms depending on where vsync happens in relation to the video frame rendering and the execution of rAF.

That's really all the help I can provide at this time. There are many other use cases and scenarios discussed here that I don't have time to address or investigate them right now.

Thanks.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=639174
[2] https://www.khronos.org/registry/webgl/extensions/proposals/WEBGL_video_texture/
[3] https://bugs.chromium.org/p/chromium/issues/detail?id=776222

@chrisn

This comment has been minimized.

Copy link
Member

commented Jul 6, 2018

This is a great discussion, identifying a number of different use cases. I suggest that the next step is to consolidate this into an explainer document that describes each use case and identifies any spec gaps or current implementation limitations. A simple markdown document in this repo would be fine. Would anyone like to start such a document?

@KilroyHughes

This comment has been minimized.

Copy link

commented Jul 20, 2018

One detail for such a document (I'm not volunteering to write) is video frame reordering. Widely deployed video codecs such as AVC reorder and often offset the presentation time of pictures relative to their order and timing in the compressed bitstream. For instance, frames 1, 2, 3, 4 in the compressed stream might be displayed in order e.g. 2, 1, 4, 3 and presentation time can be delayed several frames. Frame rate changes are not unusual in adaptively streamed video. Operations such as seeking, editing, and splicing of the compressed stream, e.g. in an MSE buffer, do not happen at the presentation times often assumed. Audio, TTML, HTML, events, etc. must take presentation reordering and delay into account for frame accurate synchronization at some "composition point" in the media pipeline.

@nigelmegitt

This comment has been minimized.

Copy link

commented Jul 20, 2018

@KilroyHughes I've always made the assumption that all those events are related to the post-decode (and therefore post-reordering) output. It would make no sense to address out of order frame counts from the compressed bitstream in specifications whose event time markers relate to generic video streams and for which video codecs are out of scope.

Certainly in TTML, the assumption is that there is a well defined media timeline against which times in the media timebase can be related; taking DASH/MP4 as an example, the track fragment decode time as modified by the presentation time offset provides that definition.

I'd push back quite strongly against any requirement to traverse the architectural layers and impose content changes on a resource like a subtitle document, whether it is provided in-band or out-of-band, just to take into account a specific set of video encoding characteristics.

@nigelmegitt

This comment has been minimized.

Copy link

commented Nov 22, 2018

There's a Chromium bug about synchronisation accuracy of Text Track Cue onenter() and onexit() events in the context of WebVTT at https://bugs.chromium.org/p/chromium/issues/detail?id=576310 and another (originally from me, via @beaufortfrancois) asking for developer input on the feasibility of reducing the accuracy threshold in the spec from the current 250ms, at https://bugs.chromium.org/p/chromium/issues/detail?id=907459 .

@1c7

This comment has been minimized.

Copy link

commented Apr 7, 2019

Because this thread is way too long. I didn't read them all.
Let me provide one more use case

Subtitle Editing software

I want to build a Subtitle Editing software using Electron.js
because Aegisub is not good enough. (hotkey, night mode, etc)

The point is:

I want build something that simple but able to improve one part of workflow.
not aim to replace Aegisub. because they have way to many feature.

So

Frame by frame and precise control to millisecond like 00:00:12:333 is important.

Here is my design (it's screenshot from Invision Studio, not an actually desktop)

image

I design many version because I want this to be beautiful

image

Here is Electron app (an actually working app)

image

as you can see the Electron app is still Work in Progress. half-built.

and now I found out there are no Frame by frame and precise control to millisecond like 00:00:12:333
image
which is very bad..

Conclusion

Use some hack like <canvas> OR just abandon Web tech(HTML/CSS/JS) Electron.js
just build Native app (OC & Swift on XCode)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.