Proposal for FrameGrabber, FrameData, DepthMap #77

anssiko · 2015-04-22T14:14:41Z

This PR contains an early proposal with contributions from @huningxin @ds-hwang @robman and @anssiko. IDL is in place to allow people to review the API shape, but the prose around the new interfaces is still missing to allow us more easily iterate on the API design based on wider feedback.

HTML preview: https://rawgit.com/anssiko/mediacapture-depth/framegrabber/index.html

New interfaces added in this PR:

FrameGrabber
FrameData
DepthData

The following interfaces and definitions were obsoleted by the new ones, and were removed:

CanvasImageSource typedef
ImageData interface (or to be precise, extensions to it)

Examples:

Added a new example 'Capture individual depth and RGB frames'; removed the obsoleted '2D Canvas Context based post-processing' example.

Editorial:

Removed the changes since the last publication from the Status of This Document section. Not needed for an Editor's Draft.

anssiko · 2015-04-23T13:30:46Z

This fixes #76, #73, #72, and #66.

ds-hwang · 2015-04-23T13:58:48Z

looks nice! lgtm

huningxin · 2015-04-23T17:21:08Z

index.html

@@ -358,66 +316,149 @@
      </section>
      <section>
        <h2>
-          <code>CanvasImageSource</code> typedef
+          <code>DepthData</code> interface


Would you consider to use DepthMap name? As depthmap (http://en.wikipedia.org/wiki/Depth_map) is a quite common name used in 3D camera space, e.g. Google Tango Project (https://developers.google.com/depthmap-metadata/) and Intel RealSense SDK are all using it in developer manual.

That's a fair point - linguistically I like the symmetry of ImageData/DepthData - but since there's an existing convention I'd be happy if we ended up with ImageData/DepthMap.

huningxin · 2015-04-24T17:42:02Z

Thanks, @anssiko ! It looks nice. My comments provided.

anssiko · 2015-04-28T12:31:50Z

@huningxin @robman @ds-hwang Thanks for the review and comments. I've addressed them in an update to this PR.

Some questions on the new proposed DepthMap attributes:

Why wouldn't we provide the data in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..65.535 metres which is probably enough. No need for units attribute.
It seems a web developer would just prefer to indicate whether more precision is needed on the far end (think measuring distance to far away objects) or near (think face recognition), and then let the implementation take care of choosing whatever format under the hood that provides the best precision. The encoding pipeline (e.g. https://developers.google.com/depthmap-metadata/encoding) would be an implementation detail. The web developer would just get the data in mm units. This would allow us to drop format, near, and far since they are only needed by the implementation to implement the encoding pipeline.
What is the use case for measureType? The distance measured along the optical axis is longer, so I assume with that information a web developer can get more precise depth measurement. What other data the web developer would need in order to make use of measureType for real (is what we have defined in Settings enough)?

robman · 2015-04-28T13:34:23Z

Hi @anssiko

For your second point, I don't believe we should drop near/far/format. Near and far variables required to reconstruct depth if you receive a normalised depth value. Basically format defines which algorithm to use.

For your last point, the measureType defines if the length is measured along the optical axis (depth down the X3 axis in the pinhole camera model) or as a ray (along the effective hypotenuse). Again this tells us what algorithm to use.

anssiko · 2015-04-28T14:23:03Z

@robman Thanks, what are the key use cases for which it would be preferred to use of the raw depth map data over the normalized data? And respectively, what are the key use case that require normalized data? I'm interested in figuring out how much of the details we can defer to the implementation, and how much we must expose to address the requirements. We should not expose metadata that has no supporting use case, and minimize the API surface instead, even if exposing metadata would be cheap.

Do the known implementations support both the measureTypes?

huningxin · 2015-04-28T14:27:26Z

Hi @anssiko

Thanks for the updating.

Why wouldn't we provide the data in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..655.35 metres which is probably enough. No need for units attribute.

It sounds good to me. Web platform can stick to one "native" unit and data format. Implementations figure out how to fit into it. There are two flavors:

Uint16 in mm (BTW, the the range would be 0..65.535 m, correct?)
Float32 in m

What do you like?

This would allow us to drop format, near, and far since they are only needed by the implementation to implement the encoding pipeline.

To encoding the depth value, e.g. implement https://developers.google.com/depthmap-metadata/encoding on web, web developers need to know near and far.

What is the use case for measureType?

Agree with @robman , it tells web developer the depth measurement model, some algorithms depend on it.

near, far, measureType can be in MST's Settings.

anssiko · 2015-04-30T10:15:01Z

@ds-hwang pointed out that Uint16 maps better to GPU when uploading this data to a WebGL texture. I'll let @ds-hwang expand.

@huningxin Yes, using Uint16 with mm units would represent the range of 0..65.535 m as you say.

If we'd stick with a single format (say range linear) and units (say mm) what use cases would we miss? At least that'd make the API simpler for the web developer. IOW, what are the key use cases that require the web developer to be able to convert the normalised depth values back?

ds-hwang · 2015-04-30T13:57:50Z

Uint16 has enough accuracy. Let me explain.

First of all, we need to support RangeLinear mode and RangeInverse mode like https://developers.google.com/depthmap-metadata/encoding

User will calculate real distance using depth camera output.
Let say depth value is 10000 on [0, 65535], near is 1m, far is 5m.

RangeLinear formula is RealDistance = d(far - near) + near
so, 10000/65535 * (5m - 1m) + 1m is real distance. 65535 step is accurate enough

RangeInverse formula is RealDistance = (far x near)/(far - d*(far - near))
user can calculate real distance.

As you see, depth value is just scale. it don't has unit. near and far has unit.

RangeLinear is used for dance game or something and RangeInverse is used for face recognition.

On the other hands, chromium keeps the cam video in texture. In the same sense, chromium will keep depth value in texture. 32float texture is supported on only extremely modern gpu. IMO 32float texture is overkill.

huningxin · 2015-05-04T09:07:16Z

IMO, with Uint16 with mm units (or Float32 with m units), we won't need to support encoding format, say RangeLinear and RangeInverse. The Uint16 depth value represents the real distance, e.g. 1 means 1 mm and 65535 means 65535 mm. It is straightforward to web developers, just use the value without any calculations. The only concern is that whether the range and accuracy is too limited, say will 65.535 m be too small or mm units is too large for some use cases with some new depth cameras in the future?

I propose to keep near and far as we may need to support web app to encode the depth value into other formats. Just like https://developers.google.com/depthmap-metadata/encoding, if web developer wants to encode the depth metadata into XMP properties, they need the near and far values. They would write JavaScript code to implement either RangeLinear or RangeInverse encoding mechanism.

ds-hwang · 2015-05-04T10:19:07Z

I trust your judgement. However, I have to say there are drawback with value with unit.

Let's assume near is 0.1m and far is 5m. (0, 100) and (5000, 65535) are useless at that time. It's why tango encoding format uses scale value, instead of real value.

Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm.

If value has unit, it looks not flexible.

What is real output of RealSense and Kinect?

anssiko · 2015-05-04T14:22:30Z

I updated the PR. Please review and comment. The use case for measureType was unclear so I dropped it for now. Also unclear if all implementations are able to support it.

I'd like to get resolution on the issue whether we should just hardcode format and unit, or allow flexibility. We should investigate the capabilities of the existing implementations and hardware to make that call.

If we keep format, I'd guess it'd make sense to allow a web developer to configure the format, right? Hypothetical API:

partial dictionary MediaStreamConstraints {
    // ...
    (DOMString or MediaTrackConstraints) depthFormat = "linear";
};

huningxin · 2015-05-04T14:58:23Z

What is real output of RealSense and Kinect?

RealSense SDK:

Format	Description
PIXEL_FORMAT_DEPTH	The depth map data in 16-bit unsigned integer. The values indicate the distance from an object to the camera's XY plane or the Cartesian depth.The value precision is in millimeters.
PIXEL_FORMAT_DEPTH_RAW	The depth map data in 16-bit unsigned integer. The value precision is device specific. The application can get the device precision via the QueryDepthUnit function;
PIXEL_FORMAT_DEPTH_F32	The depth map data in 32-bit floating point. The value precision is in millimeters.

Kinect SDK:
https://msdn.microsoft.com/en-us/library/microsoft.kinect.kinect.idepthframe.aspx
The data for this frame is stored as 16-bit unsigned integers, where each value represents the distance in millimeters. The maximum depth distance is 8 meters, although reliability starts to degrade at around 4.5 meters. Developers can use the depth frame to build custom tracking algorithms in cases where the IBodyFrame isn’t enough.

Project Tango:
https://developers.google.com/project-tango/overview/depth-perception#point_clouds
The Project Tango APIs provide a function to get depth data in the form of a point cloud. This format gives (x, y, z) coordinates for as many points in the scene as are possible to calculate. Each dimension is a floating point value recording the position of each point in meters in the coordinate frame of the depth-sensing camera.

Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm

This is why Float32 with meter units seems promising.

huningxin · 2015-05-05T02:56:31Z

Point Cloud Library (PCL) is using float in RangeImage (depth map):
http://docs.pointclouds.org/trunk/classpcl_1_1_range_image.html

with millimeters units:
http://docs.pointclouds.org/trunk/classpcl_1_1_image_grabber_base.html#a32ae91b66b415213ec3b7c29d7e61e49

anssiko · 2015-05-06T15:00:25Z

@huningxin Thanks for sharing information on implementations. What is your guesstimate re the performance implications (also memory) of using Float32Array over Uint16Array? I think we should expect frame rates of >=30 Hz. I guess some benchmark data would help to make an informed decision. @ds-hwang had some concerns, but we don't have benchmark data at hand now.

If there are performance concerns, one approach worth considering might be to go with the lowest common denominator (e.g. Uint16Array, mm units) first, while allow future extensions. For example, a new MediaTrackConstrains could be used to indicate higher precision is preferred, and the type of data could be updated to (Uint16Array or Float32Array) while keeping the API backwards compatible. Not optimal considering interoperability, but I think that might be a reasonable tradeoff to make.

Re units, mm sounds like the best choice regardless of the type.

huningxin · 2015-05-07T12:09:39Z

@anssiko , thanks for the comments. I agree with you that the Uint16Array seems to be closer to hardware.

Because:
In RealSense SDK, the depth map data is in 16-bit unsigned integer of PIXEL_FORMAT_DEPTH_RAW format. The units is device defined.

Kinect SDK uses 16-bit unsigned integers with mm

According to Tango Depth Camera implementation in Chromium (https://code.google.com/p/chromium/codesearch#chromium/src/media/base/android/java/src/org/chromium/media/VideoCaptureTango.java&q=tango&sq=package:chromium&l=141),

    // Depth is composed of 16b samples in which only 12b are
    // used.

It is also uint16.

So Uint16Array with mm units looks good to me.

For the units, I suggest we add long units into MediaTrackConstrains.

huningxin · 2015-05-07T12:41:48Z

index.html

@@ -453,26 +557,63 @@
          <dd>
            -
          </dd>
+          <dt>
+            DOMString format = null


I think we don't need format anymore.

* s/video track/video stream track/g * Link to the above terminology across the spec.

anssiko · 2015-05-12T13:33:48Z

@huningxin Thanks again for your suggestions. I updated the spec to address your comments. I also did further refactoring. I'd like to review this after your final review.

huningxin · 2015-05-13T02:30:51Z

index.html

+        </dl>
+        <dl id="enum-basic" class="idl" title="enum DepthMapUnit">
+          <dt>
+            mm


Will people confuse 'mm' between millimeters and micrometers?

Millimeter (the American spelling of millimetre) has an SI unit symbol 'mm' so I think that's the best we can do. If we'd use the full name (millimeter) people would probably typo it since the American and International spellings differ subtly.

It sounds good to me. Thanks for the explanation!

huningxin · 2015-05-13T02:39:46Z

hi @anssiko , thanks for your efforts and explanation. It makes the spec pretty good.

LGTM with one open.

anssiko · 2015-05-13T10:43:01Z

@huningxin @robman @ds-hwang I'll merge this PR now and craft a mail to group to get wider feedback. Thanks for your contributions and review!

Proposal for FrameGrabber, FrameData, DepthMap

anssiko added 8 commits April 22, 2015 16:02

Editorial: remove 'changes since' from SoTD.

387bc2c

Remove 'CanvasImageSource typedef' section.

9df0029

Remove 'ImageData interface' section.

fe7ce85

Add 'DepthData interface' section.

08ace63

Add 'FrameData interface' section.

e9589b4

Add 'FrameGrabber interface' section.

3513182

Remove '2D Canvas Context based post-processing' example.

bd8db73

Add 'Capture individual depth and RGB frames' example.

4ca7df6

huningxin reviewed Apr 23, 2015
View reviewed changes

anssiko added 4 commits April 28, 2015 10:48

Rename s/DepthData/DepthMap/gi.

09eb4fd

Update the example.

82ef5c7

Rename 'type' of DepthMap to 'measureType'.

aa764a6

Add DepthMap prose.

068ce3e

anssiko changed the title ~~Proposal for FrameGrabber, FrameData, DepthData~~ Proposal for FrameGrabber, FrameData, DepthMap Apr 28, 2015

anssiko added 3 commits May 4, 2015 15:58

Drop measureType.

563fec4

Move 'format', 'units', 'near', 'far' from DepthMap to Settings.

5ace41a

Add DepthMap and Settings prose.

354d27c

huningxin reviewed May 7, 2015
View reviewed changes

anssiko added 6 commits May 12, 2015 10:53

Drop format.

9b20c51

Add dictionary MediaTrackConstraints and enum DepthMapUnit.

3f6ca58

Rename 'Settings' to 'MediaTrackSettings'; drop MAY clause.

31314a7

Split 'Terminology' into 'Dependencies' and 'Terminology'.

2ac76d2

* s/depth track/depth stream track/g

fd86b78

* s/video track/video stream track/g * Link to the above terminology across the spec.

Note FrameGrabber is WIP.

8bbe7da

huningxin reviewed May 13, 2015
View reviewed changes

anssiko added a commit that referenced this pull request May 13, 2015

Merge pull request #77 from anssiko/framegrabber

032faff

Proposal for FrameGrabber, FrameData, DepthMap

anssiko merged commit 032faff into w3c:gh-pages May 13, 2015

This was referenced May 13, 2015

Extend Mediastream Image Capture to support depth track #76

Closed

Extend ImageData to support depth data #73

Closed

Add more depth relevant attributes into Settings interface #72

Closed

Add DepthMap interface #66

Closed

robman mentioned this pull request Jun 11, 2015

Need measureType for accurate reprojection #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for FrameGrabber, FrameData, DepthMap #77

Proposal for FrameGrabber, FrameData, DepthMap #77

anssiko commented Apr 22, 2015

anssiko commented Apr 23, 2015

ds-hwang commented Apr 23, 2015

huningxin Apr 23, 2015

robman Apr 28, 2015

huningxin commented Apr 24, 2015

anssiko commented Apr 28, 2015

robman commented Apr 28, 2015

anssiko commented Apr 28, 2015

huningxin commented Apr 28, 2015

anssiko commented Apr 30, 2015

ds-hwang commented Apr 30, 2015

huningxin commented May 4, 2015

ds-hwang commented May 4, 2015

anssiko commented May 4, 2015

huningxin commented May 4, 2015

huningxin commented May 5, 2015

anssiko commented May 6, 2015

huningxin commented May 7, 2015

huningxin May 7, 2015

anssiko May 12, 2015

anssiko commented May 12, 2015

huningxin May 13, 2015

anssiko May 13, 2015

huningxin May 14, 2015

huningxin commented May 13, 2015

anssiko commented May 13, 2015

Proposal for FrameGrabber, FrameData, DepthMap #77

Proposal for FrameGrabber, FrameData, DepthMap #77

Conversation

anssiko commented Apr 22, 2015

anssiko commented Apr 23, 2015

ds-hwang commented Apr 23, 2015

huningxin Apr 23, 2015

Choose a reason for hiding this comment

robman Apr 28, 2015

Choose a reason for hiding this comment

huningxin commented Apr 24, 2015

anssiko commented Apr 28, 2015

robman commented Apr 28, 2015

anssiko commented Apr 28, 2015

huningxin commented Apr 28, 2015

anssiko commented Apr 30, 2015

ds-hwang commented Apr 30, 2015

huningxin commented May 4, 2015

ds-hwang commented May 4, 2015

anssiko commented May 4, 2015

huningxin commented May 4, 2015

huningxin commented May 5, 2015

anssiko commented May 6, 2015

huningxin commented May 7, 2015

huningxin May 7, 2015

Choose a reason for hiding this comment

anssiko May 12, 2015

Choose a reason for hiding this comment

anssiko commented May 12, 2015

huningxin May 13, 2015

Choose a reason for hiding this comment

anssiko May 13, 2015

Choose a reason for hiding this comment

huningxin May 14, 2015

Choose a reason for hiding this comment

huningxin commented May 13, 2015

anssiko commented May 13, 2015