Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

Proposal for FrameGrabber, FrameData, DepthMap #77

Merged
merged 21 commits into from
May 13, 2015

Conversation

anssiko
Copy link
Member

@anssiko anssiko commented Apr 22, 2015

This PR contains an early proposal with contributions from @huningxin @ds-hwang @robman and @anssiko. IDL is in place to allow people to review the API shape, but the prose around the new interfaces is still missing to allow us more easily iterate on the API design based on wider feedback.

HTML preview: https://rawgit.com/anssiko/mediacapture-depth/framegrabber/index.html

New interfaces added in this PR:

  • FrameGrabber
  • FrameData
  • DepthData

The following interfaces and definitions were obsoleted by the new ones, and were removed:

  • CanvasImageSource typedef
  • ImageData interface (or to be precise, extensions to it)

Examples:

  • Added a new example 'Capture individual depth and RGB frames'; removed the obsoleted '2D Canvas Context based post-processing' example.

Editorial:

  • Removed the changes since the last publication from the Status of This Document section. Not needed for an Editor's Draft.

@anssiko
Copy link
Member Author

anssiko commented Apr 23, 2015

This fixes #76, #73, #72, and #66.

@ds-hwang
Copy link
Contributor

looks nice! lgtm

@@ -358,66 +316,149 @@
</section>
<section>
<h2>
<code>CanvasImageSource</code> typedef
<code>DepthData</code> interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider to use DepthMap name? As depthmap (http://en.wikipedia.org/wiki/Depth_map) is a quite common name used in 3D camera space, e.g. Google Tango Project (https://developers.google.com/depthmap-metadata/) and Intel RealSense SDK are all using it in developer manual.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point - linguistically I like the symmetry of ImageData/DepthData - but since there's an existing convention I'd be happy if we ended up with ImageData/DepthMap.

@huningxin
Copy link
Contributor

Thanks, @anssiko ! It looks nice. My comments provided.

@anssiko anssiko changed the title Proposal for FrameGrabber, FrameData, DepthData Proposal for FrameGrabber, FrameData, DepthMap Apr 28, 2015
@anssiko
Copy link
Member Author

anssiko commented Apr 28, 2015

@huningxin @robman @ds-hwang Thanks for the review and comments. I've addressed them in an update to this PR.

Some questions on the new proposed DepthMap attributes:

  • Why wouldn't we provide the data in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..65.535 metres which is probably enough. No need for units attribute.
  • It seems a web developer would just prefer to indicate whether more precision is needed on the far end (think measuring distance to far away objects) or near (think face recognition), and then let the implementation take care of choosing whatever format under the hood that provides the best precision. The encoding pipeline (e.g. https://developers.google.com/depthmap-metadata/encoding) would be an implementation detail. The web developer would just get the data in mm units. This would allow us to drop format, near, and far since they are only needed by the implementation to implement the encoding pipeline.
  • What is the use case for measureType? The distance measured along the optical axis is longer, so I assume with that information a web developer can get more precise depth measurement. What other data the web developer would need in order to make use of measureType for real (is what we have defined in Settings enough)?

@robman
Copy link
Contributor

robman commented Apr 28, 2015

Hi @anssiko

For your second point, I don't believe we should drop near/far/format. Near and far variables required to reconstruct depth if you receive a normalised depth value. Basically format defines which algorithm to use.

For your last point, the measureType defines if the length is measured along the optical axis (depth down the X3 axis in the pinhole camera model) or as a ray (along the effective hypotenuse). Again this tells us what algorithm to use.

@anssiko
Copy link
Member Author

anssiko commented Apr 28, 2015

@robman Thanks, what are the key use cases for which it would be preferred to use of the raw depth map data over the normalized data? And respectively, what are the key use case that require normalized data? I'm interested in figuring out how much of the details we can defer to the implementation, and how much we must expose to address the requirements. We should not expose metadata that has no supporting use case, and minimize the API surface instead, even if exposing metadata would be cheap.

Do the known implementations support both the measureTypes?

@huningxin
Copy link
Contributor

Hi @anssiko

Thanks for the updating.

Why wouldn't we provide the data in mm units directly? With Uint16Array 0..65535 we could represent dexels in the range of 0..655.35 metres which is probably enough. No need for units attribute.

It sounds good to me. Web platform can stick to one "native" unit and data format. Implementations figure out how to fit into it. There are two flavors:

  1. Uint16 in mm (BTW, the the range would be 0..65.535 m, correct?)
  2. Float32 in m

What do you like?

This would allow us to drop format, near, and far since they are only needed by the implementation to implement the encoding pipeline.

To encoding the depth value, e.g. implement https://developers.google.com/depthmap-metadata/encoding on web, web developers need to know near and far.

What is the use case for measureType?

Agree with @robman , it tells web developer the depth measurement model, some algorithms depend on it.

near, far, measureType can be in MST's Settings.

@anssiko
Copy link
Member Author

anssiko commented Apr 30, 2015

@ds-hwang pointed out that Uint16 maps better to GPU when uploading this data to a WebGL texture. I'll let @ds-hwang expand.

@huningxin Yes, using Uint16 with mm units would represent the range of 0..65.535 m as you say.

If we'd stick with a single format (say range linear) and units (say mm) what use cases would we miss? At least that'd make the API simpler for the web developer. IOW, what are the key use cases that require the web developer to be able to convert the normalised depth values back?

@ds-hwang
Copy link
Contributor

Uint16 has enough accuracy. Let me explain.

First of all, we need to support RangeLinear mode and RangeInverse mode like https://developers.google.com/depthmap-metadata/encoding

User will calculate real distance using depth camera output.
Let say depth value is 10000 on [0, 65535], near is 1m, far is 5m.

RangeLinear formula is RealDistance = d(far - near) + near
so, 10000/65535 * (5m - 1m) + 1m is real distance. 65535 step is accurate enough

RangeInverse formula is RealDistance = (far x near)/(far - d*(far - near))
user can calculate real distance.

As you see, depth value is just scale. it don't has unit. near and far has unit.

RangeLinear is used for dance game or something and RangeInverse is used for face recognition.

On the other hands, chromium keeps the cam video in texture. In the same sense, chromium will keep depth value in texture. 32float texture is supported on only extremely modern gpu. IMO 32float texture is overkill.

@huningxin
Copy link
Contributor

IMO, with Uint16 with mm units (or Float32 with m units), we won't need to support encoding format, say RangeLinear and RangeInverse. The Uint16 depth value represents the real distance, e.g. 1 means 1 mm and 65535 means 65535 mm. It is straightforward to web developers, just use the value without any calculations. The only concern is that whether the range and accuracy is too limited, say will 65.535 m be too small or mm units is too large for some use cases with some new depth cameras in the future?

I propose to keep near and far as we may need to support web app to encode the depth value into other formats. Just like https://developers.google.com/depthmap-metadata/encoding, if web developer wants to encode the depth metadata into XMP properties, they need the near and far values. They would write JavaScript code to implement either RangeLinear or RangeInverse encoding mechanism.

@ds-hwang
Copy link
Contributor

ds-hwang commented May 4, 2015

I trust your judgement. However, I have to say there are drawback with value with unit.

Let's assume near is 0.1m and far is 5m. (0, 100) and (5000, 65535) are useless at that time. It's why tango encoding format uses scale value, instead of real value.

Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm.

If value has unit, it looks not flexible.

What is real output of RealSense and Kinect?

@anssiko
Copy link
Member Author

anssiko commented May 4, 2015

I updated the PR. Please review and comment. The use case for measureType was unclear so I dropped it for now. Also unclear if all implementations are able to support it.

I'd like to get resolution on the issue whether we should just hardcode format and unit, or allow flexibility. We should investigate the capabilities of the existing implementations and hardware to make that call.

If we keep format, I'd guess it'd make sense to allow a web developer to configure the format, right? Hypothetical API:

partial dictionary MediaStreamConstraints {
    // ...
    (DOMString or MediaTrackConstraints) depthFormat = "linear";
};

@huningxin
Copy link
Contributor

What is real output of RealSense and Kinect?

RealSense SDK:

Format Description
PIXEL_FORMAT_DEPTH The depth map data in 16-bit unsigned integer. The values indicate the distance from an object to the camera's XY plane or the Cartesian depth.The value precision is in millimeters.
PIXEL_FORMAT_DEPTH_RAW The depth map data in 16-bit unsigned integer. The value precision is device specific. The application can get the device precision via the QueryDepthUnit function;
PIXEL_FORMAT_DEPTH_F32 The depth map data in 32-bit floating point. The value precision is in millimeters.

Kinect SDK:
https://msdn.microsoft.com/en-us/library/microsoft.kinect.kinect.idepthframe.aspx
The data for this frame is stored as 16-bit unsigned integers, where each value represents the distance in millimeters. The maximum depth distance is 8 meters, although reliability starts to degrade at around 4.5 meters. Developers can use the depth frame to build custom tracking algorithms in cases where the IBodyFrame isn’t enough.

Project Tango:
https://developers.google.com/project-tango/overview/depth-perception#point_clouds
The Project Tango APIs provide a function to get depth data in the form of a point cloud. This format gives (x, y, z) coordinates for as many points in the scene as are possible to calculate. Each dimension is a floating point value recording the position of each point in meters in the coordinate frame of the depth-sensing camera.

Some depth sensor in the future can have capability to detect >65m range. Some application want more precision <1mm

This is why Float32 with meter units seems promising.

@huningxin
Copy link
Contributor

@anssiko
Copy link
Member Author

anssiko commented May 6, 2015

@huningxin Thanks for sharing information on implementations. What is your guesstimate re the performance implications (also memory) of using Float32Array over Uint16Array? I think we should expect frame rates of >=30 Hz. I guess some benchmark data would help to make an informed decision. @ds-hwang had some concerns, but we don't have benchmark data at hand now.

If there are performance concerns, one approach worth considering might be to go with the lowest common denominator (e.g. Uint16Array, mm units) first, while allow future extensions. For example, a new MediaTrackConstrains could be used to indicate higher precision is preferred, and the type of data could be updated to (Uint16Array or Float32Array) while keeping the API backwards compatible. Not optimal considering interoperability, but I think that might be a reasonable tradeoff to make.

Re units, mm sounds like the best choice regardless of the type.

@huningxin
Copy link
Contributor

@anssiko , thanks for the comments. I agree with you that the Uint16Array seems to be closer to hardware.

Because:
In RealSense SDK, the depth map data is in 16-bit unsigned integer of PIXEL_FORMAT_DEPTH_RAW format. The units is device defined.

Kinect SDK uses 16-bit unsigned integers with mm

According to Tango Depth Camera implementation in Chromium (https://code.google.com/p/chromium/codesearch#chromium/src/media/base/android/java/src/org/chromium/media/VideoCaptureTango.java&q=tango&sq=package:chromium&l=141),

    // Depth is composed of 16b samples in which only 12b are
    // used.

It is also uint16.

So Uint16Array with mm units looks good to me.

For the units, I suggest we add long units into MediaTrackConstrains.

@@ -453,26 +557,63 @@
<dd>
-
</dd>
<dt>
DOMString format = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need format anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped.

@anssiko
Copy link
Member Author

anssiko commented May 12, 2015

@huningxin Thanks again for your suggestions. I updated the spec to address your comments. I also did further refactoring. I'd like to review this after your final review.

</dl>
<dl id="enum-basic" class="idl" title="enum DepthMapUnit">
<dt>
mm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will people confuse 'mm' between millimeters and micrometers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Millimeter (the American spelling of millimetre) has an SI unit symbol 'mm' so I think that's the best we can do. If we'd use the full name (millimeter) people would probably typo it since the American and International spellings differ subtly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good to me. Thanks for the explanation!

@huningxin
Copy link
Contributor

hi @anssiko , thanks for your efforts and explanation. It makes the spec pretty good.

LGTM with one open.

@anssiko
Copy link
Member Author

anssiko commented May 13, 2015

@huningxin @robman @ds-hwang I'll merge this PR now and craft a mail to group to get wider feedback. Thanks for your contributions and review!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants