Skip to content

Latest commit



104 lines (79 loc) · 5.26 KB

File metadata and controls

104 lines (79 loc) · 5.26 KB

Sensor data in Mapillary Metropolis

Coordinate systems and conventions

We define four main categories of coordinate systems:

  • World coordinates: this is a fixed frame, anchored to a specific position in the 3D world which is shared across the whole dataset. X, Y world coordinates can be directly translated to geo-referenced EPSG:6498 coordinates by adding a specific offset vector stored in aerial.json.
  • Vehicle coordinates: this is a moving frame, anchored to the vehicle that captured the sensor data. The X, Y and Z axes point right, forward and up, respectively. Vehicle coordinates are stored in ego_pose.json.
  • Sensor coordinates: these are sensor-specific moving frames, anchored to the vehicle that captured the sensor data, but generally different from the vehicle coordinates. Transformations between sensor coordinates and vehicle coordinates are stored in calibrated_sensor.json.
    • For cameras, the sensor coordinate system follows the "OpenCV" convention, i.e. X, Y and Z point right, bottom and forward, respectively.
    • For point clouds, there's no general convention. The user should always interpret the stored point cloud files in the context of their specific sensor frame defined in calibrated_sensor.json.
  • Object coordinates: these are object specific frames, used to represent 3D bounding box annotations, stored in sample_annotation.json. For objects with a well-defined orientation (e.g. cars), the X, Y and Z axes point right, forward and up, respectively. The Z axis always points up, even for objects that have some form of central symmetry (e.g. support poles). The bounding box corners have coordinates [±W/2, ±L/2, ±H/2] in the object's frame of reference.

Coordinate transformations

Transformations between coordinate systems are given as quaternion-vector pairs f1, and always represent the transformation from the local frame to a more global frame i.e. from object to world, from sensor to vehicle, from vehicle to world.

Given a roto-translation f1 from frame A to frame B, we can transform points in A coordinates f2 to points in B coordinates f3 as:


where the rotation matrix f5 for a quaternion f6 is given by:


f8 f9

where f10 is the bottom-right R x C sub-matrix of M.


The sensor.json table provides meta-data about the sensors used in Mapillary Metropolis. In particular, different sensors are described by their modality and channel. In the following we provide additional information on each modality, and least all channels available for each.

Modality: camera

These sensors produce RGB images, stored as JPG in the sweeps folder. Each sample is guaranteed to have one equirectangular image, which should be regarded as the main source of truth for annotations. Possible channels are:

  • CAM_EQUIRECTANGULAR: the main equirectangular image.
  • CAM_LEFT, CAM_RIGHT, CAM_FRONT, CAM_BACK: optional perspective images, pointing in the four cardinal directions w.r.t. to the ego-vehicle. These are obtained by warping the equirectangular image.

Modality: depth

These sensors produce depth maps, stored as 16-bit PNGs in the samples folder. These are obtained by re-projecting the multi-view stereo reconstruction. Possible channels are:

  • DEPTH_LEFT, DEPTH_RIGHT, DEPTH_FRONT, DEPTH_BACK: depth maps corresponding to the perspective images defined in the previous section.

Modality: multi-view stereo

This sensor produces point clouds, stored as NPY files in the samples folder. Each sensor reading contains a slice of a large MVS reconstruction, centered around the corresponding ego-vehicle location. The only channel for this sensor is named MVS.

Modality: lidar

This sensor produces point clouds, stored as NPY files in the samples folder. Each sensor reading contains a slice of a large lidar scan, centered around the corresponding ego-vehicle location and re-aligned to match the corresponding multi-view stereo slice. Possible channels are:

  • LIDAR_MX2: ground-level lidar data, captured by the same vehicle that collected the equirectangular images.
  • LIDAR_AERIAL: aerial lidar data.