Skip to content
Switch branches/tags
Go to file
Add tree-overview of directory-structure
2 contributors

Users who have contributed to this file

@thodan @MartinSmeyer

Format of the BOP datasets

This file describes the common format of the BOP datasets [1].

Directory structure

The datasets have the following structure:

├─ camera[_TYPE].json
├─ dataset_info.json
├─ test_targets_bop19.json
├─ models[_MODELTYPE][_eval]
│  ├─ models_info.json
│  ├─ obj_OBJ_ID.ply
├─ train|val|test[_TYPE]
│  │  ├─ scene_camera.json
│  │  ├─ scene_gt.json
│  │  ├─ scene_gt_info.json
│  │  ├─ depth
│  │  ├─ mask
│  │  ├─ mask_visib
│  │  ├─ rgb|gray
  • models[_MODELTYPE] - 3D object models.

  • models[_MODELTYPE]_eval - "Uniformly" resampled and decimated 3D object models used for calculation of errors of object pose estimates.

  • train[_TRAINTYPE]/X (optional) - Training images of object X.

  • val[_VALTYPE]/Y (optional) - Validation images of scene Y.

  • test[_TESTTYPE]/Y - Test images of scene Y.

  • camera.json - Camera parameters (for sensor simulation only; per-image camera parameters are in files scene_camera.json - see below).

  • - Dataset-specific information.

  • test_targets_bop19.json - A list of test targets used for the evaluation in the BOP Challenge 2019/2020/2022. The same list was used also in the ECCV 2018 paper [1], with exception of T-LESS, for which the list from test_targets_bop18.json was used.

MODELTYPE, TRAINTYPE, VALTYPE and TESTTYPE are optional and used if more data types are available (e.g. images from different sensors).

The images in train, val and test folders are organized into subfolders:

  • rgb/gray - Color/gray images.
  • depth - Depth images (saved as 16-bit unsigned short).
  • mask (optional) - Masks of object silhouettes.
  • mask_visib (optional) - Masks of the visible parts of object silhouettes.

The corresponding images across the subolders have the same ID, e.g. rgb/000000.png and depth/000000.png is the color and the depth image of the same RGB-D frame. The naming convention for the masks is IMID_GTID.png, where IMID is an image ID and GTID is the index of the ground-truth annotation (stored in scene_gt.json).

Training, validation and test images

If both validation and test images are available for a dataset, the ground-truth annotations are public only for the validation images. Performance scores for test images with private ground-truth annotations can be calculated in the BOP evaluation system.

Camera parameters

Each set of images is accompanied with file scene_camera.json which contains the following information for each image:

  • cam_K - 3x3 intrinsic camera matrix K (saved row-wise).
  • depth_scale - Multiply the depth image with this factor to get depth in mm.
  • cam_R_w2c (optional) - 3x3 rotation matrix R_w2c (saved row-wise).
  • cam_t_w2c (optional) - 3x1 translation vector t_w2c.
  • view_level (optional) - Viewpoint subdivision level, see below.

The matrix K may be different for each image. For example, the principal point is not constant for images in T-LESS as the images were obtained by cropping a region around the projection of the origin of the world coordinate system.

Note that the intrinsic camera parameters can be found also in file camera.json in the root folder of a dataset. These parameters are meant only for simulation of the used sensor when rendering training images.

P_w2i = K * [R_w2c, t_w2c] is the camera matrix which transforms 3D point p_w = [x, y, z, 1]' in the world coordinate system to 2D point p_i = [u, v, 1]' in the image coordinate system: s * p_i = P_w2i * p_w.

Ground-truth annotations

The ground truth object poses are provided in files scene_gt.json which contain the following information for each annotated object instance:

  • obj_id - Object ID.
  • cam_R_m2c - 3x3 rotation matrix R_m2c (saved row-wise).
  • cam_t_m2c - 3x1 translation vector t_m2c.

P_m2i = K * [R_m2c, t_m2c] is the camera matrix which transforms 3D point p_m = [x, y, z, 1]' in the model coordinate system to 2D point p_i = [u, v, 1]' in the image coordinate system: s * p_i = P_m2i * p_m.

Ground truth bounding boxes and instance masks are also provided in COCO format under scene_gt_coco.json. The RLE format is used for segmentations. Detailed information about the COCO format can be found here.

Meta information about the ground-truth poses

The following meta information about the ground-truth poses is provided in files scene_gt_info.json (calculated using scripts/, with delta = 5mm for ITODD, 15mm for other datasets, and 5mm for all photorealistic training images provided for the BOP Challenge 2020):

  • bbox_obj - 2D bounding box of the object silhouette given by (x, y, width, height), where (x, y) is the top-left corner of the bounding box.
  • bbox_visib - 2D bounding box of the visible part of the object silhouette.
  • px_count_all - Number of pixels in the object silhouette.
  • px_count_valid - Number of pixels in the object silhouette with a valid depth measurement (i.e. with a non-zero value in the depth image).
  • px_count_visib - Number of pixels in the visible part of the object silhouette.
  • visib_fract - The visible fraction of the object silhouette (= px_count_visib/px_count _all).

Acquisition of training images

Most of the datasets include training images which were obtained either by capturing real objects from various viewpoints or by rendering 3D object models (using scripts/

The viewpoints, from which the objects were rendered, were sampled from a view sphere as in [2] by recursively subdividing an icosahedron. The level of subdivision at which a viewpoint was added is saved in scene_camera.json as view_level (viewpoints corresponding to vertices of the icosahedron have view_level = 0, viewpoints obtained in the first subdivision step have view_level = 1, etc.). To reduce the number of viewpoints while preserving their "uniform" distribution over the sphere surface, one can consider only viewpoints with view_level <= n, where n is the highest considered level of subdivision.

For rendering, the radius of the view sphere was set to the distance of the closest occurrence of any annotated object instance over all test images. The distance was calculated from the camera center to the origin of the model coordinate system.

3D object models

The 3D object models are provided in PLY (ascii) format. All models include vertex normals. Most of the models include also vertex color or vertex texture coordinates with the texture saved as a separate image. The vertex normals were calculated using MeshLab as the angle-weighted sum of face normals incident to a vertex [4].

Each folder with object models contains file models_info.json, which includes the 3D bounding box and the diameter for each object model. The diameter is calculated as the largest distance between any pair of model vertices.

Coordinate systems

All coordinate systems (model, camera, world) are right-handed. In the model coordinate system, the Z axis points up (when the object is standing "naturally up-right") and the origin coincides with the center of the 3D bounding box of the object model. The camera coordinate system is as in OpenCV with the camera looking along the Z axis.


  • Depth images: See files camera.json/scene_camera.json in individual datasets.
  • 3D object models: 1 mm
  • Translation vectors: 1 mm


[1] Hodan, Michel et al. "BOP: Benchmark for 6D Object Pose Estimation" ECCV'18.

[2] Hinterstoisser et al. "Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes" ACCV'12.

[3] Thurrner and Wuthrich "Computing vertex normals from polygonal facets" Journal of Graphics Tools 3.1 (1998).