Skip to content
hoangthaiduong edited this page Sep 2, 2015 · 10 revisions

#IDX data format IDX is a binary format for storing structured data such as 3D Cartesian grids. A primary use of the format is to store data produced by large-scale scientific simulations. Often times different types of analysis require the same data to be organized in different data structures to be efficient. This quickly becomes a bottleneck when the data size increases to hundreds of gigabytes or terabytes. IDX facilitates progressively streaming and multi-resolution kinds of analysis and visualization, while at the same time performs decently at other kinds of task.

Indexing

A high-dimensional grid consists of cells, each having a high-dimensional index (a vector of indices). To convert these indices to 1D, IDX uses HZ-order indexing, a hierarchical variant of the Lebesgue space filling curve (also called Z-order indexing). HZ-order data layout has three key properties:

  • It is hierarchical but unlike other hierarchical schemes, there is no explicit data "pyramid", thus no data replication. IDX simply rearranges the grid cells so that those that belong to the same hierarchy level are put together in a contiguous chunk on disk. This means only a single, continuous read is needed to query data at a certain resolution level.
  • In each hierarchy level, the HZ indexing scheme has strong locality, which means samples that are close in spatial domain are also close in HZ domain. This is a property it inherits from the Z-order indexing scheme.
  • Conversion from Z-order indices to HZ-order indices is very fast, requiring only a simple sequence of bit-string manipulations. This enables very efficient I/O.

Binary structure

On top of the HZ-order indexing, IDX organizes data in files, each contains multiple blocks of the same size. Data is always read and write at block level. The number of files per domain and the size of a block are user-defined parameters. This allows IDX to:

  • Work with grids whose dimensions are not powers of 2.
  • Support region-of-interest data writing.
  • Be flexible in the way data is written and read. This is important since different machine architectures have different I/O characteristics.

#PIDX library PIDX is an opensource C++ library for efficient parallel read/write of IDX files. It can be used as a drop-in replacement for other I/O libraries such as Parallel HDF5 or Parallel netCDF. PIDX has been demonstrated to out-perform some of the more popular I/O libraries and to scale to supercomputer systems with hundred thousands of cores.

TODO: cite PIDX papers

Clone this wiki locally