Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change data type names and change endianness to be handled by a codec #155

Merged
merged 4 commits into from Nov 10, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/codecs.rst
Expand Up @@ -178,6 +178,39 @@ header. The format of the encoded buffer is defined in [BLOSC]_. The
reference implementation is provided by the `c-blosc library
<https://github.com/Blosc/c-blosc>`_.

.. _endian-codec:

Endian
------

Codec URI:
https://purl.org/zarr/spec/codec/endian

Encodes array elements using the specified endianness.

Configuration parameters
~~~~~~~~~~~~~~~~~~~~~~~~

endian:
Required. A string equal to either ``"big"`` or ``"little"``.

Format and algorithm
~~~~~~~~~~~~~~~~~~~~

Each element of the array is encoded using the specified endian variant of its
default binary representation. Array elements are encoded in lexicographical
order. For example, with ``endian`` specified as ``big``, the ``int32`` data
type is encoded as a 4-byte big endian two's complement integer, and the
``complex128`` data type is encoded as two consecutive 8-byte big endian IEEE
754 binary64 values.

.. note::

Single the default binary representation of all data types is little endian,
joshmoore marked this conversation as resolved.
Show resolved Hide resolved
specifying this codec with ``endian`` equal to ``"little"`` is equivalent to
omitting this codec, because if this codec is omitted, the default binary
representation of the data type, which is always little endian, is used
instead.

Deprecated codecs
=================
Expand Down
136 changes: 43 additions & 93 deletions docs/core/v3.0.rst
Expand Up @@ -177,8 +177,6 @@ draft.
We propose to develop a draft implementation with extensions and
see how far we can go. A possible list of extensions to include:

- Boolean
- Complex
- Datetime
- Named dimensions
- Awkward arrays
Expand Down Expand Up @@ -316,8 +314,8 @@ conceptual model underpinning the Zarr format.
*Data type*

A data type defines the set of possible values that an array_ may
contain, and a binary representation (i.e., sequence of bytes) for
each possible value. For example, the little-endian 32-bit signed
contain, and a default binary representation (i.e., sequence of bytes) for
each possible value. For example, the 32-bit signed
integer data type defines binary representations for all integers
in the range −2,147,483,648 to 2,147,483,647. This specification
only defines a limited set of data types, but extensions
Expand Down Expand Up @@ -488,101 +486,48 @@ Core data types

jbms marked this conversation as resolved.
Show resolved Hide resolved
* - Identifier
- Numerical type
- Size (no. bytes)
- Byte order
- Default binary representation
* - ``bool``
- Boolean, with False encoded as ``\\x00`` and True encoded as ``\\x01``
- 1
- None
* - ``i1``
- signed integer
- 1
- None
* - ``<i2``
- signed integer
- 2
- little-endian
* - ``<i4``
- signed integer
- 4
- little-endian
* - ``<i8``
- signed integer
- 8
- little-endian
* - ``>i2``
- signed integer
- 2
- big-endian
* - ``>i4``
- signed integer
- 4
- big-endian
* - ``>i8``
- signed integer
- 8
- big-endian
* - ``u1``
- unsigned integer
- 1
- None
* - ``<u2``
- unsigned integer
- 2
- little-endian
* - ``<u4``
- unsigned integer
- 4
- little-endian
* - ``<u8``
- unsigned integer
- 8
- little-endian
* - ``>u2``
- unsigned integer
- 2
- big-endian
* - ``>u4``
- unsigned integer
- 4
- big-endian
* - ``>u8``
- unsigned integer
- 8
- big-endian
* - ``<f2``
- half precision float: sign bit, 5 bits exponent, 10 bits mantissa
- 2
- little-endian
* - ``<f4``
- single precision float: sign bit, 8 bits exponent, 23 bits mantissa
- 4
- little-endian
* - ``<f8``
- double precision float: sign bit, 11 bits exponent, 52 bits mantissa
- 8
- little-endian
* - ``>f2``
- half precision float: sign bit, 5 bits exponent, 10 bits mantissa
- 2
- big-endian
* - ``>f4``
- single precision float: sign bit, 8 bits exponent, 23 bits mantissa
- 4
- big-endian
* - ``>f8``
- double precision float: sign bit, 11 bits exponent, 52 bits mantissa
- 8
- big-endian
- Boolean
- Single byte, with false encoded as ``\\x00`` and true encoded as ``\\x01``.
* - int8
- Integer in ``[-2^7, 2^7-1]``
- 1 byte two's complement
* - int16
- Integer in ``[-2^15, 2^15-1]``
- 2-byte little endian two's complement
* - int32
- Integer in ``[-2^31, 2^31-1]``
- 4-byte little endian two's complement
* - uint8
- Integer in ``[0, 2^8-1]``
- 1 byte
* - uint16
- Integer in ``[0, 2^16-1]``
- 2-byte little endian
* - uint32
- Integer in ``[0, 2^32-1]``
- 4-byte little endian
* - float16 (optionally supported)
- IEEE 754 half-precision floating point: sign bit, 5 bits exponent, 10 bits mantissa
- 2-byte little endian IEEE 754 binary16
* - float32
- IEEE 754 single-precision floating point: sign bit, 8 bits exponent, 23 bits mantissa
- 4-byte little endian IEEE 754 binary32
* - float64
- IEEE 754 double-precision floating point: sign bit, 11 bits exponent, 52 bits mantissa
- 8-byte little endian IEEE 754 binary64
* - complex64
- real and complex components are each IEEE 754 single-precision floating point
- 2 consecutive 4-byte little endian IEEE 754 binary32 values
* - complex128
- real and complex components are each IEEE 754 double-precision floating point
- 2 consecutive 8-byte little endian IEEE 754 binary64 values
* - ``r*`` (Optional)
- raw bits, use for extension type fallbacks
- variable, given by ``*``, is limited to be a multiple of 8.
- N/A


Floating point types correspond to basic binary interchange formats as
defined by IEEE 754-2008.

Additionally to these base types, an implementation should also handle the
raw/opaque pass-through type designated by the lower-case letter ``r`` followed
by the number of bits, multiple of 8. For example, ``r8``, ``r16``, and ``r24``
Expand All @@ -591,6 +536,11 @@ should be understood as fall-back types of respectively 1, 2, and 3 byte length.
Zarr v3 is limited to type sizes that are a multiple of 8 bits but may support
other type sizes in later versions of this specification.

.. note::

While the default binary representation is little endian, the :ref:`endian
codec<endian-codec>` may specified to use big endian encoding instead.
jstriebel marked this conversation as resolved.
Show resolved Hide resolved


.. note::

Expand Down