From 8694e4934cffa1a6606e251ddc81ba12c825701b Mon Sep 17 00:00:00 2001 From: Jeremy Maitin-Shepard Date: Wed, 10 Aug 2022 09:46:54 -0700 Subject: [PATCH 1/4] Change data type names and change endianness to be handled by a codec --- docs/codecs.rst | 29 ++++++++++ docs/core/v3.0.rst | 131 +++++++++++++-------------------------------- 2 files changed, 67 insertions(+), 93 deletions(-) diff --git a/docs/codecs.rst b/docs/codecs.rst index 204b9c28..1b5a1ad2 100644 --- a/docs/codecs.rst +++ b/docs/codecs.rst @@ -178,6 +178,35 @@ header. The format of the encoded buffer is defined in [BLOSC]_. The reference implementation is provided by the `c-blosc library `_. +Endian +------ + +Codec URI: + https://purl.org/zarr/spec/codec/endian + +Encodes array elements using the specified endianness. + +Configuration parameters +~~~~~~~~~~~~~~~~~~~~~~~~ + +endian: + Required. A string equal to either ``"big"`` or ``"little"``. + +Format and algorithm +~~~~~~~~~~~~~~~~~~~~ + +Each element of the array is encoded using the specified endian variant of its +default binary representation. Array elements are encoded in lexicographical +order. For example, with ``endian`` specified as ``big``, the ``int32`` data +type is encoded as a 4-byte big endian two's complement integer, and the +``complex128`` data type is encoded as two consecutive 8-byte big endian IEEE +754 binary64 values. + +.. note:: + + Single the default binary representation of all data types is little endian, + specifying this codec with ``endian`` equal to ``"little"`` is equivalent to + omitting this codec. Deprecated codecs ================= diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 5ab27003..906e8770 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -177,8 +177,6 @@ draft. We propose to develop a draft implementation with extensions and see how far we can go. A possible list of extensions to include: - - Boolean - - Complex - Datetime - Named dimensions - Awkward arrays @@ -316,8 +314,8 @@ conceptual model underpinning the Zarr format. *Data type* A data type defines the set of possible values that an array_ may - contain, and a binary representation (i.e., sequence of bytes) for - each possible value. For example, the little-endian 32-bit signed + contain, and a default binary representation (i.e., sequence of bytes) for + each possible value. For example, the 32-bit signed integer data type defines binary representations for all integers in the range −2,147,483,648 to 2,147,483,647. This specification only defines a limited set of data types, but extensions @@ -488,101 +486,48 @@ Core data types * - Identifier - Numerical type - - Size (no. bytes) - - Byte order + - Default binary representation * - ``bool`` - - Boolean, with False encoded as ``\\x00`` and True encoded as ``\\x01`` - - 1 - - None - * - ``i1`` - - signed integer - - 1 - - None - * - ``i2`` - - signed integer - - 2 - - big-endian - * - ``>i4`` - - signed integer - - 4 - - big-endian - * - ``>i8`` - - signed integer - - 8 - - big-endian - * - ``u1`` - - unsigned integer - - 1 - - None - * - ``u2`` - - unsigned integer - - 2 - - big-endian - * - ``>u4`` - - unsigned integer - - 4 - - big-endian - * - ``>u8`` - - unsigned integer - - 8 - - big-endian - * - ``f2`` - - half precision float: sign bit, 5 bits exponent, 10 bits mantissa - - 2 - - big-endian - * - ``>f4`` - - single precision float: sign bit, 8 bits exponent, 23 bits mantissa - - 4 - - big-endian - * - ``>f8`` - - double precision float: sign bit, 11 bits exponent, 52 bits mantissa - - 8 - - big-endian + - Boolean + - Single byte, with false encoded as ``\\x00`` and true encoded as ``\\x01``. + * - int8 + - Integer in ``[-2^7, 2^7-1]`` + - 1 byte two's complement + * - int16 + - Integer in ``[-2^15, 2^15-1]`` + - 2-byte little endian two's complement + * - int32 + - Integer in ``[-2^31, 2^31-1]`` + - 4-byte little endian two's complement + * - uint8 + - Integer in ``[0, 2^8-1]`` + - 1 byte + * - uint16 + - Integer in ``[0, 2^16-1]`` + - 2-byte little endian + * - uint32 + - Integer in ``[0, 2^32-1]`` + - 4-byte little endian + * - float16 (optionally supported) + - IEEE 754 half-precision floating point: sign bit, 5 bits exponent, 10 bits mantissa + - 2-byte little endian IEEE 754 binary16 + * - float32 + - IEEE 754 single-precision floating point: sign bit, 8 bits exponent, 23 bits mantissa + - 4-byte little endian IEEE 754 binary32 + * - float64 + - IEEE 754 double-precision floating point: sign bit, 11 bits exponent, 52 bits mantissa + - 8-byte little endian IEEE 754 binary64 + * - complex64 + - real and complex components are each IEEE 754 single-precision floating point + - 2 consecutive 4-byte little endian IEEE 754 binary32 values + * - complex128 + - real and complex components are each IEEE 754 double-precision floating point + - 2 consecutive 8-byte little endian IEEE 754 binary64 values * - ``r*`` (Optional) - raw bits, use for extension type fallbacks - variable, given by ``*``, is limited to be a multiple of 8. - N/A - -Floating point types correspond to basic binary interchange formats as -defined by IEEE 754-2008. - Additionally to these base types, an implementation should also handle the raw/opaque pass-through type designated by the lower-case letter ``r`` followed by the number of bits, multiple of 8. For example, ``r8``, ``r16``, and ``r24`` From aaeb3a984db508e30465f95fd26f87ca339a1c94 Mon Sep 17 00:00:00 2001 From: Jeremy Maitin-Shepard Date: Wed, 2 Nov 2022 10:54:33 -0700 Subject: [PATCH 2/4] Clarify meaning of omitting the codec --- docs/codecs.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/codecs.rst b/docs/codecs.rst index 1b5a1ad2..2b0a9742 100644 --- a/docs/codecs.rst +++ b/docs/codecs.rst @@ -206,7 +206,9 @@ type is encoded as a 4-byte big endian two's complement integer, and the Single the default binary representation of all data types is little endian, specifying this codec with ``endian`` equal to ``"little"`` is equivalent to - omitting this codec. + omitting this codec, because if this codec is omitted, the default binary + representation of the data type, which is always little endian, is used + instead. Deprecated codecs ================= From 13f3d3f4839dd71a76ca6aa3c4e2f4db2e3a9f05 Mon Sep 17 00:00:00 2001 From: Jeremy Maitin-Shepard Date: Wed, 2 Nov 2022 10:57:40 -0700 Subject: [PATCH 3/4] Add note in data type section about endian codec --- docs/codecs.rst | 2 ++ docs/core/v3.0.rst | 5 +++++ 2 files changed, 7 insertions(+) diff --git a/docs/codecs.rst b/docs/codecs.rst index 2b0a9742..e202ed1c 100644 --- a/docs/codecs.rst +++ b/docs/codecs.rst @@ -178,6 +178,8 @@ header. The format of the encoded buffer is defined in [BLOSC]_. The reference implementation is provided by the `c-blosc library `_. +.. _endian-codec: + Endian ------ diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 906e8770..386f1592 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -536,6 +536,11 @@ should be understood as fall-back types of respectively 1, 2, and 3 byte length. Zarr v3 is limited to type sizes that are a multiple of 8 bits but may support other type sizes in later versions of this specification. +.. note:: + + While the default binary representation is little endian, the :ref:`endian + codec` may specified to use big endian encoding instead. + .. note:: From bf0261ed17f28f52926f416e65e2932c45ca64c5 Mon Sep 17 00:00:00 2001 From: Jeremy Maitin-Shepard Date: Thu, 3 Nov 2022 10:04:26 -0700 Subject: [PATCH 4/4] Fix working --- docs/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 386f1592..030ba563 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -539,7 +539,7 @@ other type sizes in later versions of this specification. .. note:: While the default binary representation is little endian, the :ref:`endian - codec` may specified to use big endian encoding instead. + codec` may be specified to use big endian encoding instead. .. note::