In [2]:
import numpy as np
import nanoarrow as na
import pyarrow as pa

### Intermezzo: inspecting the buffers using PyArrow and nanoarrow

#### PyArrow

For a `pyarrow.Array`, we can use the `buffers()` method to get a list of all the buffers of the array. The information for each buffer inlcudes:

- adress of the buffer
- buffer size in bytes
- whether the buffer is mutable or not (buffers are generally mutable - changable, but an Array is an immutable container in pyarrow)

In [4]:
column1 = pa.array([1, 3, 9, 9, 2], type=pa.int64())
column1.buffers()

[None,
 <pyarrow.Buffer address=0x7f7cc1a090c0 size=40 is_cpu=True is_mutable=True>]

In this case a simple, fixed width primitive array, there is only a single buffer for the data values.

PyArrow doesn't provide direct easy access to the buffer content, but here are a few ways to inspect the buffer:

In [6]:
values_buffer = column1.buffers()[1]
values_buffer

<pyarrow.Buffer address=0x7f7cc1a090c0 size=40 is_cpu=True is_mutable=True>

In [11]:
# getting the raw bytes as a Python bytes object (note this makes a copy! don't do this with larger data)
values_buffer.to_pybytes()

b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'

In [10]:
# zero-copy view as a numpy array (using the buffer protocol)
# -> this just shows the raw bytes as well
np.array(values_buffer)

array([1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0,
       0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0], dtype=int8)

In [16]:
# in this case we know the buffer represents int64 values, so we can view the buffer as such
np.frombuffer(values_buffer, dtype=np.int64)

array([1, 3, 9, 9, 2])

#### Inspecting buffers using nanoarrow

In [18]:
na_column4 = na.Array(column4)

To start, nanoarrow does have a functionality to print the details of the layout of a certain array, which already gives us insight into the buffers of the array:

In [19]:
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'int64'
- length: 5
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int64[40 b] 1 3 9 9 2>
- dictionary: NULL
- children[0]:

Additionally, it also allows us to access the buffers directly through the `buffers` property:

In [23]:
na_column4.buffers

(nanoarrow.c_lib.CBufferView(bool[0 b] ),
 nanoarrow.c_lib.CBufferView(int64[40 b] 1 3 9 9 2))

Nanoarrow does keep track of the context in which the buffer was created (i.e. it is part of an int64 array and represents the data values):

In [24]:
data_buffer = na_column4.buffers[1]

In [25]:
data_buffer

nanoarrow.c_lib.CBufferView(int64[40 b] 1 3 9 9 2)

In [26]:
np.array(data_buffer)

array([1, 3, 9, 9, 2], dtype=int64)

## Fixed Size Primitive Layout

A primitive value array represents an array of values where each one has the same physical size measured in bytes.

![](diagrams/primitive-diagram.svg)

For example a primitive array of int32s (4 bytes per value):

In [13]:
column1 = pa.array([1, 3, 9, 9, 2], type=pa.int32())
column1.buffers()

[None,
 <pyarrow.Buffer address=0x726f004091c0 size=20 is_cpu=True is_mutable=True>]

In [14]:
column1

<pyarrow.lib.Int32Array object at 0x726f3c197100>
[
  1,
  3,
  9,
  9,
  2
]

In [15]:
na.c_array_view(column1)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'int32'
- length: 5
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int32[20 b] 1 3 9 9 2>
- dictionary: NULL
- children[0]:

## Support for null values

Arrow supports missing values or "nulls" for all data types: any value in an array may be semantically null, whether primitive or nested type.

In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, is used alongside the data indicating whether each value in the array is null or not. You can think of it as vector of 0 and 1 values, where a 1 means that the value is not-null ("valid"), while a 0 indicates the value is null.

This validity bitmap is optional, i.e. if there are no missing values in the array the buffer does not need to be allocated (as in the example column 1 in the diagram below).

![](diagrams/primitive-diagram.svg)

In [2]:
import numpy as np
import pyarrow as pa
import nanoarrow as na

In [3]:
arr = pa.array([1.2, 3.4, 9.0, None, 2.9])
arr

<pyarrow.lib.DoubleArray object at 0x7fe64d8236a0>
[
  1.2,
  3.4,
  9,
  null,
  2.9
]

In [4]:
na.c_array_view(arr)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'double'
- length: 5
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 11101000>
  - data <double[40 b] 1.2 3.4 9.0 0.0 2.9>
- dictionary: NULL
- children[0]:

**Attention**: Arrow uses [least-significant bit (LSB) numbering](https://en.wikipedia.org/wiki/Bit_numbering) (also known as bit-endianness). This means that within a group of 8 bits (1 byte), we read right-to-left. However, the `nanoarrow` repr of the validity buffer in the example above already takes that into account and shows the values in logical order matching the position in the array. 

The diagram above shows it as how it is actually stored in memory. We can inspect the validity bitmap buffer with pyarrow and numpy:

In [5]:
validity_bitmap_buffer = arr.buffers()[0]
validity_bitmap_buffer.to_pybytes()

b'\x17'

In this case of a small array of 5 values, the validity bitmap consists of only a single byte. To view the data as bytes in numpy, we can use the `uint8` data type, which has a width of 1 byte:

In [9]:
np.frombuffer(validity_bitmap_buffer, dtype="uint8")

array([23], dtype=uint8)

Numpy also provides a function to "unpack" the 0/1 bits of those bytes into separate values:

In [10]:
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 1, 1, 0, 1, 0, 0, 0], dtype=uint8)

In this case of an array of 5 elements, only the first 5 bits have a meaning, and the additional ("padded") bits are always set to 0.

### Null vs NaN

In numpy (and numpy-based packages such as pandas), often `NaN` is used as indicator for "missing" values, mostly by lack of better alternatives (numpy does not have built-in support for missing values in general). `NaN` is a specific floating-point value ("Not a Number") within the IEEE floating-point standard, and thus is only available for floating point data types.
In the Arrow format, since there is a separate concept of nulls, a NaN value is considered as just another valid floating point array value:

In [11]:
arr = na.Array([0.5, float("nan"), 1.5, None, 3.5], na.float64())

In [12]:
arr

nanoarrow.Array<double>[5]
0.5
nan
1.5
None
3.5

In [13]:
arr.buffers

(nanoarrow.c_lib.CBufferView(bool[1 b] 11101000),
 nanoarrow.c_lib.CBufferView(double[40 b] 0.5 nan 1.5 0.0 3.5))

# Variable sized types

## Variable length binary and string

The bytes of a binary or string column are stored together consecutively in a single buffer or region of memory. To know where each element of the column starts and ends the physical layout also includes integer offsets. The length of which is one more then the length on the column as the last two elements define the start and the end of the last element in the binary/string column.

Binary and string types share the same physical layout with where the string type is utf-8 binary and will produce an invalid result if the bytes are not valid utf-8.

The difference between binary/string and large binary/string is in the offset type. In the first case that is `int32` and in the second it is `int64`.

The limitation of types using 32 bit offsets is that they have a max size of 2GB for one array/column. One can still use the non-large variants for bigger data, but then multiple chunks are needed.

![image info](./diagrams/var-string-diagram.svg)

In [1]:
import nanoarrow as na
import numpy as np
import pyarrow as pa

In [24]:
# Binary column example
pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.binary())

<pyarrow.lib.BinaryArray object at 0x105427040>
[
  707974686F6E,
  64617461,
  636F6E666572656E6365,
  null,
  4265726C696E
]

The bytes in the BinaryArray are shown in the "hex" representation:

In [25]:
bytes.fromhex("707974686F6E")

b'python'

In [26]:
# String column examples
pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.string())

<pyarrow.lib.StringArray object at 0x105376500>
[
  "python",
  "data",
  "conference",
  null,
  "Berlin"
]

### String type

In [4]:
# Inspecting buffers using PyArrow and buffers() method

column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.string())
column4.buffers()

[<pyarrow.Buffer address=0x41e94020240 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020200 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020280 size=26 is_cpu=True is_mutable=True>]

In [5]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column4.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 1, 1, 0, 1, 0, 0, 0], dtype=uint8)

In [6]:
offsets_buffer = column4.buffers()[1]
np.frombuffer(offsets_buffer, dtype="int32")

array([ 0,  6, 10, 20, 20, 26], dtype=int32)

In [7]:
values_buffer = column4.buffers()[2]
values_buffer.to_pybytes()

b'pythondataconferenceBerlin'

In [8]:
# Inspecting buffers using nanoarrow

na_column4 = na.c_array(column4)
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int32[24 b] 0 6 10 20 20 26>
  - data <string[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

### Binary type

In [9]:
column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.binary())
column4.buffers()

[<pyarrow.Buffer address=0x41e94020300 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e940202c0 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020340 size=26 is_cpu=True is_mutable=True>]

In [10]:
na_column4 = na.c_array(column4)
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'binary'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int32[24 b] 0 6 10 20 20 26>
  - data <binary[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

### Comparing string and large string

In [11]:
column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.string())
na.c_array_view(na.c_array(column4))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int32[24 b] 0 6 10 20 20 26>
  - data <string[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

In [12]:
column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.large_string())
na.c_array_view(na.c_array(column4))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'large_string'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int64[48 b] 0 6 10 20 20 26>
  - data <string[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

### Variable length binary and string view

Binary and string view layout are new in Arrow Columnar format 1.4.
The main differences to classical binary and string types is the **views buffer**. It may point to one of potentially several data buffers or may contain the characters inline. It also supports binary and strings to be written out of order.

These properties are important for efficient string processing. The prefix enables a profitable fast path for string comparisons, which are frequently determined within the first four bytes.

![image info](./diagrams/var-string-view-diagram.svg)

In [13]:
column5 = pa.array(['String longer than 12', 'Short', None, 'Short string', "Another long string"], type=pa.string_view())
column5

<pyarrow.lib.StringViewArray object at 0x1054262c0>
[
  "String longer than 12",
  "Short",
  null,
  "Short string",
  "Another long string"
]

In [14]:
# Inspecting buffers using PyArrow and buffers() method
column5.buffers()

[<pyarrow.Buffer address=0x41e94020500 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e940d0080 size=80 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020540 size=40 is_cpu=True is_mutable=True>]

In [15]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column5.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 1, 0, 1, 1, 0, 0, 0], dtype=uint8)

In [16]:
views_buffer = column5.buffers()[1]
np.frombuffer(views_buffer, dtype="int32")

array([        21, 1769108563,          0,          0,          5,
       1919903827,        116,          0,          0,          0,
                0,          0,         12, 1919903827, 1953702004,
       1735289202,         19, 1953459777,          0,         21],
      dtype=int32)

In [17]:
values_buffer = column5.buffers()[2]
values_buffer.to_pybytes()

b'String longer than 12Another long string'