# Primitive layouts

In [None]:
import numpy as np
import nanoarrow as na
import pyarrow as pa

## Fixed Size Primitive Layout

A primitive value array represents an array of values where each value has the same physical size measured in bytes.

![](diagrams/primitive-diagram.svg)

For example a primitive array of int32s (4 bytes per value):

In [None]:
column1 = pa.array([1, 3, 9, 9, 2], type=pa.int32())
column1

In [None]:
column1.buffers()

In [None]:
na.array(column1).inspect()

##### Mutable vs Immutable

In [None]:
str_greeting = b"hello barcelona"
buf = pa.py_buffer(str_greeting)
buf

In [None]:
str_greeting[0] = b"f"

In [None]:
ba_greeting = bytearray(b"hello barcelona")
buf = pa.py_buffer(ba_greeting)
buf

In [None]:
ba_greeting[5] = ord('_')
ba_greeting

In [None]:
buf

In [None]:
struct.unpack("15s", buf.to_pybytes())

## Intermezzo: inspecting the buffers using PyArrow and nanoarrow

#### PyArrow

For a `pyarrow.Array`, we can use the `buffers()` method to get a list of all the buffers of the array. The information for each buffer inlcudes:

- adress of the buffer
- buffer size in bytes
- whether the buffer is in CPU or not (GPU)
- whether the buffer is mutable or not (buffers are generally mutable - changeable, but an Array is an immutable container in pyarrow)

In [None]:
column1.buffers()

In this case a simple, fixed width primitive array, there is only a single buffer for the data values.

PyArrow doesn't provide direct easy access to the buffer content, but here are a few ways to inspect the buffer:

In [None]:
values_buffer = column1.buffers()[1]
values_buffer

In [None]:
# getting the raw bytes as a Python bytes object (note this makes a copy! don't do this with larger data)
values_buffer.to_pybytes()

In [None]:
# zero-copy view as a numpy array (using the buffer protocol)
# -> this just shows the raw bytes as well
np.array(values_buffer)

In [None]:
# in this case we know the buffer represents int32 values, so we can view the buffer as such
np.frombuffer(values_buffer, dtype=np.int32)

#### Inspecting buffers using nanoarrowna_column1 = na.array(column1)

In [None]:
na_column1 = na.array(column1)

To start, nanoarrow does have a functionality to print the details of the layout of a certain array, which already gives us insight into the buffers of the array:

In [None]:
na_column1.inspect()

Additionally, it also allows us to access the buffers directly through the `buffers` property:

In [None]:
na_column1.buffers

Nanoarrow does keep track of the context in which the buffer was created (i.e. it is part of an int64 array and represents the data values):

In [None]:
data_buffer = na_column1.buffers[1]

In [None]:
data_buffer

In [None]:
np.array(data_buffer)

## Support for null values

Arrow supports missing values or "nulls" for all data types: any value in an array may be semantically null, whether primitive or nested type.

In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, is used alongside the data indicating whether each value in the array is null or not. You can think of it as vector of 0 and 1 values, where a 1 means that the value is not-null ("valid"), while a 0 indicates the value is null.

This validity bitmap is optional, i.e. if there are no missing values in the array the buffer does not need to be allocated (as in the example column 1 in the diagram below).

![](diagrams/primitive-diagram.svg)

In [None]:
column2 = pa.array([1.2, 3.4, 9.0, None, 2.9])
column2

In [None]:
na.array(column2).inspect()

**Attention**: Arrow uses [least-significant bit (LSB) numbering](https://en.wikipedia.org/wiki/Bit_numbering) (also known as bit-endianness). This means that within a group of 8 bits (1 byte), we read right-to-left. However, the `nanoarrow` repr of the validity buffer in the example above already takes that into account and shows the values in logical order matching the position in the array. 

The diagram above shows it as how it is actually stored in memory. We can inspect the validity bitmap buffer with pyarrow and numpy:

In [None]:
validity_bitmap_buffer = column2.buffers()[0]
validity_bitmap_buffer.to_pybytes()

In this case of a small array of 5 values, the validity bitmap consists of only a single byte. To view the data as bytes in numpy, we can use the `uint8` data type, which has a width of 1 byte:

In [None]:
np.frombuffer(validity_bitmap_buffer, dtype="uint8")

Numpy also provides a function to "unpack" the 0/1 bits of those bytes into separate values:

In [None]:
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

In this case of an array of 5 elements, only the first 5 bits have a meaning, and the additional ("padded") bits are always set to 0.

### Null vs NaN

In numpy (and numpy-based packages such as pandas), often `NaN` is used as indicator for "missing" values, mostly by lack of better alternatives (numpy does not have built-in support for missing values in general). `NaN` is a specific floating-point value ("Not a Number") within the IEEE floating-point standard, and thus is only available for floating point data types.
In the Arrow format, since there is a separate concept of nulls, a NaN value is considered as just another valid floating point array value:

In [None]:
arr = na.array([0.5, float("nan"), 1.5, None, 3.5], na.float64())

In [None]:
arr

In [None]:
arr.buffers

## Exercise

In the following code snippet, we create an Array object from python `datetime` instances. What is the type of the array? Is this a fixed-width primitive type? How are the datetimes expressed?

<details><summary>Hints</summary>

* A `pyarrow.Array` has a `.type` attribute. And this DataType object has a `byte_width` attribute in case of a fixed-width type.
* Does it just have a single data buffer (next to the validity bitmap)?

</details>

In [None]:
from datetime import datetime

column_datetime = pa.array([datetime(2024, 4, 22), datetime(2024, 4, 23), datetime(2024, 4, 24)])

# Variable length binary and string

The bytes of a binary or string column are stored together consecutively in a single buffer or region of memory. To know where each element of the column starts and ends the physical layout also includes integer offsets. The length of which is one more than the length on the column as the last two elements define the start and the end of the last element in the binary/string column.

Binary and string types share the same physical layout with where the string type is utf-8 binary and will produce an invalid result if the bytes are not valid utf-8.

The difference between binary/string and large binary/string is in the offset type. In the first case that is `int32` and in the second it is `int64`.

The limitation of types using 32 bit offsets is that they have a max size of 2GB for one array/column. One can still use the non-large variants for bigger data, but then multiple chunks are needed.

![image info](./diagrams/var-string-diagram.svg)

In [None]:
# Binary column example
pa.array(['python', 'data', 'conference', None, "raulcd"], type=pa.binary())

The bytes in the BinaryArray are shown in the "hex" representation:

In [None]:
bytes.fromhex("707974686F6E")

In [None]:
# String column examples
pa.array(['python', 'data', 'conference', None, "raulcd"], type=pa.string())

### String type

In [None]:
# Inspecting buffers using PyArrow and buffers() method

column4 = pa.array(['python', 'data', 'conference', None, "raulcd"], type=pa.string())
column4.buffers()

In [None]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column4.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

In [None]:
offsets_buffer = column4.buffers()[1]
np.frombuffer(offsets_buffer, dtype="int32")

In [None]:
values_buffer = column4.buffers()[2]
values_buffer.to_pybytes()

In [None]:
# Inspecting buffers using nanoarrow

na_column4 = na.array(column4)
na_column4.inspect()

### Binary type

In [None]:
column4 = pa.array(['python', 'data', 'conference', None, "raulcd"], type=pa.binary())
column4.buffers()

In [None]:
na_column4 = na.array(column4)
na_column4.inspect()

### Comparing string and large string

In [None]:
column4 = pa.array(['python', 'data', 'conference', None, "raulcd"], type=pa.string())
na.array(column4).inspect()

In [None]:
column4 = pa.array(['python', 'data', 'conference', None, "raulcd"], type=pa.large_string())
na.array(column4).inspect()

### Variable length binary and string view

Binary and string view layout are new in Arrow Columnar format 1.4. This layout is adapted from TU Munich's UmbraDB, and similar to the string layout used in DuckDB and Velox (and sometimes also called "German style strings").

The main differences to classical binary and string types is the **views buffer**. It includes the length of the string, and then either contains the characters inline (for small strings) or either only contains the first 4 bytes of the string and point to potentially several data buffers. It also supports binary and strings to be written out of order.

These properties are important for efficient string processing. The prefix enables a profitable fast path for string comparisons, which are frequently determined within the first four bytes. Selecting elements is a simple "take" operations on the fixed-width views buffer and does not need to rewrite the values buffers.

![image info](./diagrams/var-string-view-diagram.svg)

In [None]:
column5 = pa.array(['String longer than 12', 'Short', None, 'Short string', "Another long string"], type=pa.string_view())
column5

In [None]:
# Inspecting buffers using PyArrow and buffers() method
column5.buffers()

In [None]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column5.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

In [None]:
views_buffer = column5.buffers()[1]
np.frombuffer(views_buffer, dtype="int32")

In [None]:
views_buffer.to_pybytes()

In [None]:
values_buffer = column5.buffers()[2]
values_buffer.to_pybytes()

In [None]:
import struct

In [None]:
struct.unpack("i4sii", views_buffer.to_pybytes()[:16])

In [None]:
struct.unpack("i12s", views_buffer.to_pybytes()[16:32])

In [None]:
struct.unpack("iiii", views_buffer.to_pybytes()[32:48])

In [None]:
struct.unpack("i12s", views_buffer.to_pybytes()[48:64])

In [None]:
struct.unpack("i4sii", views_buffer.to_pybytes()[64:80])

In [None]:
column6 = pa.concat_arrays(
    [column5, pa.array(["pythondataconferenceraulcd"], type=pa.string_view())]
)
column6

In [None]:
column6.buffers()

**Note**: I wanted to create a String View Array reusing the buffers but foung a bug on pyarrow.

See for details: [\[Python\] StringViewArray.from_buffers does not seem to work as expected](https://github.com/apache/arrow/issues/44651)

In [None]:
views_bytes = column6.buffers()[1].to_pybytes()

for i in range(0, len(views_bytes), 16):
    length, = struct.unpack("i", views_bytes[i:i+4])
    if length > 12:
        print(struct.unpack("i4sii", views_bytes[i:i+16]))
    else:
        print(struct.unpack("i12s", views_bytes[i:i+16]))

## Exercise

In the following code snippet, we create an Array object of byte objects, all with the same size. Can you see what's the difference with the previous binary and string arrays we have seen?

<details><summary>Hints</summary>

* How many buffers does it have? Does it have an offsets buffer?
* Is it a variable-size or fixed-size layout?

</details>

In [None]:
column7 = pa.array([b"some", b"byte", b"data"], pa.binary(4)) 
column7

In [None]:
column7.to_pylist()