# Primitive layouts

In [1]:
import numpy as np
import nanoarrow as na
import pyarrow as pa

## Fixed Size Primitive Layout

A primitive value array represents an array of values where each value has the same physical size measured in bytes.

![](diagrams/primitive-diagram.svg)

For example a primitive array of int32s (4 bytes per value):

In [2]:
column1 = pa.array([1, 3, 9, 9, 2], type=pa.int32())
column1

<pyarrow.lib.Int32Array object at 0x7177600d5c60>
[
  1,
  3,
  9,
  9,
  2
]

In [3]:
column1.buffers()

[None,
 <pyarrow.Buffer address=0x51760020080 size=20 is_cpu=True is_mutable=True>]

In [4]:
na.array(column1).inspect()

<ArrowArray int32>
- length: 5
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int32[20 b] 1 3 9 9 2>
- dictionary: NULL
- children[0]:


# TODO: Example of buffer on GPU? What does it mean for a buffer to be mutable??

## Intermezzo: inspecting the buffers using PyArrow and nanoarrow

#### PyArrow

For a `pyarrow.Array`, we can use the `buffers()` method to get a list of all the buffers of the array. The information for each buffer inlcudes:

- adress of the buffer
- buffer size in bytes
- whether the buffer is in CPU or not (GPU)
- whether the buffer is mutable or not (buffers are generally mutable - changeable, but an Array is an immutable container in pyarrow)

In [5]:
column1.buffers()

[None,
 <pyarrow.Buffer address=0x51760020080 size=20 is_cpu=True is_mutable=True>]

In this case a simple, fixed width primitive array, there is only a single buffer for the data values.

PyArrow doesn't provide direct easy access to the buffer content, but here are a few ways to inspect the buffer:

In [6]:
values_buffer = column1.buffers()[1]
values_buffer

<pyarrow.Buffer address=0x51760020080 size=20 is_cpu=True is_mutable=True>

In [7]:
# getting the raw bytes as a Python bytes object (note this makes a copy! don't do this with larger data)
values_buffer.to_pybytes()

b'\x01\x00\x00\x00\x03\x00\x00\x00\t\x00\x00\x00\t\x00\x00\x00\x02\x00\x00\x00'

In [8]:
# zero-copy view as a numpy array (using the buffer protocol)
# -> this just shows the raw bytes as well
np.array(values_buffer)

array([1, 0, 0, 0, 3, 0, 0, 0, 9, 0, 0, 0, 9, 0, 0, 0, 2, 0, 0, 0],
      dtype=int8)

In [9]:
# in this case we know the buffer represents int32 values, so we can view the buffer as such
np.frombuffer(values_buffer, dtype=np.int32)

array([1, 3, 9, 9, 2], dtype=int32)

#### Inspecting buffers using nanoarrowna_column1 = na.array(column1)

In [10]:
na_column1 = na.array(column1)

To start, nanoarrow does have a functionality to print the details of the layout of a certain array, which already gives us insight into the buffers of the array:

In [11]:
na_column1.inspect()

<ArrowArray int32>
- length: 5
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int32[20 b] 1 3 9 9 2>
- dictionary: NULL
- children[0]:


Additionally, it also allows us to access the buffers directly through the `buffers` property:

In [12]:
na_column1.buffers

(nanoarrow.c_buffer.CBufferView(bool[0 b] ),
 nanoarrow.c_buffer.CBufferView(int32[20 b] 1 3 9 9 2))

Nanoarrow does keep track of the context in which the buffer was created (i.e. it is part of an int64 array and represents the data values):

In [13]:
data_buffer = na_column1.buffers[1]

In [14]:
data_buffer

nanoarrow.c_buffer.CBufferView(int32[20 b] 1 3 9 9 2)

In [15]:
np.array(data_buffer)

array([1, 3, 9, 9, 2], dtype=int32)

## Support for null values

Arrow supports missing values or "nulls" for all data types: any value in an array may be semantically null, whether primitive or nested type.

In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, is used alongside the data indicating whether each value in the array is null or not. You can think of it as vector of 0 and 1 values, where a 1 means that the value is not-null ("valid"), while a 0 indicates the value is null.

This validity bitmap is optional, i.e. if there are no missing values in the array the buffer does not need to be allocated (as in the example column 1 in the diagram below).

![](diagrams/primitive-diagram.svg)

In [16]:
column2 = pa.array([1.2, 3.4, 9.0, None, 2.9])
column2

<pyarrow.lib.DoubleArray object at 0x7177600d5d20>
[
  1.2,
  3.4,
  9,
  null,
  2.9
]

In [17]:
na.array(column2).inspect()

<ArrowArray double>
- length: 5
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 11101000>
  - data <double[40 b] 1.2 3.4 9.0 0.0 2.9>
- dictionary: NULL
- children[0]:


**Attention**: Arrow uses [least-significant bit (LSB) numbering](https://en.wikipedia.org/wiki/Bit_numbering) (also known as bit-endianness). This means that within a group of 8 bits (1 byte), we read right-to-left. However, the `nanoarrow` repr of the validity buffer in the example above already takes that into account and shows the values in logical order matching the position in the array. 

The diagram above shows it as how it is actually stored in memory. We can inspect the validity bitmap buffer with pyarrow and numpy:

In [19]:
validity_bitmap_buffer = column2.buffers()[0]
validity_bitmap_buffer.to_pybytes()

b'\x17'

In this case of a small array of 5 values, the validity bitmap consists of only a single byte. To view the data as bytes in numpy, we can use the `uint8` data type, which has a width of 1 byte:

In [21]:
np.frombuffer(validity_bitmap_buffer, dtype="uint8")

array([23], dtype=uint8)

Numpy also provides a function to "unpack" the 0/1 bits of those bytes into separate values:

In [22]:
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 1, 1, 0, 1, 0, 0, 0], dtype=uint8)

In this case of an array of 5 elements, only the first 5 bits have a meaning, and the additional ("padded") bits are always set to 0.

### Null vs NaN

In numpy (and numpy-based packages such as pandas), often `NaN` is used as indicator for "missing" values, mostly by lack of better alternatives (numpy does not have built-in support for missing values in general). `NaN` is a specific floating-point value ("Not a Number") within the IEEE floating-point standard, and thus is only available for floating point data types.
In the Arrow format, since there is a separate concept of nulls, a NaN value is considered as just another valid floating point array value:

In [23]:
arr = na.array([0.5, float("nan"), 1.5, None, 3.5], na.float64())

In [24]:
arr

nanoarrow.Array<double>[5]
0.5
nan
1.5
None
3.5

In [25]:
arr.buffers

(nanoarrow.c_buffer.CBufferView(bool[1 b] 11101000),
 nanoarrow.c_buffer.CBufferView(double[40 b] 0.5 nan 1.5 0.0 3.5))

## Exercise

In the following code snippet, we create an Array object from python `datetime` instances. What is the type of the array? Is this a fixed-width primitive type? How are the datetimes expressed?

<details><summary>Hints</summary>

* A `pyarrow.Array` has a `.type` attribute. And this DataType object has a `byte_width` attribute in case of a fixed-width type.
* Does it just have a single data buffer (next to the validity bitmap)?

</details>

In [26]:
from datetime import datetime

column_datetime = pa.array([datetime(2024, 4, 22), datetime(2024, 4, 23), datetime(2024, 4, 24)])