In [1]:
import numpy as np
import nanoarrow as na
import pyarrow as pa

### Intermezzo: inspecting the buffers using PyArrow and nanoarrow

#### PyArrow

For a `pyarrow.Array`, we can use the `buffers()` method to get a list of all the buffers of the array. The information for each buffer inlcudes:

- adress of the buffer
- buffer size in bytes
- whether the buffer is mutable or not (buffers are generally mutable - changable, but an Array is an immutable container in pyarrow)

In [4]:
column1 = pa.array([1, 3, 9, 9, 2], type=pa.int64())
column1.buffers()

[None,
 <pyarrow.Buffer address=0x7f7cc1a090c0 size=40 is_cpu=True is_mutable=True>]

In this case a simple, fixed width primitive array, there is only a single buffer for the data values.

PyArrow doesn't provide direct easy access to the buffer content, but here are a few ways to inspect the buffer:

In [6]:
values_buffer = column1.buffers()[1]
values_buffer

<pyarrow.Buffer address=0x7f7cc1a090c0 size=40 is_cpu=True is_mutable=True>

In [11]:
# getting the raw bytes as a Python bytes object (note this makes a copy! don't do this with larger data)
values_buffer.to_pybytes()

b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'

In [10]:
# zero-copy view as a numpy array (using the buffer protocol)
# -> this just shows the raw bytes as well
np.array(values_buffer)

array([1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0,
       0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0], dtype=int8)

In [16]:
# in this case we know the buffer represents int64 values, so we can view the buffer as such
np.frombuffer(values_buffer, dtype=np.int64)

array([1, 3, 9, 9, 2])

#### Inspecting buffers using nanoarrow

In [18]:
na_column4 = na.Array(column4)

To start, nanoarrow does have a functionality to print the details of the layout of a certain array, which already gives us insight into the buffers of the array:

In [19]:
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'int64'
- length: 5
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int64[40 b] 1 3 9 9 2>
- dictionary: NULL
- children[0]:

Additionally, it also allows us to access the buffers directly through the `buffers` property:

In [23]:
na_column4.buffers

(nanoarrow.c_lib.CBufferView(bool[0 b] ),
 nanoarrow.c_lib.CBufferView(int64[40 b] 1 3 9 9 2))

Nanoarrow does keep track of the context in which the buffer was created (i.e. it is part of an int64 array and represents the data values):

In [24]:
data_buffer = na_column4.buffers[1]

In [25]:
data_buffer

nanoarrow.c_lib.CBufferView(int64[40 b] 1 3 9 9 2)

In [26]:
np.array(data_buffer)

array([1, 3, 9, 9, 2], dtype=int64)