# Variable sized types

## Variable length binary, large binary, string and large string

Binary and string types share the same physical layout with where the string type is utf-8 binary and will produce an invalid result if the bytes are not valid utf-8.

The difference the between binary/string and large binary/string is in the offset type. In the first case that is `int32` and in the second it is `int64`. Second in generally used for data larger than 2GB.

![image info](./diagrams/var-string-diagram.svg)

In [1]:
import nanoarrow as na
import numpy as np
import pyarrow as pa

In [2]:
# Binary and string column examples

pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.binary())

<pyarrow.lib.BinaryArray object at 0x105426200>
[
  707974686F6E,
  64617461,
  636F6E666572656E6365,
  null,
  4265726C696E
]

In [3]:
pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.string())

<pyarrow.lib.StringArray object at 0x105376860>
[
  "python",
  "data",
  "conference",
  null,
  "Berlin"
]

### String type

In [4]:
# Inspecting buffers using PyArrow and buffers() method

column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.string())
column4.buffers()

[<pyarrow.Buffer address=0x41e94020240 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020200 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020280 size=26 is_cpu=True is_mutable=True>]

In [5]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column4.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 1, 1, 0, 1, 0, 0, 0], dtype=uint8)

In [6]:
offsets_buffer = column4.buffers()[1]
np.frombuffer(offsets_buffer, dtype="int32")

array([ 0,  6, 10, 20, 20, 26], dtype=int32)

In [7]:
values_buffer = column4.buffers()[2]
values_buffer.to_pybytes()

b'pythondataconferenceBerlin'

In [8]:
# Inspecting buffers using nanoarrow

na_column4 = na.c_array(column4)
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int32[24 b] 0 6 10 20 20 26>
  - data <string[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

### Binary type

In [9]:
column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.binary())
column4.buffers()

[<pyarrow.Buffer address=0x41e94020300 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e940202c0 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020340 size=26 is_cpu=True is_mutable=True>]

In [10]:
na_column4 = na.c_array(column4)
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'binary'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int32[24 b] 0 6 10 20 20 26>
  - data <binary[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

### Comparing string and large string

In [11]:
column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.string())
na.c_array_view(na.c_array(column4))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int32[24 b] 0 6 10 20 20 26>
  - data <string[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

In [12]:
column4 = pa.array(['python', 'data', 'conference', None, "Berlin"], type=pa.large_string())
na.c_array_view(na.c_array(column4))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'large_string'
- length: 5
- offset: 0
- null_count: 1
- buffers[3]:
  - validity <bool[1 b] 11101000>
  - data_offset <int64[48 b] 0 6 10 20 20 26>
  - data <string[26 b] b'pythondataconferenceBerlin'>
- dictionary: NULL
- children[0]:

### Variable length binary and string view

Binary and string view layout are new in Arrow Columnar format 1.4.
The main differences to classical binary and string types is the **views buffer**. It may point to one of potentially several data buffers or may contain the characters inline. It also supports binary and strings to be written out of order.

These properties are important for efficient string processing. The prefix enables a profitable fast path for string comparisons, which are frequently determined within the first four bytes.

![image info](./diagrams/var-string-view-diagram.svg)

In [13]:
column5 = pa.array(['String longer than 12', 'Short', None, 'Short string', "Another long string"], type=pa.string_view())
column5

<pyarrow.lib.StringViewArray object at 0x1054262c0>
[
  "String longer than 12",
  "Short",
  null,
  "Short string",
  "Another long string"
]

In [14]:
# Inspecting buffers using PyArrow and buffers() method
column5.buffers()

[<pyarrow.Buffer address=0x41e94020500 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e940d0080 size=80 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x41e94020540 size=40 is_cpu=True is_mutable=True>]

In [15]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column5.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 1, 0, 1, 1, 0, 0, 0], dtype=uint8)

In [16]:
views_buffer = column5.buffers()[1]
np.frombuffer(views_buffer, dtype="int32")

array([        21, 1769108563,          0,          0,          5,
       1919903827,        116,          0,          0,          0,
                0,          0,         12, 1919903827, 1953702004,
       1735289202,         19, 1953459777,          0,         21],
      dtype=int32)

In [17]:
values_buffer = column5.buffers()[2]
values_buffer.to_pybytes()

b'String longer than 12Another long string'