# Nested layouts

* List, List View
* Struct
* Map
* Union

In nested types we introduce the concept of **parent** and **child arrays**. They express relationships between physical value arrays in a nested type structure.

Nested types depend on one or more other child data types. For instance, List is a nested type (parent) that has one child (the data types of the values in the list).

## List

The list type enables values of the same type being stacked together in a sequence of values in each column slot. The layout is similar to binary or string type as it has offsets buffer to define where the sequence of values starts and ends with all the values of the column being stored consecutively in a values child array.

![image info](./diagrams/var-list-diagram.svg)

In [1]:
import nanoarrow as na
import numpy as np
import pyarrow as pa

In [2]:
column_1 = pa.array([[12, -7, 25], None, [0, -127, 127, 50], []],
                    type=pa.list_(pa.int8()))
column_1

<pyarrow.lib.ListArray object at 0x7f9952ede380>
[
  [
    12,
    -7,
    25
  ],
  null,
  [
    0,
    -127,
    127,
    50
  ],
  []
]

When inspecting a list type column (and all nested data in general) using pyarrow, the `buffers()` method returns all buffers - of the list array itself (validity bitmap buffer and offset buffer) and its child array (validity bitmap buffer and values buffer):

In [3]:
# Inspecting buffers using PyArrow and buffers() method
column_1.buffers()

[<pyarrow.Buffer address=0x7f995a208040 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208000 size=20 is_cpu=True is_mutable=True>,
 None,
 <pyarrow.Buffer address=0x7f995a208080 size=7 is_cpu=True is_mutable=True>]

In [4]:
# Inspecting buffers using PyArrow and buffers() method and numpy
validity_bitmap_buffer = column_1.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 0, 1, 1, 0, 0, 0, 0], dtype=uint8)

In [5]:
offsets_buffer = column_1.buffers()[1]
np.frombuffer(offsets_buffer, dtype="int32")

array([0, 3, 3, 7, 7], dtype=int32)

In [6]:
values_validity_bitmap_buffer = column_1.buffers()[2]
values_validity_bitmap_buffer is None

True

In [7]:
values_buffer = column_1.buffers()[3]
np.frombuffer(values_buffer, dtype="int8")

array([  12,   -7,   25,    0, -127,  127,   50], dtype=int8)

In [8]:
# Inspecting buffers using nanoarrow
na_column4 = na.Array(column_1)
na_column4.inspect()

<ArrowArray list<item: int8>>
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10110000>
  - data_offset <int32[20 b] 0 3 3 7 7>
- dictionary: NULL
- children[1]:
  'item': <ArrowArray int8>
    - length: 7
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int8[7 b] 12 -7 25 0 -127 127 50>
    - dictionary: NULL
    - children[0]:


### Fixed size list

**Fixed size list** is a special case of variable-size list where each column slot contains a fixed size sequence meaning all lists are the same size and so the offset buffer is no longer needed.

![image info](./diagrams/fixed-list-diagram.svg)

In [10]:
column_2 = pa.array([[12, -7], None, [0, None]], type=pa.list_(pa.int16(), 2))
na.Array(column_2).inspect()

<ArrowArray fixed_size_list(2)<item: int16>>
- length: 3
- offset: 0
- null_count: 1
- buffers[1]:
  - validity <bool[1 b] 10100000>
- dictionary: NULL
- children[1]:
  'item': <ArrowArray int16>
    - length: 6
    - offset: 0
    - null_count: 3
    - buffers[2]:
      - validity <bool[1 b] 11001000>
      - data <int16[12 b] 12 -7 0 0 0 0>
    - dictionary: NULL
    - children[0]:


### List and large list comparison

In a normal list with variable or fixed size the offsets are `int32` while in the **large** list the offsets are `int64`.

In [12]:
column_1 = pa.array([[12, -7, 25], None, [0, -127, 127, 50], []],
                    type=pa.list_(pa.int8()))
na.Array(column_1).inspect()

<ArrowArray list<item: int8>>
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10110000>
  - data_offset <int32[20 b] 0 3 3 7 7>
- dictionary: NULL
- children[1]:
  'item': <ArrowArray int8>
    - length: 7
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int8[7 b] 12 -7 25 0 -127 127 50>
    - dictionary: NULL
    - children[0]:


In [13]:
column_1_large = pa.array([[12, -7, 25], None, [0, -127, 127, 50], []],
                          type=pa.large_list(pa.int8()))
na.Array(column_1_large).inspect()

<ArrowArray large_list<item: int8>>
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10110000>
  - data_offset <int64[40 b] 0 3 3 7 7>
- dictionary: NULL
- children[1]:
  'item': <ArrowArray int8>
    - length: 7
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int8[7 b] 12 -7 25 0 -127 127 50>
    - dictionary: NULL
    - children[0]:


### List and large list view

List view type allows arrays to specify out-of-order offsets.

![image info](./diagrams/var-list-view-diagram.svg)

In [14]:
column_3 = pa.ListViewArray.from_arrays(offsets=[4, 7, 0, 0, 3],
                                        sizes=[3, 0, 4, 0, 2],
                                        values=[0, -127, 127, 50, 12, -7, 25],
                                        mask=pa.array([False, True, False, False, False]))

In [15]:
column_3.buffers()

[<pyarrow.Buffer address=0x7f995a2084c0 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208500 size=20 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208540 size=20 is_cpu=True is_mutable=True>,
 None,
 <pyarrow.Buffer address=0x7f995a208580 size=56 is_cpu=True is_mutable=True>]

In [16]:
column_3

<pyarrow.lib.ListViewArray object at 0x7f9952f3dae0>
[
  [
    12,
    -7,
    25
  ],
  null,
  [
    0,
    -127,
    127,
    50
  ],
  [],
  [
    50,
    12
  ]
]

In [17]:
values_buffer = column_3.buffers()[4]
np.frombuffer(values_buffer, dtype="int64")

array([   0, -127,  127,   50,   12,   -7,   25])

### Struct

A struct is a nested type parameterized by an ordered sequence of types.

* one child array for each field
* child arrays are independent and need not be adjacent to each other in memory (only need to have the same length)

One can think of an individual struct field as a key-value pair where the key is the field name and the child array its values. The field (key) is saved in the schema and the values of a specific field (key) are saved in the child array.

![image info](./diagrams/struct-diagram.svg)

In [18]:
ty = pa.struct([pa.field('x', pa.string()),
                pa.field('y', pa.int8())])
column_4 = pa.array([{"x": "joe", "y": 1},
                     {"x": None, "y": 2}, None,
                     {"x": "mark", "y": 4},
                     {"x": "jane", "y": None}],
                    type=ty)
column_4

<pyarrow.lib.StructArray object at 0x7f9952f3dc60>
-- is_valid:
  [
    true,
    true,
    false,
    true,
    true
  ]
-- child 0 type: string
  [
    "joe",
    null,
    "",
    "mark",
    "jane"
  ]
-- child 1 type: int8
  [
    1,
    2,
    0,
    4,
    null
  ]

In [19]:
column_4.buffers()

[<pyarrow.Buffer address=0x7f995a208480 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208600 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a2085c0 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a2086c0 size=11 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208680 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208640 size=5 is_cpu=True is_mutable=True>]

In [20]:
na.Array(column_4).inspect()

<ArrowArray struct<x: string, y: int8>>
- length: 5
- offset: 0
- null_count: 1
- buffers[1]:
  - validity <bool[1 b] 11011000>
- dictionary: NULL
- children[2]:
  'x': <ArrowArray string>
    - length: 5
    - offset: 0
    - null_count: 1
    - buffers[3]:
      - validity <bool[1 b] 10111000>
      - data_offset <int32[24 b] 0 3 3 3 7 11>
      - data <string[11 b] b'joemarkjane'>
    - dictionary: NULL
    - children[0]:
  'y': <ArrowArray int8>
    - length: 5
    - offset: 0
    - null_count: 1
    - buffers[2]:
      - validity <bool[1 b] 11110000>
      - data <int8[5 b] 1 2 0 4 0>
    - dictionary: NULL
    - children[0]:


### Map

Map type represents nested data where each value is a variable number of key-value pairs. Its physical representation is the same as a list of `{key, value}` structs.

The difference between a struct and a map type is that a struct holds the key in the schema therefore needs to be a string, the values are stored in the child arrays, one for each field. There can be multiple keys and therefore multiple child arrays. The map, on the other hand, has one child array holding all the different keys (that thus all need to be of the same type but not necessarily strings) and a second child array holding all the values, those values need to be of the same type (which doesn't have to match the one on the keys).

Also, the map stores the struct in a list and needs an offset as the list is variable shape.

![image info](./diagrams/map-diagram.svg)

In [21]:
column_6_data = [{'Dark Knight': 10},
                 {'Dark Knight': 8, 'Meet the Parents': 4, 'Superman': 5},
                 None,
                 {'Meet the Parents': 10, 'Superman': None}]
column_6 = pa.array(column_6_data, type=pa.map_(pa.string(), pa.int32()))
column_6

<pyarrow.lib.MapArray object at 0x7f9952f3e7a0>
[
  keys:
  [
    "Dark Knight"
  ]
  values:
  [
    10
  ],
  keys:
  [
    "Dark Knight",
    "Meet the Parents",
    "Superman"
  ]
  values:
  [
    8,
    4,
    5
  ],
  null,
  keys:
  [
    "Meet the Parents",
    "Superman"
  ]
  values:
  [
    10,
    null
  ]
]

In [22]:
column_6.buffers()

[<pyarrow.Buffer address=0x7f995a208740 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208700 size=20 is_cpu=True is_mutable=True>,
 None,
 None,
 <pyarrow.Buffer address=0x7f995a2087c0 size=28 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a211080 size=70 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208840 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x7f995a208800 size=24 is_cpu=True is_mutable=True>]

In [25]:
column_6.type

MapType(map<string, int32>)

In [26]:
na.Array(column_6).inspect()

<ArrowArray map<entries: struct<key: string, value: int32>>>
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 11010000>
  - data_offset <int32[20 b] 0 1 4 4 6>
- dictionary: NULL
- children[1]:
  'entries': <ArrowArray struct<key: string, value: int32>>
    - length: 6
    - offset: 0
    - null_count: 0
    - buffers[1]:
      - validity <bool[0 b] >
    - dictionary: NULL
    - children[2]:
      'key': <ArrowArray string>
        - length: 6
        - offset: 0
        - null_count: 0
        - buffers[3]:
          - validity <bool[0 b] >
          - data_offset <int32[28 b] 0 11 22 38 46 62 70>
          - data <string[70 b] b'D...>
        - dictionary: NULL
        - children[0]:
      'value': <ArrowArray int32>
        - length: 6
        - offset: 0
        - null_count: 1
        - buffers[2]:
          - validity <bool[1 b] 11111000>
          - data <int32[24 b] 10 8 4 5 10 0>
        - dictionary: NULL
        - children[0]:


### Comparing the conversion of struct and map to python objects

Struct converts to a list of dictionaries:

In [27]:
column_4.to_numpy(zero_copy_only=False)

array([{'x': 'joe', 'y': 1.0}, {'x': None, 'y': 2.0}, None,
       {'x': 'mark', 'y': 4.0}, {'x': 'jane', 'y': None}], dtype=object)

In [28]:
column_4.to_pylist()

[{'x': 'joe', 'y': 1},
 {'x': None, 'y': 2},
 None,
 {'x': 'mark', 'y': 4},
 {'x': 'jane', 'y': None}]

And map converts to a nested list of tuples by default:

In [29]:
column_6.to_numpy(zero_copy_only=False)

array([list([('Dark Knight', 10.0)]),
       list([('Dark Knight', 8.0), ('Meet the Parents', 4.0), ('Superman', 5.0)]),
       None, list([('Meet the Parents', 10.0), ('Superman', None)])],
      dtype=object)

Or to dictionary with the use of `maps_as_pydicts` keyword:

In [30]:
# maps_as_pydicts can be ‘lossy’ or ‘strict’
# This can change the ordering of (key, value) pairs, and will deduplicate multiple keys, resulting in a possible loss of data.

# ‘lossy’: warning printed when detected
# ‘strict’ exception being raised when deduplicate detected
column_6.to_pandas(zero_copy_only=False, maps_as_pydicts="lossy")

0                                {'Dark Knight': 10.0}
1    {'Dark Knight': 8.0, 'Meet the Parents': 4.0, ...
2                                                 None
3         {'Meet the Parents': 10.0, 'Superman': None}
dtype: object

In [31]:
column_6.to_pylist()

[[('Dark Knight', 10)],
 [('Dark Knight', 8), ('Meet the Parents', 4), ('Superman', 5)],
 None,
 [('Meet the Parents', 10), ('Superman', None)]]