# Nested types

* List, List View
* Struct
* Map
* Union
* Dictionary encoded

We will have a look at first three types.

## List

The list type enables values of the same type being stacked together in a sequence of values in each column slot. The layout is similar to binary or string type as it has offsets buffer to define where the sequence of values starts and ends with all the values of the column being stored consecutively in a values buffer.

![image info](./diagrams/var-list-diagram.svg)

In [2]:
import nanoarrow as na
import numpy as np
import pyarrow as pa

In [3]:
column_1 = pa.array([[12, -7, 25], None, [0, -127, 127, 50], []],
                    type=pa.list_(pa.int8()))
column_1

<pyarrow.lib.ListArray object at 0x110840ca0>
[
  [
    12,
    -7,
    25
  ],
  null,
  [
    0,
    -127,
    127,
    50
  ],
  []
]

In [4]:
# Inspecting buffers using PyArrow and buffers() method
column_1.buffers()

[<pyarrow.Buffer address=0x4bdc40200c0 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020080 size=20 is_cpu=True is_mutable=True>,
 None,
 <pyarrow.Buffer address=0x4bdc4020100 size=7 is_cpu=True is_mutable=True>]

In [5]:
# Inspecting buffers using PyArrow and buffers() method and numpy

validity_bitmap_buffer = column_1.buffers()[0]
np.unpackbits(np.frombuffer(validity_bitmap_buffer, dtype="uint8"), bitorder="little")

array([1, 0, 1, 1, 0, 0, 0, 0], dtype=uint8)

In [6]:
offsets_buffer = column_1.buffers()[1]
np.frombuffer(offsets_buffer, dtype="int32")

array([0, 3, 3, 7, 7], dtype=int32)

In [7]:
values_validity_bitmap_buffer = column_1.buffers()[2]
values_validity_bitmap_buffer is None

True

In [9]:
values_buffer = column_1.buffers()[3]
np.frombuffer(values_buffer, dtype="int8")

array([  12,   -7,   25,    0, -127,  127,   50], dtype=int8)

In [10]:
# Inspecting buffers using nanoarrow

na_column4 = na.c_array(column_1)
na.c_array_view(na_column4)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'list'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10110000>
  - data_offset <int32[20 b] 0 3 3 7 7>
- dictionary: NULL
- children[1]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int8'
    - length: 7
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int8[7 b] 12 -7 25 0 -127 127 50>
    - dictionary: NULL
    - children[0]:

### Fixed size list

**Fixed size list** is a special case ov variable-size list where each column slot contains a fixed size sequence meaning all lists are the same size and so the offset buffer is no longer needed.

![image info](./diagrams/fixed-list-diagram.svg)

In [11]:
column_2 = pa.array([[12, -7], None, [0, None]], type=pa.list_(pa.int16(), 2))
na.c_array_view(na.c_array(column_2))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'fixed_size_list'
- length: 3
- offset: 0
- null_count: 1
- buffers[1]:
  - validity <bool[1 b] 10100000>
- dictionary: NULL
- children[1]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int16'
    - length: 6
    - offset: 0
    - null_count: 3
    - buffers[2]:
      - validity <bool[1 b] 11001000>
      - data <int16[12 b] 12 -7 0 0 0 0>
    - dictionary: NULL
    - children[0]:

### List and large list comparison

In a normal list with variable or fixed size the offsets are `int32` while in the **large** list the offsets are `int64`.

In [13]:
column_1 = pa.array([[12, -7, 25], None, [0, -127, 127, 50], []],
                    type=pa.list_(pa.int8()))
na.c_array_view(na.c_array(column_1))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'list'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10110000>
  - data_offset <int32[20 b] 0 3 3 7 7>
- dictionary: NULL
- children[1]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int8'
    - length: 7
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int8[7 b] 12 -7 25 0 -127 127 50>
    - dictionary: NULL
    - children[0]:

In [14]:
column_1_large = pa.array([[12, -7, 25], None, [0, -127, 127, 50], []],
                          type=pa.large_list(pa.int8()))
na.c_array_view(na.c_array(column_1_large))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'large_list'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 10110000>
  - data_offset <int64[40 b] 0 3 3 7 7>
- dictionary: NULL
- children[1]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int8'
    - length: 7
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int8[7 b] 12 -7 25 0 -127 127 50>
    - dictionary: NULL
    - children[0]:

### List and large list view

List view type allows arrays to specify out-of-order offsets.

![image info](./diagrams/var-list-view-diagram.svg)

In [25]:
column_3 = pa.ListViewArray.from_arrays(offsets=[4, 7, 0, 0, 3],
                                        sizes=[3, 0, 4, 0, 2],
                                        values=[0, -127, 127, 50, 12, -7, 25],
                                        mask=pa.array([False, True, False, False, False]))

In [26]:
column_3.buffers()

[<pyarrow.Buffer address=0x4bdc4021000 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4050680 size=20 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4050700 size=20 is_cpu=True is_mutable=True>,
 None,
 <pyarrow.Buffer address=0x4bdc4020fc0 size=56 is_cpu=True is_mutable=True>]

In [28]:
column_3

<pyarrow.lib.ListViewArray object at 0x108077ee0>
[
  [
    12,
    -7,
    25
  ],
  null,
  [
    0,
    -127,
    127,
    50
  ],
  [],
  [
    50,
    12
  ]
]

In [30]:
values_buffer = column_3.buffers()[4]
np.frombuffer(values_buffer, dtype="int64")

array([   0, -127,  127,   50,   12,   -7,   25])

### Struct

A struct is a nested type parameterized by an ordered sequence of types.

* one child array for each field
* child arrays are independent and need not be adjacent to each other in memory

![image info](./diagrams/struct-diagram.svg)

In [31]:
ty = pa.struct([pa.field('x', pa.string()),
                pa.field('y', pa.int8())])
column_4 = pa.array([("joe", 1), (None, 2), None, ("mark", 4), ("jane", None)],
                    type=ty)
column_4

<pyarrow.lib.StructArray object at 0x1082689a0>
-- is_valid:
  [
    true,
    true,
    false,
    true,
    true
  ]
-- child 0 type: string
  [
    "joe",
    null,
    "",
    "mark",
    "jane"
  ]
-- child 1 type: int8
  [
    1,
    2,
    0,
    4,
    null
  ]

In [32]:
column_4.buffers()

[<pyarrow.Buffer address=0x4bdc4021040 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020f80 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020e80 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020ec0 size=11 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020f00 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020f40 size=5 is_cpu=True is_mutable=True>]

In [33]:
na.c_array_view(na.c_array(column_4))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'struct'
- length: 5
- offset: 0
- null_count: 1
- buffers[1]:
  - validity <bool[1 b] 11011000>
- dictionary: NULL
- children[2]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'string'
    - length: 5
    - offset: 0
    - null_count: 1
    - buffers[3]:
      - validity <bool[1 b] 10111000>
      - data_offset <int32[24 b] 0 3 3 3 7 11>
      - data <string[11 b] b'joemarkjane'>
    - dictionary: NULL
    - children[0]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int8'
    - length: 5
    - offset: 0
    - null_count: 1
    - buffers[2]:
      - validity <bool[1 b] 11110000>
      - data <int8[5 b] 1 2 0 4 0>
    - dictionary: NULL
    - children[0]:

### Map

Map type represents nested data where each value is a variable number of key-item pairs. Its physical representation is the same as a list of `{key, item}` structs.

![image info](./diagrams/map-diagram.svg)

In [34]:
column_6_data = [[{'key': 'Dark Knight', 'value': 10}],
                 [{'key': 'Dark Knight', 'value': 8}, {'key': 'Meet the Parents', 'value': 4}, {'key': 'Superman', 'value': 5}],
                 None,
                 [{'key': 'Meet the Parents', 'value': 10}, {'key': 'Superman', 'value': None}]]
column_6 = pa.array(column_6_data, type=pa.map_(pa.string(), pa.int32(), keys_sorted=True))
column_6

<pyarrow.lib.MapArray object at 0x108077dc0>
[
  keys:
  [
    "Dark Knight"
  ]
  values:
  [
    10
  ],
  keys:
  [
    "Dark Knight",
    "Meet the Parents",
    "Superman"
  ]
  values:
  [
    8,
    4,
    5
  ],
  null,
  keys:
  [
    "Meet the Parents",
    "Superman"
  ]
  values:
  [
    10,
    null
  ]
]

In [35]:
column_6.buffers()

[<pyarrow.Buffer address=0x4bdc4020dc0 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020cc0 size=20 is_cpu=True is_mutable=True>,
 None,
 None,
 <pyarrow.Buffer address=0x4bdc4020d40 size=28 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4050800 size=70 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4020b00 size=1 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x4bdc4050780 size=24 is_cpu=True is_mutable=True>]

In [36]:
na.c_array_view(na.c_array(column_6))

<nanoarrow.c_lib.CArrayView>
- storage_type: 'map'
- length: 4
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 11010000>
  - data_offset <int32[20 b] 0 1 4 4 6>
- dictionary: NULL
- children[1]:
  - <nanoarrow.c_lib.CArrayView>
    - storage_type: 'struct'
    - length: 6
    - offset: 0
    - null_count: 0
    - buffers[1]:
      - validity <bool[0 b] >
    - dictionary: NULL
    - children[2]:
      - <nanoarrow.c_lib.CArrayView>
        - storage_type: 'string'
        - length: 6
        - offset: 0
        - null_count: 0
        - buffers[3]:
          - validity <bool[0 b] >
          - data_offset <int32[28 b] 0 11 22 38 46 62 70>
          - data <string[70 b] b'D...>
        - dictionary: NULL
        - children[0]:
      - <nanoarrow.c_lib.CArrayView>
        - storage_type: 'int32'
        - length: 6
        - offset: 0
        - null_count: 1
        - buffers[2]:
          - validity <bool[1 b] 11111000>
          - data <int32[24 b] 10 8 4 5 10 0>
    