# In memory data model
Apache Arrow defines columnar array data structures by composing type metadata with memory buffers, like the ones explained in the documentation on Memory and IO. These data structures are exposed in Python through a series of interrelated classes:

- Type Metadata: Instances of pyarrow.DataType, which describe a logical array type
- Schemas: Instances of pyarrow.Schema, which describe a named collection of types. These can be thought of as the column types in a table-like object.
- Arrays: Instances of pyarrow.Array, which are atomic, contiguous columnar data structures composed from Arrow Buffer objects
- Record Batches: Instances of pyarrow.RecordBatch, which are a collection of Array objects with a particular Schema
- Tables: Instances of pyarrow.Table, a logical table data structure in which each column consists of one or more pyarrow.Array objects of the same type.

We will examine these in the sections below in a series of examples.

## Type Metadata

In [1]:
import pyarrow as pa

In [4]:
t1 = pa.int32()
t1

DataType(int32)

In [6]:
t2 = pa.string()
t2

DataType(string)

In [7]:
t3 = pa.binary()
t3

DataType(binary)

In [9]:
t4 = pa.binary(10)
t4

FixedSizeBinaryType(fixed_size_binary[10])

In [11]:
t5 = pa.timestamp("ms")
t5

TimestampType(timestamp[ms])

In [13]:
f0 = pa.field("int32_field", t1)
f0

pyarrow.Field<int32_field: int32>

In [14]:
f0.name

'int32_field'

In [15]:
f0.type

DataType(int32)

In [17]:
t6 = pa.list_(t1)
t6

ListType(list<item: int32>)

In [19]:
fields = [
    pa.field("s0", t1),
    pa.field("s1", t2),
    pa.field("s2", t4),
    pa.field("s3", t6)
]

In [21]:
t7 = pa.struct(fields)
t7

StructType(struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>)

## Schemas

In [23]:
my_schema = pa.schema(fields)
my_schema

s0: int32
s1: string
s2: fixed_size_binary[10]
s3: list<item: int32>
  child 0, item: int32

## Arrays

In [25]:
arr = pa.array([1,2,None,3])
arr

<pyarrow.lib.Int64Array object at 0x10e3d9578>
[
  1,
  2,
  NA,
  3
]

In [33]:
some = arr.buffers()[1]
some.to_pybytes()

'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'

In [34]:
pa.array([1, 2], type=pa.uint16())

<pyarrow.lib.UInt16Array object at 0x10e3d9c58>
[
  1,
  2
]

In [35]:
arr.type

DataType(int64)

In [36]:
len(arr)

4

In [37]:
arr.null_count

1

## List Arrays

In [39]:
nested_arr = pa.array([[[]], None, [[1,2],[3,4]], [[None], [1]]])

In [40]:
nested_arr.type

ListType(list<item: list<item: int64>>)

In [41]:
print(nested_arr.type)

list<item: list<item: int64>>


## Struct Arrays

In [42]:
ty = pa.struct([
    pa.field("x", pa.int8()),
    pa.field("y", pa.bool_())
])

In [44]:
cmplx_arr = pa.array([{"x" : 1, "y" : True}, {"x" : 2, "y" : False}], type=ty)
print cmplx_arr

<pyarrow.lib.StructArray object at 0x10e3ea470>
[
  {'y': True, 'x': 1},
  {'y': False, 'x': 2}
]


In [45]:
another_arr = pa.array([(3, True), (4, False)], type=ty)
print another_arr

<pyarrow.lib.StructArray object at 0x10e3ea628>
[
  {'y': True, 'x': 3},
  {'y': False, 'x': 4}
]


## Union Arrays

In [57]:
xs = pa.array([5,6,7])
ys = pa.array([False, True, False])
zs = pa.array([b"viktor", b"jim", b"maria"])

In [69]:
types = pa.array([0,1,2], type=pa.int8())

In [70]:
union_arr = pa.UnionArray.from_sparse(types, [xs, ys, zs])

In [71]:
print union_arr.type

union[sparse]<0: int64=0, 1: bool=1, 2: binary=2>


In [72]:
union_arr

<pyarrow.lib.UnionArray object at 0x10e401470>
[
  5,
  True,
  'maria'
]

In [73]:
xs = pa.array([5,6, 7])
ys = pa.array([False, True])
types = pa.array([0,1,1,0,0], type = pa.int8())
offsets = pa.array([0,0,1,1,2], type=pa.int32())
dense_union_arr = pa.UnionArray.from_dense(types, offsets, [xs, ys])

In [74]:
print dense_union_arr.type

union[dense]<0: int64=0, 1: bool=1>


In [75]:
dense_union_arr

<pyarrow.lib.UnionArray object at 0x10e401578>
[
  5,
  False,
  True,
  6,
  7
]

In [76]:
indices = pa.array([0,1,0,1,2,0,None,2])
dictionary = pa.array(["foo", "bar", "baz"])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)

In [77]:
dict_array

<pyarrow.lib.DictionaryArray object at 0x10e400bb0>
[
  'foo',
  'bar',
  'foo',
  'bar',
  'baz',
  'foo',
  NA,
  'baz'
]

## Record Batches

In [78]:
data = [
    pa.array([1,2,3,4]),
    pa.array(["foo", "bar", "baz", None]),
    pa.array([True, None, False, True])
]

In [79]:
batch = pa.RecordBatch.from_arrays(data, ["f0", "f1", "f2"])
batch.num_columns

3

In [80]:
batch.num_rows

4

In [81]:
batch.schema

f0: int64
f1: binary
f2: bool

In [82]:
batch2 = batch.slice(1,3)

In [85]:
print batch2[1]

<pyarrow.lib.BinaryArray object at 0x10e401fc8>
[
  'bar',
  'baz',
  NA
]


## Tables

In [86]:
batches = [batch] * 5
table = pa.Table.from_batches(batches)

In [87]:
table

pyarrow.Table
f0: int64
f1: binary
f2: bool

In [88]:
c = table[0]
c

<pyarrow.lib.Column object at 0x10e35a630>
chunk 0: <pyarrow.lib.Int64Array object at 0x10e405310>
[
  1,
  2,
  3,
  4
]
chunk 1: <pyarrow.lib.Int64Array object at 0x10e405368>
[
  1,
  2,
  3,
  4
]
chunk 2: <pyarrow.lib.Int64Array object at 0x10e4053c0>
[
  1,
  2,
  3,
  4
]
chunk 3: <pyarrow.lib.Int64Array object at 0x10e405418>
[
  1,
  2,
  3,
  4
]
chunk 4: <pyarrow.lib.Int64Array object at 0x10e405470>
[
  1,
  2,
  3,
  4
]

In [89]:
c.to_pandas()

0     1
1     2
2     3
3     4
4     1
5     2
6     3
7     4
8     1
9     2
10    3
11    4
12    1
13    2
14    3
15    4
16    1
17    2
18    3
19    4
Name: f0, dtype: int64