# Array type specifications

Instead of specifying a schema and building data to match it, type specifications are inferred from the structure of nested awkward arrays. These types are presented to the user as `awkward.type.Type` objects, which may be thought of as a generalization of Numpy's `shape`, `dtype`, and `masked` parameters.

Not all awkward arrays make a difference in type: a `ChunkedArray` of `X`, for instance, simulates a plain array of `X`. There are five structures that should be distinguishable to a high-level user:

   * **jaggedness:** some arrays contain arbitrary length subarrays
   * **tables:** some arrays are indexed by an enumerated set of strings, rather than integers ("product types")
   * **union:** some arrays represent tagged unions ("sum types")
   * **optional:** some arrays are masked, representing unions with the N/A singleton
   * **self-references:** some subarray components refer to cousins or ancestors on the tree of nested arrays

In [1]:
import os
os.chdir(os.path.expanduser("~"))

import numpy
from awkward.type import *

## Representation of types

We can get the `awkward.type.Type` of any Numpy or awkward array with `awkward.type.from_array` (calls an awkward array's `.type` property).

In [2]:
from_array(numpy.arange(15))

ArrayType(15, dtype('int64'))

The `__repr__` string (above) provides a constructor you could use to make the type manually. However, it's not the easiest way to read complex types. Instead, view the `__str__` string by printing the type object.

In [3]:
print(from_array(numpy.arange(15)))

[0, 15) -> int64


The above means that the array is a function from integers in `[0, 15)` (including 0, excluding 15) to objects of type `int64`. This is the function that is "called" by passing integers inside square brackets.

Numpy arrays with any number of dimensions can be expressed as a chain of functions that return functions. After all, passing an integer to a 2D array gives you a 1D array; passing an integer to that gives you an array element. This is known as [currying](https://en.wikipedia.org/wiki/Currying).

In [4]:
array = numpy.arange(15).reshape(3, 5).view(numpy.uint64)
print(array)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


In [5]:
print(from_array(array))

[0, 3) -> [0, 5) -> uint64


## Jagged arrays

Jagged (or "ragged") arrays go beyond Numpy in that the size of subarrays is not the same for all subarrays. Some subarrays may be empty, some may have size 1, some may have size 2, and some may have size 1 million. We represent that as a type with an infinite integer domain. _(Yes, this is a slightly different usage, but it's a consistent notation.)_

Below is a jagged array of 10 subarrays, where the subarrays may have any sizes.

In [6]:
print(ArrayType(10, numpy.inf, float))

[0, 10) -> [0, inf) -> float64


Below is a jagged array that contains Numpy arrays of fixed shape `(3, 5)` (i.e. 10 subarrays that each hold an arbitrary number of 3×5 matrices).

In [7]:
print(ArrayType(10, numpy.inf, 3, 5, float))

[0, 10) -> [0, inf) -> [0, 3) -> [0, 5) -> float64


Below is a jagged array of jagged arrays of jagged arrays.

In [8]:
print(ArrayType(10, numpy.inf, numpy.inf, numpy.inf, float))

[0, 10) -> [0, inf) -> [0, inf) -> [0, inf) -> float64


Below is a jagged array whose indexes (`starts` and `stops`) have shape `(3, 5)`.

In [9]:
print(ArrayType(3, 5, numpy.inf, float))

[0, 3) -> [0, 5) -> [0, inf) -> float64


## Tables

Numpy has structured arrays, which can be indexed by enumerated strings. This has the same logical type structure as a `Table` in awkward-array. One of the indexes that a user provides may be a string from the set `{"one", "two"}` to select a column.

In [10]:
print(from_array(numpy.array([(0, 0.0), (1, 1.1), (2, 2.2), (3, 3.3), (4, 4.4), (5, 5.5), (6, 6.6), (7, 7.7), (8, 8.8), (9, 9.9)], dtype=[("one", int), ("two", float)])))

[0, 10) -> 'one' -> int64
           'two' -> float64


We can construct the same thing by hand using the `&` operator (or the `awkward.type.TableType` constructor directly).

In [11]:
print(ArrayType(10, ArrayType("one", int) & ArrayType("two", float)))

[0, 10) -> 'one' -> int64
           'two' -> float64


Unlike Numpy structured arrays, the columns of an awkward `Table` can have different substructures from each other.

In [12]:
print(ArrayType(10, ArrayType("one", numpy.inf, int) & ArrayType("two", float)))

[0, 10) -> 'one' -> [0, inf) -> int64
           'two' -> float64


And because you can pass a string to a `Table` to get a column or pass an integer to the same `Table` to get a row, string and integer indexes _commute._ (Numpy structured arrays have this behavior as well.)

In [13]:
one = ArrayType(3, 5, ArrayType("one", int) & ArrayType("two", float))
two = ArrayType(3, ArrayType("one", 5, int) & ArrayType("two", 5, float))
print(one)
print(two)

[0, 3) -> [0, 5) -> 'one' -> int64
                    'two' -> float64
[0, 3) -> 'one' -> [0, 5) -> int64
          'two' -> [0, 5) -> float64


In [14]:
one == two

True

## Cross-references

Another awkward-array feature is that nested elements can be cross-referenced. Among other things, allows us to express trees and graphs.

_(Note: whether objects with the following type are trees or graphs depends entirely on their array values. The type specification doesn't determine connectedness properties. Also, we have to let the children/left/right be jagged or optional so that finite trees are a possibility!)_

In [15]:
tree = ArrayType("node_value", int)
tree["children"] = ArrayType(numpy.inf, tree)
print(tree)

T0 := 'node_value' -> int64
      'children'   -> [0, inf) -> T0


In [16]:
tree = ArrayType("node_value", int)
tree["left"] = OptionType(tree)
tree["right"] = OptionType(tree)
print(tree)

T0 := 'node_value' -> int64
      'left'       -> ?(T0)
      'right'      -> ?(T0)


## Unions

To emulate heterogeneous lists, awkward-array allows for union types. Whereas a table has content for every column (string index) at every row (integer index), a union has content for only one of its possibilities at every row (integer index). Therefore, they're in a sense opposites: tables are "product types" constructed with `&` and unions are "sum types" constructed with `|` (or the `awkward.type.UnionType` directly).

In [17]:
print(ArrayType(10, ArrayType(3, int) | ArrayType(5, float)))

[0, 10) -> ([0, 3) -> int64   |
            [0, 5) -> float64 )


The possibilities of a union may be tables. In the string representation, note the location of the `|`, which distinguishes records with `{"one", "two"}` fields from records with `{"uno", "dos", "tres"}` fields.

(`&` has tighter binding than `|`)

In [18]:
print(ArrayType(10, ArrayType("one", int) & ArrayType("two", float) | ArrayType("uno", bool) & ArrayType("dos", int) & ArrayType("tres", float)))

[0, 10) -> ('one' -> int64
            'two' -> float64  |
            'uno'  -> bool
            'dos'  -> int64
            'tres' -> float64 )


All of the above can be combined to make truly complex types. There's as much flexibility in this type system as in a basic programming language.

In [19]:
t = ArrayType(10, ArrayType(numpy.inf, ArrayType("one", int) & ArrayType("two", float)) | ArrayType("uno", bool) & ArrayType("dos", OptionType(int)) & ArrayType("tres", float) | ArrayType(5, 3, OptionType(ArrayType(numpy.inf, float))))
t.to[1]["tres"] = t
print(t)

T0 := [0, 10) -> ([0, inf) -> 'one' -> int64
                              'two' -> float64               |
                  'uno'  -> bool
                  'dos'  -> ?(int64)
                  'tres' -> T0                               |
                  [0, 5) -> [0, 3) -> ?([0, inf) -> float64) )
