# Arrow types overview

## All types

In the previous notebooks, we already encountered several types supported by the Arrow Columnar format while illustrating the physical layouts, but there are several more:

![image info](./diagrams/all-types.svg)

## Extension Types

In case the system or application needs to extend standard Arrow data types with custom semantics this is enabled by defining **extension types** or **user-defined** types.

For example:

* Universally unique identifier (`uuid`) can be represented as a `FixedSizeBinary` type
* Trading time can be represented as a `Timestamp` with metadata indicating the market trading calendar

Extension types can be defined by annotating any of the built-in Arrow logical types (the “storage type”) with a **custom type name** and **optional serialized representation** (`'ARROW:extension:name'` and `'ARROW:extension:metadata'` keys in the `Field` metadata structure).

Source: https://arrow.apache.org/docs/dev/format/Columnar.html#extension-types

### Canonical Extension Types

It is beneficial to share the definitions of well-known extension types so as to improve interoperability between different systems integrating Arrow columnar data. For this reason canonical extension types are defined in Arrow itself.

Examples:

* Fixed and variable shape tensor
  - https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor
  - https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#variable-shape-tensor

Source: https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#

### Community Extension Types

These are Arrow extension types that have been established as standards within specific domain areas.

Example:

* GeoArrow : collection of Arrow extension types for representing vector geometries
  - https://github.com/geoarrow/geoarrow

  ```python
  PointArray:PointType(geoarrow.point)[3]
  <POINT (1 3)>
  <POINT (2 4)>
  <POINT (3 5)>
  ```

### Subclassing ExtensionType from Python

Defining extension types from Python is done by subclassing pyarrow [`ExtensionType`](https://arrow.apache.org/docs/dev/python/generated/pyarrow.ExtensionType.html#pyarrow.ExtensionType) and giving the derived class its own extension name and serialization mechanism.

UUID example:

```python
class UuidType(pa.ExtensionType):

    def __init__(self):
        super().__init__(pa.binary(16), "my_package.uuid")

    def __arrow_ext_serialize__(self):
        return b''

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        assert storage_type == pa.binary(16)
        assert serialized == b''
        return UuidType()
```

# Overview of Arrow terminology

### Buffer

A contiguous region of memory with a given length. Buffers are used to store data for arrays.

### Array

A contiguous, one-dimensional sequence of values with known length where all values have the same type. An array consists of zero or more buffers.

### Chunked Array*

A discontiguous, one-dimensional sequence of values with known length where all values have the same type. Consists of zero or more arrays, the “chunks”.

*Note: this is a concept specific to certain implementations such as Arrow C++ and PyArrow.

### RecordBatch

A contiguous, two-dimensional data structure which consist of ordered collection of arrays of the same length.

### Schema

A collection of fields with optional metadata that determines all the data types of an object like a record batch or table.

### Table*

A discontiguous, two-dimensional chunk of data consisting of an ordered collection of chunked arrays. All chunked arrays have the same length, but may have different types. Different columns may be chunked differently.

![image info](./diagrams/tables-versus-record-batches.svg)

*Note: this is a concept specific to certain implementations such as Arrow C++ and PyArrow.


For more details, check the **Arrow Glossary**: https://arrow.apache.org/docs/dev/format/Glossary.html